You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Adds a single corpus text file of new training data to the custom language model.
2
+
Use multiple requests to submit multiple corpus text files.
3
+
Only the owner of a custom model can use this method to add a corpus to the model.
4
+
5
+
Submit a plain text file that contains sample sentences from the domain of interest to enable the service to extract words in context.
6
+
The more sentences you add that represent the context in which speakers use words from the domain, the better the service's recognition accuracy.
7
+
Adding a corpus does not affect the custom model until you train the model for the new data by using the Train a custom model method.
8
+
9
+
Use the following guidelines to prepare a corpus text file:
10
+
11
+
Provide a plain text file that is encoded in UTF-8 if it contains non-ASCII characters.
12
+
The service assumes UTF-8 encoding if it encounters such characters.
13
+
14
+
Include each sentence of the corpus on its own line, terminating each line with a carriage return.
15
+
Including multiple sentences on the same line can degrade accuracy.
16
+
17
+
Use consistent capitalization for words in the corpus.
18
+
The words resource is case-sensitive; mix upper- and lowercase letters and use capitalization only when intended.
19
+
20
+
Beware of typographical errors.
21
+
The service assumes that typos are new words; unless you correct them before training the model, the service adds them to the model's vocabulary.
22
+
23
+
The service automatically does the following:
24
+
25
+
Converts numbers to their equivalent words.
26
+
For example, 500 becomes five hundred, and 0.15 becomes zero point fifteen.
27
+
28
+
Removes the following punctuation and special characters:
29
+
30
+
! @ # $ % ^ & * - + = ~ _ . , ; : ( ) < > [ ] { }
31
+
32
+
Ignores phrases enclosed in ( ) (parentheses), < > (angle brackets), [ ] (square brackets), and { } (curly braces).
33
+
34
+
Converts tokens that include certain symbols to meaningful strings.
35
+
For example, the service
36
+
37
+
Converts a $ (dollar sign) followed by a number to its string representation.
38
+
For example, $100 becomes one hundred dollars.
39
+
40
+
Converts a % (percent sign) preceded by a number to its string representation.
41
+
For example, 100% becomes one hundred percent.
42
+
43
+
This list is not exhaustive; the service makes similar adjustments for other characters as needed.
44
+
45
+
The call returns an HTTP 201 response code if the corpus is valid. It then asynchronously pre-processes the contents of the corpus and automatically extracts new words that it finds.
46
+
This can take on the order of a minute or two to complete depending on the total number of words and the number of new words in the corpus, as well as the current load on the service.
47
+
You cannot submit requests to add additional corpora or words to the custom model, or to train the model, until the service's analysis of the corpus for the current request completes.
48
+
Use the List corpora method to check the status of the analysis.
49
+
50
+
The service auto-populates the model's words resource with any word that is not found in its base vocabulary; these are referred to as out-of-vocabulary (OOV) words.
51
+
You can use the List custom words method to examine the words resource, using other words method to eliminate typos and modify how words are pronounced as needed.
52
+
53
+
To add a corpus file that has the same name as an existing corpus, set the allow_overwrite query parameter to true; otherwise, the request fails.
54
+
Overwriting an existing corpus causes the service to process the corpus text file and extract OOV words anew.
55
+
Before doing so, it removes any OOV words associated with the existing corpus from the model's words resource unless they were also added by another corpus or they have been modified in some way with the Add custom words or Add a custom word method.
56
+
57
+
The service limits the overall amount of data that you can add to a custom model to a maximum of 10 million total words from all corpora combined.
58
+
Also, you can add no more than 30 thousand new words to a model; this includes words that the service extracts from corpora and words that you add directly.
0 commit comments