You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/en/stack/ml/nlp/ml-nlp-lang-ident.asciidoc
+114-2Lines changed: 114 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -24,9 +24,10 @@ language traditionally uses. These languages are marked in the supported
24
24
languages table (see below) with the `Latn` subtag. {lang-ident-cap} supports
25
25
Unicode input.
26
26
27
+
27
28
[discrete]
28
29
[[ml-lang-ident-supported-languages]]
29
-
=== Supported languages
30
+
== Supported languages
30
31
31
32
The table below contains the ISO codes and the English names of the languages
32
33
that {lang-ident} supports. If a language has a 2-letter `ISO 639-1` code, the
@@ -82,8 +83,119 @@ script.
82
83
<!-- lint enable -->
83
84
////
84
85
86
+
87
+
[discrete]
88
+
[[ml-lang-ident-example]]
89
+
== Example of {lang-ident}
90
+
91
+
In the following example, we feed the {lang-ident} trained model a short
92
+
Hungarian text that contains diacritics and a couple of English words. The
93
+
model identifies the text correctly as Hungarian with high probability.
94
+
95
+
[source,js]
96
+
----------------------------------
97
+
POST _ingest/pipeline/_simulate
98
+
{
99
+
"pipeline":{
100
+
"processors":[
101
+
{
102
+
"inference":{
103
+
"model_id":"lang_ident_model_1", <1>
104
+
"inference_config":{
105
+
"classification":{
106
+
"num_top_classes":5 <2>
107
+
}
108
+
},
109
+
"field_map":{
110
+
}
111
+
}
112
+
}
113
+
]
114
+
},
115
+
"docs":[
116
+
{
117
+
"_source":{ <3>
118
+
"text":"Sziasztok! Ez egy rövid magyar szöveg. Nézzük, vajon sikerül-e azonosítania a language identification funkciónak? Annak ellenére is sikerülni fog, hogy a szöveg két angol szót is tartalmaz."
119
+
}
120
+
}
121
+
]
122
+
}
123
+
----------------------------------
124
+
//NOTCONSOLE
125
+
126
+
<1> ID of the {lang-ident} trained model.
127
+
<2> Specifies the number of languages to report by descending order of
128
+
probability.
129
+
<3> The source object that contains the text to identify.
130
+
131
+
132
+
In the example above, the `num_top_classes` value indicates that only the top
133
+
five languages (that is to say, the ones with the highest probability) are
134
+
reported.
135
+
136
+
The request returns the following response:
137
+
138
+
[source,js]
139
+
----------------------------------
140
+
{
141
+
"docs" : [
142
+
{
143
+
"doc" : {
144
+
"_index" : "_index",
145
+
"_type" : "_doc",
146
+
"_id" : "_id",
147
+
"_source" : {
148
+
"text" : "Sziasztok! Ez egy rövid magyar szöveg. Nézzük, vajon sikerül-e azonosítania a language identification funkciónak? Annak ellenére is sikerülni fog, hogy a szöveg két angol szót is tartalmaz.",
149
+
"ml" : {
150
+
"inference" : {
151
+
"top_classes" : [ <1>
152
+
{
153
+
"class_name" : "hu",
154
+
"class_probability" : 0.9999936063740517,
155
+
"class_score" : 0.9999936063740517
156
+
},
157
+
{
158
+
"class_name" : "lv",
159
+
"class_probability" : 2.5020248433413966E-6,
160
+
"class_score" : 2.5020248433413966E-6
161
+
},
162
+
{
163
+
"class_name" : "is",
164
+
"class_probability" : 1.0150420723037688E-6,
165
+
"class_score" : 1.0150420723037688E-6
166
+
},
167
+
{
168
+
"class_name" : "ga",
169
+
"class_probability" : 6.67935962773335E-7,
170
+
"class_score" : 6.67935962773335E-7
171
+
},
172
+
{
173
+
"class_name" : "tr",
174
+
"class_probability" : 5.591166324774555E-7,
175
+
"class_score" : 5.591166324774555E-7
176
+
}
177
+
],
178
+
"predicted_value" : "hu", <2>
179
+
"model_id" : "lang_ident_model_1"
180
+
}
181
+
}
182
+
},
183
+
"_ingest" : {
184
+
"timestamp" : "2020-01-22T14:25:14.644912Z"
185
+
}
186
+
}
187
+
}
188
+
]
189
+
}
190
+
----------------------------------
191
+
//NOTCONSOLE
192
+
193
+
<1> Contains scores for the most probable languages.
194
+
<2> The ISO identifier of the language with the highest probability.
195
+
196
+
85
197
[discrete]
86
198
[[ml-lang-ident-readings]]
87
-
=== Further reading
199
+
== Further reading
88
200
89
201
* {blog-ref}multilingual-search-using-language-identification-in-elasticsearch[Multilingual search using {lang-ident} in {es}]
0 commit comments