You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
in the **Inference configuration** section. For example, to include the top five
50
+
language predictions:
51
+
+
52
+
--
53
+
[source,js]
54
+
----
55
+
{
56
+
"classification":{
57
+
"num_top_classes":5
58
+
}
59
+
}
60
+
----
61
+
// NOTCONSOLE
62
+
--
31
63
.. Click **Add** to save the processor.
32
64
. Optional: Add a {ref}/set-processor.html[set processor] to index the ingest
33
65
timestamp.
34
66
.. Click **Add a processor** and select the **Set** processor type.
35
-
.. Choose a name for the field (such as `timestamp`) and set its value to
67
+
.. Choose a name for the field (such as `event.ingested`) and set its value to
36
68
`{{{_ingest.timestamp}}}`. For more details, refer to
37
69
{ref}/ingest.html#access-ingest-metadata[Access ingest metadata in a processor].
38
70
.. Click **Add** to save the processor.
39
71
. To test the pipeline, click **Add documents**.
40
-
.. In the **Documents** tab, provide a sample document for testing. For example,
41
-
to test a trained model that performs named entity recognition (NER):
72
+
.. In the **Documents** tab, provide a sample document for testing.
42
73
+
43
74
--
75
+
For example, to test a trained model that performs named entity recognition
76
+
(NER):
77
+
44
78
[source,js]
45
79
----
46
80
[
@@ -52,8 +86,28 @@ to test a trained model that performs named entity recognition (NER):
52
86
]
53
87
----
54
88
// NOTCONSOLE
89
+
90
+
To test a trained model that performs {lang-ident}:
91
+
92
+
[source,js]
93
+
----
94
+
[
95
+
{
96
+
"_source":{
97
+
"message":"Sziasztok! Ez egy rövid magyar szöveg. Nézzük, vajon sikerül-e azonosítania a language identification funkciónak? Annak ellenére is sikerülni fog, hogy a szöveg két angol szót is tartalmaz."
98
+
}
99
+
}
100
+
]
101
+
----
102
+
// NOTCONSOLE
55
103
--
56
104
.. Click **Run the pipeline** and verify the pipeline worked as expected.
105
+
+
106
+
--
107
+
In the {lang-ident} example, the predicted value is the ISO identifier of the
108
+
language with the highest probability. In this case, it should be `hu` for
109
+
Hungarian.
110
+
--
57
111
.. If everything looks correct, close the panel, and click **Create
Copy file name to clipboardExpand all lines: docs/en/stack/ml/nlp/ml-nlp-lang-ident.asciidoc
+2-113Lines changed: 2 additions & 113 deletions
Original file line number
Diff line number
Diff line change
@@ -5,10 +5,8 @@
5
5
{lang-ident-cap} enables you to determine the language of text.
6
6
7
7
A {lang-ident} model is provided in your cluster, which you can use in an
8
-
{ref}/inference-processor.html[{infer} processor] of an ingest pipeline by
9
-
using its model ID (`lang_ident_model_1`). The input field name is `text`. If
10
-
you want to run {lang-ident} on a field with a different name, you must map your
11
-
field name to `text` in the ingest processor settings.
8
+
{infer} processor of an ingest pipeline by using its model ID
9
+
(`lang_ident_model_1`). For an example, refer to <<ml-nlp-inference>>.
12
10
13
11
The longer the text passed into the {lang-ident} model, the more accurately the
14
12
model can identify the language. It is fairly accurate on short samples (for
@@ -76,115 +74,6 @@ script.
76
74
| hmn | Hmong | ny | Chichewa | |
77
75
|===
78
76
79
-
[discrete]
80
-
[[ml-lang-ident-example]]
81
-
=== Example of {lang-ident}
82
-
83
-
In the following example, we feed the {lang-ident} trained model a short
84
-
Hungarian text that contains diacritics and a couple of English words. The
85
-
model identifies the text correctly as Hungarian with high probability.
86
-
87
-
[source,js]
88
-
----------------------------------
89
-
POST _ingest/pipeline/_simulate
90
-
{
91
-
"pipeline":{
92
-
"processors":[
93
-
{
94
-
"inference":{
95
-
"model_id":"lang_ident_model_1", <1>
96
-
"inference_config":{
97
-
"classification":{
98
-
"num_top_classes":5 <2>
99
-
}
100
-
},
101
-
"field_map":{
102
-
103
-
}
104
-
}
105
-
}
106
-
]
107
-
},
108
-
"docs":[
109
-
{
110
-
"_source":{ <3>
111
-
"text":"Sziasztok! Ez egy rövid magyar szöveg. Nézzük, vajon sikerül-e azonosítania a language identification funkciónak? Annak ellenére is sikerülni fog, hogy a szöveg két angol szót is tartalmaz."
112
-
}
113
-
}
114
-
]
115
-
}
116
-
----------------------------------
117
-
//NOTCONSOLE
118
-
119
-
<1> ID of the {lang-ident} trained model.
120
-
<2> Specifies the number of languages to report by descending order of
121
-
probability.
122
-
<3> The source object that contains the text to identify.
123
-
124
-
125
-
In the example above, the `num_top_classes` value indicates that only the top
126
-
five languages (that is to say, the ones with the highest probability) are
127
-
reported.
128
-
129
-
The request returns the following response:
130
-
131
-
[source,js]
132
-
----------------------------------
133
-
{
134
-
"docs" : [
135
-
{
136
-
"doc" : {
137
-
"_index" : "_index",
138
-
"_type" : "_doc",
139
-
"_id" : "_id",
140
-
"_source" : {
141
-
"text" : "Sziasztok! Ez egy rövid magyar szöveg. Nézzük, vajon sikerül-e azonosítania a language identification funkciónak? Annak ellenére is sikerülni fog, hogy a szöveg két angol szót is tartalmaz.",
142
-
"ml" : {
143
-
"inference" : {
144
-
"top_classes" : [ <1>
145
-
{
146
-
"class_name" : "hu",
147
-
"class_probability" : 0.9999936063740517,
148
-
"class_score" : 0.9999936063740517
149
-
},
150
-
{
151
-
"class_name" : "lv",
152
-
"class_probability" : 2.5020248433413966E-6,
153
-
"class_score" : 2.5020248433413966E-6
154
-
},
155
-
{
156
-
"class_name" : "is",
157
-
"class_probability" : 1.0150420723037688E-6,
158
-
"class_score" : 1.0150420723037688E-6
159
-
},
160
-
{
161
-
"class_name" : "ga",
162
-
"class_probability" : 6.67935962773335E-7,
163
-
"class_score" : 6.67935962773335E-7
164
-
},
165
-
{
166
-
"class_name" : "tr",
167
-
"class_probability" : 5.591166324774555E-7,
168
-
"class_score" : 5.591166324774555E-7
169
-
}
170
-
],
171
-
"predicted_value" : "hu", <2>
172
-
"model_id" : "lang_ident_model_1"
173
-
}
174
-
}
175
-
},
176
-
"_ingest" : {
177
-
"timestamp" : "2020-01-22T14:25:14.644912Z"
178
-
}
179
-
}
180
-
}
181
-
]
182
-
}
183
-
----------------------------------
184
-
//NOTCONSOLE
185
-
186
-
<1> Contains scores for the most probable languages.
187
-
<2> The ISO identifier of the language with the highest probability.
0 commit comments