Skip to content

Commit 5a60ef9

Browse files
committed
[DOCS] Adds language identification to NLP inference examples (elastic#1929)
1 parent 83acb87 commit 5a60ef9

File tree

4 files changed

+114
-131
lines changed

4 files changed

+114
-131
lines changed
Loading
Loading

docs/en/stack/ml/nlp/ml-nlp-inference.asciidoc

Lines changed: 112 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -27,20 +27,54 @@ image::images/ml-nlp-pipeline-ner.png[Creating a pipeline in the Stack Managemen
2727
. Add an {ref}/inference-processor.html[inference processor] to your pipeline:
2828
.. Click **Add a processor** and select the **Inference** processor type.
2929
.. Set **Model ID** to the name of your trained model, for example
30-
`elastic__distilbert-base-cased-finetuned-conll03-english`.
30+
`elastic__distilbert-base-cased-finetuned-conll03-english` or
31+
`lang_ident_model_1`.
32+
.. If you use the {lang-ident} model (`lang_ident_model_1`) that is provided in
33+
your cluster:
34+
... The input field name is assumed to be `text`. If you want to identify
35+
languages in a field with a different name, you must map your field name to
36+
`text` in the **Field map** section. For example:
37+
+
38+
--
39+
[source,js]
40+
----
41+
{
42+
"message": "text"
43+
}
44+
----
45+
// NOTCONSOLE
46+
--
47+
... You can also optionally add
48+
{ref}/inference-processor.html#inference-processor-classification-opt[classification configuration options]
49+
in the **Inference configuration** section. For example, to include the top five
50+
language predictions:
51+
+
52+
--
53+
[source,js]
54+
----
55+
{
56+
"classification":{
57+
"num_top_classes":5
58+
}
59+
}
60+
----
61+
// NOTCONSOLE
62+
--
3163
.. Click **Add** to save the processor.
3264
. Optional: Add a {ref}/set-processor.html[set processor] to index the ingest
3365
timestamp.
3466
.. Click **Add a processor** and select the **Set** processor type.
35-
.. Choose a name for the field (such as `timestamp`) and set its value to
67+
.. Choose a name for the field (such as `event.ingested`) and set its value to
3668
`{{{_ingest.timestamp}}}`. For more details, refer to
3769
{ref}/ingest.html#access-ingest-metadata[Access ingest metadata in a processor].
3870
.. Click **Add** to save the processor.
3971
. To test the pipeline, click **Add documents**.
40-
.. In the **Documents** tab, provide a sample document for testing. For example,
41-
to test a trained model that performs named entity recognition (NER):
72+
.. In the **Documents** tab, provide a sample document for testing.
4273
+
4374
--
75+
For example, to test a trained model that performs named entity recognition
76+
(NER):
77+
4478
[source,js]
4579
----
4680
[
@@ -52,8 +86,28 @@ to test a trained model that performs named entity recognition (NER):
5286
]
5387
----
5488
// NOTCONSOLE
89+
90+
To test a trained model that performs {lang-ident}:
91+
92+
[source,js]
93+
----
94+
[
95+
{
96+
"_source":{
97+
"message":"Sziasztok! Ez egy rövid magyar szöveg. Nézzük, vajon sikerül-e azonosítania a language identification funkciónak? Annak ellenére is sikerülni fog, hogy a szöveg két angol szót is tartalmaz."
98+
}
99+
}
100+
]
101+
----
102+
// NOTCONSOLE
55103
--
56104
.. Click **Run the pipeline** and verify the pipeline worked as expected.
105+
+
106+
--
107+
In the {lang-ident} example, the predicted value is the ISO identifier of the
108+
language with the highest probability. In this case, it should be `hu` for
109+
Hungarian.
110+
--
57111
.. If everything looks correct, close the panel, and click **Create
58112
pipeline**. The pipeline is now ready for use.
59113

@@ -76,24 +130,24 @@ PUT ner-test
76130
"ml.inference.predicted_value": {"type": "annotated_text"},
77131
"ml.inference.model_id": {"type": "keyword"},
78132
"text_field": {"type": "text"},
79-
"timestamp": {"type": "date"}
133+
"event.ingested": {"type": "date"}
80134
}
81135
}
82136
}
83137
----
138+
// TEST[skip:TBD]
84139

85-
TIP: The `annotated_text` data type in this example is included in the
140+
TIP: To use the `annotated_text` data type in this example, you must install the
86141
{plugins}/mapper-annotated-text.html[mapper annotated text plugin]. For more
87142
installation details, refer to
88143
{cloud}/ec-adding-elastic-plugins.html[Add plugins provided with {ess}].
89144

90-
91145
You can then use the new pipeline to index some documents. For example, use a
92-
bulk indexing request with the `pipeline` query parameter:
146+
bulk indexing request with the `pipeline` query parameter for your NER pipeline:
93147

94148
[source,console]
95149
----
96-
POST /_bulk?pipeline=ner
150+
POST /_bulk?pipeline=my-ner-pipeline
97151
{"create":{"_index":"ner-test","_id":"1"}}
98152
{"text_field":"Hello, my name is Josh and I live in Berlin."}
99153
{"create":{"_index":"ner-test","_id":"2"}}
@@ -105,26 +159,66 @@ POST /_bulk?pipeline=ner
105159
{"create":{"_index":"ner-test","_id":"5"}}
106160
{"text_field":"Elasticsearch is built using Lucene, an open source search library."}
107161
----
162+
// TEST[skip:TBD]
163+
164+
Or use an individual indexing request with the `pipeline` query parameter for
165+
your {lang-ident} pipeline:
166+
167+
[source,console]
168+
----
169+
POST lang-test/_doc?pipeline=my-lang-pipeline
170+
{
171+
"message": "Mon pays ce n'est pas un pays, c'est l'hiver"
172+
}
173+
----
174+
// TEST[skip:TBD]
108175

109176
You can also use NLP pipelines when you are reindexing documents to a new
110-
destination. Refer to
111-
{ref}/docs-reindex.html#reindex-with-an-ingest-pipeline[Reindex with an ingest pipeline].
177+
destination. For example, since the
178+
{kibana-ref}/get-started.html#gs-get-data-into-kibana[sample web logs data set]
179+
contain a `message` text field, you can reindex it with your {lang-ident}
180+
pipeline:
181+
182+
[source,console]
183+
----
184+
POST _reindex
185+
{
186+
"source": {
187+
"index": "kibana_sample_data_logs"
188+
},
189+
"dest": {
190+
"index": "lang-test",
191+
"pipeline": "my-lang-pipeline"
192+
}
193+
}
194+
----
195+
// TEST[skip:TBD]
196+
197+
However, those web log messages are unlikely to contain enough words for the
198+
model to accurately identify the language.
112199

113200
[discrete]
114201
[[ml-nlp-inference-discover]]
115202
== View the results
116203

117-
You can verify the results of the pipeline in **Discover**:
204+
Before you can verify the results of the pipelines, you must
205+
{kibana-ref}/data-views.html[create data views]. Then you can explore your data
206+
in **Discover**:
118207

119208
[role="screenshot"]
120-
image::images/ml-nlp-discover-ner.png[An expanded view of predicted values in the Discover app,align="center"]
209+
image::images/ml-nlp-discover-ner.png[A document from the NER pipeline in the Discover app,align="center"]
121210

122211
The `ml.inference.predicted_value` field contains the output from the inference
123-
processor. In this example, two documents were found to contain the `Elastic`
124-
organization entity.
212+
processor. In this NER example, there are two documents that contain the
213+
`Elastic` organization entity.
214+
215+
In this {lang-ident} example, the `ml.inference.predicted_value` contains the
216+
ISO identifier of the language with the highest probability and the
217+
`ml.inference.top_classes` fields contain the top five most probable languages
218+
and their scores:
125219

126-
NOTE: When you view the index for the first time in {kib}, you must
127-
{kibana-ref}/data-views.html[create a data view].
220+
[role="screenshot"]
221+
image::images/ml-nlp-discover-lang.png[A document from the {lang-ident} pipeline in the Discover app,align="center"]
128222

129223
To learn more about ingest pipelines and all of the other processors that you
130-
can add, refer to {ref}/ingest.html[Ingest pipelines].
224+
can add, refer to {ref}/ingest.html[Ingest pipelines].

docs/en/stack/ml/nlp/ml-nlp-lang-ident.asciidoc

Lines changed: 2 additions & 113 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,8 @@
55
{lang-ident-cap} enables you to determine the language of text.
66

77
A {lang-ident} model is provided in your cluster, which you can use in an
8-
{ref}/inference-processor.html[{infer} processor] of an ingest pipeline by
9-
using its model ID (`lang_ident_model_1`). The input field name is `text`. If
10-
you want to run {lang-ident} on a field with a different name, you must map your
11-
field name to `text` in the ingest processor settings.
8+
{infer} processor of an ingest pipeline by using its model ID
9+
(`lang_ident_model_1`). For an example, refer to <<ml-nlp-inference>>.
1210

1311
The longer the text passed into the {lang-ident} model, the more accurately the
1412
model can identify the language. It is fairly accurate on short samples (for
@@ -76,115 +74,6 @@ script.
7674
| hmn | Hmong | ny | Chichewa | |
7775
|===
7876

79-
[discrete]
80-
[[ml-lang-ident-example]]
81-
=== Example of {lang-ident}
82-
83-
In the following example, we feed the {lang-ident} trained model a short
84-
Hungarian text that contains diacritics and a couple of English words. The
85-
model identifies the text correctly as Hungarian with high probability.
86-
87-
[source,js]
88-
----------------------------------
89-
POST _ingest/pipeline/_simulate
90-
{
91-
"pipeline":{
92-
"processors":[
93-
{
94-
"inference":{
95-
"model_id":"lang_ident_model_1", <1>
96-
"inference_config":{
97-
"classification":{
98-
"num_top_classes":5 <2>
99-
}
100-
},
101-
"field_map":{
102-
103-
}
104-
}
105-
}
106-
]
107-
},
108-
"docs":[
109-
{
110-
"_source":{ <3>
111-
"text":"Sziasztok! Ez egy rövid magyar szöveg. Nézzük, vajon sikerül-e azonosítania a language identification funkciónak? Annak ellenére is sikerülni fog, hogy a szöveg két angol szót is tartalmaz."
112-
}
113-
}
114-
]
115-
}
116-
----------------------------------
117-
//NOTCONSOLE
118-
119-
<1> ID of the {lang-ident} trained model.
120-
<2> Specifies the number of languages to report by descending order of
121-
probability.
122-
<3> The source object that contains the text to identify.
123-
124-
125-
In the example above, the `num_top_classes` value indicates that only the top
126-
five languages (that is to say, the ones with the highest probability) are
127-
reported.
128-
129-
The request returns the following response:
130-
131-
[source,js]
132-
----------------------------------
133-
{
134-
"docs" : [
135-
{
136-
"doc" : {
137-
"_index" : "_index",
138-
"_type" : "_doc",
139-
"_id" : "_id",
140-
"_source" : {
141-
"text" : "Sziasztok! Ez egy rövid magyar szöveg. Nézzük, vajon sikerül-e azonosítania a language identification funkciónak? Annak ellenére is sikerülni fog, hogy a szöveg két angol szót is tartalmaz.",
142-
"ml" : {
143-
"inference" : {
144-
"top_classes" : [ <1>
145-
{
146-
"class_name" : "hu",
147-
"class_probability" : 0.9999936063740517,
148-
"class_score" : 0.9999936063740517
149-
},
150-
{
151-
"class_name" : "lv",
152-
"class_probability" : 2.5020248433413966E-6,
153-
"class_score" : 2.5020248433413966E-6
154-
},
155-
{
156-
"class_name" : "is",
157-
"class_probability" : 1.0150420723037688E-6,
158-
"class_score" : 1.0150420723037688E-6
159-
},
160-
{
161-
"class_name" : "ga",
162-
"class_probability" : 6.67935962773335E-7,
163-
"class_score" : 6.67935962773335E-7
164-
},
165-
{
166-
"class_name" : "tr",
167-
"class_probability" : 5.591166324774555E-7,
168-
"class_score" : 5.591166324774555E-7
169-
}
170-
],
171-
"predicted_value" : "hu", <2>
172-
"model_id" : "lang_ident_model_1"
173-
}
174-
}
175-
},
176-
"_ingest" : {
177-
"timestamp" : "2020-01-22T14:25:14.644912Z"
178-
}
179-
}
180-
}
181-
]
182-
}
183-
----------------------------------
184-
//NOTCONSOLE
185-
186-
<1> Contains scores for the most probable languages.
187-
<2> The ISO identifier of the language with the highest probability.
18877

18978
[discrete]
19079
[[ml-lang-ident-readings]]

0 commit comments

Comments
 (0)