lcawl
diff --git a/‎docs/en/stack/ml/nlp/images/ml-nlp-discover-lang.png
281 KB b/‎docs/en/stack/ml/nlp/images/ml-nlp-discover-lang.png
281 KB
diff --git a/‎docs/en/stack/ml/nlp/images/ml-nlp-pipeline-ner.png
2.62 KB b/‎docs/en/stack/ml/nlp/images/ml-nlp-pipeline-ner.png
2.62 KB
diff --git a/‎docs/en/stack/ml/nlp/ml-nlp-inference.asciidoc
Lines changed: 112 additions & 18 deletions b/‎docs/en/stack/ml/nlp/ml-nlp-inference.asciidoc
Lines changed: 112 additions & 18 deletions
diff --git a/‎docs/en/stack/ml/nlp/ml-nlp-lang-ident.asciidoc
Lines changed: 2 additions & 113 deletions b/‎docs/en/stack/ml/nlp/ml-nlp-lang-ident.asciidoc
Lines changed: 2 additions & 113 deletions
@@ -27,20 +27,54 @@ image::images/ml-nlp-pipeline-ner.png[Creating a pipeline in the Stack Managemen
 . Add an {ref}/inference-processor.html[inference processor] to your pipeline:
 .. Click **Add a processor** and select the **Inference** processor type.
 .. Set **Model ID** to the name of your trained model, for example
-`elastic__distilbert-base-cased-finetuned-conll03-english`.
+`elastic__distilbert-base-cased-finetuned-conll03-english` or
+`lang_ident_model_1`.
+.. If you use the {lang-ident} model (`lang_ident_model_1`) that is provided in
+your cluster:
+... The input field name is assumed to be `text`. If you want to identify
+languages in a field with a different name, you must map your field name to
+`text` in the **Field map** section. For example:
++
+--
+[source,js]
+----
+{
+  "message": "text"
+}
+----
+// NOTCONSOLE
+--
+... You can also optionally add
+{ref}/inference-processor.html#inference-processor-classification-opt[classification configuration options]
+in the **Inference configuration** section. For example, to include the top five
+language predictions:
++
+--
+[source,js]
+----
+{
+  "classification":{
+    "num_top_classes":5
+  }
+}
+----
+// NOTCONSOLE
+--
 .. Click **Add** to save the processor.
 . Optional: Add a {ref}/set-processor.html[set processor] to index the ingest
 timestamp.
 .. Click **Add a processor** and select the **Set** processor type.
-.. Choose a name for the field (such as `timestamp`) and set its value to
+.. Choose a name for the field (such as `event.ingested`) and set its value to
 `{{{_ingest.timestamp}}}`. For more details, refer to
 {ref}/ingest.html#access-ingest-metadata[Access ingest metadata in a processor].
 .. Click **Add** to save the processor.
 . To test the pipeline, click **Add documents**.
-.. In the **Documents** tab, provide a sample document for testing. For example,
-to test a trained model that performs named entity recognition (NER):
+.. In the **Documents** tab, provide a sample document for testing.
 +
 --
+For example, to test a trained model that performs named entity recognition
+(NER):
+
 [source,js]
 ----
 [
@@ -52,8 +86,28 @@ to test a trained model that performs named entity recognition (NER):
 ]
 ----
 // NOTCONSOLE
+
+To test a trained model that performs {lang-ident}:
+
+[source,js]
+----
+[
+ {
+   "_source":{
+     "message":"Sziasztok! Ez egy rövid magyar szöveg. Nézzük, vajon sikerül-e azonosítania a language identification funkciónak? Annak ellenére is sikerülni fog, hogy a szöveg két angol szót is tartalmaz."
+     }
+  }
+]
+----
+// NOTCONSOLE
 --
 .. Click **Run the pipeline** and verify the pipeline worked as expected.
++
+--
+In the {lang-ident} example, the predicted value is the ISO identifier of the
+language with the highest probability. In this case, it should be `hu` for
+Hungarian.
+--
 .. If everything looks correct, close the panel, and click **Create
 pipeline**. The pipeline is now ready for use.
 
@@ -76,24 +130,24 @@ PUT ner-test
       "ml.inference.predicted_value": {"type": "annotated_text"},
       "ml.inference.model_id": {"type": "keyword"},
       "text_field": {"type": "text"},
-      "timestamp": {"type": "date"}
+      "event.ingested": {"type": "date"}
     }
   }
 }
 ----
+// TEST[skip:TBD]
 
-TIP: The `annotated_text` data type in this example is included in the
+TIP: To use the `annotated_text` data type in this example, you must install the
 {plugins}/mapper-annotated-text.html[mapper annotated text plugin]. For more
 installation details, refer to   
 {cloud}/ec-adding-elastic-plugins.html[Add plugins provided with {ess}].
 
-
 You can then use the new pipeline to index some documents. For example, use a
-bulk indexing request with the `pipeline` query parameter:
+bulk indexing request with the `pipeline` query parameter for your NER pipeline:
 
 [source,console]
 ----
-POST /_bulk?pipeline=ner
+POST /_bulk?pipeline=my-ner-pipeline
 {"create":{"_index":"ner-test","_id":"1"}}
 {"text_field":"Hello, my name is Josh and I live in Berlin."}
 {"create":{"_index":"ner-test","_id":"2"}}
@@ -105,26 +159,66 @@ POST /_bulk?pipeline=ner
 {"create":{"_index":"ner-test","_id":"5"}}
 {"text_field":"Elasticsearch is built using Lucene, an open source search library."}
 ----
+// TEST[skip:TBD]
+
+Or use an individual indexing request with the `pipeline` query parameter for
+your {lang-ident} pipeline:
+
+[source,console]
+----
+POST lang-test/_doc?pipeline=my-lang-pipeline
+{
+  "message": "Mon pays ce n'est pas un pays, c'est l'hiver"
+}
+----
+// TEST[skip:TBD]
 
 You can also use NLP pipelines when you are reindexing documents to a new
-destination. Refer to
-{ref}/docs-reindex.html#reindex-with-an-ingest-pipeline[Reindex with an ingest pipeline].
+destination. For example, since the
+{kibana-ref}/get-started.html#gs-get-data-into-kibana[sample web logs data set]
+contain a `message` text field, you can reindex it with your {lang-ident}
+pipeline:
+
+[source,console]
+----
+POST _reindex
+{
+  "source": {
+    "index": "kibana_sample_data_logs"
+  },
+  "dest": {
+    "index": "lang-test",
+    "pipeline": "my-lang-pipeline"
+  }
+}
+----
+// TEST[skip:TBD]
+
+However, those web log messages are unlikely to contain enough words for the
+model to accurately identify the language.
 
 [discrete]
 [[ml-nlp-inference-discover]]
 == View the results
 
-You can verify the results of the pipeline in **Discover**:
+Before you can verify the results of the pipelines, you must
+{kibana-ref}/data-views.html[create data views]. Then you can explore your data
+in **Discover**:
 
 [role="screenshot"]
-image::images/ml-nlp-discover-ner.png[An expanded view of predicted values in the Discover app,align="center"]
+image::images/ml-nlp-discover-ner.png[A document from the NER pipeline in the Discover app,align="center"]
 
 The `ml.inference.predicted_value` field contains the output from the inference
-processor. In this example, two documents were found to contain the `Elastic`
-organization entity.  
+processor. In this NER example, there are two documents that contain the
+`Elastic` organization entity.
+
+In this {lang-ident} example, the `ml.inference.predicted_value` contains the 	
+ISO identifier of the language with the highest probability and the
+`ml.inference.top_classes` fields contain the top five most probable languages
+and their scores:
 
-NOTE: When you view the index for the first time in {kib}, you must
-{kibana-ref}/data-views.html[create a data view].
+[role="screenshot"]
+image::images/ml-nlp-discover-lang.png[A document from the {lang-ident} pipeline in the Discover app,align="center"]
 
 To learn more about ingest pipelines and all of the other processors that you
-can add, refer to {ref}/ingest.html[Ingest pipelines]. 
+can add, refer to {ref}/ingest.html[Ingest pipelines].
@@ -5,10 +5,8 @@
 {lang-ident-cap} enables you to determine the language of text.
 
 A {lang-ident} model is provided in your cluster, which you can use in an
-{ref}/inference-processor.html[{infer} processor] of an ingest pipeline by 
-using its model ID (`lang_ident_model_1`). The input field name is `text`. If 
-you want to run {lang-ident} on a field with a different name, you must map your 
-field name to `text` in the ingest processor settings.
+{infer} processor of an ingest pipeline by using its model ID
+(`lang_ident_model_1`). For an example, refer to <<ml-nlp-inference>>.
 
 The longer the text passed into the {lang-ident} model, the more accurately the 
 model can identify the language. It is fairly accurate on short samples (for 
@@ -76,115 +74,6 @@ script.
 | hmn     | Hmong              | ny      | Chichewa       |         |   
 |===
 
-[discrete]
-[[ml-lang-ident-example]]
-=== Example of {lang-ident}
-
-In the following example, we feed the {lang-ident} trained model a short 
-Hungarian text that contains diacritics and a couple of English words. The 
-model identifies the text correctly as Hungarian with high probability.
-
-[source,js]
-----------------------------------
-POST _ingest/pipeline/_simulate
-{
-   "pipeline":{
-      "processors":[
-         {
-            "inference":{
-               "model_id":"lang_ident_model_1", <1>
-               "inference_config":{
-                  "classification":{
-                     "num_top_classes":5 <2>
-                  }
-               },
-               "field_map":{
-
-               }
-            }
-         }
-      ]
-   },
-   "docs":[
-      {
-         "_source":{ <3>
-            "text":"Sziasztok! Ez egy rövid magyar szöveg. Nézzük, vajon sikerül-e azonosítania a language identification funkciónak? Annak ellenére is sikerülni fog, hogy a szöveg két angol szót is tartalmaz."
-         }
-      }
-   ]
-}
-----------------------------------
-//NOTCONSOLE
-
-<1> ID of the {lang-ident} trained model.
-<2> Specifies the number of languages to report by descending order of 
-probability.
-<3> The source object that contains the text to identify.
-
-
-In the example above, the `num_top_classes` value indicates that only the top 
-five languages (that is to say, the ones with the highest probability) are 
-reported.
-
-The request returns the following response:
-
-[source,js]
-----------------------------------
-{
-  "docs" : [
-    {
-      "doc" : {
-        "_index" : "_index",
-        "_type" : "_doc",
-        "_id" : "_id",
-        "_source" : {
-          "text" : "Sziasztok! Ez egy rövid magyar szöveg. Nézzük, vajon sikerül-e azonosítania a language identification funkciónak? Annak ellenére is sikerülni fog, hogy a szöveg két angol szót is tartalmaz.",
-          "ml" : {
-            "inference" : {
-              "top_classes" : [ <1>
-                {
-                  "class_name" : "hu",
-                  "class_probability" : 0.9999936063740517,
-                  "class_score" : 0.9999936063740517
-                },
-                {
-                  "class_name" : "lv",
-                  "class_probability" : 2.5020248433413966E-6,
-                  "class_score" : 2.5020248433413966E-6
-                },
-                {
-                  "class_name" : "is",
-                  "class_probability" : 1.0150420723037688E-6,
-                  "class_score" : 1.0150420723037688E-6
-                },
-                {
-                  "class_name" : "ga",
-                  "class_probability" : 6.67935962773335E-7,
-                  "class_score" : 6.67935962773335E-7
-                },
-                {
-                  "class_name" : "tr",
-                  "class_probability" : 5.591166324774555E-7,
-                  "class_score" : 5.591166324774555E-7
-                }
-              ],
-              "predicted_value" : "hu", <2>
-              "model_id" : "lang_ident_model_1"
-            }
-          }
-        },
-        "_ingest" : {
-          "timestamp" : "2020-01-22T14:25:14.644912Z"
-        }
-      }
-    }
-  ]
-}
-----------------------------------
-//NOTCONSOLE
-
-<1> Contains scores for the most probable languages.
-<2> The ISO identifier of the language with the highest probability.
 
 [discrete]
 [[ml-lang-ident-readings]]