Skip to content

Commit 7ac7c7b

Browse files
szabostevemergify[bot]
authored andcommitted
[DOCS] Adds section about tokens to ELSER conceptual (#2568)
* [DOCS] Adds section about tokens to ELSER conceptual. * [DOCS] Adds 'discrete' flag to section. (cherry picked from commit f9c8a20) # Conflicts: # docs/en/stack/ml/nlp/ml-nlp-elser.asciidoc
1 parent 9415319 commit 7ac7c7b

File tree

1 file changed

+23
-0
lines changed

1 file changed

+23
-0
lines changed

docs/en/stack/ml/nlp/ml-nlp-elser.asciidoc

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,13 +20,36 @@ meaning and user intent, rather than exact keyword matches.
2020
ELSER is an out-of-domain model which means it does not require fine-tuning on
2121
your own data, making it adaptable for various use cases out of the box.
2222

23+
24+
[discrete]
25+
[[elser-tokens]]
26+
== Tokens - not synonyms
27+
2328
ELSER expands the indexed and searched passages into collections of terms that
2429
are learned to co-occur frequently within a diverse set of training data. The
2530
terms that the text is expanded into by the model _are not_ synonyms for the
31+
<<<<<<< HEAD
2632
search terms; they are learned associations. These expanded terms are weighted
2733
as some of them are more significant than others. Then the {es}
2834
{ref}/rank-features.html[rank features field type] is used to store the terms
2935
and weights at index time, and to search against later.
36+
=======
37+
search terms; they are learned associations capturing relevance. These expanded
38+
terms are weighted as some of them are more significant than others. Then the
39+
{es} {ref}/sparse-vector.html[sparse vector]
40+
(or {ref}/rank-features.html[rank features]) field type is used to store the
41+
terms and weights at index time, and to search against later.
42+
>>>>>>> f9c8a202 ([DOCS] Adds section about tokens to ELSER conceptual (#2568))
43+
44+
This approach provides a more understandable search experience compared to
45+
vector embeddings. However, attempting to directly interpret the tokens and
46+
weights can be misleading, as the expansion essentially results in a vector in a
47+
very high-dimensional space. Consequently, certain tokens, especially those with
48+
low weight, contain information that is intertwined with other low-weight tokens
49+
in the representation. In this regard, they function similarly to a dense vector
50+
representation, making it challenging to separate their individual
51+
contributions. This complexity can potentially lead to misinterpretations if not
52+
carefully considered during analysis.
3053
3154
3255
[discrete]

0 commit comments

Comments
 (0)