Skip to content

Commit 4a2ba67

Browse files
szabostevemergify[bot]
authored andcommitted
[DOCS] Adds section about tokens to ELSER conceptual (#2568)
* [DOCS] Adds section about tokens to ELSER conceptual. * [DOCS] Adds 'discrete' flag to section. (cherry picked from commit f9c8a20)
1 parent ef2e45d commit 4a2ba67

File tree

1 file changed

+18
-3
lines changed

1 file changed

+18
-3
lines changed

docs/en/stack/ml/nlp/ml-nlp-elser.asciidoc

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,15 +20,30 @@ meaning and user intent, rather than exact keyword matches.
2020
ELSER is an out-of-domain model which means it does not require fine-tuning on
2121
your own data, making it adaptable for various use cases out of the box.
2222

23+
24+
[discrete]
25+
[[elser-tokens]]
26+
== Tokens - not synonyms
27+
2328
ELSER expands the indexed and searched passages into collections of terms that
2429
are learned to co-occur frequently within a diverse set of training data. The
2530
terms that the text is expanded into by the model _are not_ synonyms for the
26-
search terms; they are learned associations. These expanded terms are weighted
27-
as some of them are more significant than others. Then the {es}
28-
{ref}/sparse-vector.html[sparse vector]
31+
search terms; they are learned associations capturing relevance. These expanded
32+
terms are weighted as some of them are more significant than others. Then the
33+
{es} {ref}/sparse-vector.html[sparse vector]
2934
(or {ref}/rank-features.html[rank features]) field type is used to store the
3035
terms and weights at index time, and to search against later.
3136

37+
This approach provides a more understandable search experience compared to
38+
vector embeddings. However, attempting to directly interpret the tokens and
39+
weights can be misleading, as the expansion essentially results in a vector in a
40+
very high-dimensional space. Consequently, certain tokens, especially those with
41+
low weight, contain information that is intertwined with other low-weight tokens
42+
in the representation. In this regard, they function similarly to a dense vector
43+
representation, making it challenging to separate their individual
44+
contributions. This complexity can potentially lead to misinterpretations if not
45+
carefully considered during analysis.
46+
3247

3348
[discrete]
3449
[[elser-req]]

0 commit comments

Comments
 (0)