Skip to content

Commit e52bc93

Browse files
[8.9] [DOCS] Adds section about tokens to ELSER conceptual (backport #2568) (#2571)
Co-authored-by: István Zoltán Szabó <[email protected]>
1 parent 9415319 commit e52bc93

File tree

1 file changed

+19
-4
lines changed

1 file changed

+19
-4
lines changed

docs/en/stack/ml/nlp/ml-nlp-elser.asciidoc

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,13 +20,28 @@ meaning and user intent, rather than exact keyword matches.
2020
ELSER is an out-of-domain model which means it does not require fine-tuning on
2121
your own data, making it adaptable for various use cases out of the box.
2222

23+
24+
[discrete]
25+
[[elser-tokens]]
26+
== Tokens - not synonyms
27+
2328
ELSER expands the indexed and searched passages into collections of terms that
2429
are learned to co-occur frequently within a diverse set of training data. The
2530
terms that the text is expanded into by the model _are not_ synonyms for the
26-
search terms; they are learned associations. These expanded terms are weighted
27-
as some of them are more significant than others. Then the {es}
28-
{ref}/rank-features.html[rank features field type] is used to store the terms
29-
and weights at index time, and to search against later.
31+
search terms; they are learned associations capturing relevance. These expanded
32+
terms are weighted as some of them are more significant than others. Then the
33+
{es} {ref}/rank-features.html[rank features] field type is used to store the
34+
terms and weights at index time, and to search against later.
35+
36+
This approach provides a more understandable search experience compared to
37+
vector embeddings. However, attempting to directly interpret the tokens and
38+
weights can be misleading, as the expansion essentially results in a vector in a
39+
very high-dimensional space. Consequently, certain tokens, especially those with
40+
low weight, contain information that is intertwined with other low-weight tokens
41+
in the representation. In this regard, they function similarly to a dense vector
42+
representation, making it challenging to separate their individual
43+
contributions. This complexity can potentially lead to misinterpretations if not
44+
carefully considered during analysis.
3045

3146

3247
[discrete]

0 commit comments

Comments
 (0)