Skip to content

Add missing tokenizers #2877

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 9, 2024
Merged

Add missing tokenizers #2877

merged 2 commits into from
Sep 9, 2024

Conversation

pquentin
Copy link
Member

@pquentin pquentin commented Sep 8, 2024

And also:

  • Sort tokenizers alphabetically
  • Move NoriTokenizer to its own file like other plugins

I can split this in multiple commits if that would help review.

@l-trotta
Copy link
Contributor

l-trotta commented Sep 8, 2024

Where did you find PathHierarchyPascalCaseTokenizer? asking because I didn't find it in the server code or the documentation so I'm wondering if I'm missing some references. Also taking a look at the server code and discovering how many tokenizers/analyzers/normalizers we're missing... :')

@pquentin
Copy link
Member Author

pquentin commented Sep 9, 2024

Where did you find PathHierarchyPascalCaseTokenizer?

There are currently two ways to use a Path hierarchy tokenizer: with path_hierarchy (snake_case, the default) and PathHierarchy (PascalCase, presumably for legacy reasons). OK, definitely for legacy reasons: elastic/elasticsearch#15785. (Looks like Elasticsearch developers call it camelCase instead, but the idea is the same.)

When I initially added the PascalCase version I was simply fixing tests and did not stop to understand why we had such a test. But now I'll make sure to skip that test in the flights recorder.

Copy link
Contributor

@l-trotta l-trotta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@pquentin pquentin merged commit cb2c9a0 into main Sep 9, 2024
6 checks passed
@pquentin pquentin deleted the tokenizers branch September 9, 2024 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants