Skip to content

Several improvements to token filter types #4291

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 29, 2025
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions specification/_types/analysis/StopWords.ts
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,46 @@
* under the License.
*/

export enum StopWordLanguage {
_arabic_,
_armenian_,
_basque_,
_bengali_,
_brazilian_,
_bulgarian_,
_catalan_,
_cjk_,
_czech_,
_danish_,
_dutch_,
_english_,
_estonian_,
_finnish_,
_french_,
_galician_,
_german_,
_greek_,
_hindi_,
_hungarian_,
_indonesian_,
_irish_,
_italian_,
_latvian_,
_lithuanian_,
_norwegian_,
_persian_,
_portuguese_,
_romanian_,
_russian_,
_serbian_,
_sorani_,
_spanish_,
_swedish_,
_thai_,
_turkish_,
_none_
}

/**
* Language value, such as _arabic_ or _thai_. Defaults to _english_.
* Each language value corresponds to a predefined list of stop words in Lucene. See Stop words by language for supported language values and their stop words.
Expand Down
10 changes: 5 additions & 5 deletions specification/_types/analysis/analyzers.ts
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ import { IcuAnalyzer } from './icu-plugin'
import { KuromojiAnalyzer } from './kuromoji-plugin'
import { SnowballLanguage } from './languages'
import { NoriDecompoundMode } from './nori-plugin'
import { StopWords } from './StopWords'
import { StopWords, StopWordLanguage } from './StopWords'

export class CustomAnalyzer {
type: 'custom'
Expand Down Expand Up @@ -56,7 +56,7 @@ export class FingerprintAnalyzer {
*
* @server_default _none_
*/
stopwords?: StopWords
stopwords?: StopWordLanguage | string[]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause inconveniences for users in statically typed languages.

Could we please continue using the StopWords type here?

StopWords is currently defined as:

export type StopWords = string | string[]

Changing that to:

export type StopWords = StopWordLanguage | string[]

while keeping this file "as is", should do the trick.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't comment on this because the java client simplifies enum | string to string, so I assumed other static clients would have something similar

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For .NET I don’t do this since enums are way nicer to use.

In this case it’s as well string[]. Do you simplify that as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it also works as enum | string[] -> string[]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flobernd I think I got what you mean. Made a change in db8c130. Let me know if that looks better to you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect now! Thank you @JoshMock

/**
* The path to a file containing stop words.
*/
Expand Down Expand Up @@ -357,7 +357,7 @@ export class PatternAnalyzer {
*
* @server_default _none_
*/
stopwords?: StopWords
stopwords?: StopWordLanguage | string[]
/**
* The path to a file containing stop words.
*/
Expand Down Expand Up @@ -394,7 +394,7 @@ export class StandardAnalyzer {
*
* @server_default _none_
*/
stopwords?: StopWords
stopwords?: StopWordLanguage | string[]
/**
* The path to a file containing stop words.
*/
Expand All @@ -411,7 +411,7 @@ export class StopAnalyzer {
*
* @server_default _none_
*/
stopwords?: StopWords
stopwords?: StopWordLanguage | string[]
/**
* The path to a file containing stop words.
*/
Expand Down
6 changes: 6 additions & 0 deletions specification/_types/analysis/kuromoji-plugin.ts
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@

import { integer } from '@_types/Numeric'
import { CharFilterBase } from './char_filters'
import { StopWords } from './StopWords'
import { TokenizerBase } from './tokenizers'
import { TokenFilterBase } from './token_filters'

Expand All @@ -28,6 +29,11 @@ export class KuromojiAnalyzer {
user_dictionary?: string
}

export class JaStopTokenFilter extends TokenFilterBase {
type: 'ja_stop'
stopwords?: StopWords
}

export class KuromojiIterationMarkCharFilter extends CharFilterBase {
type: 'kuromoji_iteration_mark'
normalize_kana: boolean
Expand Down
5 changes: 5 additions & 0 deletions specification/_types/analysis/languages.ts
Original file line number Diff line number Diff line change
Expand Up @@ -18,25 +18,30 @@
*/

export enum SnowballLanguage {
Arabic,
Armenian,
Basque,
Catalan,
Danish,
Dutch,
English,
Estonian,
Finnish,
French,
German,
German2,
Hungarian,
Italian,
Irish,
Kp,
Lithuanian,
Lovins,
Norwegian,
Porter,
Portuguese,
Romanian,
Russian,
Serbian,
Spanish,
Swedish,
Turkish
Expand Down
7 changes: 7 additions & 0 deletions specification/_types/analysis/nori-plugin.ts
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
*/

import { TokenizerBase } from './tokenizers'
import { TokenFilterBase } from './token_filters'

export enum NoriDecompoundMode {
discard,
Expand All @@ -32,3 +33,9 @@ export class NoriTokenizer extends TokenizerBase {
user_dictionary?: string
user_dictionary_rules?: string[]
}

export class NoriPartOfSpeechTokenFilter extends TokenFilterBase {
type: 'nori_part_of_speech'
/** An array of part-of-speech tags that should be removed. */
stoptags?: string[]
}
Loading
Loading