vocab : BailingMoE : change possessive quantifiers to greedy #12677

CISC · 2025-03-31T20:43:37Z

The possessive quantifiers are causing weird issues, and atomic grouping does not seem to be supported, so revert to greedy.

See following reports:

@bartowski1182 @nicoboss

ggerganov · 2025-04-01T07:20:44Z

Did you do tokenizer tests to make sure the results match with the reference tokenizer?

CISC · 2025-04-01T07:27:17Z

Did you do tokenizer tests to make sure the results match with the reference tokenizer?

I did some basic tests, but I will admit that I'm not entirely sure how to set up a proper test, any pointers?

ggerganov · 2025-04-01T07:44:08Z

You need to run:

python convert_hf_to_gguf_update.py <hf_token>

This will download reference tokenizers for all models to models/tokenizers and will generate test files in models/ggml-vocab-...inp/out.

After that, create a "vocab-only" GGUF model:

# this is for llama - update to create one for the Bailing model
python3 convert_hf_to_gguf.py models/tokenizers/llama-spm/ --outfile models/ggml-vocab-llama-spm.gguf --vocab-only

Run the test-tokenizer-0 tool using the GGUF vocab and generated test files.

CISC · 2025-04-01T08:19:02Z

Run the test-tokenizer-0 tool using the GGUF vocab and generated test files.

Ah, I missed the relationship between these files, I see now, added for future tests.

main : tokenizing: 'ggml-vocab-ling-plus.gguf.inp'
main : text size: 1944
main : tokenized in 1.466 ms (cpp)
main : tokens: 814
main : tokens written to 'ggml-vocab-ling-plus.gguf.inp.tokcpp'

Tests passed

CISC · 2025-04-01T19:32:22Z

@ggerganov gentle ping :)

ggerganov · 2025-04-02T07:50:55Z

Nice, but I didn't mean to commit the generated test files. At some point we stopped source controlling them because they add non negligible data (in this case 5MB). The idea is to just generate and test them locally.

ggerganov

Merge after removing the vocab files.

bartowski1182 · 2025-04-03T03:05:29Z

~~huh, seems to be CUDA related? If I switch to CPU only it's able to tokenize no problem..~~

False alarms all around, was accidentally using an older build on my GPU... ignore me :) thanks so much for the fix!

change possessive quantifiers to greedy

283a763

CISC requested a review from ggerganov March 31, 2025 20:43

add tokenizer test files

83f18c7

ggerganov approved these changes Apr 2, 2025

View reviewed changes

CISC added 3 commits April 2, 2025 11:20

delete ggml-vocab-ling-plus.gguf

ccd238c

delete ggml-vocab-ling-plus.gguf.inp

4ca435a

delete ggml-vocab-ling-plus.gguf.out

8d2c8d1

CISC merged commit 83a88bd into ggml-org:master Apr 2, 2025
47 checks passed

CISC deleted the fix-bailing-vocab-regex branch April 2, 2025 09:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vocab : BailingMoE : change possessive quantifiers to greedy #12677

vocab : BailingMoE : change possessive quantifiers to greedy #12677

Uh oh!

CISC commented Mar 31, 2025

Uh oh!

ggerganov commented Apr 1, 2025

Uh oh!

CISC commented Apr 1, 2025

Uh oh!

ggerganov commented Apr 1, 2025

Uh oh!

CISC commented Apr 1, 2025

Uh oh!

CISC commented Apr 1, 2025

Uh oh!

ggerganov commented Apr 2, 2025

Uh oh!

ggerganov left a comment

Uh oh!

Uh oh!

bartowski1182 commented Apr 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

vocab : BailingMoE : change possessive quantifiers to greedy #12677

vocab : BailingMoE : change possessive quantifiers to greedy #12677

Uh oh!

Conversation

CISC commented Mar 31, 2025

Uh oh!

ggerganov commented Apr 1, 2025

Uh oh!

CISC commented Apr 1, 2025

Uh oh!

ggerganov commented Apr 1, 2025

Uh oh!

CISC commented Apr 1, 2025

Uh oh!

CISC commented Apr 1, 2025

Uh oh!

ggerganov commented Apr 2, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bartowski1182 commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bartowski1182 commented Apr 3, 2025 •

edited

Loading