Skip to content

vocab : BailingMoE : change possessive quantifiers to greedy #12677

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 2, 2025

Conversation

CISC
Copy link
Collaborator

@CISC CISC commented Mar 31, 2025

The possessive quantifiers are causing weird issues, and atomic grouping does not seem to be supported, so revert to greedy.

See following reports:

@bartowski1182 @nicoboss

@CISC CISC requested a review from ggerganov March 31, 2025 20:43
@ggerganov
Copy link
Member

Did you do tokenizer tests to make sure the results match with the reference tokenizer?

@CISC
Copy link
Collaborator Author

CISC commented Apr 1, 2025

Did you do tokenizer tests to make sure the results match with the reference tokenizer?

I did some basic tests, but I will admit that I'm not entirely sure how to set up a proper test, any pointers?

@ggerganov
Copy link
Member

You need to run:

python convert_hf_to_gguf_update.py <hf_token>

This will download reference tokenizers for all models to models/tokenizers and will generate test files in models/ggml-vocab-...inp/out.

After that, create a "vocab-only" GGUF model:

# this is for llama - update to create one for the Bailing model
python3 convert_hf_to_gguf.py models/tokenizers/llama-spm/ --outfile models/ggml-vocab-llama-spm.gguf --vocab-only

Run the test-tokenizer-0 tool using the GGUF vocab and generated test files.

@CISC
Copy link
Collaborator Author

CISC commented Apr 1, 2025

Run the test-tokenizer-0 tool using the GGUF vocab and generated test files.

Ah, I missed the relationship between these files, I see now, added for future tests.

main : tokenizing: 'ggml-vocab-ling-plus.gguf.inp'
main : text size: 1944
main : tokenized in 1.466 ms (cpp)
main : tokens: 814
main : tokens written to 'ggml-vocab-ling-plus.gguf.inp.tokcpp'

Tests passed

@CISC
Copy link
Collaborator Author

CISC commented Apr 1, 2025

@ggerganov gentle ping :)

@ggerganov
Copy link
Member

Nice, but I didn't mean to commit the generated test files. At some point we stopped source controlling them because they add non negligible data (in this case 5MB). The idea is to just generate and test them locally.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge after removing the vocab files.

@CISC CISC merged commit 83a88bd into ggml-org:master Apr 2, 2025
47 checks passed
@CISC CISC deleted the fix-bailing-vocab-regex branch April 2, 2025 09:21
@bartowski1182
Copy link
Contributor

bartowski1182 commented Apr 3, 2025

huh, seems to be CUDA related? If I switch to CPU only it's able to tokenize no problem..

False alarms all around, was accidentally using an older build on my GPU... ignore me :) thanks so much for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants