-
Notifications
You must be signed in to change notification settings - Fork 12.2k
BERT tokenizer fixes #6498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BERT tokenizer fixes #6498
Conversation
Now the BERT tokenizer actually uses the SEP and CLS tokens from SpecialVocab.
When this is not set in HF `tokenizer_config.json`, it should default to true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Makes more sense to give CLS and SEP first class treatment.
Also, in terms of backwards compatibility, this should work as long as the model uses the default CLS = 101
and SEP = 102
numbers, right?
Since we were already writing cls_token_id and seperator_token_id to the GGUF, this should be fully backwards compatible with previous conversions of BERT models. Nomic BERT will not be backwards compatible due to the tensor shape change. |
Key changes: * BERT conversion: fix abuse of LlamaHfVocab, do not set BOS or EOS * Nomic Embed conversion: pad vocab instead of slicing embedding tensor * llama_tokenize: handle added special tokens like HF does
This reverts commit 1b67731.
This reverts commit 82e6483.
This reverts commit ae342c7.
Changes to the conversion script:
Changes to llama.cpp tokenization: