Skip to content

BERT tokenizer fixes #6498

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Apr 9, 2024
Merged

BERT tokenizer fixes #6498

merged 9 commits into from
Apr 9, 2024

Conversation

cebtenzzre
Copy link
Collaborator

Changes to the conversion script:

  • LlamaHfVocab is for Llama vocabs only, so use a function based on _set_vocab_gpt2 instead.
  • The token type count should always be 2 (Sentence A and Sentence B), which is a different concept from BPE token types.
  • Do not slice the Nomic BERT token embeddings tensor as get_basic_vocab correctly pads the vocab size based on the model config.
  • Do not write token scores as they are optional and have no meaning to BERT.
  • Do not make up BOS/EOS values based on SEP/CLS; instead rely on SpecialVocab to write SEP and CLS automatically.

Changes to llama.cpp tokenization:

  • Add llama_token_cls/llama_token_sep functions, since we do not set BOS or EOS for BERT anymore.
  • Rename llama_tokenize parameters add_bos->add_special (since it now controls SEP and CLS) and special->parse_special (for clarity).
  • Move handling of special_add_* arguments into llama_tokenize. Just like HF transformers, the decision on whether to add special tokens is made by the tokenizer based on add_special and the values of special_add_*. This seems like a good time to make this change, but this isn't specifically needed for BERT and could be done in a separate PR.
  • For consistency, honor special_add_eos for the SPM vocab type. We assert that this is false in the examples that check the result of llama_should_add_bos, since it can be assumed these examples do not expect llama_tokenize to add EOS to the prompt.

@cebtenzzre cebtenzzre requested review from iamlemec and ggerganov April 4, 2024 23:01
Copy link
Collaborator

@iamlemec iamlemec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Makes more sense to give CLS and SEP first class treatment.

Also, in terms of backwards compatibility, this should work as long as the model uses the default CLS = 101 and SEP = 102 numbers, right?

@cebtenzzre
Copy link
Collaborator Author

Also, in terms of backwards compatibility, this should work as long as the model uses the default CLS = 101 and SEP = 102 numbers, right?

Since we were already writing cls_token_id and seperator_token_id to the GGUF, this should be fully backwards compatible with previous conversions of BERT models. Nomic BERT will not be backwards compatible due to the tensor shape change.

@cebtenzzre cebtenzzre merged commit 1b67731 into master Apr 9, 2024
tybalex pushed a commit to rubra-ai/tools.cpp that referenced this pull request Apr 17, 2024
Key changes:
* BERT conversion: fix abuse of LlamaHfVocab, do not set BOS or EOS
* Nomic Embed conversion: pad vocab instead of slicing embedding tensor
* llama_tokenize: handle added special tokens like HF does
vvhg1 added a commit to vvhg1/llama.cpp that referenced this pull request May 6, 2024
vvhg1 added a commit to vvhg1/llama.cpp that referenced this pull request Jun 24, 2024
vvhg1 added a commit to vvhg1/llama.cpp that referenced this pull request Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants