BERT tokenizer fixes #6498

cebtenzzre · 2024-04-04T23:01:57Z

Changes to the conversion script:

LlamaHfVocab is for Llama vocabs only, so use a function based on _set_vocab_gpt2 instead.
The token type count should always be 2 (Sentence A and Sentence B), which is a different concept from BPE token types.
Do not slice the Nomic BERT token embeddings tensor as get_basic_vocab correctly pads the vocab size based on the model config.
Do not write token scores as they are optional and have no meaning to BERT.
Do not make up BOS/EOS values based on SEP/CLS; instead rely on SpecialVocab to write SEP and CLS automatically.

Changes to llama.cpp tokenization:

Add llama_token_cls/llama_token_sep functions, since we do not set BOS or EOS for BERT anymore.
Rename llama_tokenize parameters add_bos->add_special (since it now controls SEP and CLS) and special->parse_special (for clarity).
Move handling of special_add_* arguments into llama_tokenize. Just like HF transformers, the decision on whether to add special tokens is made by the tokenizer based on add_special and the values of special_add_*. This seems like a good time to make this change, but this isn't specifically needed for BERT and could be done in a separate PR.
For consistency, honor special_add_eos for the SPM vocab type. We assert that this is false in the examples that check the result of llama_should_add_bos, since it can be assumed these examples do not expect llama_tokenize to add EOS to the prompt.

Now the BERT tokenizer actually uses the SEP and CLS tokens from SpecialVocab.

When this is not set in HF `tokenizer_config.json`, it should default to true.

iamlemec

Looks great. Makes more sense to give CLS and SEP first class treatment.

Also, in terms of backwards compatibility, this should work as long as the model uses the default CLS = 101 and SEP = 102 numbers, right?

cebtenzzre · 2024-04-05T16:55:11Z

Also, in terms of backwards compatibility, this should work as long as the model uses the default CLS = 101 and SEP = 102 numbers, right?

Since we were already writing cls_token_id and seperator_token_id to the GGUF, this should be fully backwards compatible with previous conversions of BERT models. Nomic BERT will not be backwards compatible due to the tensor shape change.

Key changes: * BERT conversion: fix abuse of LlamaHfVocab, do not set BOS or EOS * Nomic Embed conversion: pad vocab instead of slicing embedding tensor * llama_tokenize: handle added special tokens like HF does

This reverts commit 1b67731.

This reverts commit 82e6483.

This reverts commit ae342c7.

cebtenzzre added 9 commits April 4, 2024 16:01

convert-hf-to-gguf : fix BERT abuse of LlamaHfVocab

748fc8b

llama : handle added special tokens like HF does

8803582

Now the BERT tokenizer actually uses the SEP and CLS tokens from SpecialVocab.

Merge branch 'master' into ceb/bert-tokenizer-fixes

0d052cb

convert : fix Tensor type annotations

6a9d3c0

convert scripts : fix python 3.8 compatibility

909f6be

convert : remove now-unused ignore_nonllama parameter

45983e3

spm : fix special_add_bos default

d1a1b61

When this is not set in HF `tokenizer_config.json`, it should default to true.

examples : rely on new behavior of add_special

92591c1

speculative : more robust tokenizer comparison

a37696d

cebtenzzre requested review from iamlemec and ggerganov April 4, 2024 23:01

iamlemec reviewed Apr 5, 2024

View reviewed changes

iamlemec approved these changes Apr 5, 2024

View reviewed changes

ggerganov approved these changes Apr 8, 2024

View reviewed changes

cebtenzzre merged commit 1b67731 into master Apr 9, 2024

cebtenzzre mentioned this pull request Apr 11, 2024

embed.text: implement infererence_mode="local" based on Embed4All nomic-ai/nomic#287

Merged

dm4 mentioned this pull request May 3, 2024

[WASI-NN] ggml: support ubatch-size WasmEdge/WasmEdge#3383

Merged

vvhg1 added a commit to vvhg1/llama.cpp that referenced this pull request May 6, 2024

Revert "BERT tokenizer fixes (ggml-org#6498)"

82e6483

This reverts commit 1b67731.

cebtenzzre mentioned this pull request May 28, 2024

llamamodel: fix BERT tokenization after llama.cpp update nomic-ai/gpt4all#2381

Merged

vvhg1 added a commit to vvhg1/llama.cpp that referenced this pull request Jun 24, 2024

Revert "Revert "BERT tokenizer fixes (ggml-org#6498)""

ae342c7

This reverts commit 82e6483.

vvhg1 added a commit to vvhg1/llama.cpp that referenced this pull request Jun 24, 2024

Revert "Revert "Revert "BERT tokenizer fixes (ggml-org#6498)"""

1d6746e

This reverts commit ae342c7.

ggerganov mentioned this pull request Nov 2, 2024

simple-chat : only add bos on first prompt #10129

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BERT tokenizer fixes #6498

BERT tokenizer fixes #6498

Uh oh!

cebtenzzre commented Apr 4, 2024

Uh oh!

iamlemec left a comment •

edited

Loading

Uh oh!

cebtenzzre commented Apr 5, 2024

Uh oh!

Uh oh!

BERT tokenizer fixes #6498

BERT tokenizer fixes #6498

Uh oh!

Conversation

cebtenzzre commented Apr 4, 2024

Uh oh!

iamlemec left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cebtenzzre commented Apr 5, 2024

Uh oh!

Uh oh!

iamlemec left a comment •

edited

Loading