EOS Token being splitted in multiple tokens, making the detection harder #7441

Belluxx · 2024-05-21T16:29:32Z

Belluxx
May 21, 2024

I am using llama-cpp-python to generate text from phi-3 (note that this issue is present in llama3-instruct, zephyr, and others too).
When the model outputs the EOS (for example phi-3 has <|end|>), instead of outputting the single token number, it breaks the EOS in many pieces like <| then end then |>.

Also a second thing is that i am noticing many "special token hallucinations" in my python program that are not present when using ollama, for example the model instead of using the EOS uses a weird <|endtext that is not even closed, or straight up switches to impersonating the user with <|user|> without printing the EOS.

Is this an issue solvable by providing a particular argument to main.cpp?

ggerganov · 2024-05-21T16:34:50Z

ggerganov
May 21, 2024
Maintainer

Hard to say - provide repo instructions using latest version of llama.cpp examples and a model that you have converted yourself using convert-hf-to-gguf.py

0 replies

Belluxx · 2024-05-23T15:58:42Z

Belluxx
May 23, 2024
Author

Ok, i tried converting the weights myself, this is what i did:

python convert-hf-to-gguf.py --outtype f16 ../Phi-3-mini-4k-instruct
./quantize ../Phi-3-mini-4k-instruct/ggml-model-f16.gguf 15 4

Then i loaded the model trough llama-cpp-python and tried to test it again but i get the same issue: the model spits out the EOS in multiple pieces.
By reading the llama.cpp debug output however I noticed that the EOS specified in the model is not what Microsoft says (so it's not <|end|>), but the chat template is correct. Here is the output:

Model metadata: {'tokenizer.chat_template': "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|user|>' + '\n' + message['content'] + '<|end|>' + '\n' + '<|assistant|>' + '\n'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|end|>' + '\n'}}{% endif %}{% endfor %}", 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.add_bos_token': 'true', 'tokenizer.ggml.padding_token_id': '32000', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '32000', 'tokenizer.ggml.pre': 'default', 'general.file_type': '15', 'general.quantization_version': '2', 'phi3.rope.dimension_count': '96', 'tokenizer.ggml.bos_token_id': '1', 'phi3.attention.layer_norm_rms_epsilon': '0.000010', 'phi3.attention.head_count_kv': '32', 'phi3.attention.head_count': '32', 'tokenizer.ggml.model': 'llama', 'phi3.block_count': '32', 'general.architecture': 'phi3', 'phi3.feed_forward_length': '8192', 'phi3.embedding_length': '3072', 'general.name': 'Phi3', 'phi3.context_length': '4096'}
Available chat formats from metadata: chat_template.default
Using gguf chat template: {{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|user|>' + '
' + message['content'] + '<|end|>' + '
' + '<|assistant|>' + '
'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|end|>' + '
'}}{% endif %}{% endfor %}
Using chat eos_token: <|endoftext|>
Using chat bos_token: <s>

What can i try to help identify the issue? And (my curiosity) how can i use the chat template specified in the GGUF for the interactive mode of main.cpp?

4 replies

ggerganov May 25, 2024
Maintainer

~~Looks like a bug in the chat template: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/blob/bbd531db4632bb631b0c44d98172894a0c594dd0/tokenizer_config.json#L119~~

~~Is this tracked somewhere? Should be fixed in the tokenizer config IMO~~

Belluxx May 25, 2024
Author

Ok thanks got it. Is it possible that this issue was present in many llama2 and llama3 GGUFs? I started noticing it months ago.

ggerganov May 26, 2024
Maintainer

Nevermind, I got confused - the chat template is fine.

And (my curiosity) how can i use the chat template specified in the GGUF for the interactive mode of main.cpp?

I don't think we support that yet. You can probably match the template behaviour by using main arguments such as --in-prefix, --in-suffix, etc.

Is it possible that this issue was present in many llama2 and llama3 GGUFs?

No, it's likely a problem of how llama-cpp-python applies and tokenizes chat templates, or how you are using it. Not sure

Belluxx May 27, 2024
Author

Ok I'll try to check my code and maybe ask to the llama-cpp-python devs, thanks for the feedback!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

EOS Token being splitted in multiple tokens, making the detection harder #7441

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

EOS Token being splitted in multiple tokens, making the detection harder #7441

Uh oh!

Belluxx May 21, 2024

Replies: 2 comments · 4 replies

Uh oh!

ggerganov May 21, 2024 Maintainer

Uh oh!

Belluxx May 23, 2024 Author

Uh oh!

Uh oh!

ggerganov May 25, 2024 Maintainer

Uh oh!

Belluxx May 25, 2024 Author

Uh oh!

ggerganov May 26, 2024 Maintainer

Uh oh!

Belluxx May 27, 2024 Author

Belluxx
May 21, 2024

Replies: 2 comments 4 replies

ggerganov
May 21, 2024
Maintainer

Belluxx
May 23, 2024
Author

ggerganov May 25, 2024
Maintainer

Belluxx May 25, 2024
Author

ggerganov May 26, 2024
Maintainer

Belluxx May 27, 2024
Author