Replies: 2 comments 4 replies
-
Hard to say - provide repo instructions using latest version of |
Beta Was this translation helpful? Give feedback.
-
Ok, i tried converting the weights myself, this is what i did: python convert-hf-to-gguf.py --outtype f16 ../Phi-3-mini-4k-instruct
./quantize ../Phi-3-mini-4k-instruct/ggml-model-f16.gguf 15 4 Then i loaded the model trough llama-cpp-python and tried to test it again but i get the same issue: the model spits out the EOS in multiple pieces.
What can i try to help identify the issue? And (my curiosity) how can i use the chat template specified in the GGUF for the interactive mode of main.cpp? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I am using llama-cpp-python to generate text from phi-3 (note that this issue is present in llama3-instruct, zephyr, and others too).
When the model outputs the EOS (for example phi-3 has
<|end|>
), instead of outputting the single token number, it breaks the EOS in many pieces like<|
thenend
then|>
.Also a second thing is that i am noticing many "special token hallucinations" in my python program that are not present when using ollama, for example the model instead of using the EOS uses a weird
<|endtext
that is not even closed, or straight up switches to impersonating the user with<|user|>
without printing the EOS.Is this an issue solvable by providing a particular argument to main.cpp?
Beta Was this translation helpful? Give feedback.
All reactions