-
Notifications
You must be signed in to change notification settings - Fork 12.3k
fix deepseek bug in stream mode #4118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
examples/server/server.cpp
Outdated
@@ -985,6 +985,16 @@ struct llama_server_context | |||
slot.multibyte_pending = 0; | |||
} | |||
} | |||
else if (token_str.size() == 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous assumption was that only single-byte tokens can have unfinished UTF-8 sequences. Hence the check if (token_str.size() == 1)
. If the assumption no longer holds, the algorithm should be changed rather than adding special case for size 2.
If generated_text
(no matter its length) has unfinished sequence at the end, sending an event should be delayed. With this approach, multibyte_pending
can be removed from llama_client_slot
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated using a new function is_valid_utf8
to remove multibyte_pending
from llama_client_slot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not optimal. You check token_str
. If it completes code point started by previous token, it itself will be invalid, but generated_text
will be valid and ready to be sent. Example:
token_str == "\x98\x80"
generated_text == "\xF0\x9F\x98\x80"
So, the latter should be checked rather than the former.
Also, to avoid getting stuck when the model for whatever reason generates invalid sequence of bytes, completeness should be checked instead of validity. An incomplete sequence contains at the end an initial byte in range 0xC2..0xF4 followed by 0 or more continuation bytes in range 0x80..0xBF, with the number of continuation bytes less than required by the initial byte. Examples of incomplete sequences (last initial byte is bold):
- 0x61 0xCE 0xB1 0xD0 (need 1 continuation byte, have 0)
- 0xD0 0xB1 0x62 0xE1 0x83 (need 2 continuation bytes, have 1)
- 0x63 0xCE 0xB3 0xF0 0x9D 0x94 (need 3 continuation bytes, have 2)
When generated_text
is such incomplete sequence, wait for the next token. Otherwise (and if token_str
is not empty), we have the next fragment, even if it's invalid UTF-8. Examples of invalid but not incomplete sequences:
- 0x9D 0x94 0xA1 (continuation bytes without initial byte)
- 0x65 0x9D 0x94 0xA2 (continuation bytes without proper starter byte)
- 0xF0 0x9D 0x66 (incomplete code point, but not at the end)
- 0x67 0xFE (disallowed byte at the end)
What to do with invalid UTF-8 is a separate question. If we just try to send it to the client as is, JSON encoder will presumably replace invalid parts with the replacement character 0xFFFD. https://github.com/ggerganov/llama.cpp/blob/7e4ea5beff567f53be92f75f9089e6f11fa5dabd/examples/server/json.hpp#L17930 This is an acceptable way of dealing with it. More sophisticated options can be left out of scope for this fix.
Note that this approach is also not optimal, but is good enough. Theoretically, token_str
and generated_text
can have incomplete sequence at the end, but still have new complete code points to send. Though tokenizer vocabularies are unlikely to contain any tokens that would make it possible.
I implemented the idea that I described in a an earlier comment here. You can use it as a reference. Or we can replace this pull request with my implementation. |
Sorry for the lack of reply, the past two weeks have been very busy, I believe quoting this part of the code can solve this part, I will close this PR first |
deepseek model may direct return
token_str.size() -> 2
afterllama_token_to_piece
it will skip
multibyte_pending
before patch:
after patch