fix deepseek bug in stream mode #4118

qhduan · 2023-11-17T22:21:58Z

deepseek model may direct return token_str.size() -> 2 after llama_token_to_piece

it will skip multibyte_pending

    bool process_token(completion_token_output &result, llama_client_slot &slot) {
        // remember which tokens were sampled - used for repetition penalties during sampling
        const std::string token_str = llama_token_to_piece(ctx, result.tok);
        slot.sampled = result.tok;

        // search stop word and delete it
        slot.generated_text += token_str;
        slot.has_next_token = true;

        if (slot.multibyte_pending > 0)
        {
            slot.multibyte_pending -= token_str.size();
        }
        else if (token_str.size() == 1) // MAY SKIP THIS BECAUSE token_str.size() == 2
        {
            const char c = token_str[0];
            // 2-byte characters: 110xxxxx 10xxxxxx
            if ((c & 0xE0) == 0xC0)
            {
                slot.multibyte_pending = 1;
                // 3-byte characters: 1110xxxx 10xxxxxx 10xxxxxx
            }
            else if ((c & 0xF0) == 0xE0)
            {
                slot.multibyte_pending = 2;
                // 4-byte characters: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
            }
            else if ((c & 0xF8) == 0xF0)
            {
                slot.multibyte_pending = 3;
            }
            else
            {
                slot.multibyte_pending = 0;
            }
        }
        else if (token_str.size() == 2)  // PATCH HERE
        {
            const char c0 = token_str[0];
            const char c1 = token_str[1];
            if (((c0 & 0xF0) == 0xE0) && ((c1 & 0xC0) == 0x80))
            {
                slot.multibyte_pending = 1;
                // 3-byte characters: 1110xxxx 10xxxxxx 10xxxxxx
            }
        }

before patch:

{'content': '当然', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '，', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '这是', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '一个', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '用', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': 'Python', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '实现', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '的', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '快速', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '排', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '序', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '�', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '�', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '数', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '�', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '�', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '\n```', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': 'python', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': ' ', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '', 'generation_settings': {'frequency_penalty': 0.0, 'grammar': '', 'ignore_eos': False, 'logit_bias': [], 'min_p': 0.05000000074505806, 'mirostat': 0, 'mirostat_eta': 0.10000000149011612, 'mirostat_tau': 5.0, 'model': '../deepseek-coder-1.3b-instruct.Q4_K_M.gguf', 'n_ctx': 512, 'n_keep': 0, 'n_predict': 20, 'n_probs': 0, 'penalize_nl': True, 'presence_penalty': 0.0, 'repeat_last_n': 64, 'repeat_penalty': 1.100000023841858, 'seed': 4294967295, 'stop': ['</s>', '\n###'], 'stream': True, 'temp': 0.699999988079071, 'tfs_z': 1.0, 'top_k': 1, 'top_p': 0.949999988079071, 'typical_p': 1.0}, 'model': '../deepseek-coder-1.3b-instruct.Q4_K_M.gguf', 'prompt': '### INSTRUCTION:\n用python写一个快速排序函数，写详细点的中文注释\n### Response:\n', 'slot_id': 0, 'stop': True, 'stopped_eos': False, 'stopped_limit': True, 'stopped_word': False, 'stopping_word': '', 'timings': {'predicted_ms': 370.632, 'predicted_n': 20, 'predicted_per_second': 53.96188132703059, 'predicted_per_token_ms': 18.5316, 'prompt_ms': 204.254, 'prompt_n': 30, 'prompt_per_second': 146.87594857383453, 'prompt_per_token_ms': 6.808466666666666}, 'tokens_cached': 50, 'tokens_evaluated': 30, 'tokens_predicted': 20, 'truncated': False}

after patch

{'content': '当然', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '，', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '这是', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '一个', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '用', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': 'Python', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '实现', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '的', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '快速', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '排', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '序', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '函', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '数', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '：', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '\n```', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': 'python', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': ' ', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '', 'generation_settings': {'frequency_penalty': 0.0, 'grammar': '', 'ignore_eos': False, 'logit_bias': [], 'min_p': 0.05000000074505806, 'mirostat': 0, 'mirostat_eta': 0.10000000149011612, 'mirostat_tau': 5.0, 'model': '../deepseek-coder-1.3b-instruct.Q4_K_M.gguf', 'n_ctx': 512, 'n_keep': 0, 'n_predict': 20, 'n_probs': 0, 'penalize_nl': True, 'presence_penalty': 0.0, 'repeat_last_n': 64, 'repeat_penalty': 1.100000023841858, 'seed': 4294967295, 'stop': ['</s>', '\n###'], 'stream': True, 'temp': 0.699999988079071, 'tfs_z': 1.0, 'top_k': 1, 'top_p': 0.949999988079071, 'typical_p': 1.0}, 'model': '../deepseek-coder-1.3b-instruct.Q4_K_M.gguf', 'prompt': '### INSTRUCTION:\n用python写一个快速排序函数，写详细点的中文注释\n### Response:\n', 'slot_id': 0, 'stop': True, 'stopped_eos': False, 'stopped_limit': True, 'stopped_word': False, 'stopping_word': '', 'timings': {'predicted_ms': 365.661, 'predicted_n': 20, 'predicted_per_second': 54.69546930080047, 'predicted_per_token_ms': 18.28305, 'prompt_ms': 203.587, 'prompt_n': 30, 'prompt_per_second': 147.35714952329965, 'prompt_per_token_ms': 6.786233333333333}, 'tokens_cached': 50, 'tokens_evaluated': 30, 'tokens_predicted': 20, 'truncated': False}

shibe2 · 2023-11-17T23:21:18Z

examples/server/server.cpp

@@ -985,6 +985,16 @@ struct llama_server_context
                slot.multibyte_pending = 0;
            }
        }
+        else if (token_str.size() == 2)


Previous assumption was that only single-byte tokens can have unfinished UTF-8 sequences. Hence the check if (token_str.size() == 1). If the assumption no longer holds, the algorithm should be changed rather than adding special case for size 2.

If generated_text (no matter its length) has unfinished sequence at the end, sending an event should be delayed. With this approach, multibyte_pending can be removed from llama_client_slot.

updated using a new function is_valid_utf8 to remove multibyte_pending from llama_client_slot

This is not optimal. You check token_str. If it completes code point started by previous token, it itself will be invalid, but generated_text will be valid and ready to be sent. Example:

token_str == "\x98\x80" generated_text == "\xF0\x9F\x98\x80"

So, the latter should be checked rather than the former.

Also, to avoid getting stuck when the model for whatever reason generates invalid sequence of bytes, completeness should be checked instead of validity. An incomplete sequence contains at the end an initial byte in range 0xC2..0xF4 followed by 0 or more continuation bytes in range 0x80..0xBF, with the number of continuation bytes less than required by the initial byte. Examples of incomplete sequences (last initial byte is bold):

0x61 0xCE 0xB1 0xD0 (need 1 continuation byte, have 0)

0xD0 0xB1 0x62 0xE1 0x83 (need 2 continuation bytes, have 1)

0x63 0xCE 0xB3 0xF0 0x9D 0x94 (need 3 continuation bytes, have 2)

When generated_text is such incomplete sequence, wait for the next token. Otherwise (and if token_str is not empty), we have the next fragment, even if it's invalid UTF-8. Examples of invalid but not incomplete sequences:

0x9D 0x94 0xA1 (continuation bytes without initial byte)

0x65 0x9D 0x94 0xA2 (continuation bytes without proper starter byte)

0xF0 0x9D 0x66 (incomplete code point, but not at the end)

0x67 0xFE (disallowed byte at the end)

What to do with invalid UTF-8 is a separate question. If we just try to send it to the client as is, JSON encoder will presumably replace invalid parts with the replacement character 0xFFFD. https://github.com/ggerganov/llama.cpp/blob/7e4ea5beff567f53be92f75f9089e6f11fa5dabd/examples/server/json.hpp#L17930 This is an acceptable way of dealing with it. More sophisticated options can be left out of scope for this fix.

Note that this approach is also not optimal, but is good enough. Theoretically, token_str and generated_text can have incomplete sequence at the end, but still have new complete code points to send. ~~Though tokenizer vocabularies are unlikely to contain any tokens that would make it possible.~~

shibe2 · 2023-11-19T15:12:08Z

I implemented the idea that I described in a an earlier comment here. You can use it as a reference. Or we can replace this pull request with my implementation.

qhduan · 2023-12-05T14:09:54Z

I implemented the idea that I described in a an earlier comment here. You can use it as a reference. Or we can replace this pull request with my implementation.

Sorry for the lack of reply, the past two weeks have been very busy, I believe quoting this part of the code can solve this part, I will close this PR first

fix deepseek bug in stream mode

506aada

shibe2 reviewed Nov 17, 2023

View reviewed changes

shibe2 mentioned this pull request Nov 17, 2023

The server output some unicode characters as <?> #4036

Closed

4 tasks

remove multibyte_pending from server

af5a6ce

shibe2 linked an issue Nov 19, 2023 that may be closed by this pull request

The server output some unicode characters as <?> #4036

Closed

4 tasks

xcnick mentioned this pull request Dec 1, 2023

fix: generate unicode characters error TabbyML/tabby#925

Merged

qhduan closed this Dec 5, 2023

shibe2 mentioned this pull request Dec 13, 2023

server: Fix handling of characters that span multiple tokens when streaming #4446

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix deepseek bug in stream mode #4118

fix deepseek bug in stream mode #4118

Uh oh!

qhduan commented Nov 17, 2023

Uh oh!

shibe2 Nov 17, 2023

Uh oh!

qhduan Nov 18, 2023

Uh oh!

shibe2 Nov 18, 2023 •

edited

Loading

Uh oh!

shibe2 commented Nov 19, 2023 •

edited

Loading

Uh oh!

qhduan commented Dec 5, 2023

Uh oh!

Uh oh!

fix deepseek bug in stream mode #4118

fix deepseek bug in stream mode #4118

Uh oh!

Conversation

qhduan commented Nov 17, 2023

Uh oh!

shibe2 Nov 17, 2023

Choose a reason for hiding this comment

Uh oh!

qhduan Nov 18, 2023

Choose a reason for hiding this comment

Uh oh!

shibe2 Nov 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shibe2 commented Nov 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qhduan commented Dec 5, 2023

Uh oh!

Uh oh!

shibe2 Nov 18, 2023 •

edited

Loading

shibe2 commented Nov 19, 2023 •

edited

Loading