Skip to content

fix deepseek bug in stream mode #4118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Conversation

qhduan
Copy link

@qhduan qhduan commented Nov 17, 2023

deepseek model may direct return token_str.size() -> 2 after llama_token_to_piece

it will skip multibyte_pending

    bool process_token(completion_token_output &result, llama_client_slot &slot) {
        // remember which tokens were sampled - used for repetition penalties during sampling
        const std::string token_str = llama_token_to_piece(ctx, result.tok);
        slot.sampled = result.tok;

        // search stop word and delete it
        slot.generated_text += token_str;
        slot.has_next_token = true;

        if (slot.multibyte_pending > 0)
        {
            slot.multibyte_pending -= token_str.size();
        }
        else if (token_str.size() == 1) // MAY SKIP THIS BECAUSE token_str.size() == 2
        {
            const char c = token_str[0];
            // 2-byte characters: 110xxxxx 10xxxxxx
            if ((c & 0xE0) == 0xC0)
            {
                slot.multibyte_pending = 1;
                // 3-byte characters: 1110xxxx 10xxxxxx 10xxxxxx
            }
            else if ((c & 0xF0) == 0xE0)
            {
                slot.multibyte_pending = 2;
                // 4-byte characters: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
            }
            else if ((c & 0xF8) == 0xF0)
            {
                slot.multibyte_pending = 3;
            }
            else
            {
                slot.multibyte_pending = 0;
            }
        }
        else if (token_str.size() == 2)  // PATCH HERE
        {
            const char c0 = token_str[0];
            const char c1 = token_str[1];
            if (((c0 & 0xF0) == 0xE0) && ((c1 & 0xC0) == 0x80))
            {
                slot.multibyte_pending = 1;
                // 3-byte characters: 1110xxxx 10xxxxxx 10xxxxxx
            }
        }

before patch:

{'content': '当然', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': ',', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '这是', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '一个', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '用', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': 'Python', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '实现', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '的', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '快速', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '排', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '序', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '�', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '�', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '数', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '�', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '�', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '\n```', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': 'python', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': ' ', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '', 'generation_settings': {'frequency_penalty': 0.0, 'grammar': '', 'ignore_eos': False, 'logit_bias': [], 'min_p': 0.05000000074505806, 'mirostat': 0, 'mirostat_eta': 0.10000000149011612, 'mirostat_tau': 5.0, 'model': '../deepseek-coder-1.3b-instruct.Q4_K_M.gguf', 'n_ctx': 512, 'n_keep': 0, 'n_predict': 20, 'n_probs': 0, 'penalize_nl': True, 'presence_penalty': 0.0, 'repeat_last_n': 64, 'repeat_penalty': 1.100000023841858, 'seed': 4294967295, 'stop': ['</s>', '\n###'], 'stream': True, 'temp': 0.699999988079071, 'tfs_z': 1.0, 'top_k': 1, 'top_p': 0.949999988079071, 'typical_p': 1.0}, 'model': '../deepseek-coder-1.3b-instruct.Q4_K_M.gguf', 'prompt': '### INSTRUCTION:\n用python写一个快速排序函数,写详细点的中文注释\n### Response:\n', 'slot_id': 0, 'stop': True, 'stopped_eos': False, 'stopped_limit': True, 'stopped_word': False, 'stopping_word': '', 'timings': {'predicted_ms': 370.632, 'predicted_n': 20, 'predicted_per_second': 53.96188132703059, 'predicted_per_token_ms': 18.5316, 'prompt_ms': 204.254, 'prompt_n': 30, 'prompt_per_second': 146.87594857383453, 'prompt_per_token_ms': 6.808466666666666}, 'tokens_cached': 50, 'tokens_evaluated': 30, 'tokens_predicted': 20, 'truncated': False}

after patch

{'content': '当然', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': ',', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '这是', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '一个', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '用', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': 'Python', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '实现', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '的', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '快速', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '排', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '序', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '函', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '数', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': ':', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '\n```', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': 'python', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': ' ', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '', 'multimodal': False, 'slot_id': 0, 'stop': False}
{'content': '', 'generation_settings': {'frequency_penalty': 0.0, 'grammar': '', 'ignore_eos': False, 'logit_bias': [], 'min_p': 0.05000000074505806, 'mirostat': 0, 'mirostat_eta': 0.10000000149011612, 'mirostat_tau': 5.0, 'model': '../deepseek-coder-1.3b-instruct.Q4_K_M.gguf', 'n_ctx': 512, 'n_keep': 0, 'n_predict': 20, 'n_probs': 0, 'penalize_nl': True, 'presence_penalty': 0.0, 'repeat_last_n': 64, 'repeat_penalty': 1.100000023841858, 'seed': 4294967295, 'stop': ['</s>', '\n###'], 'stream': True, 'temp': 0.699999988079071, 'tfs_z': 1.0, 'top_k': 1, 'top_p': 0.949999988079071, 'typical_p': 1.0}, 'model': '../deepseek-coder-1.3b-instruct.Q4_K_M.gguf', 'prompt': '### INSTRUCTION:\n用python写一个快速排序函数,写详细点的中文注释\n### Response:\n', 'slot_id': 0, 'stop': True, 'stopped_eos': False, 'stopped_limit': True, 'stopped_word': False, 'stopping_word': '', 'timings': {'predicted_ms': 365.661, 'predicted_n': 20, 'predicted_per_second': 54.69546930080047, 'predicted_per_token_ms': 18.28305, 'prompt_ms': 203.587, 'prompt_n': 30, 'prompt_per_second': 147.35714952329965, 'prompt_per_token_ms': 6.786233333333333}, 'tokens_cached': 50, 'tokens_evaluated': 30, 'tokens_predicted': 20, 'truncated': False}

@@ -985,6 +985,16 @@ struct llama_server_context
slot.multibyte_pending = 0;
}
}
else if (token_str.size() == 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous assumption was that only single-byte tokens can have unfinished UTF-8 sequences. Hence the check if (token_str.size() == 1). If the assumption no longer holds, the algorithm should be changed rather than adding special case for size 2.

If generated_text (no matter its length) has unfinished sequence at the end, sending an event should be delayed. With this approach, multibyte_pending can be removed from llama_client_slot.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated using a new function is_valid_utf8 to remove multibyte_pending from llama_client_slot

Copy link
Contributor

@shibe2 shibe2 Nov 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not optimal. You check token_str. If it completes code point started by previous token, it itself will be invalid, but generated_text will be valid and ready to be sent. Example:

token_str == "\x98\x80"
generated_text == "\xF0\x9F\x98\x80"

So, the latter should be checked rather than the former.

Also, to avoid getting stuck when the model for whatever reason generates invalid sequence of bytes, completeness should be checked instead of validity. An incomplete sequence contains at the end an initial byte in range 0xC2..0xF4 followed by 0 or more continuation bytes in range 0x80..0xBF, with the number of continuation bytes less than required by the initial byte. Examples of incomplete sequences (last initial byte is bold):

  • 0x61 0xCE 0xB1 0xD0 (need 1 continuation byte, have 0)
  • 0xD0 0xB1 0x62 0xE1 0x83 (need 2 continuation bytes, have 1)
  • 0x63 0xCE 0xB3 0xF0 0x9D 0x94 (need 3 continuation bytes, have 2)

When generated_text is such incomplete sequence, wait for the next token. Otherwise (and if token_str is not empty), we have the next fragment, even if it's invalid UTF-8. Examples of invalid but not incomplete sequences:

  • 0x9D 0x94 0xA1 (continuation bytes without initial byte)
  • 0x65 0x9D 0x94 0xA2 (continuation bytes without proper starter byte)
  • 0xF0 0x9D 0x66 (incomplete code point, but not at the end)
  • 0x67 0xFE (disallowed byte at the end)

What to do with invalid UTF-8 is a separate question. If we just try to send it to the client as is, JSON encoder will presumably replace invalid parts with the replacement character 0xFFFD. https://github.com/ggerganov/llama.cpp/blob/7e4ea5beff567f53be92f75f9089e6f11fa5dabd/examples/server/json.hpp#L17930 This is an acceptable way of dealing with it. More sophisticated options can be left out of scope for this fix.

Note that this approach is also not optimal, but is good enough. Theoretically, token_str and generated_text can have incomplete sequence at the end, but still have new complete code points to send. Though tokenizer vocabularies are unlikely to contain any tokens that would make it possible.

@shibe2 shibe2 linked an issue Nov 19, 2023 that may be closed by this pull request
4 tasks
@shibe2
Copy link
Contributor

shibe2 commented Nov 19, 2023

I implemented the idea that I described in a an earlier comment here. You can use it as a reference. Or we can replace this pull request with my implementation.

@qhduan
Copy link
Author

qhduan commented Dec 5, 2023

I implemented the idea that I described in a an earlier comment here. You can use it as a reference. Or we can replace this pull request with my implementation.

Sorry for the lack of reply, the past two weeks have been very busy, I believe quoting this part of the code can solve this part, I will close this PR first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The server output some unicode characters as <?>
2 participants