Fix the penultimate token sometimes being lost with SSE streaming (ggml-org#1031)

pi6am · web-flow · commit 26f1df5e5f97 · 2024-07-29T20:16:47.000+08:00
The token immediately before an eot token was lost when SSE streaming
was enabled if that token was contained entirely within a stop sequence.
As an example of when this could happen, consider this prompt:
  Type the phrase 'pleas' once.
In a Llama 3-derived model, 'pleas' tokenizes as 'ple' 'as'. The token
'as' is contained within this instruct mode stop sequence:
  &lt;|eot_id|&gt;&lt;|start_header_id|&gt;assistant&lt;|end_header_id|&gt;
due to the word 'assistant'. Since `string_contains_sequence_substring`
returns True for 'as', this token is added to `tokenReserve` instead of
being streamed immediately. If the '&lt;|eot_id|&gt;' token was generated
next, the text in `tokenReserve` would be discarded.
diff --git a/koboldcpp.py b/koboldcpp.py
@@ -1447,7 +1447,7 @@ async def handle_sse_stream(self, genparams, api_format):
                         tokenReserve += tokenStr
                         await asyncio.sleep(async_sleep_short) #if a stop sequence could trigger soon, do not send output
                     else:
-                        if tokenStr!="":
+                        if tokenStr!="" or tokenReserve!="":
                             tokenStr = tokenReserve + tokenStr
                             tokenReserve = ""