Skip to content

Support models that don't split stream chunks in tokens #8235

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 17, 2025

Conversation

chenmoneygithub
Copy link
Collaborator

Resolve #8211

Gemini models don't split stream chunks into 1-2 tokens per chunk, but instead have lots of tokens per chunk. We are reworking the listeners' logic to support this use case.

Side note - I do understand the rationale of buffering tokens for safety checks. but why not streaming back the original tokens for consistency with other models? This also creates a burden for app developer who wants to render the stream on UI.

@okhat okhat merged commit 5d31cd1 into stanfordnlp:main May 17, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Google Models do not stream chunks using StreamListener
2 participants