-
Notifications
You must be signed in to change notification settings - Fork 12.1k
Support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client #13196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
76549e1
to
379c7a8
Compare
can you add chat_template_kwargs to cli argument as well? |
I added it. I tested it using updated command (You might want to check the escaping of the double quotes): |
d1861c4
to
01b58b5
Compare
Very useful for Qwen3 series. +1 for this feature! |
2d1d595
to
28dd37e
Compare
Cannot work with the --chat_template_kwargs option from CLI: error: invalid argument: --chat-template-kwargs |
@ggerganov is there any reason why this PR has not been accepted and merged yet? |
This PR is implemented only for llama-server and its webui. llama-cli has unresolved bugs that prevent me from enabling this feature. |
Hope you'll integrate it for the CLI environment soon, thanks! |
d44d099
to
5b3de5d
Compare
would be nice a enable_thinking checkbox or something like that on llama cpp webui too |
@celsowm Lack of eyes on this area would be my guess. With 438 open PRs (many obsolete), I've kind of come to accept I'll need to pull in some PRs of interest to me when building. |
vLLM and SGLang have got this feature the first day Qwen3 released. At the same time many useful enhancement and fix PRs become obsolete just because of delay on merging here in llama.cpp community. Really sad about that. |
This is so necessary when dealing with Qwen3! Can't wait to see this merged and be able to use the latest version with this <3 |
54f128a
to
e4b0489
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @matteoserva, and sorry for the slow review!
It looks working well, you guys are so cool |
@ngxson shouldn't this be merged? |
Thanks to all contributors, looking forward to it |
I think this pr can also be applied to deepseek-r1, If we append "{% if ns.is_last_user %}{% if enable_thinking is defined and enable_thinking is false %}{{'\n\n\n\n'}}{% endif %}{% endif %}" to the DeepSeek template |
This is not needed, as there is a new reasoning effort param already. |
The new reasoning param requires restarting the server to change the value. With this PR you can set it per request. There is ongoing effort to implement this in the future: #13272 |
Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
coding standard: cosmetic changes Co-authored-by: Georgi Gerganov <[email protected]>
… extra_context (+ the odd existing additional_context)
e4b0489
to
a056e53
Compare
I've built a server version of this branch and I'm missing the KV cache % usage from the /metrics endpoint. Flags used for build: CXXFLAGS="-march=core2 -mtune=generic" cmake .. Sample output of /metrics endpoint: HELP llamacpp:prompt_tokens_total Number of prompt tokens processed. |
It may be unrelated to this branch, and more relates to changes in master (KV cache refactor). Can you reproduce with master? |
This PR implements support for setting additional jinja parameters.
An example of this is
enable_thinking
in the Qwen3 models template.Main features:
--chat_template_kwargs
or the environment variablechat_template_kwargs
parameterNotice
server
: add--reasoning-budget 0
to disable thinking (incl. qwen3 w/ enable_thinking:false) #13771 the preferred way for disabling thinking with a command line argument is now--reasoning-budget 0
. The command line setting can be overridden anyway by passing thechat_template_kwargs
during the request to the OAI compatible APIOther info
The official template is still only partially compatible. I modified it to use only supported features.
It's here:
https://pastebin.com/16ZpCLHkhttps://pastebin.com/GGuTbFRcAnd should be loaded with
llama-server --jinja --chat-template-file {template_file}
It fixes #13160 and #13189
Test it with:
{"prompt":"\n<|im_start|>user\nGive me a short introduction to large language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"}