Skip to content

Support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client #13196

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

matteoserva
Copy link
Contributor

@matteoserva matteoserva commented Apr 29, 2025

This PR implements support for setting additional jinja parameters.
An example of this is enable_thinking in the Qwen3 models template.

Main features:

  • Setting jinja variables for command line using --chat_template_kwargs or the environment variable
  • Setting variables per request in the OAI compatible api using the chat_template_kwargs parameter
  • Compatibility with the VLLM API

Notice

Other info

The official template is still only partially compatible. I modified it to use only supported features.
It's here: https://pastebin.com/16ZpCLHk https://pastebin.com/GGuTbFRc
And should be loaded with llama-server --jinja --chat-template-file {template_file}

It fixes #13160 and #13189

Test it with:

  • enable_thinking=false. Expected: {"prompt":"\n<|im_start|>user\nGive me a short introduction to large language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"}
curl http://localhost:8080/apply-template -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen3-8B",
  "messages": [
    {"role": "user", "content": "Give me a short introduction to large language models."}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "top_k": 20,
  "max_tokens": 8192,
  "presence_penalty": 1.5,
  "chat_template_kwargs": {"enable_thinking": false}
}'
  • enable_thinking=true
curl http://localhost:8080/apply-template -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen3-8B",
  "messages": [
    {"role": "user", "content": "Give me a short introduction to large language models."}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "top_k": 20,
  "max_tokens": 8192,
  "presence_penalty": 1.5,
  "chat_template_kwargs": {"enable_thinking": true}
}'
  • enable_thinking undefined
curl http://localhost:8080/apply-template -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen3-8B",
  "messages": [
    {"role": "user", "content": "Give me a short introduction to large language models."}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "top_k": 20,
  "max_tokens": 8192,
  "presence_penalty": 1.5
}'

@matteoserva matteoserva requested a review from ngxson as a code owner April 29, 2025 18:58
@matteoserva matteoserva marked this pull request as draft April 29, 2025 18:58
@rhjdvsgsgks
Copy link
Contributor

can you add chat_template_kwargs to cli argument as well?

@matteoserva
Copy link
Contributor Author

matteoserva commented Apr 30, 2025

can you add chat_template_kwargs to cli argument as well?

I added it. I tested it using updated command (You might want to check the escaping of the double quotes):
--chat_template_kwargs "{\"enable_thinking\":false}" --jinja --chat-template-file qwen/qwen3_template.txt

@matteoserva matteoserva force-pushed the enable_thinking branch 2 times, most recently from d1861c4 to 01b58b5 Compare April 30, 2025 15:58
@matteoserva matteoserva changed the title [RFC] handling jinja extra template kwargs (Qwen3 enable_thinking feature) Support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client Apr 30, 2025
@matteoserva matteoserva marked this pull request as ready for review April 30, 2025 15:59
@neolee
Copy link

neolee commented May 1, 2025

Very useful for Qwen3 series. +1 for this feature!

@xiaomi102
Copy link

Cannot work with the --chat_template_kwargs option from CLI: error: invalid argument: --chat-template-kwargs
git clone --depth=1 -b enable_thinking https://github.com/matteoserva/llama.cpp
Screenshot 2025-05-09 061336

@celsowm
Copy link

celsowm commented May 9, 2025

@ggerganov is there any reason why this PR has not been accepted and merged yet?

@matteoserva
Copy link
Contributor Author

Cannot work with the --chat_template_kwargs option from CLI: error: invalid argument: --chat-template-kwargs git clone

This PR is implemented only for llama-server and its webui.

llama-cli has unresolved bugs that prevent me from enabling this feature.

@xiaomi102
Copy link

Cannot work with the --chat_template_kwargs option from CLI: error: invalid argument: --chat-template-kwargs git clone

This PR is implemented only for llama-server and its webui.

llama-cli has unresolved bugs that prevent me from enabling this feature.

Hope you'll integrate it for the CLI environment soon, thanks!

@matteoserva
Copy link
Contributor Author

Hope you'll integrate it for the CLI environment soon, thanks!

I'll open a new PR soon for llama-cli. The code is ready but it's blocked by #13402 and #13404

@matteoserva matteoserva force-pushed the enable_thinking branch 2 times, most recently from d44d099 to 5b3de5d Compare May 15, 2025 14:09
@celsowm
Copy link

celsowm commented May 15, 2025

would be nice a enable_thinking checkbox or something like that on llama cpp webui too

@strawberrymelonpanda
Copy link
Contributor

strawberrymelonpanda commented May 15, 2025

@ggerganov is there any reason why this PR has not been accepted and merged yet?

@celsowm Lack of eyes on this area would be my guess.

With 438 open PRs (many obsolete), I've kind of come to accept I'll need to pull in some PRs of interest to me when building.

@neolee
Copy link

neolee commented May 16, 2025

@ggerganov is there any reason why this PR has not been accepted and merged yet?

@celsowm Lack of eyes on this area would be my guess.

With 438 open PRs (many obsolete), I've kind of come to accept I'll need to pull in some PRs of interest to me when building.

vLLM and SGLang have got this feature the first day Qwen3 released. At the same time many useful enhancement and fix PRs become obsolete just because of delay on merging here in llama.cpp community. Really sad about that.

@ggerganov ggerganov requested a review from ochafik May 16, 2025 05:28
@matteoserva matteoserva requested a review from ggerganov May 16, 2025 06:52
@Neath
Copy link

Neath commented May 16, 2025

This is so necessary when dealing with Qwen3! Can't wait to see this merged and be able to use the latest version with this <3

Copy link
Collaborator

@ochafik ochafik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @matteoserva, and sorry for the slow review!

@exxocism
Copy link

exxocism commented Jun 3, 2025

It looks working well, you guys are so cool

@aviallon
Copy link
Contributor

aviallon commented Jun 5, 2025

@ngxson shouldn't this be merged?

@zghnwsq
Copy link

zghnwsq commented Jun 6, 2025

Thanks to all contributors, looking forward to it

@squirrelfish
Copy link

I think this pr can also be applied to deepseek-r1, If we append "{% if ns.is_last_user %}{% if enable_thinking is defined and enable_thinking is false %}{{'\n\n\n\n'}}{% endif %}{% endif %}" to the DeepSeek template

@aviallon
Copy link
Contributor

aviallon commented Jun 6, 2025

I think this pr can also be applied to deepseek-r1, If we append "{% if ns.is_last_user %}{% if enable_thinking is defined and enable_thinking is false %}{{'\n\n\n\n'}}{% endif %}{% endif %}" to the DeepSeek template

This is not needed, as there is a new reasoning effort param already.

@matteoserva
Copy link
Contributor Author

I think this pr can also be applied to deepseek-r1, If we append "{% if ns.is_last_user %}{% if enable_thinking is defined and enable_thinking is false %}{{'\n\n\n\n'}}{% endif %}{% endif %}" to the DeepSeek template

This is not needed, as there is a new reasoning effort param already.

The new reasoning param requires restarting the server to change the value. With this PR you can set it per request.

There is ongoing effort to implement this in the future: #13272

@rasbid
Copy link

rasbid commented Jun 9, 2025

I've built a server version of this branch and I'm missing the KV cache % usage from the /metrics endpoint.
Does anyone know why this is happening?

Flags used for build:

CXXFLAGS="-march=core2 -mtune=generic" cmake ..
-DLLAMA_BUILD_SERVER=ON
-DGGML_VULKAN=ON
-DGGML_NATIVE=OFF \ # don’t auto-detect CPU
-DGGML_AVX=OFF -DGGML_AVX2=OFF
-DGGML_AVX512=OFF -DGGML_AVX_VNNI=OFF
-DGGML_FMA=OFF -DGGML_F16C=OFF
-DGGML_AMX_TILE=OFF -DGGML_AMX_INT8=OFF -DGGML_AMX_BF16=OFF
-DGGML_SSE42=ON \
-DCMAKE_BUILD_TYPE=Release

Sample output of /metrics endpoint:

HELP llamacpp:prompt_tokens_total Number of prompt tokens processed.
TYPE llamacpp:prompt_tokens_total counter
llamacpp:prompt_tokens_total 143
HELP llamacpp:prompt_seconds_total Prompt process time
TYPE llamacpp:prompt_seconds_total counter
llamacpp:prompt_seconds_total 8.168
HELP llamacpp:tokens_predicted_total Number of generation tokens processed.
TYPE llamacpp:tokens_predicted_total counter
llamacpp:tokens_predicted_total 673
HELP llamacpp:tokens_predicted_seconds_total Predict process time
TYPE llamacpp:tokens_predicted_seconds_total counter
llamacpp:tokens_predicted_seconds_total 29.567
HELP llamacpp:n_decode_total Total number of llama_decode() calls
TYPE llamacpp:n_decode_total counter
llamacpp:n_decode_total 673
HELP llamacpp:n_busy_slots_per_decode Average number of busy slots per llama_decode() call
TYPE llamacpp:n_busy_slots_per_decode counter
llamacpp:n_busy_slots_per_decode 1
HELP llamacpp:prompt_tokens_seconds Average prompt throughput in tokens/s.
TYPE llamacpp:prompt_tokens_seconds gauge
llamacpp:prompt_tokens_seconds 17.5073
HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds 22.7619
HELP llamacpp:requests_processing Number of requests processing.
TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
HELP llamacpp:requests_deferred Number of requests deferred.
TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0

@aviallon
Copy link
Contributor

aviallon commented Jun 9, 2025

I've built a server version of this branch and I'm missing the KV cache % usage from the /metrics endpoint.
Does anyone know why this is happening?

Flags used for build:

CXXFLAGS="-march=core2 -mtune=generic" cmake ..
-DLLAMA_BUILD_SERVER=ON
-DGGML_VULKAN=ON
-DGGML_NATIVE=OFF \ # don’t auto-detect CPU
-DGGML_AVX=OFF -DGGML_AVX2=OFF
-DGGML_AVX512=OFF -DGGML_AVX_VNNI=OFF
-DGGML_FMA=OFF -DGGML_F16C=OFF
-DGGML_AMX_TILE=OFF -DGGML_AMX_INT8=OFF -DGGML_AMX_BF16=OFF
-DGGML_SSE42=ON \
-DCMAKE_BUILD_TYPE=Release

Sample output of /metrics endpoint:

HELP llamacpp:prompt_tokens_total Number of prompt tokens processed.
TYPE llamacpp:prompt_tokens_total counter
llamacpp:prompt_tokens_total 143
HELP llamacpp:prompt_seconds_total Prompt process time
TYPE llamacpp:prompt_seconds_total counter
llamacpp:prompt_seconds_total 8.168
HELP llamacpp:tokens_predicted_total Number of generation tokens processed.
TYPE llamacpp:tokens_predicted_total counter
llamacpp:tokens_predicted_total 673
HELP llamacpp:tokens_predicted_seconds_total Predict process time
TYPE llamacpp:tokens_predicted_seconds_total counter
llamacpp:tokens_predicted_seconds_total 29.567
HELP llamacpp:n_decode_total Total number of llama_decode() calls
TYPE llamacpp:n_decode_total counter
llamacpp:n_decode_total 673
HELP llamacpp:n_busy_slots_per_decode Average number of busy slots per llama_decode() call
TYPE llamacpp:n_busy_slots_per_decode counter
llamacpp:n_busy_slots_per_decode 1
HELP llamacpp:prompt_tokens_seconds Average prompt throughput in tokens/s.
TYPE llamacpp:prompt_tokens_seconds gauge
llamacpp:prompt_tokens_seconds 17.5073
HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds 22.7619
HELP llamacpp:requests_processing Number of requests processing.
TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
HELP llamacpp:requests_deferred Number of requests deferred.
TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0

It may be unrelated to this branch, and more relates to changes in master (KV cache refactor). Can you reproduce with master?

@ggerganov
Copy link
Member

@rasbid These metrics have been removed recently - see the changelog: #9291

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Misc. bug: Qwen 3.0 "enable_thinking" parameter not working