Support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client #13196

matteoserva · 2025-04-29T18:58:24Z

This PR implements support for setting additional jinja parameters.
An example of this is enable_thinking in the Qwen3 models template.

Main features:

Setting jinja variables for command line using --chat_template_kwargs or the environment variable
Setting variables per request in the OAI compatible api using the chat_template_kwargs parameter
Compatibility with the VLLM API

Notice

As per server: add --reasoning-budget 0 to disable thinking (incl. qwen3 w/ enable_thinking:false) #13771 the preferred way for disabling thinking with a command line argument is now --reasoning-budget 0. The command line setting can be overridden anyway by passing the chat_template_kwargs during the request to the OAI compatible API
There is ongoing discussion to support setting the reasoning budget per request in Feature Request: add per-request "reasoning" options in llama-server #13272. This would allow to completely disable thinking by setting the budget to 0

Other info

The official template is still only partially compatible. I modified it to use only supported features.
It's here: ~~https://pastebin.com/16ZpCLHk~~ https://pastebin.com/GGuTbFRc
And should be loaded with llama-server --jinja --chat-template-file {template_file}

It fixes #13160 and #13189

Test it with:

enable_thinking=false. Expected: {"prompt":"\n<|im_start|>user\nGive me a short introduction to large language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"}

curl http://localhost:8080/apply-template -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen3-8B",
  "messages": [
    {"role": "user", "content": "Give me a short introduction to large language models."}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "top_k": 20,
  "max_tokens": 8192,
  "presence_penalty": 1.5,
  "chat_template_kwargs": {"enable_thinking": false}
}'

enable_thinking=true

curl http://localhost:8080/apply-template -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen3-8B",
  "messages": [
    {"role": "user", "content": "Give me a short introduction to large language models."}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "top_k": 20,
  "max_tokens": 8192,
  "presence_penalty": 1.5,
  "chat_template_kwargs": {"enable_thinking": true}
}'

enable_thinking undefined

curl http://localhost:8080/apply-template -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen3-8B",
  "messages": [
    {"role": "user", "content": "Give me a short introduction to large language models."}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "top_k": 20,
  "max_tokens": 8192,
  "presence_penalty": 1.5
}'

rhjdvsgsgks · 2025-04-29T20:14:30Z

can you add chat_template_kwargs to cli argument as well?

matteoserva · 2025-04-30T07:36:22Z

can you add chat_template_kwargs to cli argument as well?

I added it. I tested it using updated command (You might want to check the escaping of the double quotes):
--chat_template_kwargs "{\"enable_thinking\":false}" --jinja --chat-template-file qwen/qwen3_template.txt

neolee · 2025-05-01T03:01:10Z

Very useful for Qwen3 series. +1 for this feature!

xiaomi102 · 2025-05-08T23:14:22Z

Cannot work with the --chat_template_kwargs option from CLI: error: invalid argument: --chat-template-kwargs
git clone --depth=1 -b enable_thinking https://github.com/matteoserva/llama.cpp

celsowm · 2025-05-09T00:36:12Z

@ggerganov is there any reason why this PR has not been accepted and merged yet?

matteoserva · 2025-05-09T09:01:55Z

Cannot work with the --chat_template_kwargs option from CLI: error: invalid argument: --chat-template-kwargs git clone

This PR is implemented only for llama-server and its webui.

llama-cli has unresolved bugs that prevent me from enabling this feature.

xiaomi102 · 2025-05-09T09:38:30Z

Cannot work with the --chat_template_kwargs option from CLI: error: invalid argument: --chat-template-kwargs git clone

This PR is implemented only for llama-server and its webui.

llama-cli has unresolved bugs that prevent me from enabling this feature.

Hope you'll integrate it for the CLI environment soon, thanks!

matteoserva · 2025-05-09T13:38:39Z

Hope you'll integrate it for the CLI environment soon, thanks!

I'll open a new PR soon for llama-cli. The code is ready but it's blocked by #13402 and #13404

celsowm · 2025-05-15T15:57:15Z

would be nice a enable_thinking checkbox or something like that on llama cpp webui too

strawberrymelonpanda · 2025-05-15T16:59:02Z

@ggerganov is there any reason why this PR has not been accepted and merged yet?

@celsowm Lack of eyes on this area would be my guess.

With 438 open PRs (many obsolete), I've kind of come to accept I'll need to pull in some PRs of interest to me when building.

neolee · 2025-05-16T04:37:07Z

@ggerganov is there any reason why this PR has not been accepted and merged yet?

@celsowm Lack of eyes on this area would be my guess.

With 438 open PRs (many obsolete), I've kind of come to accept I'll need to pull in some PRs of interest to me when building.

vLLM and SGLang have got this feature the first day Qwen3 released. At the same time many useful enhancement and fix PRs become obsolete just because of delay on merging here in llama.cpp community. Really sad about that.

common/chat.cpp

common/chat.h

common/common.h

tools/server/utils.hpp

Neath · 2025-05-16T20:43:27Z

This is so necessary when dealing with Qwen3! Can't wait to see this merged and be able to use the latest version with this <3

ochafik

Thanks @matteoserva, and sorry for the slow review!

common/chat.cpp

exxocism · 2025-06-03T12:22:34Z

It looks working well, you guys are so cool

aviallon · 2025-06-05T17:01:49Z

@ngxson shouldn't this be merged?

zghnwsq · 2025-06-06T09:14:09Z

Thanks to all contributors, looking forward to it

squirrelfish · 2025-06-06T09:20:42Z

I think this pr can also be applied to deepseek-r1, If we append "{% if ns.is_last_user %}{% if enable_thinking is defined and enable_thinking is false %}{{'\n\n\n\n'}}{% endif %}{% endif %}" to the DeepSeek template

aviallon · 2025-06-06T09:50:18Z

I think this pr can also be applied to deepseek-r1, If we append "{% if ns.is_last_user %}{% if enable_thinking is defined and enable_thinking is false %}{{'\n\n\n\n'}}{% endif %}{% endif %}" to the DeepSeek template

This is not needed, as there is a new reasoning effort param already.

matteoserva · 2025-06-06T10:02:58Z

I think this pr can also be applied to deepseek-r1, If we append "{% if ns.is_last_user %}{% if enable_thinking is defined and enable_thinking is false %}{{'\n\n\n\n'}}{% endif %}{% endif %}" to the DeepSeek template

This is not needed, as there is a new reasoning effort param already.

The new reasoning param requires restarting the server to change the value. With this PR you can set it per request.

There is ongoing effort to implement this in the future: #13272

Co-authored-by: Georgi Gerganov <[email protected]>

coding standard: cosmetic changes Co-authored-by: Georgi Gerganov <[email protected]>

… extra_context (+ the odd existing additional_context)

tools/server/utils.hpp

rasbid · 2025-06-09T14:26:19Z

I've built a server version of this branch and I'm missing the KV cache % usage from the /metrics endpoint.
Does anyone know why this is happening?

Flags used for build:

CXXFLAGS="-march=core2 -mtune=generic" cmake ..
-DLLAMA_BUILD_SERVER=ON
-DGGML_VULKAN=ON
-DGGML_NATIVE=OFF \ # don’t auto-detect CPU
-DGGML_AVX=OFF -DGGML_AVX2=OFF
-DGGML_AVX512=OFF -DGGML_AVX_VNNI=OFF
-DGGML_FMA=OFF -DGGML_F16C=OFF
-DGGML_AMX_TILE=OFF -DGGML_AMX_INT8=OFF -DGGML_AMX_BF16=OFF
-DGGML_SSE42=ON \
-DCMAKE_BUILD_TYPE=Release

Sample output of /metrics endpoint:

HELP llamacpp:prompt_tokens_total Number of prompt tokens processed.
TYPE llamacpp:prompt_tokens_total counter
llamacpp:prompt_tokens_total 143
HELP llamacpp:prompt_seconds_total Prompt process time
TYPE llamacpp:prompt_seconds_total counter
llamacpp:prompt_seconds_total 8.168
HELP llamacpp:tokens_predicted_total Number of generation tokens processed.
TYPE llamacpp:tokens_predicted_total counter
llamacpp:tokens_predicted_total 673
HELP llamacpp:tokens_predicted_seconds_total Predict process time
TYPE llamacpp:tokens_predicted_seconds_total counter
llamacpp:tokens_predicted_seconds_total 29.567
HELP llamacpp:n_decode_total Total number of llama_decode() calls
TYPE llamacpp:n_decode_total counter
llamacpp:n_decode_total 673
HELP llamacpp:n_busy_slots_per_decode Average number of busy slots per llama_decode() call
TYPE llamacpp:n_busy_slots_per_decode counter
llamacpp:n_busy_slots_per_decode 1
HELP llamacpp:prompt_tokens_seconds Average prompt throughput in tokens/s.
TYPE llamacpp:prompt_tokens_seconds gauge
llamacpp:prompt_tokens_seconds 17.5073
HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds 22.7619
HELP llamacpp:requests_processing Number of requests processing.
TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
HELP llamacpp:requests_deferred Number of requests deferred.
TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0

aviallon · 2025-06-09T23:13:04Z

I've built a server version of this branch and I'm missing the KV cache % usage from the /metrics endpoint.
Does anyone know why this is happening?

Flags used for build:

CXXFLAGS="-march=core2 -mtune=generic" cmake ..
-DLLAMA_BUILD_SERVER=ON
-DGGML_VULKAN=ON
-DGGML_NATIVE=OFF \ # don’t auto-detect CPU
-DGGML_AVX=OFF -DGGML_AVX2=OFF
-DGGML_AVX512=OFF -DGGML_AVX_VNNI=OFF
-DGGML_FMA=OFF -DGGML_F16C=OFF
-DGGML_AMX_TILE=OFF -DGGML_AMX_INT8=OFF -DGGML_AMX_BF16=OFF
-DGGML_SSE42=ON \
-DCMAKE_BUILD_TYPE=Release

Sample output of /metrics endpoint:

HELP llamacpp:prompt_tokens_total Number of prompt tokens processed.
TYPE llamacpp:prompt_tokens_total counter
llamacpp:prompt_tokens_total 143
HELP llamacpp:prompt_seconds_total Prompt process time
TYPE llamacpp:prompt_seconds_total counter
llamacpp:prompt_seconds_total 8.168
HELP llamacpp:tokens_predicted_total Number of generation tokens processed.
TYPE llamacpp:tokens_predicted_total counter
llamacpp:tokens_predicted_total 673
HELP llamacpp:tokens_predicted_seconds_total Predict process time
TYPE llamacpp:tokens_predicted_seconds_total counter
llamacpp:tokens_predicted_seconds_total 29.567
HELP llamacpp:n_decode_total Total number of llama_decode() calls
TYPE llamacpp:n_decode_total counter
llamacpp:n_decode_total 673
HELP llamacpp:n_busy_slots_per_decode Average number of busy slots per llama_decode() call
TYPE llamacpp:n_busy_slots_per_decode counter
llamacpp:n_busy_slots_per_decode 1
HELP llamacpp:prompt_tokens_seconds Average prompt throughput in tokens/s.
TYPE llamacpp:prompt_tokens_seconds gauge
llamacpp:prompt_tokens_seconds 17.5073
HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds 22.7619
HELP llamacpp:requests_processing Number of requests processing.
TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
HELP llamacpp:requests_deferred Number of requests deferred.
TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0

It may be unrelated to this branch, and more relates to changes in master (KV cache refactor). Can you reproduce with master?

ggerganov · 2025-06-10T08:13:16Z

@rasbid These metrics have been removed recently - see the changelog: #9291

tools/server/utils.hpp

matteoserva requested a review from ngxson as a code owner April 29, 2025 18:58

matteoserva marked this pull request as draft April 29, 2025 18:58

github-actions bot added examples server labels Apr 29, 2025

matteoserva force-pushed the enable_thinking branch from 76549e1 to 379c7a8 Compare April 29, 2025 19:01

zpitroda mentioned this pull request Apr 30, 2025

Feature/qwen3 mindverse/Second-Me#311

Open

matteoserva force-pushed the enable_thinking branch 2 times, most recently from d1861c4 to 01b58b5 Compare April 30, 2025 15:58

matteoserva changed the title ~~[RFC] handling jinja extra template kwargs (Qwen3 enable_thinking feature)~~ Support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client Apr 30, 2025

matteoserva marked this pull request as ready for review April 30, 2025 15:59

matteoserva force-pushed the enable_thinking branch from 2d1d595 to 28dd37e Compare May 7, 2025 10:17

VJHack mentioned this pull request May 8, 2025

Add support for "Thinking" with Qwen3 mybigday/llama.rn#139

Open

createthis mentioned this pull request May 8, 2025

Misc. bug: Qwen 3.0 "enable_thinking" parameter not working #13160

Open

matteoserva force-pushed the enable_thinking branch 2 times, most recently from d44d099 to 5b3de5d Compare May 15, 2025 14:09

ggerganov reviewed May 16, 2025

View reviewed changes

ggerganov requested a review from ochafik May 16, 2025 05:28

matteoserva requested a review from ggerganov May 16, 2025 06:52

matteoserva force-pushed the enable_thinking branch from 54f128a to e4b0489 Compare May 31, 2025 08:14

ochafik approved these changes Jun 3, 2025

View reviewed changes

common/chat.cpp Show resolved Hide resolved

matteoserva and others added 15 commits June 8, 2025 14:25

initial commit for handling extra template kwargs

2950f62

enable_thinking and assistant prefill cannot be enabled at the same time

46064b4

can set chat_template_kwargs in command line

91681d4

added doc

a92e790

fixed formatting

abda1ae

add support for extra context in generic template init

570018b

coding standard: common/chat.cpp

8c8b290

Co-authored-by: Georgi Gerganov <[email protected]>

coding standard: common/chat.cpp

56b3a69

Co-authored-by: Georgi Gerganov <[email protected]>

Apply suggestions from code review

fe6e44a

coding standard: cosmetic changes Co-authored-by: Georgi Gerganov <[email protected]>

fix merge conflict

67789ef

chat.cpp: simplify calls to apply to ensure systematic propagation of…

9a93863

… extra_context (+ the odd existing additional_context)

normalize environment variable name

74f6060

simplify code

cdc3cbe

prefill cannot be used with thinking models

226e37d

compatibility with the new reasoning-budget parameter

4e1c329

exxocism reviewed Jun 8, 2025

View reviewed changes

tools/server/utils.hpp Show resolved Hide resolved

fix prefill for non thinking models

a056e53

matteoserva force-pushed the enable_thinking branch from e4b0489 to a056e53 Compare June 8, 2025 14:48

exxocism reviewed Jun 14, 2025

View reviewed changes

tools/server/utils.hpp Show resolved Hide resolved

Support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client #13196

Are you sure you want to change the base?

Support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client #13196

Conversation

matteoserva commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Main features:

Notice

Other info

Uh oh!

rhjdvsgsgks commented Apr 29, 2025

Uh oh!

matteoserva commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neolee commented May 1, 2025

Uh oh!

xiaomi102 commented May 8, 2025

Uh oh!

celsowm commented May 9, 2025

Uh oh!

matteoserva commented May 9, 2025

Uh oh!

xiaomi102 commented May 9, 2025

Uh oh!

matteoserva commented May 9, 2025

Uh oh!

celsowm commented May 15, 2025

Uh oh!

strawberrymelonpanda commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neolee commented May 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Neath commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ochafik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

exxocism commented Jun 3, 2025

Uh oh!

aviallon commented Jun 5, 2025

Uh oh!

zghnwsq commented Jun 6, 2025

Uh oh!

squirrelfish commented Jun 6, 2025

Uh oh!

aviallon commented Jun 6, 2025

Uh oh!

matteoserva commented Jun 6, 2025

Uh oh!

Uh oh!

rasbid commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aviallon commented Jun 9, 2025

Uh oh!

ggerganov commented Jun 10, 2025

Uh oh!

Uh oh!

Uh oh!

matteoserva commented Apr 29, 2025 •

edited

Loading

matteoserva commented Apr 30, 2025 •

edited

Loading

strawberrymelonpanda commented May 15, 2025 •

edited

Loading

Neath commented May 16, 2025 •

edited

Loading

rasbid commented Jun 9, 2025 •

edited

Loading