You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|`--keep N`| number of tokens to keep from the initial prompt (default: 0, -1 = all) |
@@ -69,6 +72,7 @@ The project is under active development, and we are [looking for feedback and co
69
72
|`--numa TYPE`| attempt optimizations that help on some NUMA systems<br/>- distribute: spread execution evenly over all nodes<br/>- isolate: only spawn threads on CPUs on the node that execution started on<br/>- numactl: use the CPU map provided by numactl<br/>if run without this previously, it is recommended to drop the system page cache before using this<br/>see https://github.com/ggml-org/llama.cpp/issues/1437<br/>(env: LLAMA_ARG_NUMA) |
70
73
|`-dev, --device <dev1,dev2,..>`| comma-separated list of devices to use for offloading (none = don't offload)<br/>use --list-devices to see a list of available devices<br/>(env: LLAMA_ARG_DEVICE) |
71
74
|`--list-devices`| print list of available devices and exit |
75
+
|`--override-tensor, -ot <tensor name pattern>=<buffer type>,...`| override tensor buffer type |
72
76
|`-ngl, --gpu-layers, --n-gpu-layers N`| number of layers to store in VRAM<br/>(env: LLAMA_ARG_N_GPU_LAYERS) |
73
77
|`-sm, --split-mode {none,layer,row}`| how to split the model across multiple GPUs, one of:<br/>- none: use one GPU only<br/>- layer (default): split layers and KV across GPUs<br/>- row: split rows across GPUs<br/>(env: LLAMA_ARG_SPLIT_MODE) |
74
78
|`-ts, --tensor-split N0,N1,N2,...`| fraction of the model to offload to each GPU, comma-separated list of proportions, e.g. 3,1<br/>(env: LLAMA_ARG_TENSOR_SPLIT) |
@@ -82,25 +86,28 @@ The project is under active development, and we are [looking for feedback and co
82
86
|`--control-vector-layer-range START END`| layer range to apply the control vector(s) to, start and end inclusive |
83
87
|`-m, --model FNAME`| model path (default: `models/$filename` with filename from `--hf-file` or `--model-url` if set, otherwise models/7B/ggml-model-f16.gguf)<br/>(env: LLAMA_ARG_MODEL) |
84
88
|`-mu, --model-url MODEL_URL`| model download url (default: unused)<br/>(env: LLAMA_ARG_MODEL_URL) |
85
-
|`-hfr, --hf-repo REPO`| Hugging Face model repository (default: unused)<br/>(env: LLAMA_ARG_HF_REPO) |
86
-
|`-hff, --hf-file FILE`| Hugging Face model file (default: unused)<br/>(env: LLAMA_ARG_HF_FILE) |
89
+
|`-hf, -hfr, --hf-repo <user>/<model>[:quant]`| Hugging Face model repository; quant is optional, case-insensitive, default to Q4_K_M, or falls back to the first file in the repo if Q4_K_M doesn't exist.<br/>mmproj is also downloaded automatically if available. to disable, add --no-mmproj<br/>example: unsloth/phi-4-GGUF:q4_k_m<br/>(default: unused)<br/>(env: LLAMA_ARG_HF_REPO) |
90
+
|`-hfd, -hfrd, --hf-repo-draft <user>/<model>[:quant]`| Same as --hf-repo, but for the draft model (default: unused)<br/>(env: LLAMA_ARG_HFD_REPO) |
91
+
|`-hff, --hf-file FILE`| Hugging Face model file. If specified, it will override the quant in --hf-repo (default: unused)<br/>(env: LLAMA_ARG_HF_FILE) |
92
+
|`-hfv, -hfrv, --hf-repo-v <user>/<model>[:quant]`| Hugging Face model repository for the vocoder model (default: unused)<br/>(env: LLAMA_ARG_HF_REPO_V) |
93
+
|`-hffv, --hf-file-v FILE`| Hugging Face model file for the vocoder model (default: unused)<br/>(env: LLAMA_ARG_HF_FILE_V) |
87
94
|`-hft, --hf-token TOKEN`| Hugging Face access token (default: value from HF_TOKEN environment variable)<br/>(env: HF_TOKEN) |
|`-v, --verbose, --log-verbose`| Set verbosity level to infinity (i.e. log all messages, useful for debugging) |
92
99
|`-lv, --verbosity, --log-verbosity N`| Set the verbosity threshold. Messages with a higher verbosity will be ignored.<br/>(env: LLAMA_LOG_VERBOSITY) |
93
-
|`--log-prefix`| Enable prefx in log messages<br/>(env: LLAMA_LOG_PREFIX) |
100
+
|`--log-prefix`| Enable prefix in log messages<br/>(env: LLAMA_LOG_PREFIX) |
94
101
|`--log-timestamps`| Enable timestamps in log messages<br/>(env: LLAMA_LOG_TIMESTAMPS) |
95
102
96
103
97
104
**Sampling params**
98
105
99
106
| Argument | Explanation |
100
107
| -------- | ----------- |
101
-
|`--samplers SAMPLERS`| samplers that will be used for generation in the order, separated by ';'<br/>(default: dry;top_k;typ_p;top_p;min_p;xtc;temperature) |
108
+
|`--samplers SAMPLERS`| samplers that will be used for generation in the order, separated by ';'<br/>(default: penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature) |
102
109
|`-s, --seed SEED`| RNG seed (default: -1, use random seed for -1) |
103
-
|`--sampling-seqSEQUENCE`| simplified sequence for samplers that will be used (default: dkypmxt) |
110
+
|`--sampling-seq, --sampler-seq SEQUENCE`| simplified sequence for samplers that will be used (default: edskypmxt) |
104
111
|`--ignore-eos`| ignore end of stream token and continue generating (implies --logit-bias EOS-inf) |
@@ -127,22 +134,26 @@ The project is under active development, and we are [looking for feedback and co
127
134
|`--grammar GRAMMAR`| BNF-like grammar to constrain generations (see samples in grammars/ dir) (default: '') |
128
135
|`--grammar-file FNAME`| file to read grammar from |
129
136
|`-j, --json-schema SCHEMA`| JSON schema to constrain generations (https://json-schema.org/), e.g. `{}` for any JSON object<br/>For schemas w/ external $refs, use --grammar + example/json_schema_to_grammar.py instead |
130
-
|`--jinja`|Enable experimental Jinja templating engine (required for tool use)|
131
-
|`--reasoning-format FORMAT`| Controls extraction of model thinking traces and the format / field in which they are returned (default: `deepseek`; allowed values: `deepseek`, `none`; requires `--jinja`). `none` will leave thinking traces inline in `message.content` in a model-specific format, while `deepseek` will return them separately under `message.reasoning_content`|
137
+
|`-jf, --json-schema-file FILE`|File containing a JSON schema to constrain generations (https://json-schema.org/), e.g. `{}`for any JSON object<br/>For schemas w/ external $refs, use --grammar + example/json_schema_to_grammar.py instead|
138
+
132
139
133
140
**Example-specific params**
134
141
135
142
| Argument | Explanation |
136
143
| -------- | ----------- |
137
-
|`--no-context-shift`| disables context shift on inifinite text generation (default: disabled)<br/>(env: LLAMA_ARG_NO_CONTEXT_SHIFT) |
144
+
|`--no-context-shift`| disables context shift on infinite text generation (default: disabled)<br/>(env: LLAMA_ARG_NO_CONTEXT_SHIFT) |
138
145
|`-sp, --special`| special tokens output enabled (default: false) |
139
146
|`--no-warmup`| skip warming up the model with an empty run |
140
147
|`--spm-infill`| use Suffix/Prefix/Middle pattern for infill (instead of Prefix/Suffix/Middle) as some models prefer this. (default: disabled) |
141
148
|`--pooling {none,mean,cls,last,rank}`| pooling type for embeddings, use model default if unspecified<br/>(env: LLAMA_ARG_POOLING) |
|`--mmproj FILE`| path to a multimodal projector file. see tools/mtmd/README.md<br/>note: if -hf is used, this argument can be omitted<br/>(env: LLAMA_ARG_MMPROJ) |
152
+
|`--mmproj-url URL`| URL to a multimodal projector file. see tools/mtmd/README.md<br/>(env: LLAMA_ARG_MMPROJ_URL) |
153
+
|`--no-mmproj`| explicitly disable multimodal projector, useful when using -hf<br/>(env: LLAMA_ARG_NO_MMPROJ) |
154
+
|`--no-mmproj-offload`| do not offload multimodal projector to GPU<br/>(env: LLAMA_ARG_NO_MMPROJ_OFFLOAD) |
144
155
|`-a, --alias STRING`| set alias for model name (to be used by REST API)<br/>(env: LLAMA_ARG_ALIAS) |
145
-
|`--host HOST`| ip address to listen (default: 127.0.0.1)<br/>(env: LLAMA_ARG_HOST) |
156
+
|`--host HOST`| ip address to listen, or bind to an UNIX socket if the address ends with .sock (default: 127.0.0.1)<br/>(env: LLAMA_ARG_HOST) |
146
157
|`--port PORT`| port to listen (default: 8080)<br/>(env: LLAMA_ARG_PORT) |
147
158
|`--path PATH`| path to serve static files from (default: )<br/>(env: LLAMA_ARG_STATIC_PATH) |
148
159
|`--no-webui`| Disable the Web UI (default: enabled)<br/>(env: LLAMA_ARG_NO_WEBUI) |
@@ -160,16 +171,29 @@ The project is under active development, and we are [looking for feedback and co
160
171
|`--props`| enable changing global properties via POST /props (default: disabled)<br/>(env: LLAMA_ARG_ENDPOINT_PROPS) |
|`--slot-save-path PATH`| path to save slot kv cache (default: disabled) |
163
-
|`--chat-template JINJA_TEMPLATE`| set custom jinja chat template (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>list of built-in templates:<br/>chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, exaone3, gemma, granite, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, monarch, openchat, orion, phi3, rwkv-world, vicuna, vicuna-orca, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE) |
174
+
|`--jinja`| use jinja template for chat (default: disabled)<br/>(env: LLAMA_ARG_JINJA) |
175
+
|`--reasoning-format FORMAT`| reasoning format (default: deepseek; allowed values: deepseek, none)<br/>controls whether thought tags are extracted from the response, and in which format they're returned. 'none' leaves thoughts unparsed in `message.content`, 'deepseek' puts them in `message.reasoning_content` (for DeepSeek R1 & Command R7B only).<br/>only supported for non-streamed responses<br/>(env: LLAMA_ARG_THINK) |
176
+
|`--chat-template JINJA_TEMPLATE`| set custom jinja chat template (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted (unless --jinja is set before this flag):<br/>list of built-in templates:<br/>bailing, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, deepseek3, exaone3, falcon3, gemma, gigachat, glmedge, granite, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch, openchat, orion, phi3, phi4, rwkv-world, smolvlm, vicuna, vicuna-orca, yandex, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE) |
177
+
|`--chat-template-file JINJA_TEMPLATE_FILE`| set custom jinja chat template file (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted (unless --jinja is set before this flag):<br/>list of built-in templates:<br/>bailing, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, deepseek3, exaone3, falcon3, gemma, gigachat, glmedge, granite, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch, openchat, orion, phi3, phi4, rwkv-world, smolvlm, vicuna, vicuna-orca, yandex, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE_FILE) |
164
178
|`-sps, --slot-prompt-similarity SIMILARITY`| how much the prompt of a request must match the prompt of a slot in order to use that slot (default: 0.50, 0.0 = disabled)<br/> |
165
179
|`--lora-init-without-apply`| load LoRA adapters without applying them (apply later via POST /lora-adapters) (default: disabled) |
166
180
|`--draft-max, --draft, --draft-n N`| number of tokens to draft for speculative decoding (default: 16)<br/>(env: LLAMA_ARG_DRAFT_MAX) |
167
-
|`--draft-min, --draft-n-min N`| minimum number of draft tokens to use for speculative decoding (default: 5)<br/>(env: LLAMA_ARG_DRAFT_MIN) |
|`-cd, --ctx-size-draft N`| size of the prompt context for the draft model (default: 0, 0 = loaded from model)<br/>(env: LLAMA_ARG_CTX_SIZE_DRAFT) |
170
184
|`-devd, --device-draft <dev1,dev2,..>`| comma-separated list of devices to use for offloading the draft model (none = don't offload)<br/>use --list-devices to see a list of available devices |
171
185
|`-ngld, --gpu-layers-draft, --n-gpu-layers-draft N`| number of layers to store in VRAM for the draft model<br/>(env: LLAMA_ARG_N_GPU_LAYERS_DRAFT) |
172
186
|`-md, --model-draft FNAME`| draft model for speculative decoding (default: unused)<br/>(env: LLAMA_ARG_MODEL_DRAFT) |
187
+
|`-mv, --model-vocoder FNAME`| vocoder model for audio generation (default: unused) |
188
+
|`--tts-use-guide-tokens`| Use guide tokens to improve TTS word recall |
189
+
|`--embd-bge-small-en-default`| use default bge-small-en-v1.5 model (note: can download weights from the internet) |
190
+
|`--embd-e5-small-en-default`| use default e5-small-v2 model (note: can download weights from the internet) |
191
+
|`--embd-gte-small-default`| use default gte-small model (note: can download weights from the internet) |
192
+
|`--fim-qwen-1.5b-default`| use default Qwen 2.5 Coder 1.5B (note: can download weights from the internet) |
193
+
|`--fim-qwen-3b-default`| use default Qwen 2.5 Coder 3B (note: can download weights from the internet) |
194
+
|`--fim-qwen-7b-default`| use default Qwen 2.5 Coder 7B (note: can download weights from the internet) |
195
+
|`--fim-qwen-7b-spec`| use Qwen 2.5 Coder 7B + 0.5B draft for speculative decoding (note: can download weights from the internet) |
196
+
|`--fim-qwen-14b-spec`| use Qwen 2.5 Coder 14B + 0.5B draft for speculative decoding (note: can download weights from the internet) |
173
197
174
198
175
199
Note: If both command line argument and environment variable are both set for the same param, the argument will take precedence over env var.
0 commit comments