Skip to content

Commit 0e21835

Browse files
ggerganovarthw
authored andcommitted
server : enable KV cache defrag by default (ggml-org#10233)
ggml-ci
1 parent c7d7613 commit 0e21835

File tree

2 files changed

+12
-10
lines changed

2 files changed

+12
-10
lines changed

common/common.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -178,7 +178,7 @@ struct common_params {
178178
float yarn_beta_fast = 32.0f; // YaRN low correction dim
179179
float yarn_beta_slow = 1.0f; // YaRN high correction dim
180180
int32_t yarn_orig_ctx = 0; // YaRN original context length
181-
float defrag_thold = -1.0f; // KV cache defragmentation threshold
181+
float defrag_thold = 0.1f; // KV cache defragmentation threshold
182182

183183
struct cpu_params cpuparams;
184184
struct cpu_params cpuparams_batch;

examples/server/README.md

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ The project is under active development, and we are [looking for feedback and co
3939
| `--cpu-strict-batch <0\|1>` | use strict CPU placement (default: same as --cpu-strict) |
4040
| `--prio-batch N` | set process/thread priority : 0-normal, 1-medium, 2-high, 3-realtime (default: 0)<br/> |
4141
| `--poll-batch <0\|1>` | use polling to wait for work (default: same as --poll) |
42-
| `-c, --ctx-size N` | size of the prompt context (default: 0, 0 = loaded from model)<br/>(env: LLAMA_ARG_CTX_SIZE) |
42+
| `-c, --ctx-size N` | size of the prompt context (default: 4096, 0 = loaded from model)<br/>(env: LLAMA_ARG_CTX_SIZE) |
4343
| `-n, --predict, --n-predict N` | number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)<br/>(env: LLAMA_ARG_N_PREDICT) |
4444
| `-b, --batch-size N` | logical maximum batch size (default: 2048)<br/>(env: LLAMA_ARG_BATCH) |
4545
| `-ub, --ubatch-size N` | physical maximum batch size (default: 512)<br/>(env: LLAMA_ARG_UBATCH) |
@@ -64,7 +64,7 @@ The project is under active development, and we are [looking for feedback and co
6464
| `-nkvo, --no-kv-offload` | disable KV offload<br/>(env: LLAMA_ARG_NO_KV_OFFLOAD) |
6565
| `-ctk, --cache-type-k TYPE` | KV cache data type for K (default: f16)<br/>(env: LLAMA_ARG_CACHE_TYPE_K) |
6666
| `-ctv, --cache-type-v TYPE` | KV cache data type for V (default: f16)<br/>(env: LLAMA_ARG_CACHE_TYPE_V) |
67-
| `-dt, --defrag-thold N` | KV cache defragmentation threshold (default: -1.0, < 0 - disabled)<br/>(env: LLAMA_ARG_DEFRAG_THOLD) |
67+
| `-dt, --defrag-thold N` | KV cache defragmentation threshold (default: 0.1, < 0 - disabled)<br/>(env: LLAMA_ARG_DEFRAG_THOLD) |
6868
| `-np, --parallel N` | number of parallel sequences to decode (default: 1)<br/>(env: LLAMA_ARG_N_PARALLEL) |
6969
| `--mlock` | force system to keep model in RAM rather than swapping or compressing<br/>(env: LLAMA_ARG_MLOCK) |
7070
| `--no-mmap` | do not memory-map model (slower load but may reduce pageouts if not using mlock)<br/>(env: LLAMA_ARG_NO_MMAP) |
@@ -99,25 +99,27 @@ The project is under active development, and we are [looking for feedback and co
9999

100100
| Argument | Explanation |
101101
| -------- | ----------- |
102-
| `--samplers SAMPLERS` | samplers that will be used for generation in the order, separated by ';'<br/>(default: top_k;typ_p;top_p;min_p;temperature) |
102+
| `--samplers SAMPLERS` | samplers that will be used for generation in the order, separated by ';'<br/>(default: dry;top_k;typ_p;top_p;min_p;xtc;temperature) |
103103
| `-s, --seed SEED` | RNG seed (default: -1, use random seed for -1) |
104-
| `--sampling-seq SEQUENCE` | simplified sequence for samplers that will be used (default: kfypmt) |
104+
| `--sampling-seq SEQUENCE` | simplified sequence for samplers that will be used (default: dkypmxt) |
105105
| `--ignore-eos` | ignore end of stream token and continue generating (implies --logit-bias EOS-inf) |
106106
| `--penalize-nl` | penalize newline tokens (default: false) |
107107
| `--temp N` | temperature (default: 0.8) |
108108
| `--top-k N` | top-k sampling (default: 40, 0 = disabled) |
109109
| `--top-p N` | top-p sampling (default: 0.9, 1.0 = disabled) |
110110
| `--min-p N` | min-p sampling (default: 0.1, 0.0 = disabled) |
111+
| `--xtc-probability N` | xtc probability (default: 0.0, 0.0 = disabled) |
112+
| `--xtc-threshold N` | xtc threshold (default: 0.1, 1.0 = disabled) |
111113
| `--typical N` | locally typical sampling, parameter p (default: 1.0, 1.0 = disabled) |
112114
| `--repeat-last-n N` | last n tokens to consider for penalize (default: 64, 0 = disabled, -1 = ctx_size) |
113115
| `--repeat-penalty N` | penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled) |
114116
| `--presence-penalty N` | repeat alpha presence penalty (default: 0.0, 0.0 = disabled) |
115117
| `--frequency-penalty N` | repeat alpha frequency penalty (default: 0.0, 0.0 = disabled) |
116-
| `--dry-multiplier N` | DRY sampling multiplier (default: 0.0, 0.0 = disabled) |
117-
| `--dry-base N` | DRY sampling base value (default: 1.75) |
118-
| `--dry-allowed-length N` | allowed length for DRY sampling (default: 2) |
119-
| `--dry-penalty-last-n N` | DRY penalty for the last n tokens (default: -1, 0 = disable, -1 = context size) |
120-
| `--dry-sequence-breaker STRING` | add sequence breaker for DRY sampling, clearing out default breakers (`['\n', ':', '"', '*']`) in the process; use `"none"` to not use any sequence breakers
118+
| `--dry-multiplier N` | set DRY sampling multiplier (default: 0.0, 0.0 = disabled) |
119+
| `--dry-base N` | set DRY sampling base value (default: 1.75) |
120+
| `--dry-allowed-length N` | set allowed length for DRY sampling (default: 2) |
121+
| `--dry-penalty-last-n N` | set DRY penalty for the last n tokens (default: -1, 0 = disable, -1 = context size) |
122+
| `--dry-sequence-breaker STRING` | add sequence breaker for DRY sampling, clearing out default breakers ('\n', ':', '"', '*') in the process; use "none" to not use any sequence breakers<br/> |
121123
| `--dynatemp-range N` | dynamic temperature range (default: 0.0, 0.0 = disabled) |
122124
| `--dynatemp-exp N` | dynamic temperature exponent (default: 1.0) |
123125
| `--mirostat N` | use Mirostat sampling.<br/>Top K, Nucleus and Locally Typical samplers are ignored if used.<br/>(default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0) |

0 commit comments

Comments
 (0)