-
Notifications
You must be signed in to change notification settings - Fork 12.2k
server: args for draft model cache types (#11200) #13782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: args for draft model cache types (#11200) #13782
Conversation
@ggerganov forgot to merge? |
@ngxson was added automatically on pull request creation, so merging probably needs 2 approvals. I'm not sure what can I do regarding the failed arm64/ppc64/risc64 builds as I have no access to that kind of hardware and looking at the build logs all of these failed initial apt operations (maybe apt caches/mirrors for these architectures were down at the check time or something similar). E: Failed to fetch https://security.ubuntu.com/ubuntu/dists/noble-security/main/binary-riscv64/Packages 404 Not Found [IP: 52.252.163.49 80]
E: Some index files failed to download. They have been ignored, or old ones used instead.
Reading package lists...
Building dependency tree...
Reading state information...
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:
The following packages have unmet dependencies:
libcurl4t64:riscv64 : Depends: libgssapi-krb5-2:riscv64 (>= 1.17) but it is not going to be installed
libssh-4:riscv64 : Depends: libgssapi-krb5-2:riscv64 (>= 1.17) but it is not going to be installed
E: Unable to correct problems, you have held broken packages. |
No, this can be merged now, but just want to make sure it's not on hold for some reason first. :)
Yes, these failures are irrelevant, it has been fixed in CI since. |
Yes, thanks for the reminder. |
* mamba2-sync: (24 commits) sync : ggml Add `ggml_roll` (ggml/1274) docs : fix the link to llama.h (ggml-org#14293) CUDA: add conv_2d_transpose (ggml-org#14287) lint : remove trailing whitepace (ggml-org#14304) vocab : prevent tokenizer overflow (ggml-org#14301) sycl: add usage of enqueue_functions extension (ggml-org#14244) Implement GGML_CPU_ALL_VARIANTS for PowerPC (ggml-org#14286) llama : improve sep token handling (ggml-org#14272) cuda : synchronize graph capture and cublas handle destruction (ggml-org#14288) ggml : fix repack work size for mul_mat_id (ggml-org#14292) ggml: Update KleidiAI to v1.9.0 (ggml-org#14277) model : more uniform output id handling (ggml-org#14275) ubatch : new splitting logic (ggml-org#14217) CUDA: add conv_2d_dw (ggml-org#14265) ggml-cpu : remove unnecesary arm feature detection (ggml-org#14281) gguf-py : make sentencepiece optional (ggml-org#14200) server : add server parameters for draft model cache type (ggml-org#13782) build : suppress gcc15 compile warnings (ggml-org#14261) sycl: Cleanup codepaths in Get Rows in sycl backend (ggml-org#14215) ...
Should fix the #11200, while keeping the default f16 from #10586.
New command line arguments:
-ctkd, --cache-type-k-draft TYPE
allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
(default: f16)
(env: LLAMA_ARG_CACHE_TYPE_K_DRAFT)
-ctvd, --cache-type-v-draft TYPE
allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
(default: f16)
(env: LLAMA_ARG_CACHE_TYPE_V_DRAFT)