Releases: ngxson/llama.cpp
Releases · ngxson/llama.cpp
b5535
arm64: optimize q4_k_q8_k kernel with i8mm (#13886) This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction. Tested on neoverse-n2 with llama3 8b q4_k_m quantization model. - 34% ~ 50% S_PP uplift for all batch sizes - 12% ~ 37% S_TG uplift for batch size 4 and above Perplexity doesn't change with this PR. ``` // tested on neoverse-n2 $ llama-batched-bench \ -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \ --no-mmap -fa \ -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \ -npl 1,2,4,8,16,32 \ -t 64 --------------------------------------------------------------------- | PP | TG | B | S_PP t/s | S_TG t/s | | | | | original | this pr | original | this pr | |-------|--------|------|----------|----------|----------|----------| | 128 | 128 | 1 | 110.12 | 147.83 | 24.36 | 24.28 | | 128 | 128 | 2 | 121.16 | 172.42 | 46.36 | 47.93 | | 128 | 128 | 4 | 120.15 | 169.75 | 74.68 | 84.00 | | 128 | 128 | 8 | 130.97 | 196.81 | 91.04 | 114.74 | | 128 | 128 | 16 | 131.01 | 196.88 | 101.43 | 135.79 | | 128 | 128 | 32 | 130.85 | 196.51 | 106.97 | 147.29 | --------------------------------------------------------------------- ```
b5534
cmake: Factor out CPU architecture detection (#13883) * cmake: Define function for querying architecture The tests and results match exactly those of ggml/src/CMakeLists.txt * Switch arch detection over to new function
b5533
ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Al…
b5530
llama : add RobertaForSequenceClassification reranker support (#13875)
b5529
ggml: aarch64: Implement SVE F32 kernels for vector functions (#13843) * F32-Mamba-SVE * F32-Mamba-SVE * Resolve test errors-1 * Resolve test errors-2 * F32-vec-SVE * F32-vec-SVE * F32-vec-SVE
b5527
llama : fix KV shift for qwen2vl (#13870) * llama : fix KV shift for qwen2vl * add ref to the PR
b5524
llama : add support for BertForSequenceClassification reranker (#13858) * convert: add support for BertForSequenceClassification * add support for reranking using BertForSequenceClassification * merge checks of eos and sep * fix lint --------- Co-authored-by: dinhhuy <[email protected]>
b5523
convert: small addition to support LlamaModel (#13838) Co-authored-by: dinhhuy <[email protected]>
b5519
CUDA: fix FA tg at long context for CC >= 8.9 (#13852)
b5517
CANN: Add SOC TYPE printing in cmake configuration (#13837)