Releases: ngxson/llama.cpp
Releases · ngxson/llama.cpp
b4093
scripts: update compare-llama-bench.py (#10319)
b4091
cmake : fix ppc64 check (whisper/0) ggml-ci
b4088
AVX BF16 and single scale quant optimizations (#10212) * use 128 bit loads (i've tried 256->128 to death and its slower) * double accumulator * avx bf16 vec dot * +3% q4_0 inference * +7% tg +5% pp compared to master * slower f16c version, kep for reference * 256b version, also slow. i tried :) * revert f16 * faster with madd * split to functions * Q8_0 and IQ4_NL, 5-7% faster * fix potential overflow (performance reduced) * 16 bit add for q4_0 only * merge
b4085
server : (web UI) add copy button for code block, fix api key (#10242) * server : (web ui) add copy btn for code blocks * fix problem with api key * use settings-modal-short-input component * always show copy btn for code snippet
b4082
sycl: Use syclcompat::dp4a (#10267) * sycl: Use syclcompat::dp4a * Using the syclcompat version allow the compiler to optimize the operation with native function * Update news section * Update CI Windows oneAPI version to 2025.0 * Reword doc * Call syclcompat::dp4a inside dpct::dp4a This reverts commit 90cb61d692d61360b46954a1c7f780bd2e569b73.
b4081
backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels (#9921) * backend-cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels --------- Co-authored-by: Diego Devesa <[email protected]>
b4080
ggml : build backends as libraries (#10256) * ggml : build backends as libraries --------- Signed-off-by: Xiaodong Ye <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: R0CKSTAR <[email protected]>
b4079
CUDA: no -sm row for very small matrices (#10185)
b4078
speculative : fix out-of-bounds access (#10289)
b4077
vulkan: Optimize binary ops (#10270) Reuse the index calculations across all of src0/src1/dst. Add a shader variant for when src0/src1 are the same dimensions and additional modulus for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that have a fast path when the calculation isn't needed or can be done more cheaply.