Releases: ngxson/llama.cpp
Releases · ngxson/llama.cpp
b5504
kv-cells : track min/max used cells and per-sequence positions (#13808) * kv-cells : track min/max used cells and per-sequence positions ggml-ci * kv-cells : fix pos-modification updates for seq_pos ggml-ci * kv-cells : add comments ggml-ci
b5503
sampling : make sure samplers return at least 1 token (#13822) * sampling : min-p should always return at least one token ggml-ci * sampling : same for typical sampling * tests : sampling tests use min_keep == 0 ggml-ci
b5502
llama : validate seq id batch input (#13809) * llama : validate seq id batch input ggml-ci * cont : fix the fix ggml-ci
b5501
server: --offline mode (#13804) * server: --offline mode (env: LLAMA_OFFLINE) --------- Co-authored-by: Xuan-Son Nguyen <[email protected]>
b5499
cuda : avoid cuGetErrorString (#13791) ggml-ci
b5498
SYCL: Add non contiguous support in RMS_NORM and NORM kernels (#13611) * SYCL: Add non contiguous input support to norm kernel * refactor and add RMS_NORM non contiguous input support ggml-ci * restore subgroup reduction for multi-subgroup thread blocks in norm kernels * Swap grid dims of nsamples and nrows ggml-ci * Revert "Swap grid dims of nsamples and nrows" This reverts commit 43be2d657fec7f7fba54e2cd154106bc0fc45adf. * restore not required changes ggml-ci * address review comments: change it to more like SYCL * Use a common function to calculate offset * remove wrap around logic for handling broadcasts * remove static from calculate_offset fn and use ceil_div
b5497
server: fix streaming crashes (#13786) * add preludes to content on partial regex match * allow all parsers to parse non-tool-call content. * tweak order of <|python_tag|> vs <function= parsing for functionary v3.1 format. still not ideal but hopefully less prone to crash
b5495
`server`: fix format of streamed tool call deltas (diff name, fix id …
b5494
server: fix regression on streamed non-chat completion w/ stops (#13785) * more forgiving message diffs: partial stop words aren't erased, full stops are * Add (slow) server test for completion + stream + stop
b5493
examples : allow extracting embeddings from decoder contexts (#13797) ggml-ci