Releases · ggml-org/llama.cpp

12 Jun 10:19

7d51644

b5646 Latest

Latest

server : re-enable SWA speculative decoding (#14131)

ggml-ci

Assets 15

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6
373 MB 2025-06-12T10:19:54Z
llama-b5646-bin-macos-arm64.zip

sha256:c799eef09cee027e8a076c7c496f0115b110aab6aa7f34fb440d938215e5b1f6
10.4 MB 2025-06-12T10:20:05Z
llama-b5646-bin-macos-x64.zip

sha256:8e50c5bf8944a0313812d278beddecd0ec8edccf735addd95166494cc363e959
25.4 MB 2025-06-12T10:20:06Z
llama-b5646-bin-ubuntu-vulkan-x64.zip

sha256:3c82c4cce30080388b789e62be7c13764ba530f324177f124beb8d53b2c93880
20 MB 2025-06-12T10:20:07Z
llama-b5646-bin-ubuntu-x64.zip

sha256:e7e2c672db3a0a0043a6fe7664fab09c3e856ddb2abd86ab04032e89af033967
12.2 MB 2025-06-12T10:20:08Z
llama-b5646-bin-win-cpu-arm64.zip

sha256:8757e07c676eb1e60a9e061b66dccf5885b7ea7bca530074a53ba363932347a4
10.7 MB 2025-06-12T10:20:09Z
llama-b5646-bin-win-cpu-x64.zip

sha256:bf4e253dcc342ce37bba260fe8309d9de5c3f59bfb084f6c144530aa354cad70
13.5 MB 2025-06-12T10:20:10Z
llama-b5646-bin-win-cuda-12.4-x64.zip

sha256:a6218dc9821f7b30a9e4a10ff23d67cc1c9f5ab6a1761f13725bc03c2e816f8c
126 MB 2025-06-12T10:20:11Z
llama-b5646-bin-win-hip-radeon-x64.zip

sha256:040865f979ebf0b4284ff8994ed7c45745891e831779d91644f4756a458dc9ba
298 MB 2025-06-12T10:20:15Z
llama-b5646-bin-win-opencl-adreno-arm64.zip

sha256:28122a412201551ff54b0203b0cbaa50a93d797b42210cf98dfb5c67361d26a0
11.1 MB 2025-06-12T10:20:24Z
Source code (zip)

2025-06-12T08:51:38Z
Source code (tar.gz)

2025-06-12T08:51:38Z

12 Jun 10:09

github-actions

b5645

f6e1a7a

b5645

context : simplify output counting logic during decode (#14142)

* batch : remove logits_all flag

ggml-ci

* context : simplify output counting logic during decode

ggml-ci

* cont : fix comments

Assets 15

12 Jun 09:53

github-actions

b5644

c3ee46f

b5644

batch : remove logits_all flag (#14141)

ggml-ci

Assets 15

12 Jun 08:38

github-actions

b5642

9596506

b5642

kv-cache : fix split_equal handling in unified implementation (#14130)

ggml-ci

Assets 15

12 Jun 08:12

github-actions

b5641

a20b2b0

b5641

context : round n_tokens to next multiple of n_seqs when reserving (#…

Assets 15

11 Jun 20:49

github-actions

b5640

2e89f76

b5640

common: fix issue with regex_escape routine on windows (#14133)

Assets 15

11 Jun 19:28

github-actions

b5639

532802f

b5639

Implement GGML_CPU_ALL_VARIANTS for ARM (#14080)

* ggml-cpu: Factor out feature detection build from x86

* ggml-cpu: Add ARM feature detection and scoring

This is analogous to cpu-feats-x86.cpp. However, to detect compile-time
activation of features, we rely on GGML_USE_<FEAT> which need to be set
in cmake, instead of GGML_<FEAT> that users would set for x86.

This is because on ARM, users specify features with GGML_CPU_ARM_ARCH,
rather than with individual flags.

* ggml-cpu: Implement GGML_CPU_ALL_VARIANTS for ARM

Like x86, however to pass around arch flags within cmake, we use
GGML_INTERNAL_<FEAT> as we don't have GGML_<FEAT>.

Some features are optional, so we may need to build multiple backends
per arch version (armv8.2_1, armv8.2_2, ...), and let the scoring
function sort out which one can be used.

* ggml-cpu: Limit ARM GGML_CPU_ALL_VARIANTS to Linux for now

The other platforms will need their own specific variants.

This also fixes the bug that the the variant-building branch was always
being executed as the else-branch of GGML_NATIVE=OFF. The branch is
moved to an elseif-branch which restores the previous behavior.

Assets 15

11 Jun 17:20

github-actions

b5638

d4e0d95

b5638

chore : clean up relative source dir paths (#14128)

Assets 15

11 Jun 16:22

github-actions

b5637

cc66a7f

b5637

tests : add test-tokenizers-repo (#14017)

Assets 15

11 Jun 15:27

github-actions

b5636

bd248d4

b5636

vulkan: Better thread-safety for command pools/buffers (#14116)

This change moves the command pool/buffer tracking into a vk_command_pool
structure. There are two instances per context (for compute+transfer) and
two instances per device for operations that don't go through a context.
This should prevent separate contexts from stomping on each other.

Assets 15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: ggml-org/llama.cpp

b5646

Uh oh!

b5645

Uh oh!

b5644

Uh oh!

b5642

Uh oh!

b5641

Uh oh!

b5640

Uh oh!

b5639

Uh oh!

b5638

Uh oh!

b5637

Uh oh!

b5636

Uh oh!