Releases · ggml-org/llama.cpp

30 May 11:10

291f2b6

b5540

llama : add support for DistilBert (#13907)

* add distilbert

* small fixes

* add note for LLM_ARCH_DISTIL_BERT

* Use MODEL_ARCH.BERT for DistilBert

---------

Co-authored-by: dinhhuy <[email protected]>

Assets 18

30 May 09:10

github-actions

b5539

2c90da4

b5539

llama : use llm_build_granite for minicpm (#13911)

Assets 18

29 May 23:45

github-actions

b5538

ec9e030

b5538

cmake: Guard GGML_CPU_ALL_VARIANTS by architecture (#13890)

Assets 18

29 May 20:28

github-actions

b5537

e83ba3e

b5537

llama : add support for jina-reranker-v2 (#13900)

Assets 18

29 May 13:35

github-actions

b5535

54a2c7a

b5535

arm64: optimize q4_k_q8_k kernel with i8mm (#13886)

This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q4_k_m quantization model.
- 34% ~ 50% S_PP uplift for all batch sizes
- 12% ~ 37% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |   110.12 |   147.83 |    24.36 |    24.28 |
|   128 |    128 |    2 |   121.16 |   172.42 |    46.36 |    47.93 |
|   128 |    128 |    4 |   120.15 |   169.75 |    74.68 |    84.00 |
|   128 |    128 |    8 |   130.97 |   196.81 |    91.04 |   114.74 |
|   128 |    128 |   16 |   131.01 |   196.88 |   101.43 |   135.79 |
|   128 |    128 |   32 |   130.85 |   196.51 |   106.97 |   147.29 |
---------------------------------------------------------------------
```

Assets 18

29 May 13:30

github-actions

b5534

21fcc21

b5534

cmake: Factor out CPU architecture detection (#13883)

* cmake: Define function for querying architecture

The tests and results match exactly those of ggml/src/CMakeLists.txt

* Switch arch detection over to new function

Assets 18

29 May 10:44

github-actions

b5533

dd8ba93

b5533

ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Al…

Assets 18

29 May 10:42

github-actions

b5532

66c9206

b5532

tests : remove json.hpp from a test (#13880)

ggml-ci

Assets 18

29 May 07:41

github-actions

b5530

6385b84

b5530

llama : add RobertaForSequenceClassification reranker support (#13875)

Assets 18

29 May 07:00

github-actions

b5529

1b8fb81

b5529

ggml: aarch64: Implement SVE F32 kernels for vector functions (#13843)

* F32-Mamba-SVE

* F32-Mamba-SVE

* Resolve test errors-1

* Resolve test errors-2

* F32-vec-SVE

* F32-vec-SVE

* F32-vec-SVE

Assets 18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: ggml-org/llama.cpp

b5540

Uh oh!

b5539

Uh oh!

b5538

Uh oh!

b5537

Uh oh!

b5535

Uh oh!

b5534

Uh oh!

b5533

Uh oh!

b5532

Uh oh!

b5530

Uh oh!

b5529

Uh oh!