ggml: aarch64: implement SVE kernels for q2_k_q8_k vector dot #12064

Vithulep · 2025-02-25T07:56:16Z

This PR introduces support for SVE (Scalable Vector Extensions) kernels for the q2_K_q8_K vector dot on the Arm architecture. A similar proposal for SVE support is made in PR 7433 and 11227.

This PR contains the SVE implementation of the vector dot used to compute the Q2_K quantization.
By running a Q2_K quantized model of mistral-7b-v01, on Graviton 3E (Perf 21 XL), Accuracy and Performance are measured.

Performance

The performance enhancement with this PR (SVE) is ~ x1.03 to x1.09 faster than the NEON implementation.

Decoding Throughput (TPOT)

Threads	NEON (original)	This PR(SVE)	Ratio
2	4.31	4.67	1.08
4	8.43	9.17	1.09
8	16.24	17.56	1.08
16	30.04	32.24	1.07
32	50.06	53.12	1.06
48	58.05	59.78	1.03

The command used to measure the performance is

./llama-bench  -m ${PATH_TO_MODEL} -n 0 -n 16 -p 64 -t 2,4,8,16,32,48

Perplexity

I have ran perplexity with the NEON(Original) and SVE (This PR) Implementation.
And below is the summary.

NEON (original)	SVE (this PR)
3.1285 +/- 0.40252	3.1289 +/- 0.40320

This correction does not appear to have any impact on accuracy.

ggml/src/ggml-cpu/ggml-cpu-quants.c

ggerganov · 2025-02-27T07:29:36Z

ggml/src/ggml-cpu/ggml-cpu-quants.c

+
+                svint32_t sumi1 = svdup_n_s32(0);
+
+                for (int j = 0; j < QK_K/256; ++j) {


These loops seem redundant - are they needed? Can you simplify by knowing that this will always be a single iteration?

Yes. This loop always run for one time. The code is simplified considering the for loop will run for 1 time only. I can remove the loop.

…code

…rg#12064) * Added SVE Support for Q2_K Quantized Models * Use 4-space indentation in the switch cases * removed comments lines * Remove the loop Retain the curly bracess for better understanding of code * Remove the comment like added for q3_k_q8_k kernel --------- Co-authored-by: vithulep <[email protected]>

Added SVE Support for Q2_K Quantized Models

6cfedbe

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 25, 2025

ggerganov reviewed Feb 25, 2025

View reviewed changes

ggml/src/ggml-cpu/ggml-cpu-quants.c Outdated Show resolved Hide resolved

pvname added 2 commits February 25, 2025 17:24

Use 4-space indentation in the switch cases

45bd871

removed comments lines

417a6c9

ggerganov reviewed Feb 27, 2025

View reviewed changes

pvname added 2 commits February 27, 2025 14:30

Remove the loop Retain the curly bracess for better understanding of …

bfb0d1c

…code

Remove the comment like added for q3_k_q8_k kernel

dbe7af8

ggerganov approved these changes Feb 28, 2025

View reviewed changes

ggerganov merged commit 05e6f5a into ggml-org:master Feb 28, 2025
47 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml: aarch64: implement SVE kernels for q2_k_q8_k vector dot #12064

ggml: aarch64: implement SVE kernels for q2_k_q8_k vector dot #12064

Uh oh!

Vithulep commented Feb 25, 2025

Uh oh!

Uh oh!

ggerganov Feb 27, 2025

Uh oh!

Vithulep Feb 27, 2025

Uh oh!

Uh oh!

Uh oh!


		svint32_t sumi1 = svdup_n_s32(0);

		for (int j = 0; j < QK_K/256; ++j) {

ggml: aarch64: implement SVE kernels for q2_k_q8_k vector dot #12064

ggml: aarch64: implement SVE kernels for q2_k_q8_k vector dot #12064

Uh oh!

Conversation

Vithulep commented Feb 25, 2025

Performance

Perplexity

Uh oh!

Uh oh!

ggerganov Feb 27, 2025

Choose a reason for hiding this comment

Uh oh!

Vithulep Feb 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!