Small (and unreliable) AVX2 ggml_vec_dot_q4_K_q8_K improvement - remove instruction dependency #2819

hydroo · 2023-08-26T17:51:21Z

Hi,
This is my first MR. Any help is appreciated.

The change reduces instruction dependency.

E.g. with llama-2-13b-chat.ggmlv3.q4_K_S.bin:

bin\RelWithDebInfo\main.exe -m d:\projects\ai\models\llama-2-13b-chat.ggmlv3.q4_K_S.bin -t 24 -n 128 -c 4096 --color -p "[INST] <<SYS>>You are a helpful assistant<</SYS>>Write a short poem on GPUs and memory bandwidth[/INST]"  

[...]
llama_print_timings:        load time =  2295.29 ms
llama_print_timings:      sample time =    15.70 ms /   128 runs   (    0.12 ms per token,  8154.42 tokens per second)
llama_print_timings: prompt eval time =  1298.96 ms /    33 tokens (   39.36 ms per token,    25.40 tokens per second)
llama_print_timings:        eval time = 17234.00 ms /   127 runs   (  135.70 ms per token,     7.37 tokens per second)
llama_print_timings:       total time = 18587.70 ms

vs before:

llama_print_timings:        load time =  1128.65 ms
llama_print_timings:      sample time =    15.84 ms /   128 runs   (    0.12 ms per token,  8083.36 tokens per second)
llama_print_timings: prompt eval time =  1334.17 ms /    33 tokens (   40.43 ms per token,    24.73 tokens per second)
llama_print_timings:        eval time = 19414.74 ms /   127 runs   (  152.87 ms per token,     6.54 tokens per second)

Measured on an 13900K with -t24

Edit:
While I observe an improvement of 5-10% with -t 24 on MSVC and models of ~34B params and more, @slaren and @ikawrakow could not reproduce it.
Most benchmarks showed differences within measurement error.

hydroo · 2023-08-26T17:55:49Z

I'll dig a bit into the other quants and QK_K != 256.
In the meantime I'm interested in any feedback.
Thanks!

ggerganov · 2023-08-26T18:16:22Z

Nice - lets keep this PR up for a bit and see if other people observe similar improvements

ghost · 2023-08-26T18:58:10Z

Thanks for clarification.

…dency

slaren · 2023-08-26T19:02:50Z

You need to merge master first before testing, ~~the source branch is very outdated~~ not anymore. However, this PR only affects x86 processors with AVX2, you won't be able to test these changes on an Android ARM device.

hydroo · 2023-08-26T19:03:58Z

This is partially my bad because I initiated the change before the gguf merge, and wanted to try it out.
I'll try changing my workflow to gguf now.

slaren · 2023-08-26T19:23:51Z

Quick test with 13900K/24 threads/7B:

Master:

model	size	params	backend	threads	test	t/s
LLaMA 7B mostly Q4_K - Small	3.59 GiB	6.74 B	CPU	24	pp 64	58.37 ± 0.14
LLaMA 7B mostly Q4_K - Medium	3.80 GiB	6.74 B	CPU	24	pp 64	55.96 ± 0.20
LLaMA 7B mostly Q4_K - Small	3.59 GiB	6.74 B	CPU	24	tg 32	18.02 ± 0.18
LLaMA 7B mostly Q4_K - Medium	3.80 GiB	6.74 B	CPU	24	tg 32	17.32 ± 0.06

build: c7d92e6 (1081)

PR:

model	size	params	backend	threads	test	t/s
LLaMA 7B mostly Q4_K - Small	3.59 GiB	6.74 B	CPU	24	pp 64	57.46 ± 0.13
LLaMA 7B mostly Q4_K - Medium	3.80 GiB	6.74 B	CPU	24	pp 64	55.20 ± 0.18
LLaMA 7B mostly Q4_K - Small	3.59 GiB	6.74 B	CPU	24	tg 32	18.15 ± 0.13
LLaMA 7B mostly Q4_K - Medium	3.80 GiB	6.74 B	CPU	24	tg 32	17.27 ± 0.07

build: ff83017 (1082)

Not sure if I am doing something wrong, maybe the difference is bigger with 13B?

slaren · 2023-08-26T19:30:08Z

Similar results with 13B:

model	size	params	backend	threads	test	t/s
LLaMA 13B mostly Q4_K - Small	6.90 GiB	13.02 B	CPU	24	pp 64	30.77 ± 0.02
LLaMA 13B mostly Q4_K - Small	6.90 GiB	13.02 B	CPU	24	tg 24	9.39 ± 0.04

build: c7d92e6 (1081)

model	size	params	backend	threads	test	t/s
LLaMA 13B mostly Q4_K - Small	6.90 GiB	13.02 B	CPU	24	pp 64	30.80 ± 0.05
LLaMA 13B mostly Q4_K - Small	6.90 GiB	13.02 B	CPU	24	tg 24	9.50 ± 0.10

build: ff83017 (1082)

Compiler: g++ (Ubuntu 12.3.0-1ubuntu1~23.04) 12.3.0

hydroo · 2023-08-26T19:34:54Z

I'm doing more tests now. Would you mind sharing your command?
I'm using MSVC 2022 19.35.32215.0 .

slaren · 2023-08-26T19:36:15Z

These are the commands I used:

./llama-bench -m models/7B/ggml-model-Q4_K_S.gguf -m models/7B/ggml-model-Q4_K_M.gguf -n 32 -p 64 -t 24 
./llama-bench -m models/13B/ggml-model-Q4_K_S.gguf -n 24 -p 64 -t 24

hydroo · 2023-08-26T20:05:29Z

Where is this llama-bench from?
I'm still getting pretty reliable speedups as quoted above.
For 7b q4_K_S I get 11.9-12.1t/s vs 13.3-13.6 with the change.

Interestingly your t/s is generally way higher than mine, considering we have the same CPU. That's strange.

slaren · 2023-08-26T20:08:17Z

llama-bench is an example in this repository. If you used cmake to build, you should have the executable.

The difference may be easily explained by the different compilers, MSVC is notoriously slower than GCC these days.

slaren · 2023-08-27T12:59:35Z

Results under native Windows with MSVC (please ignore the broken utf-8):

model	size	params	backend	threads	test	t/s
LLaMA 7B mostly Q4_K - Small	3.59 GiB	6.74 B	CPU	24	pp 64	51.41 ┬▒ 0.19
LLaMA 7B mostly Q4_K - Small	3.59 GiB	6.74 B	CPU	24	tg 32	16.64 ┬▒ 0.03

build: 25423e9 (1090)

model	size	params	backend	threads	test	t/s
LLaMA 7B mostly Q4_K - Small	3.59 GiB	6.74 B	CPU	24	pp 64	51.62 ┬▒ 0.11
LLaMA 7B mostly Q4_K - Small	3.59 GiB	6.74 B	CPU	24	tg 32	16.22 ┬▒ 0.07

build: ff83017 (1082)

Compiler: Microsoft (R) C/C++ Optimizing Compiler Version 19.29.30151 for x64

ikawrakow · 2023-08-27T14:00:27Z

I wrote the code that is being optimized in this PR. I wrote it the way it is after experimenting what is the fastest way to do this calculation (including testing the version that is being proposed here). I was testing on Ryzen 7950X, Ryzen 5950X, and on i9-9960X. The version that landed in the code was tiny little bit faster than what is being proposed (and still is on these 3 computers). E.g., on Ryzen 7950X:

Master:

model	size	params	backend	threads	test	t/s
LLaMA 7B mostly Q4_K - Small	3.59 GiB	6.74 B	CPU	16	pp 64	84.48 ± 1.11
LLaMA 7B mostly Q4_K - Small	3.59 GiB	6.74 B	CPU	16	tg 32	14.86 ± 0.02
LLaMA 7B mostly Q4_K - Small	3.59 GiB	6.74 B	CPU	8	pp 64	46.78 ± 0.38
LLaMA 7B mostly Q4_K - Small	3.59 GiB	6.74 B	CPU	8	tg 32	15.14 ± 0.01

PR:

model	size	params	backend	threads	test	t/s
LLaMA 7B mostly Q4_K - Small	3.59 GiB	6.74 B	CPU	16	pp 64	73.44 ± 15.16
LLaMA 7B mostly Q4_K - Small	3.59 GiB	6.74 B	CPU	16	tg 32	14.76 ± 0.02
LLaMA 7B mostly Q4_K - Small	3.59 GiB	6.74 B	CPU	8	pp 64	45.78 ± 0.40
LLaMA 7B mostly Q4_K - Small	3.59 GiB	6.74 B	CPU	8	tg 32	15.08 ± 0.13

Update: noticed the huge fluctuation in pp 64 for 16 threads, so here another run with the PR:

model	size	params	backend	threads	test	t/s
LLaMA 7B mostly Q4_K - Small	3.59 GiB	6.74 B	CPU	16	pp 64	84.01 ± 0.68
LLaMA 7B mostly Q4_K - Small	3.59 GiB	6.74 B	CPU	16	tg 32	14.73 ± 0.04

hydroo · 2023-08-28T07:08:44Z

Running with -t24 and boost maybe introduces effects that we don't want to measure.
I'm doing the following benchmarks with clock-locking (no boost) and -t 8 to stay on P-cores.
(And every measurement twice, and merged the tables)

codellama-13b-instruct.Q4_K_S.gguf
Before:
llama-bench -m codellama-13b-instruct.Q4_K_S.gguf -n 24 -p 64 -t 8

model	size	params	backend	threads	test	t/s
LLaMA 13B mostly Q4_K - Small	7.08 GiB	13.02 B	CPU	8	pp 64	10.09 ± 0.31
LLaMA 13B mostly Q4_K - Small	7.08 GiB	13.02 B	CPU	8	pp 64	11.41 ± 0.29
LLaMA 13B mostly Q4_K - Small	7.08 GiB	13.02 B	CPU	8	tg 24	5.12 ± 0.33
LLaMA 13B mostly Q4_K - Small	7.08 GiB	13.02 B	CPU	8	tg 24	5.41 ± 0.33

After:

model	size	params	backend	threads	test	t/s
LLaMA 13B mostly Q4_K - Small	7.08 GiB	13.02 B	CPU	8	pp 64	10.63 ± 0.53
LLaMA 13B mostly Q4_K - Small	7.08 GiB	13.02 B	CPU	8	pp 64	12.29 ± 0.99
LLaMA 13B mostly Q4_K - Small	7.08 GiB	13.02 B	CPU	8	tg 24	5.61 ± 0.16
LLaMA 13B mostly Q4_K - Small	7.08 GiB	13.02 B	CPU	8	tg 24	5.40 ± 0.29

The PP has ~20% variance. More than the proposed optimization.
The TG has ~7% - borderline.
Overall, for this model, the PR doesn't look beneficial.

codellama-34b-instruct.Q4_K_S.gguf

model	size	params	backend	threads	test	t/s
LLaMA 34B mostly Q4_K - Small	17.83 GiB	33.74 B	CPU	8	pp 64	5.51 ± 0.20
LLaMA 34B mostly Q4_K - Small	17.83 GiB	33.74 B	CPU	8	pp 64	5.61 ± 0.36
LLaMA 34B mostly Q4_K - Small	17.83 GiB	33.74 B	CPU	8	tg 24	2.18 ± 0.08
LLaMA 34B mostly Q4_K - Small	17.83 GiB	33.74 B	CPU	8	tg 24	2.09 ± 0.10

After:

model	size	params	backend	threads	test	t/s
LLaMA 34B mostly Q4_K - Small	17.83 GiB	33.74 B	CPU	8	pp 64	5.19 ± 0.22
LLaMA 34B mostly Q4_K - Small	17.83 GiB	33.74 B	CPU	8	pp 64	5.78 ± 0.24
LLaMA 34B mostly Q4_K - Small	17.83 GiB	33.74 B	CPU	8	tg 24	1.98 ± 0.02
LLaMA 34B mostly Q4_K - Small	17.83 GiB	33.74 B	CPU	8	tg 24	2.02 ± 0.04

PP is again unreliable.
TG is better by ~5-10%.

In VTune I observe ~0.5% fewer instructions retired in the function.
The code gen is a hard to grok.
I couldn't immediately make out major differences.

@ikawrakow would you mind trying a larger model?
Or do you think it's too slow, i.e. not a use-case and therefore we shouldn't compare those?

ikawrakow · 2023-08-28T10:17:43Z

I haven't downloaded codellama-34B yet, but here is what I get with LLaMA-v1-30B:

Master:

model	size	params	backend	threads	test	t/s
LLaMA 30B mostly Q4_K - Small	17.17 GiB	32.53 B	CPU	16	pp 64	17.62 ± 0.11
LLaMA 30B mostly Q4_K - Small	17.17 GiB	32.53 B	CPU	16	tg 32	3.15 ± 0.00
LLaMA 30B mostly Q4_K - Small	17.17 GiB	32.53 B	CPU	8	pp 64	9.68 ± 0.06
LLaMA 30B mostly Q4_K - Small	17.17 GiB	32.53 B	CPU	8	tg 32	3.25 ± 0.01

PR:

model	size	params	backend	threads	test	t/s
LLaMA 30B mostly Q4_K - Small	17.17 GiB	32.53 B	CPU	16	pp 64	17.72 ± 0.09
LLaMA 30B mostly Q4_K - Small	17.17 GiB	32.53 B	CPU	16	tg 32	3.15 ± 0.00
LLaMA 30B mostly Q4_K - Small	17.17 GiB	32.53 B	CPU	8	pp 64	9.70 ± 0.13
LLaMA 30B mostly Q4_K - Small	17.17 GiB	32.53 B	CPU	8	tg 32	3.25 ± 0.00

Here one might think that the PR is tiny little bit faster for prompt processing, but if so, it is well below 1% and nowhere near 10%. The reason one cannot achieve much is that on these CPU's the calculation is totally memory bound. On the Ryzen 7950X I get the highest TG performance with 4 threads. PP is fastest at 16 threads (but performance is only ~80% of linear scaling with number of threads), and then goes down at 32 threads as hyper-threading cannot really benefit from memory load stalls because the memory bandwidth is basically saturated (and the extra threads just start stepping on each other's feet when loading data from memory). The way you have re-ordered the instructions one might expect to get better performance due to the eliminated dependence, but if loading of q8h stalls, there are now more instructions to execute after q8h becomes available compared to the version on the master branch. As a result, it is not easy to reason about which version will perform better. And then there comes the compiler, and reorders the instructions anyway.

hydroo · 2023-08-28T10:30:19Z

Makes perfect sense.
I'm fine closing this MR then.
Thanks for your efforts!

ikawrakow · 2023-08-28T11:16:37Z

You can merge it, but please change the title/description. Neither @slaren nor myself are observing anywhere near the 10% improvement claimed.

hydroo · 2023-08-28T12:19:51Z

Thanks!
I've updated the description.

ggerganov · 2023-08-28T12:52:42Z

I also did a few quick experiments on a Linux AMD Ryzen 9 5950X 16-Core Processor and confirm @ikawrakow's observations. If I look really really hard, I think there is a tiny bit of improvement in PP, but overall not much difference.

wtarreau · 2023-08-29T16:34:37Z

It possibly changes the memory access patterns on some machines, which can have a huge effect depending on data dependency.

hydroo changed the title ~~Small AVX2 ggml_vec_dot_q4_K_q8_K change for 10% more t/s~~ Small AVX2 ggml_vec_dot_q4_K_q8_K improvement for 10% more t/s Aug 26, 2023

ggerganov approved these changes Aug 26, 2023

View reviewed changes

improve ggml_vec_dot_q4_K_q8_K AVX2 10% by reducing instruction depen…

ff83017

…dency

Karocyt approved these changes Aug 27, 2023

View reviewed changes

hydroo changed the title ~~Small AVX2 ggml_vec_dot_q4_K_q8_K improvement for 10% more t/s~~ Small (and unreliable) AVX2 ggml_vec_dot_q4_K_q8_K improvement - remove instruction dependency Aug 28, 2023

ggerganov merged commit 3af6b86 into ggml-org:master Aug 28, 2023

akawrykow pushed a commit to akawrykow/llama.cpp that referenced this pull request Aug 29, 2023

ggml : tiny ggml_vec_dot_q4_K_q8_K AVX2 improvement (ggml-org#2819)

aa101f7

Small (and unreliable) AVX2 ggml_vec_dot_q4_K_q8_K improvement - remove instruction dependency #2819

Small (and unreliable) AVX2 ggml_vec_dot_q4_K_q8_K improvement - remove instruction dependency #2819

Uh oh!

Conversation

hydroo commented Aug 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hydroo commented Aug 26, 2023

Uh oh!

ggerganov commented Aug 26, 2023

Uh oh!

ghost commented Aug 26, 2023 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Aug 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hydroo commented Aug 26, 2023

Uh oh!

slaren commented Aug 26, 2023

Uh oh!

slaren commented Aug 26, 2023

Uh oh!

hydroo commented Aug 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Aug 26, 2023

Uh oh!

hydroo commented Aug 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Aug 26, 2023

Uh oh!

slaren commented Aug 27, 2023

Uh oh!

ikawrakow commented Aug 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hydroo commented Aug 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Aug 28, 2023

Uh oh!

hydroo commented Aug 28, 2023

Uh oh!

ikawrakow commented Aug 28, 2023

Uh oh!

hydroo commented Aug 28, 2023

Uh oh!

ggerganov commented Aug 28, 2023

Uh oh!

wtarreau commented Aug 29, 2023

Uh oh!

Uh oh!

hydroo commented Aug 26, 2023 •

edited

Loading

ghost commented Aug 26, 2023 •

edited by ghost

Loading

slaren commented Aug 26, 2023 •

edited

Loading

hydroo commented Aug 26, 2023 •

edited

Loading

hydroo commented Aug 26, 2023 •

edited

Loading

ikawrakow commented Aug 27, 2023 •

edited

Loading

hydroo commented Aug 28, 2023 •

edited

Loading