Skip to content

Small (and unreliable) AVX2 ggml_vec_dot_q4_K_q8_K improvement - remove instruction dependency #2819

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 28, 2023
Merged

Small (and unreliable) AVX2 ggml_vec_dot_q4_K_q8_K improvement - remove instruction dependency #2819

merged 1 commit into from
Aug 28, 2023

Conversation

hydroo
Copy link
Contributor

@hydroo hydroo commented Aug 26, 2023

Hi,
This is my first MR. Any help is appreciated.

The change reduces instruction dependency.

E.g. with llama-2-13b-chat.ggmlv3.q4_K_S.bin:

bin\RelWithDebInfo\main.exe -m d:\projects\ai\models\llama-2-13b-chat.ggmlv3.q4_K_S.bin -t 24 -n 128 -c 4096 --color -p "[INST] <<SYS>>You are a helpful assistant<</SYS>>Write a short poem on GPUs and memory bandwidth[/INST]"  

[...]
llama_print_timings:        load time =  2295.29 ms
llama_print_timings:      sample time =    15.70 ms /   128 runs   (    0.12 ms per token,  8154.42 tokens per second)
llama_print_timings: prompt eval time =  1298.96 ms /    33 tokens (   39.36 ms per token,    25.40 tokens per second)
llama_print_timings:        eval time = 17234.00 ms /   127 runs   (  135.70 ms per token,     7.37 tokens per second)
llama_print_timings:       total time = 18587.70 ms

vs before:

llama_print_timings:        load time =  1128.65 ms
llama_print_timings:      sample time =    15.84 ms /   128 runs   (    0.12 ms per token,  8083.36 tokens per second)
llama_print_timings: prompt eval time =  1334.17 ms /    33 tokens (   40.43 ms per token,    24.73 tokens per second)
llama_print_timings:        eval time = 19414.74 ms /   127 runs   (  152.87 ms per token,     6.54 tokens per second)

Measured on an 13900K with -t24

Edit:
While I observe an improvement of 5-10% with -t 24 on MSVC and models of ~34B params and more, @slaren and @ikawrakow could not reproduce it.
Most benchmarks showed differences within measurement error.

@hydroo hydroo changed the title Small AVX2 ggml_vec_dot_q4_K_q8_K change for 10% more t/s Small AVX2 ggml_vec_dot_q4_K_q8_K improvement for 10% more t/s Aug 26, 2023
@hydroo
Copy link
Contributor Author

hydroo commented Aug 26, 2023

I'll dig a bit into the other quants and QK_K != 256.
In the meantime I'm interested in any feedback.
Thanks!

@ggerganov
Copy link
Member

Nice - lets keep this PR up for a bit and see if other people observe similar improvements

@ghost
Copy link

ghost commented Aug 26, 2023

Thanks for clarification.

@slaren
Copy link
Member

slaren commented Aug 26, 2023

You need to merge master first before testing, the source branch is very outdated not anymore. However, this PR only affects x86 processors with AVX2, you won't be able to test these changes on an Android ARM device.

@hydroo
Copy link
Contributor Author

hydroo commented Aug 26, 2023

This is partially my bad because I initiated the change before the gguf merge, and wanted to try it out.
I'll try changing my workflow to gguf now.

@slaren
Copy link
Member

slaren commented Aug 26, 2023

Quick test with 13900K/24 threads/7B:

Master:

model size params backend threads test t/s
LLaMA 7B mostly Q4_K - Small 3.59 GiB 6.74 B CPU 24 pp 64 58.37 ± 0.14
LLaMA 7B mostly Q4_K - Medium 3.80 GiB 6.74 B CPU 24 pp 64 55.96 ± 0.20
LLaMA 7B mostly Q4_K - Small 3.59 GiB 6.74 B CPU 24 tg 32 18.02 ± 0.18
LLaMA 7B mostly Q4_K - Medium 3.80 GiB 6.74 B CPU 24 tg 32 17.32 ± 0.06

build: c7d92e6 (1081)

PR:

model size params backend threads test t/s
LLaMA 7B mostly Q4_K - Small 3.59 GiB 6.74 B CPU 24 pp 64 57.46 ± 0.13
LLaMA 7B mostly Q4_K - Medium 3.80 GiB 6.74 B CPU 24 pp 64 55.20 ± 0.18
LLaMA 7B mostly Q4_K - Small 3.59 GiB 6.74 B CPU 24 tg 32 18.15 ± 0.13
LLaMA 7B mostly Q4_K - Medium 3.80 GiB 6.74 B CPU 24 tg 32 17.27 ± 0.07

build: ff83017 (1082)

Not sure if I am doing something wrong, maybe the difference is bigger with 13B?

@slaren
Copy link
Member

slaren commented Aug 26, 2023

Similar results with 13B:

model size params backend threads test t/s
LLaMA 13B mostly Q4_K - Small 6.90 GiB 13.02 B CPU 24 pp 64 30.77 ± 0.02
LLaMA 13B mostly Q4_K - Small 6.90 GiB 13.02 B CPU 24 tg 24 9.39 ± 0.04

build: c7d92e6 (1081)

model size params backend threads test t/s
LLaMA 13B mostly Q4_K - Small 6.90 GiB 13.02 B CPU 24 pp 64 30.80 ± 0.05
LLaMA 13B mostly Q4_K - Small 6.90 GiB 13.02 B CPU 24 tg 24 9.50 ± 0.10

build: ff83017 (1082)

Compiler: g++ (Ubuntu 12.3.0-1ubuntu1~23.04) 12.3.0

@hydroo
Copy link
Contributor Author

hydroo commented Aug 26, 2023

I'm doing more tests now. Would you mind sharing your command?
I'm using MSVC 2022 19.35.32215.0 .

@slaren
Copy link
Member

slaren commented Aug 26, 2023

These are the commands I used:

./llama-bench -m models/7B/ggml-model-Q4_K_S.gguf -m models/7B/ggml-model-Q4_K_M.gguf -n 32 -p 64 -t 24 
./llama-bench -m models/13B/ggml-model-Q4_K_S.gguf -n 24 -p 64 -t 24    

@hydroo
Copy link
Contributor Author

hydroo commented Aug 26, 2023

Where is this llama-bench from?
I'm still getting pretty reliable speedups as quoted above.
For 7b q4_K_S I get 11.9-12.1t/s vs 13.3-13.6 with the change.

Interestingly your t/s is generally way higher than mine, considering we have the same CPU. That's strange.

@slaren
Copy link
Member

slaren commented Aug 26, 2023

llama-bench is an example in this repository. If you used cmake to build, you should have the executable.

The difference may be easily explained by the different compilers, MSVC is notoriously slower than GCC these days.

@slaren
Copy link
Member

slaren commented Aug 27, 2023

Results under native Windows with MSVC (please ignore the broken utf-8):

model size params backend threads test t/s
LLaMA 7B mostly Q4_K - Small 3.59 GiB 6.74 B CPU 24 pp 64 51.41 ┬▒ 0.19
LLaMA 7B mostly Q4_K - Small 3.59 GiB 6.74 B CPU 24 tg 32 16.64 ┬▒ 0.03

build: 25423e9 (1090)

model size params backend threads test t/s
LLaMA 7B mostly Q4_K - Small 3.59 GiB 6.74 B CPU 24 pp 64 51.62 ┬▒ 0.11
LLaMA 7B mostly Q4_K - Small 3.59 GiB 6.74 B CPU 24 tg 32 16.22 ┬▒ 0.07

build: ff83017 (1082)

Compiler: Microsoft (R) C/C++ Optimizing Compiler Version 19.29.30151 for x64

@ikawrakow
Copy link
Contributor

ikawrakow commented Aug 27, 2023

I wrote the code that is being optimized in this PR. I wrote it the way it is after experimenting what is the fastest way to do this calculation (including testing the version that is being proposed here). I was testing on Ryzen 7950X, Ryzen 5950X, and on i9-9960X. The version that landed in the code was tiny little bit faster than what is being proposed (and still is on these 3 computers). E.g., on Ryzen 7950X:

Master:

model size params backend threads test t/s
LLaMA 7B mostly Q4_K - Small 3.59 GiB 6.74 B CPU 16 pp 64 84.48 ± 1.11
LLaMA 7B mostly Q4_K - Small 3.59 GiB 6.74 B CPU 16 tg 32 14.86 ± 0.02
LLaMA 7B mostly Q4_K - Small 3.59 GiB 6.74 B CPU 8 pp 64 46.78 ± 0.38
LLaMA 7B mostly Q4_K - Small 3.59 GiB 6.74 B CPU 8 tg 32 15.14 ± 0.01

PR:

model size params backend threads test t/s
LLaMA 7B mostly Q4_K - Small 3.59 GiB 6.74 B CPU 16 pp 64 73.44 ± 15.16
LLaMA 7B mostly Q4_K - Small 3.59 GiB 6.74 B CPU 16 tg 32 14.76 ± 0.02
LLaMA 7B mostly Q4_K - Small 3.59 GiB 6.74 B CPU 8 pp 64 45.78 ± 0.40
LLaMA 7B mostly Q4_K - Small 3.59 GiB 6.74 B CPU 8 tg 32 15.08 ± 0.13

Update: noticed the huge fluctuation in pp 64 for 16 threads, so here another run with the PR:

model size params backend threads test t/s
LLaMA 7B mostly Q4_K - Small 3.59 GiB 6.74 B CPU 16 pp 64 84.01 ± 0.68
LLaMA 7B mostly Q4_K - Small 3.59 GiB 6.74 B CPU 16 tg 32 14.73 ± 0.04

@hydroo
Copy link
Contributor Author

hydroo commented Aug 28, 2023

Running with -t24 and boost maybe introduces effects that we don't want to measure.
I'm doing the following benchmarks with clock-locking (no boost) and -t 8 to stay on P-cores.
(And every measurement twice, and merged the tables)

codellama-13b-instruct.Q4_K_S.gguf
Before:
llama-bench -m codellama-13b-instruct.Q4_K_S.gguf -n 24 -p 64 -t 8

model size params backend threads test t/s
LLaMA 13B mostly Q4_K - Small 7.08 GiB 13.02 B CPU 8 pp 64 10.09 ± 0.31
LLaMA 13B mostly Q4_K - Small 7.08 GiB 13.02 B CPU 8 pp 64 11.41 ± 0.29
LLaMA 13B mostly Q4_K - Small 7.08 GiB 13.02 B CPU 8 tg 24 5.12 ± 0.33
LLaMA 13B mostly Q4_K - Small 7.08 GiB 13.02 B CPU 8 tg 24 5.41 ± 0.33

After:

model size params backend threads test t/s
LLaMA 13B mostly Q4_K - Small 7.08 GiB 13.02 B CPU 8 pp 64 10.63 ± 0.53
LLaMA 13B mostly Q4_K - Small 7.08 GiB 13.02 B CPU 8 pp 64 12.29 ± 0.99
LLaMA 13B mostly Q4_K - Small 7.08 GiB 13.02 B CPU 8 tg 24 5.61 ± 0.16
LLaMA 13B mostly Q4_K - Small 7.08 GiB 13.02 B CPU 8 tg 24 5.40 ± 0.29

The PP has ~20% variance. More than the proposed optimization.
The TG has ~7% - borderline.
Overall, for this model, the PR doesn't look beneficial.

codellama-34b-instruct.Q4_K_S.gguf

model size params backend threads test t/s
LLaMA 34B mostly Q4_K - Small 17.83 GiB 33.74 B CPU 8 pp 64 5.51 ± 0.20
LLaMA 34B mostly Q4_K - Small 17.83 GiB 33.74 B CPU 8 pp 64 5.61 ± 0.36
LLaMA 34B mostly Q4_K - Small 17.83 GiB 33.74 B CPU 8 tg 24 2.18 ± 0.08
LLaMA 34B mostly Q4_K - Small 17.83 GiB 33.74 B CPU 8 tg 24 2.09 ± 0.10

After:

model size params backend threads test t/s
LLaMA 34B mostly Q4_K - Small 17.83 GiB 33.74 B CPU 8 pp 64 5.19 ± 0.22
LLaMA 34B mostly Q4_K - Small 17.83 GiB 33.74 B CPU 8 pp 64 5.78 ± 0.24
LLaMA 34B mostly Q4_K - Small 17.83 GiB 33.74 B CPU 8 tg 24 1.98 ± 0.02
LLaMA 34B mostly Q4_K - Small 17.83 GiB 33.74 B CPU 8 tg 24 2.02 ± 0.04

PP is again unreliable.
TG is better by ~5-10%.

In VTune I observe ~0.5% fewer instructions retired in the function.
The code gen is a hard to grok.
I couldn't immediately make out major differences.

@ikawrakow would you mind trying a larger model?
Or do you think it's too slow, i.e. not a use-case and therefore we shouldn't compare those?

@ikawrakow
Copy link
Contributor

I haven't downloaded codellama-34B yet, but here is what I get with LLaMA-v1-30B:

Master:

model size params backend threads test t/s
LLaMA 30B mostly Q4_K - Small 17.17 GiB 32.53 B CPU 16 pp 64 17.62 ± 0.11
LLaMA 30B mostly Q4_K - Small 17.17 GiB 32.53 B CPU 16 tg 32 3.15 ± 0.00
LLaMA 30B mostly Q4_K - Small 17.17 GiB 32.53 B CPU 8 pp 64 9.68 ± 0.06
LLaMA 30B mostly Q4_K - Small 17.17 GiB 32.53 B CPU 8 tg 32 3.25 ± 0.01

PR:

model size params backend threads test t/s
LLaMA 30B mostly Q4_K - Small 17.17 GiB 32.53 B CPU 16 pp 64 17.72 ± 0.09
LLaMA 30B mostly Q4_K - Small 17.17 GiB 32.53 B CPU 16 tg 32 3.15 ± 0.00
LLaMA 30B mostly Q4_K - Small 17.17 GiB 32.53 B CPU 8 pp 64 9.70 ± 0.13
LLaMA 30B mostly Q4_K - Small 17.17 GiB 32.53 B CPU 8 tg 32 3.25 ± 0.00

Here one might think that the PR is tiny little bit faster for prompt processing, but if so, it is well below 1% and nowhere near 10%. The reason one cannot achieve much is that on these CPU's the calculation is totally memory bound. On the Ryzen 7950X I get the highest TG performance with 4 threads. PP is fastest at 16 threads (but performance is only ~80% of linear scaling with number of threads), and then goes down at 32 threads as hyper-threading cannot really benefit from memory load stalls because the memory bandwidth is basically saturated (and the extra threads just start stepping on each other's feet when loading data from memory). The way you have re-ordered the instructions one might expect to get better performance due to the eliminated dependence, but if loading of q8h stalls, there are now more instructions to execute after q8h becomes available compared to the version on the master branch. As a result, it is not easy to reason about which version will perform better. And then there comes the compiler, and reorders the instructions anyway.

@hydroo
Copy link
Contributor Author

hydroo commented Aug 28, 2023

Makes perfect sense.
I'm fine closing this MR then.
Thanks for your efforts!

@ikawrakow
Copy link
Contributor

You can merge it, but please change the title/description. Neither @slaren nor myself are observing anywhere near the 10% improvement claimed.

@hydroo hydroo changed the title Small AVX2 ggml_vec_dot_q4_K_q8_K improvement for 10% more t/s Small (and unreliable) AVX2 ggml_vec_dot_q4_K_q8_K improvement - remove instruction dependency Aug 28, 2023
@hydroo
Copy link
Contributor Author

hydroo commented Aug 28, 2023

Thanks!
I've updated the description.

@ggerganov ggerganov merged commit 3af6b86 into ggml-org:master Aug 28, 2023
@ggerganov
Copy link
Member

I also did a few quick experiments on a Linux AMD Ryzen 9 5950X 16-Core Processor and confirm @ikawrakow's observations. If I look really really hard, I think there is a tiny bit of improvement in PP, but overall not much difference.

@wtarreau
Copy link
Contributor

It possibly changes the memory access patterns on some machines, which can have a huge effect depending on data dependency.

akawrykow pushed a commit to akawrykow/llama.cpp that referenced this pull request Aug 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants