-
Notifications
You must be signed in to change notification settings - Fork 12.2k
Small (and unreliable) AVX2 ggml_vec_dot_q4_K_q8_K improvement - remove instruction dependency #2819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Small (and unreliable) AVX2 ggml_vec_dot_q4_K_q8_K improvement - remove instruction dependency #2819
Conversation
I'll dig a bit into the other quants and QK_K != 256. |
Nice - lets keep this PR up for a bit and see if other people observe similar improvements |
Thanks for clarification. |
You need to merge master first before testing, |
This is partially my bad because I initiated the change before the gguf merge, and wanted to try it out. |
Quick test with 13900K/24 threads/7B: Master:
build: c7d92e6 (1081) PR:
build: ff83017 (1082) Not sure if I am doing something wrong, maybe the difference is bigger with 13B? |
Similar results with 13B:
build: c7d92e6 (1081)
build: ff83017 (1082) Compiler: |
I'm doing more tests now. Would you mind sharing your command? |
These are the commands I used: ./llama-bench -m models/7B/ggml-model-Q4_K_S.gguf -m models/7B/ggml-model-Q4_K_M.gguf -n 32 -p 64 -t 24
./llama-bench -m models/13B/ggml-model-Q4_K_S.gguf -n 24 -p 64 -t 24 |
Where is this llama-bench from? Interestingly your t/s is generally way higher than mine, considering we have the same CPU. That's strange. |
The difference may be easily explained by the different compilers, MSVC is notoriously slower than GCC these days. |
Results under native Windows with MSVC (please ignore the broken utf-8):
build: 25423e9 (1090)
build: ff83017 (1082) Compiler: |
I wrote the code that is being optimized in this PR. I wrote it the way it is after experimenting what is the fastest way to do this calculation (including testing the version that is being proposed here). I was testing on Ryzen 7950X, Ryzen 5950X, and on i9-9960X. The version that landed in the code was tiny little bit faster than what is being proposed (and still is on these 3 computers). E.g., on Ryzen 7950X: Master:
PR:
Update: noticed the huge fluctuation in
|
Running with -t24 and boost maybe introduces effects that we don't want to measure. codellama-13b-instruct.Q4_K_S.gguf
After:
The PP has ~20% variance. More than the proposed optimization. codellama-34b-instruct.Q4_K_S.gguf
After:
PP is again unreliable. In VTune I observe ~0.5% fewer instructions retired in the function. @ikawrakow would you mind trying a larger model? |
I haven't downloaded codellama-34B yet, but here is what I get with LLaMA-v1-30B: Master:
PR:
Here one might think that the PR is tiny little bit faster for prompt processing, but if so, it is well below 1% and nowhere near 10%. The reason one cannot achieve much is that on these CPU's the calculation is totally memory bound. On the Ryzen 7950X I get the highest TG performance with 4 threads. PP is fastest at 16 threads (but performance is only ~80% of linear scaling with number of threads), and then goes down at 32 threads as hyper-threading cannot really benefit from memory load stalls because the memory bandwidth is basically saturated (and the extra threads just start stepping on each other's feet when loading data from memory). The way you have re-ordered the instructions one might expect to get better performance due to the eliminated dependence, but if loading of |
Makes perfect sense. |
You can merge it, but please change the title/description. Neither @slaren nor myself are observing anywhere near the 10% improvement claimed. |
Thanks! |
I also did a few quick experiments on a Linux AMD Ryzen 9 5950X 16-Core Processor and confirm @ikawrakow's observations. If I look really really hard, I think there is a tiny bit of improvement in PP, but overall not much difference. |
It possibly changes the memory access patterns on some machines, which can have a huge effect depending on data dependency. |
Hi,
This is my first MR. Any help is appreciated.
The change reduces instruction dependency.
E.g. with llama-2-13b-chat.ggmlv3.q4_K_S.bin:
vs before:
Measured on an 13900K with -t24
Edit:
While I observe an improvement of 5-10% with -t 24 on MSVC and models of ~34B params and more, @slaren and @ikawrakow could not reproduce it.
Most benchmarks showed differences within measurement error.