Speed up Q4_K #2322

ikawrakow · 2023-07-22T13:29:03Z

Improves performance for LLAMA_CUDA_FORCE_DMMV=ON on older GPUs. E.g., on GTX-1660 TG-128 time is 34.5 ms/t with this PR vs 37.1 ms/t on master. The change is performance neutral on modern cards (RTX-4080).
For LLAMA_CUDA_FORCE_DMMV=OFF
- Recover Q4_K performance on modern GPUs. E.g., on my RTX-4080 TG-128 time drops to 8.4 ms/t from 9.15 ms/t and is again comparable to Q4_0 and Q4_K with LLAMA_CUDA_FORCE_DMMV=ON performance
- On a somewhat older card (GTX-1660) TG-128 is reduced from 35.5 ms/t to 29.5 ms/t. This is better than the LLAMA_CUDA_FORCE_DMMV=ON performance.

I'm noticing that the LLAMA_CUDA_FORCE_DMMV=OFF CUDA version is completely broken for a k-quants block size of 64 (it does not even compile). I guess, I will be fixing this in a separate PR.

Oh, also fixed an annoying warning that g_compute_capabilities is not used, which is triggered when LLAMA_CUDA_FORCE_DMMV=ON

JohannesGaessler · 2023-07-22T19:29:57Z

I'm noticing that the LLAMA_CUDA_FORCE_DMMV=OFF CUDA version is completely broken for a k-quants block size of 64 (it does not even compile). I guess, I will be fixing this in a separate PR.

Yes, I did not implement support for a k-quant block size of 64 because I was under the impression that this is only intended as a temporary fix anyways.

JohannesGaessler · 2023-07-22T19:50:56Z

The new kernels are faster on my systems:

GPU	Model	Test	t/s master	t/s PR	Speedup
RTX 3090	7b q4_k_s	tg128	102.81	116.34	1.13
RTX 3090	7b q4_k_m	tg128	98.28	111.54	1.13
P40	7b q4_k_s	tg128	33.11	36.09	1.09
P40	7b q4_k_m	tg128	32.00	34.00	1.06

Edit: these numbers are for LLAMA_CUDA_FORCE_DMMV=OFF.

JohannesGaessler · 2023-07-22T20:00:05Z

ggml-cuda.cu

+    if (j < 2) {
+        aux[0] = scales[j+0] & 0x3f3f;
+        aux[1] = scales[j+2] & 0x3f3f;
+    } else {
+        aux[0] = ((scales[j+2] >> 0) & 0x0f0f) | ((scales[j-2] & 0xc0c0) >> 2);
+        aux[1] = ((scales[j+2] >> 4) & 0x0f0f) | ((scales[j-0] & 0xc0c0) >> 2);
+    }


There is probably still potential for optimization here: conditional statements are very slow on GPUs so if this could be somehow rewritten to work without a conditional statement I suspect it would be faster.

I know branches are slow, but how do you arrange 16 scales/mins in 12 bytes such that there is no branch? One way is what is being done in the LLAMA_CUDA_FORCE_DMMV=ON case, where a thread processes quants from the 0...63 + 128...191 or 64...127 +192...255 range, which is branchless. I did try the same approach for LLAMA_CUDA_FORCE_DMMV=OFF, but it is slightly slower on my RTX-4080 (8.6 ms/t vs 8.4 ms/t as it is here), so my guess is that memory access pattern is just as important as being branchless. Another possibility would be to change how the 16 scales/mins are stored in the 12 bytes, but that would invalidate all Q4_K quantized models out there and I don't think people will be happy with that. The way the scales/mins are stored seemed to work best on the CPU, and when I was developing the k-quants the CPU was still the main focus of llama.cpp.

I pushed a quick implementation of what I meant here. The performance seems to be worse than this PR though.

JohannesGaessler · 2023-07-22T20:01:22Z

ggml-cuda.cu

-    for (int i = 0; i < QR4_K; ++i) {
-        const int isc = bq8_offset + i;
+    const uint16_t * scales = (const uint16_t *)bq4_K->scales;
+    uint16_t aux[2];


I think a comment explaining the bit magic would be useful.

JohannesGaessler · 2023-07-22T20:09:07Z

I tested LLAMA_CUDA_FORCE_DMMV=ON as well, the performance on my RTX 3090 is ~30% lower, on my P40 it's ~10% lower (than the LLAMA_CUDA_FORCE_DMMV=OFF performance with this PR).

ikawrakow · 2023-07-22T20:59:35Z

I tested LLAMA_CUDA_FORCE_DMMV=ON as well, the performance on my RTX 3090 is ~30% lower, on my P40 it's ~10% lower (than the LLAMA_CUDA_FORCE_DMMV=OFF performance with this PR).

Did you use LLAMA_CUDA_KQUANTS_ITER=2 on the 3090?

JohannesGaessler · 2023-07-22T21:29:51Z

Actually I forgot about that option entirely so it should have been 2 as per default. Setting it to 1 reduced performance on both cards.

Speed up Q4_K

91317f7

ikawrakow requested a review from JohannesGaessler July 22, 2023 13:29

JohannesGaessler approved these changes Jul 22, 2023

View reviewed changes

ikawrakow merged commit d2a4366 into master Jul 23, 2023

ikawrakow deleted the ik/cuda-q4k branch July 23, 2023 05:49

ikawrakow mentioned this pull request Jul 23, 2023

Some more Q4_K and Q5_K speedup on CUDA #2346

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up Q4_K #2322

Speed up Q4_K #2322

Uh oh!

ikawrakow commented Jul 22, 2023

Uh oh!

JohannesGaessler commented Jul 22, 2023

Uh oh!

JohannesGaessler commented Jul 22, 2023 •

edited

Loading

Uh oh!

JohannesGaessler Jul 22, 2023

Uh oh!

ikawrakow Jul 22, 2023

Uh oh!

JohannesGaessler Jul 22, 2023

Uh oh!

JohannesGaessler Jul 22, 2023

Uh oh!

JohannesGaessler commented Jul 22, 2023

Uh oh!

ikawrakow commented Jul 22, 2023

Uh oh!

JohannesGaessler commented Jul 22, 2023

Uh oh!

Uh oh!

Speed up Q4_K #2322

Speed up Q4_K #2322

Uh oh!

Conversation

ikawrakow commented Jul 22, 2023

Uh oh!

JohannesGaessler commented Jul 22, 2023

Uh oh!

JohannesGaessler commented Jul 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler Jul 22, 2023

Choose a reason for hiding this comment

Uh oh!

ikawrakow Jul 22, 2023

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Jul 22, 2023

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Jul 22, 2023

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Jul 22, 2023

Uh oh!

ikawrakow commented Jul 22, 2023

Uh oh!

JohannesGaessler commented Jul 22, 2023

Uh oh!

Uh oh!

JohannesGaessler commented Jul 22, 2023 •

edited

Loading