Some more Q4_K and Q5_K speedup on CUDA #2346

ikawrakow · 2023-07-23T13:24:27Z

Quite minor on the modern card (RTX-4080), but fairly significant on the older card that I have available (GTX-1660). For Q5_K the same bit fiddling for computing the scales is added as in #2322. For both, we gain somewhat if we do 2 dot products per 32 quants in the dot product kernel. Both changes apply to LLAMA_CUDA_FORCE_DMMV=OFF.

TG-128 in ms/t on on GTX-1660 (6GB VRAM, so no 13B results, for Q5_K_S 32 layers fit on the GPU).

Quants	Master	PR	Speedup
Q4_K_S	29.5	25.6	15.2%
Q5_K_S	42.8	35.5	20.6%

TG-128 in ms/t on RTX-4080:

Quants	Master	PR	Speedup
Q4_K_S - 7B	8.40	8.23	2.1%
Q5_K_S - 7B	10.42	9.52	9.5%
Q4_K_S - 13B	14.64	14.38	1.8%
Q5_K_S - 13B	17.89	16.80	6.5%

GTX1660: 29.5 ms/t -> 25.6 ms/t RTX4080: 8.40 ms/t -> 8.25 ms/t

GTX1660: 36.7 ms/t -> 35.6 ms/t RTX4080: 9.8 ms/t -> 9.5 ms/t

JohannesGaessler · 2023-07-23T13:59:42Z

ggml-cuda.cu

@@ -169,7 +169,7 @@ typedef struct {
 } block_q3_K;
 //static_assert(sizeof(block_q3_K) == sizeof(ggml_fp16_t) + QK_K / 4 + QK_K / 8 + K_SCALE_SIZE, "wrong q3_K block size/padding");

-#define QR4_K 2
+#define QR4_K 4


Why are you changing this? This is supposed to be 2 because each byte contains 2 values that represent the lower bits of the quantized value.

I changed it so I can reuse your dot product template but have 4 products per kernel invocation instead of 2. With QR4_K = 2 we "visit" two groups of 4 quants that are 32 positions apart in each call to vec_dot_q4_K_q8_1 . With QR4_K = 4 we process 4 groups of 4 quants that are 16 positions apart. This gives a minor performance gain on my RTX-4080, but results in quite a significant speedup on the GTX-1660. On the RTX-4080 Q4_K and Q4_0 were approximately equal before this change, but on the GTX-1660 Q4_K was ~15% slower than Q4_0 and I was bothered by that. After this change, Q4_K is very slightly faster on the RTX-4080 and very slightly slower on the GTX-1660 compared to Q4_0.

Okay. I think this should be done somehow differently. Changing QR4_K in turn changes QI4_K which is supposed to represent how many integers are needed to store the lower quantization bits.

Basically this is editing a property of the data structure when I think the correct way to go about it would be to define a property of the kernel.

You say that QR_X is a property of the data structure and not a property of the kernel. But then this property is being used in the template mul_mat_vec_q via QI_X = QK_X / QR_X that is passed as a template parameter. So, not really just a data property. We have blocks_per_warp = WARP_SIZE / qi, so that defines how we want the threads in a warp to access the data. Given that we could have arranged the bits in a completely different way, we basically arrive at the conclusion that no, QR_X/QI_X are not related to the data but just define how we want to access the data (well, at least I arrive at that conclusion).

Nevertheless, I removed the change in QR4_K and QR5_K with the last commit. I arrive at the desired result by using QI4_K/2 and QI5_K/2 as the template parameters, hope you like this better.

I'm only passing the defines as a template parameter because I think bundling them together with the block types and kernels would make the code more convoluted - I understand them to be tied to the data structures.

The way I intended blocks_per_warp to be used is to spread the blocks across threads in a warp in such a way that each thread accesses one 32 bit integer from the quantized matrix. The QR_X then controls with how many integers from the q8_1 blocks the first integers need to be combined with for the dot product. The point of this design is that this results in relatively good memory coalescing by default because the threads access consecutive integers within the x blocks. Strictly speaking QR_X and QI_X are not necessary for this since you could derive the same logic from QK_X but I thought the code would be more intuitive this way.

Nevertheless, I removed the change in QR4_K and QR5_K with the last commit. I arrive at the desired result by using QI4_K/2 and QI5_K/2 as the template parameters, hope you like this better.

I think this is acceptable if you add a small comment like "QI4_K/2 in order to process two integers per thread"; I think the ideal solution would be to extend the template but that can still be done once it actually becomes necessary.

JohannesGaessler · 2023-07-23T14:09:23Z

ggml-cuda.cu

+        const int dot1 = __dp4a(vi2, ui2, __dp4a(vi1, ui1, 0));
+        const int dot2 = __dp4a(0x01010101, ui2, __dp4a(0x01010101, ui1, 0));
+
+        sumf_d += d8i * (dot1 * sc[i]); // SIMD dot product


This comment is on the wrong line since the actual dot product happens further up.

JohannesGaessler · 2023-07-23T17:11:11Z

@ikawrakow which k-quants do you plan to touch in the immediate future (next 2-3 days)? I want to finalize my matrix matrix multiplication PR and as part of that PR I'm decoupling the actual dot products from the code that loads the primitives from the blocks. It would be convenient for me to know your plans in order to minimize rebasing.

ikawrakow · 2023-07-23T17:19:14Z

@ikawrakow which k-quants do you plan to touch in the immediate future (next 2-3 days)? I want to finalize my matrix matrix multiplication PR and as part of that PR I'm decoupling the actual dot products from the code that loads the primitives from the blocks. It would be convenient for me to know your plans in order to minimize rebasing.

I have another PR basically ready. It fixes the Q4_K and Q5_K dot products for a block size of 64. Then I should be done for now with the CUDA code.

Kawrakow added 4 commits July 23, 2023 09:42

Faster Q5_K on CUDA

ec6d3d6

Small Q5_K improvement on older GPUs

c2a5188

Spped up Q4_K on CUDA

1af6f38

GTX1660: 29.5 ms/t -> 25.6 ms/t RTX4080: 8.40 ms/t -> 8.25 ms/t

Spped up Q4_K on CUDA

16c24c6

GTX1660: 36.7 ms/t -> 35.6 ms/t RTX4080: 9.8 ms/t -> 9.5 ms/t

ikawrakow requested a review from JohannesGaessler July 23, 2023 13:24

JohannesGaessler reviewed Jul 23, 2023

View reviewed changes

Address PR comments

6baa4ea

Add some comments to satisfy PR reviewer

f3a9211

JohannesGaessler approved these changes Jul 23, 2023

View reviewed changes

ikawrakow merged commit 2f9cf97 into master Jul 23, 2023

ikawrakow deleted the ik/cuda_q5k branch July 23, 2023 21:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some more Q4_K and Q5_K speedup on CUDA #2346

Some more Q4_K and Q5_K speedup on CUDA #2346

Uh oh!

ikawrakow commented Jul 23, 2023

Uh oh!

JohannesGaessler Jul 23, 2023

Uh oh!

ikawrakow Jul 23, 2023

Uh oh!

JohannesGaessler Jul 23, 2023

Uh oh!

JohannesGaessler Jul 23, 2023

Uh oh!

ikawrakow Jul 23, 2023

Uh oh!

JohannesGaessler Jul 23, 2023

Uh oh!

JohannesGaessler Jul 23, 2023

Uh oh!

JohannesGaessler commented Jul 23, 2023

Uh oh!

ikawrakow commented Jul 23, 2023

Uh oh!

Uh oh!

Some more Q4_K and Q5_K speedup on CUDA #2346

Some more Q4_K and Q5_K speedup on CUDA #2346

Uh oh!

Conversation

ikawrakow commented Jul 23, 2023

Uh oh!

JohannesGaessler Jul 23, 2023

Choose a reason for hiding this comment

Uh oh!

ikawrakow Jul 23, 2023

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Jul 23, 2023

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Jul 23, 2023

Choose a reason for hiding this comment

Uh oh!

ikawrakow Jul 23, 2023

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Jul 23, 2023

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Jul 23, 2023

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Jul 23, 2023

Uh oh!

ikawrakow commented Jul 23, 2023

Uh oh!

Uh oh!