Skip to content

Some more Q4_K and Q5_K speedup on CUDA #2346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 23, 2023
Merged

Some more Q4_K and Q5_K speedup on CUDA #2346

merged 6 commits into from
Jul 23, 2023

Conversation

ikawrakow
Copy link
Contributor

Quite minor on the modern card (RTX-4080), but fairly significant on the older card that I have available (GTX-1660). For Q5_K the same bit fiddling for computing the scales is added as in #2322. For both, we gain somewhat if we do 2 dot products per 32 quants in the dot product kernel. Both changes apply to LLAMA_CUDA_FORCE_DMMV=OFF.

  • TG-128 in ms/t on on GTX-1660 (6GB VRAM, so no 13B results, for Q5_K_S 32 layers fit on the GPU).
Quants Master PR Speedup
Q4_K_S 29.5 25.6 15.2%
Q5_K_S 42.8 35.5 20.6%
  • TG-128 in ms/t on RTX-4080:
Quants Master PR Speedup
Q4_K_S - 7B 8.40 8.23 2.1%
Q5_K_S - 7B 10.42 9.52 9.5%
Q4_K_S - 13B 14.64 14.38 1.8%
Q5_K_S - 13B 17.89 16.80 6.5%

Kawrakow added 4 commits July 23, 2023 09:42
GTX1660: 29.5 ms/t -> 25.6 ms/t
RTX4080: 8.40 ms/t -> 8.25 ms/t
GTX1660: 36.7 ms/t -> 35.6 ms/t
RTX4080:  9.8 ms/t ->  9.5 ms/t
ggml-cuda.cu Outdated
@@ -169,7 +169,7 @@ typedef struct {
} block_q3_K;
//static_assert(sizeof(block_q3_K) == sizeof(ggml_fp16_t) + QK_K / 4 + QK_K / 8 + K_SCALE_SIZE, "wrong q3_K block size/padding");

#define QR4_K 2
#define QR4_K 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you changing this? This is supposed to be 2 because each byte contains 2 values that represent the lower bits of the quantized value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it so I can reuse your dot product template but have 4 products per kernel invocation instead of 2. With QR4_K = 2 we "visit" two groups of 4 quants that are 32 positions apart in each call to vec_dot_q4_K_q8_1 . With QR4_K = 4 we process 4 groups of 4 quants that are 16 positions apart. This gives a minor performance gain on my RTX-4080, but results in quite a significant speedup on the GTX-1660. On the RTX-4080 Q4_K and Q4_0 were approximately equal before this change, but on the GTX-1660 Q4_K was ~15% slower than Q4_0 and I was bothered by that. After this change, Q4_K is very slightly faster on the RTX-4080 and very slightly slower on the GTX-1660 compared to Q4_0.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I think this should be done somehow differently. Changing QR4_K in turn changes QI4_K which is supposed to represent how many integers are needed to store the lower quantization bits.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically this is editing a property of the data structure when I think the correct way to go about it would be to define a property of the kernel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You say that QR_X is a property of the data structure and not a property of the kernel. But then this property is being used in the template mul_mat_vec_q via QI_X = QK_X / QR_X that is passed as a template parameter. So, not really just a data property. We have blocks_per_warp = WARP_SIZE / qi, so that defines how we want the threads in a warp to access the data. Given that we could have arranged the bits in a completely different way, we basically arrive at the conclusion that no, QR_X/QI_X are not related to the data but just define how we want to access the data (well, at least I arrive at that conclusion).

Nevertheless, I removed the change in QR4_K and QR5_K with the last commit. I arrive at the desired result by using QI4_K/2 and QI5_K/2 as the template parameters, hope you like this better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm only passing the defines as a template parameter because I think bundling them together with the block types and kernels would make the code more convoluted - I understand them to be tied to the data structures.

The way I intended blocks_per_warp to be used is to spread the blocks across threads in a warp in such a way that each thread accesses one 32 bit integer from the quantized matrix. The QR_X then controls with how many integers from the q8_1 blocks the first integers need to be combined with for the dot product. The point of this design is that this results in relatively good memory coalescing by default because the threads access consecutive integers within the x blocks. Strictly speaking QR_X and QI_X are not necessary for this since you could derive the same logic from QK_X but I thought the code would be more intuitive this way.

Nevertheless, I removed the change in QR4_K and QR5_K with the last commit. I arrive at the desired result by using QI4_K/2 and QI5_K/2 as the template parameters, hope you like this better.

I think this is acceptable if you add a small comment like "QI4_K/2 in order to process two integers per thread"; I think the ideal solution would be to extend the template but that can still be done once it actually becomes necessary.

ggml-cuda.cu Outdated
const int dot1 = __dp4a(vi2, ui2, __dp4a(vi1, ui1, 0));
const int dot2 = __dp4a(0x01010101, ui2, __dp4a(0x01010101, ui1, 0));

sumf_d += d8i * (dot1 * sc[i]); // SIMD dot product
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is on the wrong line since the actual dot product happens further up.

@JohannesGaessler
Copy link
Collaborator

@ikawrakow which k-quants do you plan to touch in the immediate future (next 2-3 days)? I want to finalize my matrix matrix multiplication PR and as part of that PR I'm decoupling the actual dot products from the code that loads the primitives from the blocks. It would be convenient for me to know your plans in order to minimize rebasing.

@ikawrakow
Copy link
Contributor Author

@ikawrakow which k-quants do you plan to touch in the immediate future (next 2-3 days)? I want to finalize my matrix matrix multiplication PR and as part of that PR I'm decoupling the actual dot products from the code that loads the primitives from the blocks. It would be convenient for me to know your plans in order to minimize rebasing.

I have another PR basically ready. It fixes the Q4_K and Q5_K dot products for a block size of 64. Then I should be done for now with the CUDA code.

@ikawrakow ikawrakow merged commit 2f9cf97 into master Jul 23, 2023
@ikawrakow ikawrakow deleted the ik/cuda_q5k branch July 23, 2023 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants