-
Notifications
You must be signed in to change notification settings - Fork 12.2k
Some more Q4_K and Q5_K speedup on CUDA #2346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
GTX1660: 29.5 ms/t -> 25.6 ms/t RTX4080: 8.40 ms/t -> 8.25 ms/t
GTX1660: 36.7 ms/t -> 35.6 ms/t RTX4080: 9.8 ms/t -> 9.5 ms/t
ggml-cuda.cu
Outdated
@@ -169,7 +169,7 @@ typedef struct { | |||
} block_q3_K; | |||
//static_assert(sizeof(block_q3_K) == sizeof(ggml_fp16_t) + QK_K / 4 + QK_K / 8 + K_SCALE_SIZE, "wrong q3_K block size/padding"); | |||
|
|||
#define QR4_K 2 | |||
#define QR4_K 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you changing this? This is supposed to be 2 because each byte contains 2 values that represent the lower bits of the quantized value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it so I can reuse your dot product template but have 4 products per kernel invocation instead of 2. With QR4_K = 2
we "visit" two groups of 4 quants that are 32 positions apart in each call to vec_dot_q4_K_q8_1
. With QR4_K = 4
we process 4 groups of 4 quants that are 16 positions apart. This gives a minor performance gain on my RTX-4080, but results in quite a significant speedup on the GTX-1660. On the RTX-4080 Q4_K
and Q4_0
were approximately equal before this change, but on the GTX-1660 Q4_K
was ~15% slower than Q4_0
and I was bothered by that. After this change, Q4_K
is very slightly faster on the RTX-4080 and very slightly slower on the GTX-1660 compared to Q4_0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I think this should be done somehow differently. Changing QR4_K
in turn changes QI4_K
which is supposed to represent how many integers are needed to store the lower quantization bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically this is editing a property of the data structure when I think the correct way to go about it would be to define a property of the kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You say that QR_X
is a property of the data structure and not a property of the kernel. But then this property is being used in the template mul_mat_vec_q
via QI_X = QK_X / QR_X
that is passed as a template parameter. So, not really just a data property. We have blocks_per_warp = WARP_SIZE / qi
, so that defines how we want the threads in a warp to access the data. Given that we could have arranged the bits in a completely different way, we basically arrive at the conclusion that no, QR_X/QI_X
are not related to the data but just define how we want to access the data (well, at least I arrive at that conclusion).
Nevertheless, I removed the change in QR4_K
and QR5_K
with the last commit. I arrive at the desired result by using QI4_K/2
and QI5_K/2
as the template parameters, hope you like this better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm only passing the defines as a template parameter because I think bundling them together with the block types and kernels would make the code more convoluted - I understand them to be tied to the data structures.
The way I intended blocks_per_warp
to be used is to spread the blocks across threads in a warp in such a way that each thread accesses one 32 bit integer from the quantized matrix. The QR_X
then controls with how many integers from the q8_1
blocks the first integers need to be combined with for the dot product. The point of this design is that this results in relatively good memory coalescing by default because the threads access consecutive integers within the x blocks. Strictly speaking QR_X
and QI_X
are not necessary for this since you could derive the same logic from QK_X
but I thought the code would be more intuitive this way.
Nevertheless, I removed the change in QR4_K and QR5_K with the last commit. I arrive at the desired result by using QI4_K/2 and QI5_K/2 as the template parameters, hope you like this better.
I think this is acceptable if you add a small comment like "QI4_K/2 in order to process two integers per thread"; I think the ideal solution would be to extend the template but that can still be done once it actually becomes necessary.
ggml-cuda.cu
Outdated
const int dot1 = __dp4a(vi2, ui2, __dp4a(vi1, ui1, 0)); | ||
const int dot2 = __dp4a(0x01010101, ui2, __dp4a(0x01010101, ui1, 0)); | ||
|
||
sumf_d += d8i * (dot1 * sc[i]); // SIMD dot product |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is on the wrong line since the actual dot product happens further up.
@ikawrakow which k-quants do you plan to touch in the immediate future (next 2-3 days)? I want to finalize my matrix matrix multiplication PR and as part of that PR I'm decoupling the actual dot products from the code that loads the primitives from the blocks. It would be convenient for me to know your plans in order to minimize rebasing. |
I have another PR basically ready. It fixes the |
Quite minor on the modern card (RTX-4080), but fairly significant on the older card that I have available (GTX-1660). For
Q5_K
the same bit fiddling for computing the scales is added as in #2322. For both, we gain somewhat if we do 2 dot products per 32 quants in the dot product kernel. Both changes apply toLLAMA_CUDA_FORCE_DMMV=OFF
.Q5_K_S
32 layers fit on the GPU).