-
Notifications
You must be signed in to change notification settings - Fork 12.2k
Speed up Q4_K #2322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up Q4_K #2322
Conversation
Yes, I did not implement support for a k-quant block size of 64 because I was under the impression that this is only intended as a temporary fix anyways. |
The new kernels are faster on my systems:
Edit: these numbers are for |
if (j < 2) { | ||
aux[0] = scales[j+0] & 0x3f3f; | ||
aux[1] = scales[j+2] & 0x3f3f; | ||
} else { | ||
aux[0] = ((scales[j+2] >> 0) & 0x0f0f) | ((scales[j-2] & 0xc0c0) >> 2); | ||
aux[1] = ((scales[j+2] >> 4) & 0x0f0f) | ((scales[j-0] & 0xc0c0) >> 2); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is probably still potential for optimization here: conditional statements are very slow on GPUs so if this could be somehow rewritten to work without a conditional statement I suspect it would be faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know branches are slow, but how do you arrange 16 scales/mins in 12 bytes such that there is no branch? One way is what is being done in the LLAMA_CUDA_FORCE_DMMV=ON
case, where a thread processes quants from the 0...63 + 128...191
or 64...127 +192...255
range, which is branchless. I did try the same approach for LLAMA_CUDA_FORCE_DMMV=OFF
, but it is slightly slower on my RTX-4080 (8.6 ms/t vs 8.4 ms/t as it is here), so my guess is that memory access pattern is just as important as being branchless. Another possibility would be to change how the 16 scales/mins are stored in the 12 bytes, but that would invalidate all Q4_K
quantized models out there and I don't think people will be happy with that. The way the scales/mins are stored seemed to work best on the CPU, and when I was developing the k-quants the CPU was still the main focus of llama.cpp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a quick implementation of what I meant here. The performance seems to be worse than this PR though.
for (int i = 0; i < QR4_K; ++i) { | ||
const int isc = bq8_offset + i; | ||
const uint16_t * scales = (const uint16_t *)bq4_K->scales; | ||
uint16_t aux[2]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a comment explaining the bit magic would be useful.
I tested |
Did you use |
Actually I forgot about that option entirely so it should have been 2 as per default. Setting it to 1 reduced performance on both cards. |
LLAMA_CUDA_FORCE_DMMV=ON
on older GPUs. E.g., on GTX-1660 TG-128 time is 34.5 ms/t with this PR vs 37.1 ms/t on master. The change is performance neutral on modern cards (RTX-4080).LLAMA_CUDA_FORCE_DMMV=OFF
Q4_K
performance on modern GPUs. E.g., on my RTX-4080 TG-128 time drops to 8.4 ms/t from 9.15 ms/t and is again comparable toQ4_0
andQ4_K
withLLAMA_CUDA_FORCE_DMMV=ON
performanceLLAMA_CUDA_FORCE_DMMV=ON
performance.I'm noticing that the
LLAMA_CUDA_FORCE_DMMV=OFF
CUDA version is completely broken for a k-quants block size of 64 (it does not even compile). I guess, I will be fixing this in a separate PR.Oh, also fixed an annoying warning that
g_compute_capabilities
is not used, which is triggered whenLLAMA_CUDA_FORCE_DMMV=ON