Skip to content

Speed up Q4_K #2322

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 23, 2023
Merged

Speed up Q4_K #2322

merged 1 commit into from
Jul 23, 2023

Conversation

ikawrakow
Copy link
Contributor

  • Improves performance for LLAMA_CUDA_FORCE_DMMV=ON on older GPUs. E.g., on GTX-1660 TG-128 time is 34.5 ms/t with this PR vs 37.1 ms/t on master. The change is performance neutral on modern cards (RTX-4080).
  • For LLAMA_CUDA_FORCE_DMMV=OFF
    • Recover Q4_K performance on modern GPUs. E.g., on my RTX-4080 TG-128 time drops to 8.4 ms/t from 9.15 ms/t and is again comparable to Q4_0 and Q4_K with LLAMA_CUDA_FORCE_DMMV=ON performance
    • On a somewhat older card (GTX-1660) TG-128 is reduced from 35.5 ms/t to 29.5 ms/t. This is better than the LLAMA_CUDA_FORCE_DMMV=ON performance.

I'm noticing that the LLAMA_CUDA_FORCE_DMMV=OFF CUDA version is completely broken for a k-quants block size of 64 (it does not even compile). I guess, I will be fixing this in a separate PR.

Oh, also fixed an annoying warning that g_compute_capabilities is not used, which is triggered when LLAMA_CUDA_FORCE_DMMV=ON

@JohannesGaessler
Copy link
Collaborator

I'm noticing that the LLAMA_CUDA_FORCE_DMMV=OFF CUDA version is completely broken for a k-quants block size of 64 (it does not even compile). I guess, I will be fixing this in a separate PR.

Yes, I did not implement support for a k-quant block size of 64 because I was under the impression that this is only intended as a temporary fix anyways.

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Jul 22, 2023

The new kernels are faster on my systems:

GPU Model Test t/s master t/s PR Speedup
RTX 3090 7b q4_k_s tg128 102.81 116.34 1.13
RTX 3090 7b q4_k_m tg128 98.28 111.54 1.13
P40 7b q4_k_s tg128 33.11 36.09 1.09
P40 7b q4_k_m tg128 32.00 34.00 1.06

Edit: these numbers are for LLAMA_CUDA_FORCE_DMMV=OFF.

Comment on lines +1570 to +1576
if (j < 2) {
aux[0] = scales[j+0] & 0x3f3f;
aux[1] = scales[j+2] & 0x3f3f;
} else {
aux[0] = ((scales[j+2] >> 0) & 0x0f0f) | ((scales[j-2] & 0xc0c0) >> 2);
aux[1] = ((scales[j+2] >> 4) & 0x0f0f) | ((scales[j-0] & 0xc0c0) >> 2);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is probably still potential for optimization here: conditional statements are very slow on GPUs so if this could be somehow rewritten to work without a conditional statement I suspect it would be faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know branches are slow, but how do you arrange 16 scales/mins in 12 bytes such that there is no branch? One way is what is being done in the LLAMA_CUDA_FORCE_DMMV=ON case, where a thread processes quants from the 0...63 + 128...191 or 64...127 +192...255 range, which is branchless. I did try the same approach for LLAMA_CUDA_FORCE_DMMV=OFF, but it is slightly slower on my RTX-4080 (8.6 ms/t vs 8.4 ms/t as it is here), so my guess is that memory access pattern is just as important as being branchless. Another possibility would be to change how the 16 scales/mins are stored in the 12 bytes, but that would invalidate all Q4_K quantized models out there and I don't think people will be happy with that. The way the scales/mins are stored seemed to work best on the CPU, and when I was developing the k-quants the CPU was still the main focus of llama.cpp.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a quick implementation of what I meant here. The performance seems to be worse than this PR though.

for (int i = 0; i < QR4_K; ++i) {
const int isc = bq8_offset + i;
const uint16_t * scales = (const uint16_t *)bq4_K->scales;
uint16_t aux[2];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a comment explaining the bit magic would be useful.

@JohannesGaessler
Copy link
Collaborator

I tested LLAMA_CUDA_FORCE_DMMV=ON as well, the performance on my RTX 3090 is ~30% lower, on my P40 it's ~10% lower (than the LLAMA_CUDA_FORCE_DMMV=OFF performance with this PR).

@ikawrakow
Copy link
Contributor Author

I tested LLAMA_CUDA_FORCE_DMMV=ON as well, the performance on my RTX 3090 is ~30% lower, on my P40 it's ~10% lower (than the LLAMA_CUDA_FORCE_DMMV=OFF performance with this PR).

Did you use LLAMA_CUDA_KQUANTS_ITER=2 on the 3090?

@JohannesGaessler
Copy link
Collaborator

Actually I forgot about that option entirely so it should have been 2 as per default. Setting it to 1 reduced performance on both cards.

@ikawrakow ikawrakow merged commit d2a4366 into master Jul 23, 2023
@ikawrakow ikawrakow deleted the ik/cuda-q4k branch July 23, 2023 05:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants