Skip to content

CUDA: faster k-quant mul_mat_q kernels #2525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

JohannesGaessler
Copy link
Collaborator

This PR adds faster mul_mat_q kernels for k-quants. The new kernels are optimized for compute (prompt processing bottleneck) rather than memory bandwidth (token generation bottleneck). The approach is essentially the same as in #2483 : change the order in which data is being iterated to reduce the number of operations, and move as much computation as possible to the data loading which is executed only once per 32 computations. Unfortunately the latter didn't quite work out for assembling q3_K upon loading due to shared memory limits. This is the current performance:

GPU Model Test t/s master t/s PR Speedup
RTX 3090 7b q2_k pp 746 1445 1.94
RTX 3090 7b q3_k_s pp 579 937 1.62
RTX 3090 7b q4_k_s pp 960 1696 1.77
RTX 3090 7b q5_k_s pp 573 1453 2.54
RTX 3090 7b q6_k pp 694 1408 2.03
P40 7b q2_k pp 240 626 2.61
P40 7b q3_k_s pp 205 432 2.11
P40 7b q4_k_s pp 240 772 3.22
P40 7b q5_k_s pp 210 474 2.26
P40 7b q6_k pp 249 680 2.73

For reference, the speed of cuBLAS is ~1500 t/s on an RTX 3090 and ~500 t/s on a P40.

@Loufe
Copy link

Loufe commented Aug 5, 2023

You're on fire lately, Johannes!

Quick question: your quoted 1500T/s on a 3090, etc. with CuBLAS are with which quantization, for fair comparison?

On another note... As you mention, there seems to be an optimization targetting difference here, prompt processing vs token generation. Something missing from the simple T/s metric with all these PRs is the impact on the prompt processing. I imagine a lot of testing involved a small prompt so T/s generation is the only important metric. I think if prompt processing time (s/T) would be a great extra column to see. I know I tend to get into high token counts for my prompts, personally.

@JohannesGaessler
Copy link
Collaborator Author

Quick question: your quoted 1500T/s on a 3090, etc. with CuBLAS are with which quantization, for fair comparison?

Doesn't matter, it's essentially the same speed for each quantization type since the entire matrix is only dequantized once and then the computations are done entirely using 32 bit floating point arithmetic.

@slaren
Copy link
Member

slaren commented Aug 5, 2023

3090 Ti / WSL2

Model pp t/s
7b q2_k 1404
7b q3_k 1473
7b q4_k_m 1521
7b q5_k_m 1372
7b q6_k_m 1350

7b cuBLAS is ~1460 t/s

Btw there is quite a bit of noise between measurements. This is obtained with perplexity with wiki.test.103 (first 103 lines / 6144 tokens). It would be good to have a standardized way to test performance.

return dm4f.x*sumf_d - dm4f.y*sumf_m;

#else
return 0.0f; // only to satisfy the compiler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated, buy maybe an assert(false) in here would be good to make sure that these functions aren't used with incompatible hardware.

@JohannesGaessler JohannesGaessler merged commit f514d1b into ggml-org:master Aug 5, 2023
@JohannesGaessler
Copy link
Collaborator Author

On another note... As you mention, there seems to be an optimization targetting difference here, prompt processing vs token generation. Something missing from the simple T/s metric with all these PRs is the impact on the prompt processing. I imagine a lot of testing involved a small prompt so T/s generation is the only important metric. I think if prompt processing time (s/T) would be a great extra column to see. I know I tend to get into high token counts for my prompts, personally.

@Loufe I forgot to say: for most kernels the distinction does not matter because they are a) I/O bound anyways and b) only take up a very small percentage of the total runtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants