-
Notifications
You must be signed in to change notification settings - Fork 12.2k
CUDA: MMQ support for iq4_nl, iq4_xs #8278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: MMQ support for iq4_nl, iq4_xs #8278
Conversation
Funny to see that on a rtx 4090, a higher microbatch size doesn't mean higher speed |
This has nothing to do with MMQ though; the MMQ runtime still goes down by ~3% if you increase the batch size from 512 to 2048. The problem is instead inefficient masking in the FlashAttention kernel where larger batch sizes lead to iteration over more values that are masked out anyways. cuBLAS has the same problem but in return suffers less from dequantization overhead at large batch sizes. |
@JohannesGaessler llama-bench is broken now
|
@JohannesGaessler this broke the server as well
|
@nitinrathi check #8311 |
What commit are you on? Line 2189 of |
master is right now on 148ec97 |
Okay, but as I said: there is no |
Actually before you try |
@JohannesGaessler My apologies, I am very sorry. Everything works fine after compiling with LLAMA_NO_CACHE. |
Thank you for performance improvements
AFTER
|
@Green-Sky #8311 also works great. |
This PR adds MMQ support for iq4_nl and iq4_xs. The data is loaded, converted to 8 bit, and written to shared memory. Because this is the same strategy as for q5_0 the same code can be re-used except for the part that loads the data.
The other iq data types have issues with shared memory limits with the current MMQ code; it will need a refactor that allows setting the tile size in k direction as a configurable parameter. Presumably due to Ampere/Ada Lovelace consumer cards having 50% more shared memory than Turing this will also mean that for optimal performance there would then need to be different template instances for Turing and Ampere+. For A100s/H100s which have even more shared memory you would in principle also need different configurations but I am not interested in working on that hardware since I will not be able to afford it anyways.
Since I am already working on the MMQ code I replaced instances of
get_int_from_int8
with the refactored and implified variants that accept void pointers.Performance