-
Notifications
You must be signed in to change notification settings - Fork 12.2k
Metal: faster Q4_0 and Q4_1 matrix x vector kernels #2212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Results on M1 Pro for
|
Btw, I don't see the suggested changes in #2188 - did you submit the changes for review or are they still "pending"? Here is how the PR looks on my end: |
@ggerganov This is interesting. Below is what I see in my browser. I now see it has a "Pending" label on the comments. I did start by using "Add a single comment", but it then somehow got converted to a Review that I did not submit? ![]() |
Yup, most likely that's the case. I usually get confused the other way around - click on "Add single comment" when I actually intent to do "Start review". But either way - not a great UX :) |
I just realized something and I am not sure how we haven't spotted this yet: In ...
[encoder setBuffer:id_src0 offset:offs_src0 atIndex:0];
[encoder setBuffer:id_src1 offset:offs_src1 atIndex:1];
[encoder setBuffer:id_dst offset:offs_dst atIndex:2];
[encoder setBytes:&ne00 length:sizeof(ne00) atIndex:3];
[encoder setBytes:&ne01 length:sizeof(ne01) atIndex:4];
[encoder setBytes:&nb00 length:sizeof(nb00) atIndex:5];
[encoder setBytes:&nb01 length:sizeof(nb01) atIndex:6];
[encoder setBytes:&nb02 length:sizeof(nb02) atIndex:7];
[encoder setBytes:&ne10 length:sizeof(ne10) atIndex:8];
[encoder setBytes:&ne11 length:sizeof(ne11) atIndex:9];
[encoder setBytes:&nb10 length:sizeof(nb10) atIndex:10];
[encoder setBytes:&nb11 length:sizeof(nb11) atIndex:11];
[encoder setBytes:&nb12 length:sizeof(nb12) atIndex:12];
[encoder setBytes:&ne0 length:sizeof(ne0) atIndex:13];
[encoder setBytes:&ne1 length:sizeof(ne1) atIndex:14];
if (src0t == GGML_TYPE_Q4_0) {
[encoder dispatchThreadgroups:MTLSizeMake(ne01 / 8+((ne01 % 8) & 0x01), ne11, 1) threadsPerThreadgroup:MTLSizeMake(nth0, nth1, 1)];
}
else if (src0t == GGML_TYPE_Q4_1) {
... etc These are the arguments we pass to the Metal shaders / kernels and they should match the definitions in However, the number of arguments in the kernels do not match, and neither the order matches as well. kernel void kernel_mul_mat_q4_0_f32(
device const void * src0,
device const float * src1,
device float * dst,
constant int64_t & ne00,
constant int64_t & ne10,
constant int64_t & ne0,
constant int64_t & ne01[[buffer(4)]],
uint2 tgpig[[threadgroup_position_in_grid]],
uint tiisg[[thread_index_in_simdgroup]],
uint sgitg[[simdgroup_index_in_threadgroup]]) {
const int nb = ne00/QK4_0;
... etc This would put Is my understanding wrong, or we've had a problem for quite some time that we somehow haven't noticed? |
@ikawrakow I saw the PR the same as ggerganov, so I thought by "suggested changes" you referred to remove the old q4_0 kernel. Bad UX! @ggerganov I did notice that arguments in kernel don't match, but I was not sure if that's intended... |
This was me throwing out unused arguments to the kernels (at some points I noticed that having fewer arguments passed to the kernels slightly improved performance). But it looks like the corresponding change in |
M1 Max 32c
That's a 7% improvement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated my results for Q4_1
as well - ~25% speed-up indeed 🦙
We can merge this now or wait if we get some M1 / M2 Ultra reports.
We can fix the arg stuff in separate PR
@lshzh-ww Sorry about the confusion. I felt victim to the brilliant UX. I thought I had posted the comments as they appeared in my browser, but I had not. It must have looked really dumb me talking about additional possible speedup as per my comments in the tables I posted in your PR, while nobody but me seeing these suggestions. |
This PR continues along the lines of #2188 and adds two improvements:
Q4_0
.Q4_1
that leads to a 7-25% speedup compared to master on M2 Max with a 30-core GPU (see table).Token generation time in milliseconds per token on M2 Max with 30-core GPU using
Some interesting observations:
if (tiisg == 0 && ((r0 * N_SIMDGROUP + sgitg) * N_DST + row) < ne01)
vsif (tiisg == 0)
originally) needed to avoid having two separate kernels leads to a measurable performance degradationQ4_0
at 7B I measure 19.9 ms/token without prefetch vs 19.8 ms/token with prefetch; at 65B it is 149 ms/token versus 147 ms/token without/with prefetch. Hence, most of the gain in speed comes from letting each thread in a SIMD group process a full block.