Skip to content

Metal: faster Q4_0 and Q4_1 matrix x vector kernels #2212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 14, 2023

Conversation

ikawrakow
Copy link
Contributor

This PR continues along the lines of #2188 and adds two improvements:

  • PR metal: new q4_0 mat-vec mul kernel  #2188 was merged without the suggested changes that lead to additional 3-5% speed up. This PR adds this additional speedup for Q4_0.
  • Adding a very similar kernel for matrix vector multiplications for Q4_1 that leads to a 7-25% speedup compared to master on M2 Max with a 30-core GPU (see table).

Token generation time in milliseconds per token on M2 Max with 30-core GPU using

./main -m model -p "I believe the meaning of life is" -c 512 -s 1234 -n 128 -ngl 1 --no-mmap
Model Q4_0 master Q4_0 PR Q4_0 speed up Q4_1 master Q4_1 PR Q4_1 speed up
7B 21.1 19.8 6.6% 22.1 20.7 6.8%
13B 35.5 33.8 5.0% 39.4 34.8 13.2%
33B 83.0 80.5 3.1% 99.8 85.4 16.9%
65B 153 147 4.1% 194 156 24.4%

Some interesting observations:

  • The addition check (if (tiisg == 0 && ((r0 * N_SIMDGROUP + sgitg) * N_DST + row) < ne01) vs if (tiisg == 0) originally) needed to avoid having two separate kernels leads to a measurable performance degradation
  • At least on my M2 Max the prefetch of the next quantized block, which was assumed to be the major factor for the metal: new q4_0 mat-vec mul kernel  #2188 speed up, is a very minor effect. For instance, for Q4_0 at 7B I measure 19.9 ms/token without prefetch vs 19.8 ms/token with prefetch; at 65B it is 149 ms/token versus 147 ms/token without/with prefetch. Hence, most of the gain in speed comes from letting each thread in a SIMD group process a full block.

@ikawrakow ikawrakow requested a review from ggerganov July 13, 2023 09:27
@ggerganov
Copy link
Member

ggerganov commented Jul 13, 2023

Results on M1 Pro for ./main -m $model -p "I believe the meaning of life is" --ignore-eos -s 3 -n 128 -t 8 -ngl 1

model master this PR
7B Q4_0 35.1 ms/tok 32.7 ms/tok
7B Q4_1 46.2 ms/tok 34.9 ms/tok
13B Q4_0 61.2 ms/tok 57.0 ms/tok
13B Q4_1 89.2 ms/tok 61.4 ms/tok

@ggerganov ggerganov added performance Speed related topics high priority Very important issue labels Jul 13, 2023
@ggerganov
Copy link
Member

@ikawrakow

Btw, I don't see the suggested changes in #2188 - did you submit the changes for review or are they still "pending"?

Here is how the PR looks on my end:

image

@ikawrakow
Copy link
Contributor Author

@ggerganov This is interesting. Below is what I see in my browser. I now see it has a "Pending" label on the comments. I did start by using "Add a single comment", but it then somehow got converted to a Review that I did not submit?

Screenshot 2023-07-13 at 12 24 12 PM

@ggerganov
Copy link
Member

Yup, most likely that's the case. I usually get confused the other way around - click on "Add single comment" when I actually intent to do "Start review". But either way - not a great UX :)

@ggerganov
Copy link
Member

I just realized something and I am not sure how we haven't spotted this yet:

In ggml-metal.m we do the following before we dispatch a mul_mat kernel:

...
                                [encoder setBuffer:id_src0 offset:offs_src0 atIndex:0];
                                [encoder setBuffer:id_src1 offset:offs_src1 atIndex:1];
                                [encoder setBuffer:id_dst  offset:offs_dst  atIndex:2];
                                [encoder setBytes:&ne00 length:sizeof(ne00) atIndex:3];
                                [encoder setBytes:&ne01 length:sizeof(ne01) atIndex:4];
                                [encoder setBytes:&nb00 length:sizeof(nb00) atIndex:5];
                                [encoder setBytes:&nb01 length:sizeof(nb01) atIndex:6];
                                [encoder setBytes:&nb02 length:sizeof(nb02) atIndex:7];
                                [encoder setBytes:&ne10 length:sizeof(ne10) atIndex:8];
                                [encoder setBytes:&ne11 length:sizeof(ne11) atIndex:9];
                                [encoder setBytes:&nb10 length:sizeof(nb10) atIndex:10];
                                [encoder setBytes:&nb11 length:sizeof(nb11) atIndex:11];
                                [encoder setBytes:&nb12 length:sizeof(nb12) atIndex:12];
                                [encoder setBytes:&ne0  length:sizeof(ne0)  atIndex:13];
                                [encoder setBytes:&ne1  length:sizeof(ne1)  atIndex:14];

                                if (src0t == GGML_TYPE_Q4_0) {
                                    [encoder dispatchThreadgroups:MTLSizeMake(ne01 / 8+((ne01 % 8) & 0x01), ne11, 1) threadsPerThreadgroup:MTLSizeMake(nth0, nth1, 1)];
                                }
                                else if (src0t == GGML_TYPE_Q4_1) {
                               ... etc

These are the arguments we pass to the Metal shaders / kernels and they should match the definitions in ggml-metal.metal.

However, the number of arguments in the kernels do not match, and neither the order matches as well.
For example, the Q4_0 kernel:

kernel void kernel_mul_mat_q4_0_f32(
        device const  void * src0,
        device const float * src1,
        device       float * dst,
        constant   int64_t & ne00,
        constant   int64_t & ne10,
        constant   int64_t & ne0,
        constant   int64_t & ne01[[buffer(4)]],
        uint2 tgpig[[threadgroup_position_in_grid]],
        uint tiisg[[thread_index_in_simdgroup]],
        uint sgitg[[simdgroup_index_in_threadgroup]]) {
    const int nb = ne00/QK4_0;
... etc

This would put ne00 -> ne00 which is OK, but then it will put ne01 -> ne10 which is not OK.
And from then on, everything else, all arguments won't match.

Is my understanding wrong, or we've had a problem for quite some time that we somehow haven't noticed?
I just don't understand if that is the case, then how the Metal code even produces anything meaningful at all.
So maybe I'm missing something

@lshzh-ww
Copy link
Contributor

@ikawrakow I saw the PR the same as ggerganov, so I thought by "suggested changes" you referred to remove the old q4_0 kernel. Bad UX!

@ggerganov I did notice that arguments in kernel don't match, but I was not sure if that's intended...

@ikawrakow
Copy link
Contributor Author

@ggerganov

I just realized something and I am not sure how we haven't spotted this yet:

In ggml-metal.m we do the following before we dispatch a mul_mat kernel:

This was me throwing out unused arguments to the kernels (at some points I noticed that having fewer arguments passed to the kernels slightly improved performance). But it looks like the corresponding change in ggml-metal.m has been lost? In any case, these kernels are never used in matrix-matrix multiplications where it would matter (but it would be useful to fix it anyway).

@lshzh-ww
Copy link
Contributor

M1 Max 32c

model master this PR
7B q4_0 22.15 ms/tok 20.73 ms/tok

That's a 7% improvement.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated my results for Q4_1 as well - ~25% speed-up indeed 🦙

We can merge this now or wait if we get some M1 / M2 Ultra reports.
We can fix the arg stuff in separate PR

@ikawrakow
Copy link
Contributor Author

@ikawrakow I saw the PR the same as ggerganov, so I thought by "suggested changes" you referred to remove the old q4_0 kernel. Bad UX!

@lshzh-ww Sorry about the confusion. I felt victim to the brilliant UX. I thought I had posted the comments as they appeared in my browser, but I had not. It must have looked really dumb me talking about additional possible speedup as per my comments in the tables I posted in your PR, while nobody but me seeing these suggestions.

@ikawrakow ikawrakow merged commit 27ad57a into master Jul 14, 2023
@ikawrakow ikawrakow deleted the ik/metal_new_q4_0_1 branch July 14, 2023 09:56
@ikawrakow ikawrakow mentioned this pull request Jul 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Very important issue performance Speed related topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants