-
Notifications
You must be signed in to change notification settings - Fork 12.2k
sycl : variable sg_size support for mmvq kernels #12336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
403a5ad
to
65ecd1a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. Thank you!
LGTM
@Alcpz QK_WARP_SIZE is 32, WARP_SIZE is 16 for Intel GPU. I suggest to revert this PR since most of codes are used to rename them. |
What do you mean? I've tested the changes in multiple Intel GPUs and there are no issues in the kernels changed here. That's the purpose of these changes. |
QK_WARP_SIZE is 32, WARP_SIZE is 16 for Intel GPU. The QK_WARP_SIZE is created by PR: a9554e2 @luoyu-intel Could you provide more info about it? |
Yes, that I understand what you are saying. The purpose of these changes is just to allow the mmvq kernels (It's also not the path followed by Intel architectures) to support sub group size 16, so we can use WARP_SIZE, which is the variable defined when compiling (GGML_SYCL_WARP_SIZE in the CMakeLists). I left QK_WARP_SIZE on other kernels that are used by Intel outside of these changes. I checked if entirely removing the variable was possible but that would have introduced errors in the dmmv (like dequantize_mul_mat_vec_q6_k) when the mul mat values are reduced at the end of the kernel. That's maybe what you are referring to? |
As far as I can remember, QK_WARP_SIZE was created because I didn't rewrite SYCL kernels for these QK dtypes. I rewrote some SYCL kernels that were directly translated from CUDA to suit Intel's actual GPU warp size. WARP_SIZE=16 may get the incorrect result from these QK kernels. |
@Alcpz |
Sorry @NeoZhangJianyu, I don´t understand why you are suggesting this. I don't mind reverting these changes if they actually break something, but I still haven't found a case where this happens. If you provide me an example of this breaking models or can point me to a model that is broken due to this change, I will be more than happy to revert and go back to the implementation to ensure correctness. |
why you think use 16 replace 32 won't impact the case? I can't provide a detailed case to approve the side effect by now. I want to revert the code, use QK_WARP_SIZE replace WARP_SIZE. |
Using QK_WARP_SIZE is not a "legacy" behavior, it's a workaround behavior. @NeoZhangJianyu I have been testing for non UT case here and this PR doesn't seem to cause any problems so far. If it does in case, I will open a PR to revert it. |
There are some accuracy issues of real LLMs of Q4_K, Q5_K. |
But how will it help reverting this code if the functions inside mmvq shouldn't be executed on Intel devices path unless you change ggml_sycl_mul_mat? This is why I am confused about reverting this and why I am asking you for a more detailed explanation. Again, if you know of which LLMs we have accuracy issues, we should check what is going on there and fix the problems there, not revert some work to see if by chance we get better performance/accuracy. |
Could you explain why use 16 to replace 32 won't impact the result? I find there is accuracy issue in at least 2 models. I hope check it from a clean base. I don't want to revert the whole PR. |
Alright, the key thing here is that switching back to 32, even if it's a small thing, makes redundant the key part of the PR, which is the changes to Before the PR changing from 32 to 16: the subgroup_size would skip half of the quants in a superblock, due to how these kernels are designed (one subgroup per workgroup). For Q4_K it is slightly different, as Q4_K was programmed to tackle more than one superblock per workgroup, so you end having the first 16 threads dealing with the first superblock, and the second 16 threads with the second superblock. In practice, what this means is: If you changed 32 to 16, Unit tests would pass even though computations were incorrect because for size m = 256 you would only have a single superblock and the results would be correct. this loop was added to ensure that all the work is distributed correctly and ensure the number of operations remain the same. Your small change (16->32) makes that loop pointless, and it would be simply better to entirely revert the PR. Can you please tell me which build has the precision issues? That will help in the future as well to ensure other PRs don't introduce problems. |
Thank you for so detailed explain! |
#12096 This issue is closed due to user use 3.2 and passed. It's good if you could check it. Thank you! |
@NeoZhangJianyu Seems to be working okay here:
|
It's ok, as long as we have constructive discussions I think it's good that you raise your concerns. That issue was opened before I merged this code, so you may be right in that we may have a bug somewhere worth investigating. |
I tested it with Q4_0 model. Thank you! |
Yes. |
MMVQ kernels were based on CUDA's mmvq kernels, which were tailored for subgroups of size 32. While bigger subgroups were considered, smaller subgroups were not. The changes allow for more flexibility in these kernels and reduce the code base size if we make these kernels fast enough in intel architectures.
The introduced changes distribute more work per thread / workitem if WARP_SIZE < 32.
This PR doesn't affect performance.
QK_WARP_SIZE is still needed for some kernels that still require subgroup size == 32, so I am renaming the existing variable for now.
Note: To enable these changes on Intel devices, further changes are required in the entry point of the MUL_MAT op.
Note2: This code path requires FMA intrinsics that may not be available for all architectures, even more changes are required to make it the default path on Intel devices without potentially losing support on older archs