-
Notifications
You must be signed in to change notification settings - Fork 12.2k
[SYCL] Fix the sub group size of Intel #8106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need confirmation from our codeplay mates to verify on other HW
fcbd140
to
7d8c960
Compare
@AidanBeltonS @OuadiElfarouki @joeatodd if no regression on your side, I will merge it soon |
The quantize functions limit the WARP_SIZE equals block size=32, there is remaining work for this. |
ed267f2
to
e2023a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, changes look good, nice to see some more work to split ggml-sycl.cpp
into more small files! 👍
Tested on our side, all fine except for the 2 small things I've raised.
Hi @joeatodd please review the change. |
I've fixed the src1_ncols bug of mmvq. But there is a remaining accuracy bug when the prompt length is less than 9. This PR branch with prompt_length=9: The master branch with prompt_length=9: The main branch is worse than this PR. And I can tell that this is not related to the sub-group size. So I will fix it in the next PR instead of this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for the changes 😃
} | ||
} | ||
|
||
template<int QUANT_BLOCK_TILE> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall there be some check of block_size? We don't want all block size supported, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is for Q8_1, so therer is only one fixed block size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, shall we add some check like
static_assert(QUANT_BLOCK_TILE ==32)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QUANT_BLOCK_TILE = QK8_1/WARP_SIZE.
It's not necessary to add this check, the kernel works for all BlockSize%WARP_SIZE==0
.
@qnixsynapse Is your prompt "Hi"? The SYCL backend had this repeat issue a long time ago. |
You can check this comment. How about a longer prompt? |
I think there are two issues: a). short prompt produces the repeating tokens. b). garbage tokens when the context length is larger than some values. |
@qnixsynapse The first one is confirmed as an existing issue of the master branch. I will look into the second one to see whether it is introduced by this PR. |
It's a regression since before this patch it used to work well(although a bit slower). I am still trying to debug. Sorry that I couldn't test it before because I was hooked in testing Gemma models. |
I didn't test Q4_K_S models. I will test it on A770. |
if (GGML_SYCL_TARGET STREQUAL "NVIDIA") | ||
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsycl-targets=nvptx64-nvidia-cuda") | ||
add_compile_definitions(GGML_SYCL_WARP_SIZE=32) | ||
else() | ||
add_compile_definitions(GGML_SYCL_WARP_SIZE=16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qnixsynapse Could you change this size to 32 and test your model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've confirmed this bug on Q4K_S. |
It's also broken on Q4_0 |
@qnixsynapse BTW which UI are you using, looking quite cool |
@airMeng It's chainlit. |
@qnixsynapse WARP_SIZE=32 works fine for me. I can change WARP_SIZE to 32 for Intel GPUs in the new PR to revert this regression. Do you agree with this? |
@luoyu-intel Sure. :) Edit: BTW, I am getting about 30 tokens/sec with iQ4_XS, earlier generation speed was 20 tokens/sec; with warp_size of 32 and the other portions of this PR, so please don't revert anything else. :) |
* use warp_size macro for all sycl kernels * fix mask of permute_sub_group_by_xor * fix rms_norm with correct warp number * fix rms_norm_f32/group_norm_f32 * move norm to norm.cpp file * fix quantize bug * fix mmvq's batch size
Changes:
-nan
. It's an issue from dpcpp: [SYCL]subgroup shuffle bug when sg_size=32 intel/llvm#14274Debug can output the same tokens as release in this PR(master runs into exceptions):
Release output:
Performance benefit
Intel Arc A770
37 tokens/s to 39 tokens/s (Windows + 9600K):
38.9 tokens/s to 41.8 tokens/s (Linux + Xeon 4th):