-
Notifications
You must be signed in to change notification settings - Fork 12.1k
sycl: simplify bin_bcast_kernel #13383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
0aa9325
to
6c2091f
Compare
No concerns were raised, the CI failure is unrelated so merging. |
I know I gave this an approval but this seems to break fp16 DIV operations. I am investigating.
Reverting this seems to fix the issue. ggml-ci started failing too: https://github.com/ggml-org/ci/tree/results/ggml/0e/07f5c1552fada22de01c8a92c5aa56b4765f4e/ggml-6-x86-sycl |
Temporarily reverted due to failing fp16 DIV operation This reverts commit 02cdd2d. ggml-ci
Ok, thanks for taking a look! @qnixsynapse: I wonder if this could be related to llama.cpp/docs/backend/SYCL.md Line 307 in 9012eb9
With DPC++ 2025.2 we should be able to remove this environment variable and set it as a compile flag instead. If you can confirm that's the issue we could wait for that. |
I can confirm setting the env variable fixes the issue. |
This PR simplifies the
bin-bcast
kernel, and adds a special code-path when the all the inputs are contiguous, thus avoiding the need for unnecessary index calculations.The current bin-bcast launches a 3D grid or a 1D grid, and the former often limiting in the number of workitems it can accomodate.
This PR completely flattens the kernel, which also makes it easier to check for contiguous memory accesses, and the separate contiguous path also opens the possibility of vectorization later on, though in my current testing, it did not prove to bring about meaningful difference to performance.
This PR also bring minor but consistent improvement of around 1 tk/s on some models.
Performance compared with the following parameters (with
-mmp 0 -ngl 99 -t 8
)Intel Lunar Lake 140V iGPU
Intel Data Center Max 1100