Skip to content

sycl: simplify bin_bcast_kernel #13383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 15, 2025
Merged

Conversation

AD2605
Copy link
Contributor

@AD2605 AD2605 commented May 8, 2025

This PR simplifies the bin-bcast kernel, and adds a special code-path when the all the inputs are contiguous, thus avoiding the need for unnecessary index calculations.

The current bin-bcast launches a 3D grid or a 1D grid, and the former often limiting in the number of workitems it can accomodate.

This PR completely flattens the kernel, which also makes it easier to check for contiguous memory accesses, and the separate contiguous path also opens the possibility of vectorization later on, though in my current testing, it did not prove to bring about meaningful difference to performance.

This PR also bring minor but consistent improvement of around 1 tk/s on some models.

Performance compared with the following parameters (with -mmp 0 -ngl 99 -t 8)

Intel Lunar Lake 140V iGPU

Model This PR (5a0e7a9 ) tk/s Master (814f795) tk / s
qwen2 1.5B Q4_0 34.05 ± 0.53 33.39 ± 0.52
gemma2 2B Q4_K 25.00 ± 0.25 24.74 ± 0.21
llama 8B Q4_K - Medium 11.10 ± 1.27 10.2 ± 0.70

Intel Data Center Max 1100

Model This PR (5a0e7a9 ) tk/s Master (814f795) tk / s
qwen2 1.5B Q4_0 95.06 ± 0.76 92.51 ± 5.88
gemma2 2B Q4_K 84.62 ± 0.18 82.44 ± 0.22
llama 8B Q4_K - Medium 36.72 ± 0.05 36.35 ± 0.07

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 8, 2025
Copy link
Collaborator

@Rbiessy Rbiessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@AD2605 AD2605 force-pushed the ad/simplify_binbcast branch from 0aa9325 to 6c2091f Compare May 14, 2025 16:20
@Rbiessy
Copy link
Collaborator

Rbiessy commented May 15, 2025

No concerns were raised, the CI failure is unrelated so merging.

@Rbiessy Rbiessy merged commit 02cdd2d into ggml-org:master May 15, 2025
43 of 45 checks passed
@qnixsynapse
Copy link
Collaborator

qnixsynapse commented May 24, 2025

I know I gave this an approval but this seems to break fp16 DIV operations. I am investigating.

build/bin/test-backend-ops -b SYCL0 -o DIV
Testing 2 devices

Backend 1/2: SYCL0
Running with Environment Variables:
  GGML_SYCL_DEBUG: 0
  GGML_SYCL_DISABLE_OPT: 1
  GGML_SYCL_DISABLE_GRAPH: 1
  GGML_SYCL_DISABLE_DNN: 0
  GGML_SYCL_PRIORITIZE_DMMV: 0
Build with Macros:
  GGML_SYCL_FORCE_MMQ: no
  GGML_SYCL_F16: yes
Found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A750 Graphics|  12.55|    448|    1024|   32|  8096M|     1.6.33276.160000|
SYCL Optimization Feature:
|ID|        Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]|      Y|
  Device description: Intel(R) Arc(TM) A750 Graphics
  Device memory: 7721 MB (7721 MB free)

  DIV(type=f16,ne=[1,1,8,1],nr=[1,1,1,1]): [DIV] NMSE = 0.000000110 > 0.000000100 FAIL
  DIV(type=f16,ne=[1,1,1,1],nr=[32,1,1,1]): OK
  DIV(type=f16,ne=[1,1,320,320],nr=[1,1,1,1]): [DIV] NMSE = 0.000000133 > 0.000000100 FAIL
  DIV(type=f16,ne=[10,5,1,1],nr=[1,1,1,1]): [DIV] NMSE = 0.000000157 > 0.000000100 FAIL
  DIV(type=f16,ne=[10,5,4,1],nr=[1,1,1,1]): [DIV] NMSE = 0.000000118 > 0.000000100 FAIL
  DIV(type=f16,ne=[10,5,4,3],nr=[1,1,1,1]): [DIV] NMSE = 0.000000126 > 0.000000100 FAIL
  DIV(type=f16,ne=[10,5,4,3],nr=[2,1,1,1]): [DIV] NMSE = 0.000000157 > 0.000000100 FAIL
  DIV(type=f16,ne=[10,5,4,3],nr=[1,2,1,1]): [DIV] NMSE = 0.000000133 > 0.000000100 FAIL
  DIV(type=f16,ne=[10,5,4,3],nr=[1,1,2,1]): [DIV] NMSE = 0.000000114 > 0.000000100 FAIL
  DIV(type=f16,ne=[10,5,4,3],nr=[1,1,1,2]): [DIV] NMSE = 0.000000138 > 0.000000100 FAIL
  DIV(type=f16,ne=[10,5,4,3],nr=[1,1,2,2]): [DIV] NMSE = 0.000000127 > 0.000000100 FAIL
  DIV(type=f16,ne=[10,5,4,3],nr=[1,2,2,2]): [DIV] NMSE = 0.000000136 > 0.000000100 FAIL
  DIV(type=f16,ne=[10,5,4,3],nr=[2,2,2,2]): [DIV] NMSE = 0.000000133 > 0.000000100 FAIL
  DIV(type=f16,ne=[1280,1,1,1],nr=[1,1,1,1]): [DIV] NMSE = 0.000000135 > 0.000000100 FAIL
  DIV(type=f16,ne=[1280,1,1,1],nr=[1,16,16,1]): [DIV] NMSE = 0.000000138 > 0.000000100 FAIL
  DIV(type=f16,ne=[1280,16,16,1],nr=[1,1,1,1]): [DIV] NMSE = 0.000000134 > 0.000000100 FAIL
  DIV(type=f16,ne=[1280,1,1,1],nr=[1,256,1,1]): [DIV] NMSE = 0.000000134 > 0.000000100 FAIL
  DIV(type=f16,ne=[1,1,1280,1],nr=[16,16,1,1]): [DIV] NMSE = 0.000000133 > 0.000000100 FAIL
  DIV(type=f16,ne=[16,16,1280,1],nr=[1,1,1,1]): [DIV] NMSE = 0.000000134 > 0.000000100 FAIL
  DIV(type=f16,ne=[1,1,1920,1],nr=[16,16,1,1]): [DIV] NMSE = 0.000000133 > 0.000000100 FAIL
  DIV(type=f16,ne=[1,1,2560,1],nr=[16,16,1,1]): [DIV] NMSE = 0.000000134 > 0.000000100 FAIL
  DIV(type=f16,ne=[1,1,1280,1],nr=[32,32,1,1]): [DIV] NMSE = 0.000000140 > 0.000000100 FAIL
  DIV(type=f16,ne=[1,1,1920,1],nr=[32,32,1,1]): [DIV] NMSE = 0.000000136 > 0.000000100 FAIL
  DIV(type=f16,ne=[1,1,640,1],nr=[32,32,1,1]): [DIV] NMSE = 0.000000136 > 0.000000100 FAIL
  DIV(type=f16,ne=[5120,1,1,1],nr=[1,256,1,1]): [DIV] NMSE = 0.000000135 > 0.000000100 FAIL
  DIV(type=f16,ne=[640,1,1,1],nr=[1,1,1,1]): [DIV] NMSE = 0.000000118 > 0.000000100 FAIL
  DIV(type=f32,ne=[1,1,8,1],nr=[1,1,1,1]): OK
  DIV(type=f32,ne=[1,1,1,1],nr=[32,1,1,1]): OK
  DIV(type=f32,ne=[1,1,320,320],nr=[1,1,1,1]): OK
  DIV(type=f32,ne=[10,5,1,1],nr=[1,1,1,1]): OK
  DIV(type=f32,ne=[10,5,4,1],nr=[1,1,1,1]): OK
  DIV(type=f32,ne=[10,5,4,3],nr=[1,1,1,1]): OK
  DIV(type=f32,ne=[10,5,4,3],nr=[2,1,1,1]): OK
  DIV(type=f32,ne=[10,5,4,3],nr=[1,2,1,1]): OK
  DIV(type=f32,ne=[10,5,4,3],nr=[1,1,2,1]): OK
  DIV(type=f32,ne=[10,5,4,3],nr=[1,1,1,2]): OK
  DIV(type=f32,ne=[10,5,4,3],nr=[1,1,2,2]): OK
  DIV(type=f32,ne=[10,5,4,3],nr=[1,2,2,2]): OK
  DIV(type=f32,ne=[10,5,4,3],nr=[2,2,2,2]): OK
  DIV(type=f32,ne=[1280,1,1,1],nr=[1,1,1,1]): OK
  DIV(type=f32,ne=[1280,1,1,1],nr=[1,16,16,1]): OK
  DIV(type=f32,ne=[1280,16,16,1],nr=[1,1,1,1]): OK
  DIV(type=f32,ne=[1280,1,1,1],nr=[1,256,1,1]): OK
  DIV(type=f32,ne=[1,1,1280,1],nr=[16,16,1,1]): OK
  DIV(type=f32,ne=[16,16,1280,1],nr=[1,1,1,1]): OK
  DIV(type=f32,ne=[1,1,1920,1],nr=[16,16,1,1]): OK
  DIV(type=f32,ne=[1,1,2560,1],nr=[16,16,1,1]): OK
  DIV(type=f32,ne=[1,1,1280,1],nr=[32,32,1,1]): OK
  DIV(type=f32,ne=[1,1,1920,1],nr=[32,32,1,1]): OK
  DIV(type=f32,ne=[1,1,640,1],nr=[32,32,1,1]): OK
  DIV(type=f32,ne=[5120,1,1,1],nr=[1,256,1,1]): OK
  DIV(type=f32,ne=[640,1,1,1],nr=[1,1,1,1]): OK
  5502/5527 tests passed
  Backend SYCL0: FAIL

Backend 2/2: CPU
  Skipping
1/2 backends passed
FAIL

Reverting this seems to fix the issue.

ggml-ci started failing too: https://github.com/ggml-org/ci/tree/results/ggml/0e/07f5c1552fada22de01c8a92c5aa56b4765f4e/ggml-6-x86-sycl

qnixsynapse added a commit that referenced this pull request May 24, 2025
Temporarily reverted due to failing fp16 DIV operation

This reverts commit 02cdd2d.

ggml-ci
ggerganov pushed a commit that referenced this pull request May 25, 2025
Temporarily reverted due to failing fp16 DIV operation

This reverts commit 02cdd2d.

ggml-ci
@Rbiessy
Copy link
Collaborator

Rbiessy commented May 26, 2025

Ok, thanks for taking a look!

@qnixsynapse: I wonder if this could be related to SYCL_PROGRAM_COMPILE_OPTIONS="-cl-fp32-correctly-rounded-divide-sqrt" not being set, see

It is possible to come across some precision issues when running tests that stem from using faster

With DPC++ 2025.2 we should be able to remove this environment variable and set it as a compile flag instead. If you can confirm that's the issue we could wait for that.

@qnixsynapse
Copy link
Collaborator

Ok, thanks for taking a look!

@qnixsynapse: I wonder if this could be related to SYCL_PROGRAM_COMPILE_OPTIONS="-cl-fp32-correctly-rounded-divide-sqrt" not being set, see

It is possible to come across some precision issues when running tests that stem from using faster

With DPC++ 2025.2 we should be able to remove this environment variable and set it as a compile flag instead. If you can confirm that's the issue we could wait for that.

I can confirm setting the env variable fixes the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants