sycl: simplify bin_bcast_kernel #13383

AD2605 · 2025-05-08T10:41:02Z

This PR simplifies the bin-bcast kernel, and adds a special code-path when the all the inputs are contiguous, thus avoiding the need for unnecessary index calculations.

The current bin-bcast launches a 3D grid or a 1D grid, and the former often limiting in the number of workitems it can accomodate.

This PR completely flattens the kernel, which also makes it easier to check for contiguous memory accesses, and the separate contiguous path also opens the possibility of vectorization later on, though in my current testing, it did not prove to bring about meaningful difference to performance.

This PR also bring minor but consistent improvement of around 1 tk/s on some models.

Performance compared with the following parameters (with -mmp 0 -ngl 99 -t 8)

Intel Lunar Lake 140V iGPU

Model	This PR (`5a0e7a9` ) tk/s	Master (`814f795`) tk / s
qwen2 1.5B Q4_0	34.05 ± 0.53	33.39 ± 0.52
gemma2 2B Q4_K	25.00 ± 0.25	24.74 ± 0.21
llama 8B Q4_K - Medium	11.10 ± 1.27	10.2 ± 0.70

Intel Data Center Max 1100

Model	This PR (`5a0e7a9` ) tk/s	Master (`814f795`) tk / s
qwen2 1.5B Q4_0	95.06 ± 0.76	92.51 ± 5.88
gemma2 2B Q4_K	84.62 ± 0.18	82.44 ± 0.22
llama 8B Q4_K - Medium	36.72 ± 0.05	36.35 ± 0.07

ggml/src/ggml-sycl/binbcast.cpp

Rbiessy

LGTM!

ggml/src/ggml-sycl/binbcast.cpp

Rbiessy · 2025-05-15T15:37:34Z

No concerns were raised, the CI failure is unrelated so merging.

qnixsynapse · 2025-05-24T13:14:31Z

I know I gave this an approval but this seems to break fp16 DIV operations. I am investigating.

build/bin/test-backend-ops -b SYCL0 -o DIV
Testing 2 devices

Backend 1/2: SYCL0
Running with Environment Variables:
  GGML_SYCL_DEBUG: 0
  GGML_SYCL_DISABLE_OPT: 1
  GGML_SYCL_DISABLE_GRAPH: 1
  GGML_SYCL_DISABLE_DNN: 0
  GGML_SYCL_PRIORITIZE_DMMV: 0
Build with Macros:
  GGML_SYCL_FORCE_MMQ: no
  GGML_SYCL_F16: yes
Found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A750 Graphics|  12.55|    448|    1024|   32|  8096M|     1.6.33276.160000|
SYCL Optimization Feature:
|ID|        Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]|      Y|
  Device description: Intel(R) Arc(TM) A750 Graphics
  Device memory: 7721 MB (7721 MB free)

  DIV(type=f16,ne=[1,1,8,1],nr=[1,1,1,1]): [DIV] NMSE = 0.000000110 > 0.000000100 FAIL
  DIV(type=f16,ne=[1,1,1,1],nr=[32,1,1,1]): OK
  DIV(type=f16,ne=[1,1,320,320],nr=[1,1,1,1]): [DIV] NMSE = 0.000000133 > 0.000000100 FAIL
  DIV(type=f16,ne=[10,5,1,1],nr=[1,1,1,1]): [DIV] NMSE = 0.000000157 > 0.000000100 FAIL
  DIV(type=f16,ne=[10,5,4,1],nr=[1,1,1,1]): [DIV] NMSE = 0.000000118 > 0.000000100 FAIL
  DIV(type=f16,ne=[10,5,4,3],nr=[1,1,1,1]): [DIV] NMSE = 0.000000126 > 0.000000100 FAIL
  DIV(type=f16,ne=[10,5,4,3],nr=[2,1,1,1]): [DIV] NMSE = 0.000000157 > 0.000000100 FAIL
  DIV(type=f16,ne=[10,5,4,3],nr=[1,2,1,1]): [DIV] NMSE = 0.000000133 > 0.000000100 FAIL
  DIV(type=f16,ne=[10,5,4,3],nr=[1,1,2,1]): [DIV] NMSE = 0.000000114 > 0.000000100 FAIL
  DIV(type=f16,ne=[10,5,4,3],nr=[1,1,1,2]): [DIV] NMSE = 0.000000138 > 0.000000100 FAIL
  DIV(type=f16,ne=[10,5,4,3],nr=[1,1,2,2]): [DIV] NMSE = 0.000000127 > 0.000000100 FAIL
  DIV(type=f16,ne=[10,5,4,3],nr=[1,2,2,2]): [DIV] NMSE = 0.000000136 > 0.000000100 FAIL
  DIV(type=f16,ne=[10,5,4,3],nr=[2,2,2,2]): [DIV] NMSE = 0.000000133 > 0.000000100 FAIL
  DIV(type=f16,ne=[1280,1,1,1],nr=[1,1,1,1]): [DIV] NMSE = 0.000000135 > 0.000000100 FAIL
  DIV(type=f16,ne=[1280,1,1,1],nr=[1,16,16,1]): [DIV] NMSE = 0.000000138 > 0.000000100 FAIL
  DIV(type=f16,ne=[1280,16,16,1],nr=[1,1,1,1]): [DIV] NMSE = 0.000000134 > 0.000000100 FAIL
  DIV(type=f16,ne=[1280,1,1,1],nr=[1,256,1,1]): [DIV] NMSE = 0.000000134 > 0.000000100 FAIL
  DIV(type=f16,ne=[1,1,1280,1],nr=[16,16,1,1]): [DIV] NMSE = 0.000000133 > 0.000000100 FAIL
  DIV(type=f16,ne=[16,16,1280,1],nr=[1,1,1,1]): [DIV] NMSE = 0.000000134 > 0.000000100 FAIL
  DIV(type=f16,ne=[1,1,1920,1],nr=[16,16,1,1]): [DIV] NMSE = 0.000000133 > 0.000000100 FAIL
  DIV(type=f16,ne=[1,1,2560,1],nr=[16,16,1,1]): [DIV] NMSE = 0.000000134 > 0.000000100 FAIL
  DIV(type=f16,ne=[1,1,1280,1],nr=[32,32,1,1]): [DIV] NMSE = 0.000000140 > 0.000000100 FAIL
  DIV(type=f16,ne=[1,1,1920,1],nr=[32,32,1,1]): [DIV] NMSE = 0.000000136 > 0.000000100 FAIL
  DIV(type=f16,ne=[1,1,640,1],nr=[32,32,1,1]): [DIV] NMSE = 0.000000136 > 0.000000100 FAIL
  DIV(type=f16,ne=[5120,1,1,1],nr=[1,256,1,1]): [DIV] NMSE = 0.000000135 > 0.000000100 FAIL
  DIV(type=f16,ne=[640,1,1,1],nr=[1,1,1,1]): [DIV] NMSE = 0.000000118 > 0.000000100 FAIL
  DIV(type=f32,ne=[1,1,8,1],nr=[1,1,1,1]): OK
  DIV(type=f32,ne=[1,1,1,1],nr=[32,1,1,1]): OK
  DIV(type=f32,ne=[1,1,320,320],nr=[1,1,1,1]): OK
  DIV(type=f32,ne=[10,5,1,1],nr=[1,1,1,1]): OK
  DIV(type=f32,ne=[10,5,4,1],nr=[1,1,1,1]): OK
  DIV(type=f32,ne=[10,5,4,3],nr=[1,1,1,1]): OK
  DIV(type=f32,ne=[10,5,4,3],nr=[2,1,1,1]): OK
  DIV(type=f32,ne=[10,5,4,3],nr=[1,2,1,1]): OK
  DIV(type=f32,ne=[10,5,4,3],nr=[1,1,2,1]): OK
  DIV(type=f32,ne=[10,5,4,3],nr=[1,1,1,2]): OK
  DIV(type=f32,ne=[10,5,4,3],nr=[1,1,2,2]): OK
  DIV(type=f32,ne=[10,5,4,3],nr=[1,2,2,2]): OK
  DIV(type=f32,ne=[10,5,4,3],nr=[2,2,2,2]): OK
  DIV(type=f32,ne=[1280,1,1,1],nr=[1,1,1,1]): OK
  DIV(type=f32,ne=[1280,1,1,1],nr=[1,16,16,1]): OK
  DIV(type=f32,ne=[1280,16,16,1],nr=[1,1,1,1]): OK
  DIV(type=f32,ne=[1280,1,1,1],nr=[1,256,1,1]): OK
  DIV(type=f32,ne=[1,1,1280,1],nr=[16,16,1,1]): OK
  DIV(type=f32,ne=[16,16,1280,1],nr=[1,1,1,1]): OK
  DIV(type=f32,ne=[1,1,1920,1],nr=[16,16,1,1]): OK
  DIV(type=f32,ne=[1,1,2560,1],nr=[16,16,1,1]): OK
  DIV(type=f32,ne=[1,1,1280,1],nr=[32,32,1,1]): OK
  DIV(type=f32,ne=[1,1,1920,1],nr=[32,32,1,1]): OK
  DIV(type=f32,ne=[1,1,640,1],nr=[32,32,1,1]): OK
  DIV(type=f32,ne=[5120,1,1,1],nr=[1,256,1,1]): OK
  DIV(type=f32,ne=[640,1,1,1],nr=[1,1,1,1]): OK
  5502/5527 tests passed
  Backend SYCL0: FAIL

Backend 2/2: CPU
  Skipping
1/2 backends passed
FAIL

Reverting this seems to fix the issue.

ggml-ci started failing too: https://github.com/ggml-org/ci/tree/results/ggml/0e/07f5c1552fada22de01c8a92c5aa56b4765f4e/ggml-6-x86-sycl

Temporarily reverted due to failing fp16 DIV operation This reverts commit 02cdd2d. ggml-ci

Rbiessy · 2025-05-26T08:52:28Z

Ok, thanks for taking a look!

@qnixsynapse: I wonder if this could be related to SYCL_PROGRAM_COMPILE_OPTIONS="-cl-fp32-correctly-rounded-divide-sqrt" not being set, see

llama.cpp/docs/backend/SYCL.md

Line 307 in 9012eb9

    
           It is possible to come across some precision issues when running tests that stem from using faster

With DPC++ 2025.2 we should be able to remove this environment variable and set it as a compile flag instead. If you can confirm that's the issue we could wait for that.

qnixsynapse · 2025-05-26T10:48:19Z

Ok, thanks for taking a look!

@qnixsynapse: I wonder if this could be related to SYCL_PROGRAM_COMPILE_OPTIONS="-cl-fp32-correctly-rounded-divide-sqrt" not being set, see

llama.cpp/docs/backend/SYCL.md

Line 307 in 9012eb9

It is possible to come across some precision issues when running tests that stem from using faster

With DPC++ 2025.2 we should be able to remove this environment variable and set it as a compile flag instead. If you can confirm that's the issue we could wait for that.

I can confirm setting the env variable fixes the issue.

github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 8, 2025

s-Nick reviewed May 12, 2025

View reviewed changes

ggml/src/ggml-sycl/binbcast.cpp Outdated Show resolved Hide resolved

s-Nick reviewed May 12, 2025

View reviewed changes

ggml/src/ggml-sycl/binbcast.cpp Show resolved Hide resolved

Rbiessy reviewed May 14, 2025

View reviewed changes

ggml/src/ggml-sycl/binbcast.cpp Outdated Show resolved Hide resolved

sycl: simplify bin_bcast_kernel

6c2091f

AD2605 force-pushed the ad/simplify_binbcast branch from 0aa9325 to 6c2091f Compare May 14, 2025 16:20

Rbiessy approved these changes May 14, 2025

View reviewed changes

qnixsynapse approved these changes May 15, 2025

View reviewed changes

s-Nick approved these changes May 15, 2025

View reviewed changes

Rbiessy merged commit 02cdd2d into ggml-org:master May 15, 2025
43 of 45 checks passed

qnixsynapse mentioned this pull request May 24, 2025

Compile bug: ‘ggml_gelu_erf’ was not declared in this scope; did you mean ‘ggml_gelu’ #13744

Closed

qnixsynapse added a commit that referenced this pull request May 24, 2025

Revert "sycl: simplify bin_bcast_kernel (#13383)"

4470bcd

Temporarily reverted due to failing fp16 DIV operation This reverts commit 02cdd2d. ggml-ci

ggerganov pushed a commit that referenced this pull request May 25, 2025

SYCL: revert "sycl: simplify bin_bcast_kernel (#13383)" (#13752)

515fdbf

Temporarily reverted due to failing fp16 DIV operation This reverts commit 02cdd2d. ggml-ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sycl: simplify bin_bcast_kernel #13383

sycl: simplify bin_bcast_kernel #13383

Uh oh!

AD2605 commented May 8, 2025

Uh oh!

Uh oh!

Uh oh!

Rbiessy left a comment

Uh oh!

Uh oh!

Rbiessy commented May 15, 2025

Uh oh!

Uh oh!

qnixsynapse commented May 24, 2025 •

edited

Loading

Uh oh!

Rbiessy commented May 26, 2025 •

edited

Loading

Uh oh!

qnixsynapse commented May 26, 2025

Uh oh!

Uh oh!

sycl: simplify bin_bcast_kernel #13383

sycl: simplify bin_bcast_kernel #13383

Uh oh!

Conversation

AD2605 commented May 8, 2025

Uh oh!

Uh oh!

Uh oh!

Rbiessy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Rbiessy commented May 15, 2025

Uh oh!

Uh oh!

qnixsynapse commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rbiessy commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qnixsynapse commented May 26, 2025

Uh oh!

Uh oh!

qnixsynapse commented May 24, 2025 •

edited

Loading

Rbiessy commented May 26, 2025 •

edited

Loading