[NFC][SYCL] Replace #pragma unroll with dim_loop in accessor.hpp #6939

aelovikov-intel · 2022-10-03T15:46:03Z

The utility was introduced in #6560 because "#pragma unroll" doesn't always work and template-based solution is much more reliable. Original PR only changed the loops that resulted in immediate performance difference but other occurrences were missed. This PR updates remaining ones. Note that I've found them by looking into the LLVM IR produced by our device compiler and having the loop really unrolled improves readability of such dumps (and most likely codesize/perf, although not significantly).

The utility was introduced in intel#6560 because "#pragma unroll" doesn't always work and template-based solution is much more reliable. Original PR only changed the loops that resulted in immediate performance difference but other occurrences were missed. This PR updates remaining ones. Note that I've found them by looking into the LLVM IR produced by our device compiler and having the loop really unrolled improves readability of such dumps (and most likely codesize/perf, although not significantly).

bso-intel

LGTM

aelovikov-intel · 2022-10-03T22:35:45Z

@intel/llvm-gatekeepers PR is ready.

…hpp (intel#6939)" This reverts commit fee486e.

…ccessor.hpp (intel#6939)"" This reverts commit 98f50ae.

…backend. (#7948) A performance regression was reported when using `reduce_over_group` with sycl::vec. This was due to a loop over calls to the scalar `reduce_over_group` for each of the `sycl::vec` components that was not unrolled and led to register spills even at -O3. It was initially possible to fix the performance by calling `#pragma unroll` and declare `reduce_over_group` with `__attribute__((always_inline))`. However the `SYCL_UNROLL` macro that calls `#pragma unroll` has been removed in favour of `dim_loop` (#6939). I have used dim_loop to fix the loop unrolling. However, in the cuda backend, just using `dim_loop` in this way actually makes the performance worse. This is because `dim_loop` introduces new non inlined function calls in the cuda backend that lead to register spills. The solution to this coincides with the solution of several user reports that the cuda backend is not aggressive enough with inlining. In this PR I have also therefore increased the inlining threshold multiplier value to 11. See https://reviews.llvm.org/D142232/new/ for the corresponding upstream PR (for the inlining threshold change) that includes much more details on benchmarking dpc++ cuda with this change. In short, for dpc++ cuda, there is no other downside apart from a very small increase in compile time in some cases, but there is a massive benefit to increasing the inlining threshold across a large amount of applications. Testing using opencl cpu backend reveals that this code change has no effect on this backend. This change is required for the cuda backend but should have no performance effect for other backends. fixes #6583. --------- Signed-off-by: JackAKirk <[email protected]> Co-authored-by: JackAKirk <[email protected]>

aelovikov-intel requested a review from a team as a code owner October 3, 2022 15:46

aelovikov-intel requested a review from bso-intel October 3, 2022 15:46

clang-format

f09bcba

bso-intel approved these changes Oct 3, 2022

View reviewed changes

pvchupin merged commit fee486e into intel:sycl Oct 3, 2022

whitneywhtsang added a commit to whitneywhtsang/llvm that referenced this pull request Oct 28, 2022

Revert "[NFC][SYCL] Replace #pragma unroll with dim_loop in accessor.…

98f50ae

…hpp (intel#6939)" This reverts commit fee486e.

whitneywhtsang added a commit to whitneywhtsang/llvm that referenced this pull request Nov 3, 2022

Revert "Revert "[NFC][SYCL] Replace #pragma unroll with dim_loop in a…

29edb5e

…ccessor.hpp (intel#6939)"" This reverts commit 98f50ae.

whitneywhtsang mentioned this pull request Nov 3, 2022

[SYCL-MLIR] Add back the reverted sycl commit #7264

Merged

aelovikov-intel deleted the unroll branch November 8, 2022 20:53

JackAKirk mentioned this pull request Jan 6, 2023

[SYCL] Use dim_loop to unroll loops in reduce_over_group in cuda backend. #7948

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NFC][SYCL] Replace #pragma unroll with dim_loop in accessor.hpp #6939

[NFC][SYCL] Replace #pragma unroll with dim_loop in accessor.hpp #6939

Uh oh!

aelovikov-intel commented Oct 3, 2022

Uh oh!

bso-intel left a comment

Uh oh!

aelovikov-intel commented Oct 3, 2022

Uh oh!

Uh oh!

[NFC][SYCL] Replace #pragma unroll with dim_loop in accessor.hpp #6939

[NFC][SYCL] Replace #pragma unroll with dim_loop in accessor.hpp #6939

Uh oh!

Conversation

aelovikov-intel commented Oct 3, 2022

Uh oh!

bso-intel left a comment

Choose a reason for hiding this comment

Uh oh!

aelovikov-intel commented Oct 3, 2022

Uh oh!

Uh oh!