-
Notifications
You must be signed in to change notification settings - Fork 787
[SYCL] Use dim_loop
to unroll loops in reduce_over_group
in cuda backend.
#7948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: JackAKirk <[email protected]>
Seems that |
#pragma unroll
and always_inline
to stop reg spills in reduce_over_group
.dim_loop
to stop reg spills in reduce_over_group
in cuda backend.
Signed-off-by: JackAKirk <[email protected]>
Done. I've updated the PR description accordingly. |
Signed-off-by: JackAKirk <[email protected]>
dim_loop
to stop reg spills in reduce_over_group
in cuda backend.dim_loop
to unroll loops in reduce_over_group
in cuda backend.
ping @intel/dpcpp-tools-reviewers |
Fixes #6583. esimd and opencl CI just didn't run this time and should be unrelated: they ran before and passed. There was an unrelated hip failure before but this time the hip ci passed.F |
Fixes #6583. |
@intel/dpcpp-tools-reviewers Would it be possible to get a review? |
Please, fix the issue reference in the description. This is the id of this PR. NOTE: to link PRs with issues, keywords must be used in the description. See https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword. |
I see thanks. Done. |
A performance regression was reported when using
reduce_over_group
with sycl::vec.This was due to a loop over calls to the scalar
reduce_over_group
for each of thesycl::vec
components that was not unrolled and led to register spills even at -O3.It was initially possible to fix the performance by calling
#pragma unroll
and declarereduce_over_group
with__attribute__((always_inline))
. However theSYCL_UNROLL
macro that calls#pragma unroll
has been removed in favour ofdim_loop
(#6939).I have used dim_loop to fix the loop unrolling. However, in the cuda backend, just using
dim_loop
in this way actually makes the performance worse. This is becausedim_loop
introduces new non inlined function calls in the cuda backend that lead to register spills. The solution to this coincides with the solution of several user reports that the cuda backend is not aggressive enough with inlining. In this PR I have also therefore increased the inlining threshold multiplier value to 11.See https://reviews.llvm.org/D142232/new/ for the corresponding upstream PR (for the inlining threshold change) that includes much more details on benchmarking dpc++ cuda with this change. In short, for dpc++ cuda, there is no other downside apart from a very small increase in compile time in some cases, but there is a massive benefit to increasing the inlining threshold across a large amount of applications.
Testing using opencl cpu backend reveals that this code change has no effect on this backend. This change is required for the cuda backend but should have no performance effect for other backends.
fixes #6583.
Signed-off-by: JackAKirk [email protected]