Skip to content

[SYCL] Use dim_loop to unroll loops in reduce_over_group in cuda backend. #7948

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 22, 2023

Conversation

JackAKirk
Copy link
Contributor

@JackAKirk JackAKirk commented Jan 6, 2023

A performance regression was reported when using reduce_over_group with sycl::vec.
This was due to a loop over calls to the scalar reduce_over_group for each of the sycl::vec components that was not unrolled and led to register spills even at -O3.
It was initially possible to fix the performance by calling #pragma unroll and declare reduce_over_group with __attribute__((always_inline)). However the SYCL_UNROLL macro that calls #pragma unroll has been removed in favour of dim_loop (#6939).
I have used dim_loop to fix the loop unrolling. However, in the cuda backend, just using dim_loop in this way actually makes the performance worse. This is because dim_loop introduces new non inlined function calls in the cuda backend that lead to register spills. The solution to this coincides with the solution of several user reports that the cuda backend is not aggressive enough with inlining. In this PR I have also therefore increased the inlining threshold multiplier value to 11.

See https://reviews.llvm.org/D142232/new/ for the corresponding upstream PR (for the inlining threshold change) that includes much more details on benchmarking dpc++ cuda with this change. In short, for dpc++ cuda, there is no other downside apart from a very small increase in compile time in some cases, but there is a massive benefit to increasing the inlining threshold across a large amount of applications.

Testing using opencl cpu backend reveals that this code change has no effect on this backend. This change is required for the cuda backend but should have no performance effect for other backends.

fixes #6583.

Signed-off-by: JackAKirk [email protected]

@JackAKirk JackAKirk requested a review from a team as a code owner January 6, 2023 18:04
@JackAKirk JackAKirk requested a review from againull January 6, 2023 18:04
@JackAKirk JackAKirk temporarily deployed to aws January 6, 2023 18:11 — with GitHub Actions Inactive
@JackAKirk JackAKirk marked this pull request as draft January 6, 2023 18:15
@JackAKirk
Copy link
Contributor Author

JackAKirk commented Jan 6, 2023

Seems that dim_loop is favoured instead of #pragma unroll, which is making CI build fail here due to warnings. Marked as draft. I will try to get dim_loop to work correctly.

@JackAKirk JackAKirk changed the title [SYCL] Use #pragma unroll and always_inline to stop reg spills in reduce_over_group. [SYCL] Use dim_loop to stop reg spills in reduce_over_group in cuda backend. Feb 3, 2023
@JackAKirk JackAKirk marked this pull request as ready for review February 3, 2023 13:57
@JackAKirk JackAKirk requested a review from a team as a code owner February 3, 2023 13:57
@JackAKirk
Copy link
Contributor Author

Seems that dim_loop is favoured instead of #pragma unroll, which is making CI build fail here due to warnings. Marked as draft. I will try to get dim_loop to work correctly.

Done. I've updated the PR description accordingly.

@JackAKirk JackAKirk temporarily deployed to aws February 3, 2023 14:18 — with GitHub Actions Inactive
@JackAKirk JackAKirk temporarily deployed to aws February 3, 2023 14:50 — with GitHub Actions Inactive
@JackAKirk JackAKirk temporarily deployed to aws February 6, 2023 16:23 — with GitHub Actions Inactive
@JackAKirk JackAKirk temporarily deployed to aws February 6, 2023 16:54 — with GitHub Actions Inactive
@JackAKirk JackAKirk changed the title [SYCL] Use dim_loop to stop reg spills in reduce_over_group in cuda backend. [SYCL] Use dim_loop to unroll loops in reduce_over_group in cuda backend. Feb 7, 2023
@JackAKirk
Copy link
Contributor Author

ping @intel/dpcpp-tools-reviewers

@JackAKirk JackAKirk closed this Feb 14, 2023
@JackAKirk JackAKirk reopened this Feb 14, 2023
@JackAKirk JackAKirk temporarily deployed to aws February 15, 2023 12:27 — with GitHub Actions Inactive
@JackAKirk JackAKirk temporarily deployed to aws February 15, 2023 21:09 — with GitHub Actions Inactive
@JackAKirk
Copy link
Contributor Author

JackAKirk commented Feb 16, 2023

Fixes #6583. esimd and opencl CI just didn't run this time and should be unrelated: they ran before and passed. There was an unrelated hip failure before but this time the hip ci passed.F

@JackAKirk
Copy link
Contributor Author

Fixes #6583.

@JackAKirk
Copy link
Contributor Author

ping @intel/dpcpp-tools-reviewers

@intel/dpcpp-tools-reviewers Would it be possible to get a review?
Thanks

@bader
Copy link
Contributor

bader commented Feb 22, 2023

fixes #7948.

Please, fix the issue reference in the description. This is the id of this PR.

NOTE: to link PRs with issues, keywords must be used in the description. See https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword.

@JackAKirk
Copy link
Contributor Author

fixes #7948.

Please, fix the issue reference in the description. This is the id of this PR.

NOTE: to link PRs with issues, keywords must be used in the description. See https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword.

I see thanks. Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sycl::double2 type degrades sycl performance on NV GPU with additional generated memory instructions
4 participants