Skip to content

[SYCL][CUDA] Add group algorithms #2647

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Oct 19, 2020

Conversation

Pennycook
Copy link
Contributor

Adds support for the following SPIR-V instructions to libclc:

  • OpGroupAll, OpGroupAny
  • OpGroupBroadcast
  • OpGroupIAdd, OpGroupFAdd
  • OpGroupFMin, OpGroupUMin, OpGroupSMin
  • OpGroupFMax, OpGroupUMax, OpGroupSMax

At sub-group scope, these operations employ shuffles and other warp
instructions.

At work-group scope, partial results from each sub-group are combined
via shared memory.

The current implementation reserves 512 bytes of shared memory for any kernel
using a group algorithm, which is sufficient to cover the worst case.
Determining the correct amount of shared memory to reserve for a specific
kernel will likely require a dedicated compiler pass.

Signed-off-by: John Pennycook [email protected]

Adds support for the following SPIR-V instructions to libclc:
- OpGroupAll, OpGroupAny
- OpGroupBroadcast
- OpGroupIAdd, OpGroupFAdd
- OpGroupFMin, OpGroupUMin, OpGroupSMin
- OpGroupFMax, OpGroupUMax, OpGroupSMax

At sub-group scope, these operations employ shuffles and other warp
instructions.

At work-group scope, partial results from each sub-group are combined
via shared memory.

The current implementation reserves 512 bytes of shared memory for any kernel
using a group algorithm, which is sufficient to cover the worst case.
Determining the correct amount of shared memory to reserve for a specific
kernel will likely require a dedicated compiler pass.

Signed-off-by: John Pennycook <[email protected]>
Moves isSupportedDevice into support.h and adds check for CUDA.
Increases work-group size for some tests to ensure more than one warp.

Signed-off-by: John Pennycook <[email protected]>
Reductions only failed previously because of missing group algorithm support.

Signed-off-by: John Pennycook <[email protected]>
Requires additional mangled entry points:
- OpenCL mangles "half" to "h"
- SYCL mangles "half" to "DF16_"

Signed-off-by: John Pennycook <[email protected]>
@Pennycook Pennycook added enhancement New feature or request spec extension All issues/PRs related to extensions specifications cuda CUDA back-end libclc libclc project related issues labels Oct 15, 2020
@Pennycook Pennycook requested a review from v-klochkov October 15, 2020 21:39
@Pennycook Pennycook requested review from bader and a team as code owners October 15, 2020 21:39
@Pennycook
Copy link
Contributor Author

Apologies to reviewers for the giant merge request. I've split things out into multiple commits to allow the libclc changes to be reviewed separately from the changes to the tests, but there wasn't an obvious way to further subdivide the libclc changes.

All of the SPIR-V functionality is defined in terms of one mega instruction: whether it computes a reduction, inclusive scan or exclusive scan is controlled by one parameter; and whether the operation is performed at sub-group or work-group scope is another parameter. The DPC++ implementation of scan depends on broadcast, and so I ended up having to implement everything at once.

@jbrodman
Copy link
Contributor

Very nice!

Are there any corner cases that don't work? Do we need to document what is/is not expected to work?

@Pennycook
Copy link
Contributor Author

Are there any corner cases that don't work?

Not intentionally. It's passing all the tests, and I tried a few extra cases as a sanity check (e.g. non-power-of-2 sub-groups, weird data types) and things seemed to work. I haven't done exhaustive testing, though, so there might still be some bugs.

Do we need to document what is/is not expected to work?

Yeah. I guess we should add "CUDA" to the appropriate rows in https://github.com/intel/llvm/tree/sycl/sycl/doc/extensions#extensions?

Copy link
Contributor

@bader bader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to document what is/is not expected to work?

Yeah. I guess we should add "CUDA" to the appropriate rows in https://github.com/intel/llvm/tree/sycl/sycl/doc/extensions#extensions?

Let's update the documentation in this PR.

@Pennycook Pennycook requested a review from a team as a code owner October 16, 2020 14:51
bader
bader previously approved these changes Oct 16, 2020
Copy link
Contributor

@bader bader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libclc part looks good to me. Thanks!
@Naghasan, FYI.

Satisfies clang-format and keeps one line RUNx command.

Signed-off-by: John Pennycook <[email protected]>
@v-klochkov v-klochkov merged commit 909459b into intel:sycl Oct 19, 2020
@Pennycook Pennycook deleted the cuda-group-algorithms branch October 19, 2020 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda CUDA back-end enhancement New feature or request libclc libclc project related issues spec extension All issues/PRs related to extensions specifications
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants