[SYCL][CUDA] Add group algorithms #2647

Pennycook · 2020-10-15T21:39:44Z

Adds support for the following SPIR-V instructions to libclc:

OpGroupAll, OpGroupAny
OpGroupBroadcast
OpGroupIAdd, OpGroupFAdd
OpGroupFMin, OpGroupUMin, OpGroupSMin
OpGroupFMax, OpGroupUMax, OpGroupSMax

At sub-group scope, these operations employ shuffles and other warp
instructions.

At work-group scope, partial results from each sub-group are combined
via shared memory.

The current implementation reserves 512 bytes of shared memory for any kernel
using a group algorithm, which is sufficient to cover the worst case.
Determining the correct amount of shared memory to reserve for a specific
kernel will likely require a dedicated compiler pass.

Signed-off-by: John Pennycook [email protected]

Adds support for the following SPIR-V instructions to libclc: - OpGroupAll, OpGroupAny - OpGroupBroadcast - OpGroupIAdd, OpGroupFAdd - OpGroupFMin, OpGroupUMin, OpGroupSMin - OpGroupFMax, OpGroupUMax, OpGroupSMax At sub-group scope, these operations employ shuffles and other warp instructions. At work-group scope, partial results from each sub-group are combined via shared memory. The current implementation reserves 512 bytes of shared memory for any kernel using a group algorithm, which is sufficient to cover the worst case. Determining the correct amount of shared memory to reserve for a specific kernel will likely require a dedicated compiler pass. Signed-off-by: John Pennycook <[email protected]>

Moves isSupportedDevice into support.h and adds check for CUDA. Increases work-group size for some tests to ensure more than one warp. Signed-off-by: John Pennycook <[email protected]>

Reductions only failed previously because of missing group algorithm support. Signed-off-by: John Pennycook <[email protected]>

Requires additional mangled entry points: - OpenCL mangles "half" to "h" - SYCL mangles "half" to "DF16_" Signed-off-by: John Pennycook <[email protected]>

Pennycook · 2020-10-15T21:44:53Z

Apologies to reviewers for the giant merge request. I've split things out into multiple commits to allow the libclc changes to be reviewed separately from the changes to the tests, but there wasn't an obvious way to further subdivide the libclc changes.

All of the SPIR-V functionality is defined in terms of one mega instruction: whether it computes a reduction, inclusive scan or exclusive scan is controlled by one parameter; and whether the operation is performed at sub-group or work-group scope is another parameter. The DPC++ implementation of scan depends on broadcast, and so I ended up having to implement everything at once.

jbrodman · 2020-10-15T21:51:48Z

Very nice!

Are there any corner cases that don't work? Do we need to document what is/is not expected to work?

Pennycook · 2020-10-15T22:09:39Z

Are there any corner cases that don't work?

Not intentionally. It's passing all the tests, and I tried a few extra cases as a sanity check (e.g. non-power-of-2 sub-groups, weird data types) and things seemed to work. I haven't done exhaustive testing, though, so there might still be some bugs.

Do we need to document what is/is not expected to work?

Yeah. I guess we should add "CUDA" to the appropriate rows in https://github.com/intel/llvm/tree/sycl/sycl/doc/extensions#extensions?

bader

Do we need to document what is/is not expected to work?

Yeah. I guess we should add "CUDA" to the appropriate rows in https://github.com/intel/llvm/tree/sycl/sycl/doc/extensions#extensions?

Let's update the documentation in this PR.

sycl/test/group-algorithm/exclusive_scan.cpp

sycl/test/group-algorithm/inclusive_scan.cpp

sycl/test/group-algorithm/reduce.cpp

libclc/ptx-nvidiacl/libspirv/group/collectives.cl

libclc/ptx-nvidiacl/libspirv/group/collectives_helpers.ll

Signed-off-by: John Pennycook <[email protected]>

i8 => i16 Signed-off-by: John Pennycook <[email protected]>

Signed-off-by: John Pennycook <[email protected]>

bader

libclc part looks good to me. Thanks!
@Naghasan, FYI.

Satisfies clang-format and keeps one line RUNx command. Signed-off-by: John Pennycook <[email protected]>

Pennycook added 4 commits October 15, 2020 17:28

[SYCL][CUDA] Enable group algorithm tests

7dcf835

Moves isSupportedDevice into support.h and adds check for CUDA. Increases work-group size for some tests to ensure more than one warp. Signed-off-by: John Pennycook <[email protected]>

[SYCL][CUDA] Enable reduction tests

701cfee

Reductions only failed previously because of missing group algorithm support. Signed-off-by: John Pennycook <[email protected]>

[SYCL][CUDA] Add half overloads to libclc

8fbc3eb

Requires additional mangled entry points: - OpenCL mangles "half" to "h" - SYCL mangles "half" to "DF16_" Signed-off-by: John Pennycook <[email protected]>

Pennycook added enhancement New feature or request spec extension All issues/PRs related to extensions specifications cuda CUDA back-end libclc libclc project related issues labels Oct 15, 2020

Pennycook requested a review from v-klochkov October 15, 2020 21:39

Pennycook requested review from bader and a team as code owners October 15, 2020 21:39

bader reviewed Oct 16, 2020

View reviewed changes

Pennycook added 3 commits October 16, 2020 10:29

[SYCL][CUDA][Doc] Update extension support docs

d8d2be6

Signed-off-by: John Pennycook <[email protected]>

[SYCL][CUDA] Fix __clc__get_group_scratch_short

3375433

i8 => i16 Signed-off-by: John Pennycook <[email protected]>

[SYCL][CUDA] Replace bit shifts with uint2 cast

3b04495

Signed-off-by: John Pennycook <[email protected]>

Pennycook requested a review from a team as a code owner October 16, 2020 14:51

bader previously approved these changes Oct 16, 2020

View reviewed changes

[SYCL][CUDA] Add line continuation to comments

a5a6019

Satisfies clang-format and keeps one line RUNx command. Signed-off-by: John Pennycook <[email protected]>

Pennycook dismissed bader’s stale review via a5a6019 October 16, 2020 15:13

bader approved these changes Oct 18, 2020

View reviewed changes

v-klochkov approved these changes Oct 19, 2020

View reviewed changes

jbrodman approved these changes Oct 19, 2020

View reviewed changes

v-klochkov merged commit 909459b into intel:sycl Oct 19, 2020

Pennycook deleted the cuda-group-algorithms branch October 19, 2020 20:04

Pennycook mentioned this pull request Oct 19, 2020

When can we expect reductions to work with the CUDA backend? #2498

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCL][CUDA] Add group algorithms #2647

[SYCL][CUDA] Add group algorithms #2647

Uh oh!

Pennycook commented Oct 15, 2020

Uh oh!

Pennycook commented Oct 15, 2020

Uh oh!

jbrodman commented Oct 15, 2020

Uh oh!

Pennycook commented Oct 15, 2020

Uh oh!

bader left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bader left a comment

Uh oh!

Uh oh!

[SYCL][CUDA] Add group algorithms #2647

[SYCL][CUDA] Add group algorithms #2647

Uh oh!

Conversation

Pennycook commented Oct 15, 2020

Uh oh!

Pennycook commented Oct 15, 2020

Uh oh!

jbrodman commented Oct 15, 2020

Uh oh!

Pennycook commented Oct 15, 2020

Uh oh!

bader left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bader left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!