-
Notifications
You must be signed in to change notification settings - Fork 788
[SYCL][CUDA] Add group algorithms #2647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Adds support for the following SPIR-V instructions to libclc: - OpGroupAll, OpGroupAny - OpGroupBroadcast - OpGroupIAdd, OpGroupFAdd - OpGroupFMin, OpGroupUMin, OpGroupSMin - OpGroupFMax, OpGroupUMax, OpGroupSMax At sub-group scope, these operations employ shuffles and other warp instructions. At work-group scope, partial results from each sub-group are combined via shared memory. The current implementation reserves 512 bytes of shared memory for any kernel using a group algorithm, which is sufficient to cover the worst case. Determining the correct amount of shared memory to reserve for a specific kernel will likely require a dedicated compiler pass. Signed-off-by: John Pennycook <[email protected]>
Moves isSupportedDevice into support.h and adds check for CUDA. Increases work-group size for some tests to ensure more than one warp. Signed-off-by: John Pennycook <[email protected]>
Reductions only failed previously because of missing group algorithm support. Signed-off-by: John Pennycook <[email protected]>
Requires additional mangled entry points: - OpenCL mangles "half" to "h" - SYCL mangles "half" to "DF16_" Signed-off-by: John Pennycook <[email protected]>
Apologies to reviewers for the giant merge request. I've split things out into multiple commits to allow the libclc changes to be reviewed separately from the changes to the tests, but there wasn't an obvious way to further subdivide the libclc changes. All of the SPIR-V functionality is defined in terms of one mega instruction: whether it computes a reduction, inclusive scan or exclusive scan is controlled by one parameter; and whether the operation is performed at sub-group or work-group scope is another parameter. The DPC++ implementation of scan depends on broadcast, and so I ended up having to implement everything at once. |
Very nice! Are there any corner cases that don't work? Do we need to document what is/is not expected to work? |
Not intentionally. It's passing all the tests, and I tried a few extra cases as a sanity check (e.g. non-power-of-2 sub-groups, weird data types) and things seemed to work. I haven't done exhaustive testing, though, so there might still be some bugs.
Yeah. I guess we should add "CUDA" to the appropriate rows in https://github.com/intel/llvm/tree/sycl/sycl/doc/extensions#extensions? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to document what is/is not expected to work?
Yeah. I guess we should add "CUDA" to the appropriate rows in https://github.com/intel/llvm/tree/sycl/sycl/doc/extensions#extensions?
Let's update the documentation in this PR.
Signed-off-by: John Pennycook <[email protected]>
i8 => i16 Signed-off-by: John Pennycook <[email protected]>
Signed-off-by: John Pennycook <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
libclc part looks good to me. Thanks!
@Naghasan, FYI.
Satisfies clang-format and keeps one line RUNx command. Signed-off-by: John Pennycook <[email protected]>
Adds support for the following SPIR-V instructions to libclc:
At sub-group scope, these operations employ shuffles and other warp
instructions.
At work-group scope, partial results from each sub-group are combined
via shared memory.
The current implementation reserves 512 bytes of shared memory for any kernel
using a group algorithm, which is sufficient to cover the worst case.
Determining the correct amount of shared memory to reserve for a specific
kernel will likely require a dedicated compiler pass.
Signed-off-by: John Pennycook [email protected]