-
Notifications
You must be signed in to change notification settings - Fork 787
[SYCL] UR_KERNEL_SUB_GROUP_INFO_SUB_GROUP_SIZE_INTEL on Cuda and HIP #17137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] UR_KERNEL_SUB_GROUP_INFO_SUB_GROUP_SIZE_INTEL on Cuda and HIP #17137
Conversation
Fixes: #14357 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UR LGTM
return ReturnValue(0); | ||
// The only supported value of required sub-group size for CUDA devices is | ||
// 32. | ||
return ReturnValue(32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm understanding correctly, the CUDA binaries will also have the same metadata as HIP, just we're choosing to hard-code 32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are righty, decorating a kernel with sycl::reqd_sub_group_size(SG_SIZE)
results in it having !intel_reqd_sub_group_size !14
attached. For Cuda, the value of that node can only ever be 32
.
I'm in two minds about hard coding that value, but copy-pasting the code handling the metadata in question for it only ever generating 32
doesn't seem great either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... I lean ever so slightly towards not hard coding it, but it's a toss up. If we could more programmatically share the 32
between the compiler and UR then that would also be an option. But I'm not a huge fan of individually held assumptions.
I think what would help would be sharing more code between HIP and CUDA. Right now they're often just copy/pastes of each other. There's nothing CUDA or HIP-specific about fetching and interpreting metadata from the program, for instance. And I suppose another argument for not hard-coding it would be that it makes any eventual refactor for code sharing more trivial as the copy/paste becomes obvious: it'd be in the identical metadata code, not in the value retrieval code.
Anyway these are ultimately just ideas, I don't know if it's worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, a GPU-adapter that would abstract away the duplication would be nice. You make a good point about providing the implementation to make it easier for future re-factor. Will extend the CUDA path as well.
@@ -39,6 +39,7 @@ struct ur_program_handle_t_ { | |||
std::unordered_map<std::string, std::string> GlobalIDMD; | |||
std::unordered_map<std::string, std::tuple<uint32_t, uint32_t, uint32_t>> | |||
KernelReqdWorkGroupSizeMD; | |||
std::unordered_map<std::string, uint32_t> KernelReqdSubGroupSizeMD; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last time I changed the program metadata stuff I was wondering if all these separate maps for each program metadata item are the best idea, in terms of memory usage, cache efficiency, access times, etc.
Would std::unordered_map<std::string, struct KernelMetadata>
bring any benefits, do we think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An honest answer is that I don't know and without micro benchmarking it, it's impossible to answer this.
I'm not too keen on the idea of bundling all possible metadata into one struct and storing that as per kernel name value. Intuitively, I'd say most of the time that struct would be storing 0
-initialised bytes. I'm tempted to leave it as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lots of food for thought but it's not essential
301e2c6
to
09a31f0
Compare
Friendly ping: @uditagarwal97 @intel/llvm-reviewers-runtime |
14aa75c
to
5c99321
Compare
For HIP the value of sub group size can either be 32 or 64, it can be retrieved from `intel_reqd_sub_group_size` metadata node. Cuda only supports 32, which is enforced in the compiler, see [SemaSYCL::addIntelReqdSubGroupSizeAttr](https://github.com/intel/llvm/blob/sycl/clang/lib/Sema/SemaSYCLDeclAttr.cpp#L828).
5c99321
to
4b4fc14
Compare
Friendly ping @npmiller, @AlexeySachkov. |
@AlexeySachkov are you happy with the changes in the patch? |
@intel/llvm-gatekeepers I think this should be ready to land. |
@jchlanda Github UI still says that merge is blocked due to missing review from @intel/dpcpp-tools-reviewers, so the change to code owners might only take effect for new PRs. @intel/dpcpp-tools-reviewers can we get a quick review on this one? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tools part LGTM.
@intel/llvm-gatekeepers this should be ready to merge now |
CI test results are 2 weeks old. Can you please rebase and re-run the CI? |
@maksimsab while you're here could I kindly ask you to have a look a this patch as well. It's missing a tools review and I didn't have much luck with pinging. |
@intel/llvm-gatekeepers the fresh CI run passed, it should be good to merge now |
For HIP the value of sub group size can either be 32 or 64, it can be retrieved from
intel_reqd_sub_group_size
metadata node.Cuda only supports 32, which is enforced in the compiler, see SemaSYCL::addIntelReqdSubGroupSizeAttr.