-
Notifications
You must be signed in to change notification settings - Fork 787
[SYCL][CUDA][libclc] Add asynchronous barrier #5303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a useful extension. Most of my comments are about making it more obvious that it's CUDA-specific to help set user expectations.
Split barriers are useful, and we (Intel) are interested in exposing them for other hardware. Cm already exposes some support for split barriers on Intel GPUs via its cm_sbarrier
function; we should think about what a portable split barrier interface would look like.
I did not find anything about that. Can you point me to either extension doc or the implementation?
Currently this extension exposes all functionality from CUDA. However I am not sure if we need to require all this functionality from all implementations. For example I would not count no_complete variants of the arrive operation or copy_async_arrive as the most essential parts of the barrier. Getting some input from whoever designed |
…dependent forward progress
I can't find a good resource, either. This isn't a SYCL extension, but you might be able to glean some details from the CM compiler: https://github.com/intel/cm-compiler/search?q=sbarrier
Personally, I think it would be cleaner to have two extensions to support this case: one CUDA-backend-specific extension that gives access to capabilities only supported by CUDA (but that may be useful to those writing SYCL and tuning some sections of their code for NVIDIA devices); and another, more general extension that exposes split barriers in a portable way. Providing a class that works everywhere but has member functions that only work on some devices has the potential to be confusing. |
namespace ext { | ||
namespace oneapi { | ||
|
||
class barrier { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this really need to have that "reach" interface compared to std::barrier that it draws similarity with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what do you mean by "reach interface".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that std::barrier is happy with just arrive/wait/drop, and no token. Can we simplify this API (it's likely will add chances that other backends would be able to support it too)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While removing this token would make interface more similar to std::barrier
, I disagree that doing so would make it easier to implement for other backends. If other backends do not need this token they can pass around a dummy value. Meanwhile implementing this without the token for CUDA would lead to more complicated implementation and most likely additional limitations, such as only one barrier being usable at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If other backends do not need this token they can pass around a dummy value.
But the token is used as an input in the core wait
API, it can't be dummy.
Meanwhile implementing this without the token for CUDA would lead to more complicated implementation and most likely additional limitations, such as only one barrier being usable at once.
CUDA implementation can use "token" under the hood, of course.
I don't have any practical suggestions, just expressing a desire for a simpler and a more standard interface for the feature. I will rely on @Pennycook review/approval for the feature definition and then review the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the token is used as an input in the core wait API, it can't be dummy.
I guess I was not clear enough. I meant that an implementation that does need the token can return a dummy value from arrive
and ignore that token in the wait
call.
CUDA implementation can use "token" under the hood, of course.
I have a feeling that hiding the token under the hood would, depending on how it is implemented, either limit the functionality, or be inefficient or complicated. Although I can't provide any concrete arguments from the top of my head. I need to think about this some more.
Co-authored-by: John Pennycook <[email protected]>
The SYCL 2020 naming conventions for extensions encourages "sycl_ext_<vendorstring>_<featurename>" , where the vendor string is used to avoid collisions. Our vendor strings are "oneapi" and "intel". "ext_oneapi_cuda_" is supposed to convey that it's a oneAPI extension, but specific to the CUDA backend. |
I guess now this has all the approvals needed for merge? |
I believe we need @pvchupin's approval as well. |
@gmlueck, can you review/approve please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmlueck, can you review/approve please?
@Pennycook has been taking the lead on reviewing the spec for this, and I trust his judgement. I happened to notice a small mistake when I read through, though. I'm happy when that is fixed.
sycl/doc/extensions/experimental/sycl_ext_oneapi_cuda_async_barrier.asciidoc
Outdated
Show resolved
Hide resolved
9478857
@pvchupin @smaslov-intel @Pennycook The last change in this PR was minimal and this was previously approved, so can we get this approved and merged? |
@t4c1, please fix post commit issues: https://github.com/intel/llvm/runs/6478611698 |
Fixes warnings introduced in #5303. Co-authored-by: Artur Gainullin <[email protected]>
@t4c1, please check also LIT fail on windows: https://github.com/intel/llvm/runs/6478612314?check_suite_focus=true |
Fixed async barrier definition clash with max macro defined in some headers on windows by temporarily undefining the macro. The issue was reported here: #5303 (comment)
Adds extension proposal and implementation for asynchronous barrier (for now the implementation is for CUDA backend sm 80+ only). Tests for this are here: intel/llvm-test-suite#737
Fixes warnings introduced in intel#5303. Co-authored-by: Artur Gainullin <[email protected]>
Fixed async barrier definition clash with max macro defined in some headers on windows by temporarily undefining the macro. The issue was reported here: intel#5303 (comment)
Adds extension proposal and implementation for asynchronous barrier (for now the implementation is for CUDA backend sm 80+ only).
Tests for this are here: intel/llvm-test-suite#737