Skip to content

[SYCL][CUDA][libclc] Add asynchronous barrier #5303

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
May 17, 2022
Merged

Conversation

t4c1
Copy link
Contributor

@t4c1 t4c1 commented Jan 13, 2022

Adds extension proposal and implementation for asynchronous barrier (for now the implementation is for CUDA backend sm 80+ only).

Tests for this are here: intel/llvm-test-suite#737

@t4c1 t4c1 requested review from bader and a team as code owners January 13, 2022 14:18
@t4c1 t4c1 requested a review from smaslov-intel January 13, 2022 14:18
@t4c1 t4c1 changed the title Async barrier [SYCL][CUDA][libclc] Add asynchronous barrier Jan 13, 2022
Copy link
Contributor

@Pennycook Pennycook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a useful extension. Most of my comments are about making it more obvious that it's CUDA-specific to help set user expectations.

Split barriers are useful, and we (Intel) are interested in exposing them for other hardware. Cm already exposes some support for split barriers on Intel GPUs via its cm_sbarrier function; we should think about what a portable split barrier interface would look like.

@t4c1
Copy link
Contributor Author

t4c1 commented Jan 20, 2022

Cm already exposes some support for split barriers on Intel GPUs via its cm_sbarrier function;

I did not find anything about that. Can you point me to either extension doc or the implementation?

we should think about what a portable split barrier interface would look like.

Currently this extension exposes all functionality from CUDA. However I am not sure if we need to require all this functionality from all implementations. For example I would not count no_complete variants of the arrive operation or copy_async_arrive as the most essential parts of the barrier.

Getting some input from whoever designed cm_sbarrier would also be nice.

@Pennycook
Copy link
Contributor

I did not find anything about that. Can you point me to either extension doc or the implementation?

I can't find a good resource, either. This isn't a SYCL extension, but you might be able to glean some details from the CM compiler: https://github.com/intel/cm-compiler/search?q=sbarrier

Currently this extension exposes all functionality from CUDA. However I am not sure if we need to require all this functionality from all implementations. For example I would not count no_complete variants of the arrive operation or copy_async_arrive as the most essential parts of the barrier.

Personally, I think it would be cleaner to have two extensions to support this case: one CUDA-backend-specific extension that gives access to capabilities only supported by CUDA (but that may be useful to those writing SYCL and tuning some sections of their code for NVIDIA devices); and another, more general extension that exposes split barriers in a portable way.

Providing a class that works everywhere but has member functions that only work on some devices has the potential to be confusing.

namespace ext {
namespace oneapi {

class barrier {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this really need to have that "reach" interface compared to std::barrier that it draws similarity with?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what do you mean by "reach interface".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that std::barrier is happy with just arrive/wait/drop, and no token. Can we simplify this API (it's likely will add chances that other backends would be able to support it too)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While removing this token would make interface more similar to std::barrier, I disagree that doing so would make it easier to implement for other backends. If other backends do not need this token they can pass around a dummy value. Meanwhile implementing this without the token for CUDA would lead to more complicated implementation and most likely additional limitations, such as only one barrier being usable at once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If other backends do not need this token they can pass around a dummy value.

But the token is used as an input in the core wait API, it can't be dummy.

Meanwhile implementing this without the token for CUDA would lead to more complicated implementation and most likely additional limitations, such as only one barrier being usable at once.

CUDA implementation can use "token" under the hood, of course.

I don't have any practical suggestions, just expressing a desire for a simpler and a more standard interface for the feature. I will rely on @Pennycook review/approval for the feature definition and then review the implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the token is used as an input in the core wait API, it can't be dummy.

I guess I was not clear enough. I meant that an implementation that does need the token can return a dummy value from arrive and ignore that token in the wait call.

CUDA implementation can use "token" under the hood, of course.

I have a feeling that hiding the token under the hood would, depending on how it is implemented, either limit the functionality, or be inefficient or complicated. Although I can't provide any concrete arguments from the top of my head. I need to think about this some more.

@Pennycook
Copy link
Contributor

I don't quite agree to that, but OK. should the name of the extension be "ext_cuda_async_barrier" instead of "ext_oneapi_cuda_async_barrier" then?

The SYCL 2020 naming conventions for extensions encourages "sycl_ext_<vendorstring>_<featurename>" , where the vendor string is used to avoid collisions. Our vendor strings are "oneapi" and "intel".

"ext_oneapi_cuda_" is supposed to convey that it's a oneAPI extension, but specific to the CUDA backend.

smaslov-intel
smaslov-intel previously approved these changes Apr 8, 2022
@t4c1
Copy link
Contributor Author

t4c1 commented May 3, 2022

I guess now this has all the approvals needed for merge?

@steffenlarsen
Copy link
Contributor

I guess now this has all the approvals needed for merge?

I believe we need @pvchupin's approval as well.

@steffenlarsen steffenlarsen requested a review from pvchupin May 4, 2022 09:30
@pvchupin
Copy link
Contributor

pvchupin commented May 4, 2022

@gmlueck, can you review/approve please?

pvchupin
pvchupin previously approved these changes May 4, 2022
Copy link
Contributor

@gmlueck gmlueck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmlueck, can you review/approve please?

@Pennycook has been taking the lead on reviewing the spec for this, and I trust his judgement. I happened to notice a small mistake when I read through, though. I'm happy when that is fixed.

@t4c1 t4c1 dismissed stale reviews from pvchupin, smaslov-intel, and Pennycook via 9478857 May 10, 2022 06:59
@t4c1
Copy link
Contributor Author

t4c1 commented May 17, 2022

@pvchupin @smaslov-intel @Pennycook The last change in this PR was minimal and this was previously approved, so can we get this approved and merged?

@pvchupin pvchupin merged commit 6770421 into intel:sycl May 17, 2022
pvchupin pushed a commit to intel/llvm-test-suite that referenced this pull request May 17, 2022
@pvchupin
Copy link
Contributor

@t4c1, please fix post commit issues: https://github.com/intel/llvm/runs/6478611698

pvchupin pushed a commit that referenced this pull request May 19, 2022
Fixes warnings introduced in #5303.

Co-authored-by: Artur Gainullin <[email protected]>
@pvchupin
Copy link
Contributor

@t4c1, please check also LIT fail on windows: https://github.com/intel/llvm/runs/6478612314?check_suite_focus=true
Also fails in post-commit of this PR.

pvchupin pushed a commit that referenced this pull request May 19, 2022
Fixed async barrier definition clash with max macro defined in some headers on windows by temporarily undefining the macro.

The issue was reported here: #5303 (comment)
yinyangsx pushed a commit to yinyangsx/llvm that referenced this pull request May 25, 2022
Adds extension proposal and implementation for asynchronous barrier (for now the implementation is for CUDA backend sm 80+ only).

Tests for this are here: intel/llvm-test-suite#737
yinyangsx pushed a commit to yinyangsx/llvm that referenced this pull request May 25, 2022
Fixes warnings introduced in intel#5303.

Co-authored-by: Artur Gainullin <[email protected]>
yinyangsx pushed a commit to yinyangsx/llvm that referenced this pull request May 25, 2022
Fixed async barrier definition clash with max macro defined in some headers on windows by temporarily undefining the macro.

The issue was reported here: intel#5303 (comment)
@t4c1 t4c1 deleted the async_barrier branch May 27, 2022 06:52
aelovikov-intel pushed a commit to aelovikov-intel/llvm that referenced this pull request Mar 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants