Skip to content

[SYCL] Added ext_oneapi_non_uniform_groups aspect #10902

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Sep 26, 2023

Conversation

JackAKirk
Copy link
Contributor

@JackAKirk JackAKirk commented Aug 21, 2023

The errors that you get if you try to e.g. create a ballot_group in e.g. HIP backend are not instructive. This PR introduces ext_oneapi_non_uniform_groups and uses the [[__sycl_detail__::__uses_aspects__(aspect::foo)]] annotations detailed in:

https://github.com/intel/llvm/blob/sycl/sycl/doc/design/OptionalDeviceFeatures.md#changes-to-dpc-headers

to give a clear runtime error in the case that a programmer tries to execute a kernel that uses non-uniform groups on a device that does not supported the non-uniform group extension.

for Non uniform groups.

Signed-off-by: Jack Kirk <[email protected]>
@JackAKirk JackAKirk requested a review from a team as a code owner August 21, 2023 12:56
Jack Kirk added 2 commits August 21, 2023 09:00
Signed-off-by: Jack Kirk <[email protected]>
Signed-off-by: Jack Kirk <[email protected]`>
@@ -150,6 +151,10 @@ get_ballot_group(Group group, bool predicate) {
} else {
return ballot_group<sycl::sub_group>(~mask, predicate);
}
#else
static_assert(false,
"ballot_group is not currently supported on this platform.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a static assertion our best option here? I worry it may be a little to harsh on users that don't intend to run such kernels on other platforms.

@Pennycook - What are your thoughts here? My thoughts are that ideally if we want to restrict the platforms of these, we would want to have an aspect for it and mark related features as requiring it so we can report failure at runtime instead if the user tried to launch a kernel with these.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @steffenlarsen. Furthermore, won't these asserts trigger any time the functions are compiled, regardless of whether they're actually used? I don't think we want to be in a position where we can't compile for a specific backend just because there's a usage of one of these non-uniform groups somewhere in a header.

I think something like an aspect, optional feature, or other device queries makes sense here. We didn't include anything like this in the initial proposal because it wasn't clear what should and shouldn't be allowed... For example, should a device be allowed to support only certain fixed_size_group sizes, or certain scopes? I think we need to get more implementation experience -- across more than just SPIR-V and NVPTX -- before we can properly answer those questions.

I'm not opposed to adding some broad aspects/queries for whether the groups are supported at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, we can add the aspect then tell programmers it is their responsibility to use the aspect as a check in code that is using non uniform groups?

RE implementations on other backends:

For HIP AMD I'm not sure that that any of these groups can be fully implemented on existing hardware. The issue is that amd cards don't have independent forward progress (as it is defined here https://developer.nvidia.com/blog/inside-volta/), and they don't support __syncwarp(mask);

HIP does support some subsets of these group features, (ballot,any, or taking a mask), such that I think it would be possible to implement basically all of ballot_group and fixed_size_group, minus barrier. Although due to the lack of independent forward progress guarantees I guess could mean that even then certain code might not be portable in the sense it doesn't hang etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, we can add the aspect then tell programmers it is their responsibility to use the aspect as a check in code that is using non uniform groups?

Exactly.

For HIP AMD I'm not sure that that any of these groups can be fully implemented on existing hardware. The issue is that amd cards don't have independent forward progress (as it is defined here https://developer.nvidia.com/blog/inside-volta/), and they don't support __syncwarp(mask);

Intel GPUs don't support "independent forward progress" either, and the extension doesn't require it. Working around the lack of __syncwarp(mask) probably requires some assumptions and/or compiler smarts... since we only need to be able to synchronize the work-items in the active branch, it might be sufficient to run the equivalent of an unmasked __syncwarp, or to do nothing at all. For SPIR-V targets we currently just issue a memory barrier, which seems to work (see here).

I'm still not comfortable with this and I think we need more test-cases to prove definitively whether things are working as expected. But that's why the extension is experimental. 😄

Although due to the lack of independent forward progress guarantees I guess could mean that even then certain code might not be portable in the sense it doesn't hang etc.

I agree, but this is always going to be true. I don't think the presence of non-uniform groups makes this any worse than it is already. Any code that makes assumptions about forward progress will be non-portable, and that's why we're defining the sycl_ext_oneapi_forward_progress extension.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, we can add the aspect then tell programmers it is their responsibility to use the aspect as a check in code that is using non uniform groups?

Exactly.

For HIP AMD I'm not sure that that any of these groups can be fully implemented on existing hardware. The issue is that amd cards don't have independent forward progress (as it is defined here https://developer.nvidia.com/blog/inside-volta/), and they don't support __syncwarp(mask);

Intel GPUs don't support "independent forward progress" either, and the extension doesn't require it. Working around the lack of __syncwarp(mask) probably requires some assumptions and/or compiler smarts... since we only need to be able to synchronize the work-items in the active branch, it might be sufficient to run the equivalent of an unmasked __syncwarp, or to do nothing at all. For SPIR-V targets we currently just issue a memory barrier, which seems to work (see here).

I'm still not comfortable with this and I think we need more test-cases to prove definitively whether things are working as expected. But that's why the extension is experimental. 😄

I don't know for sure how amd devices would behave - we'd just have to experiment and see what happens. I guess that we don't want to support e.g. ballot_group partially in a backend if it can be helped. There is also the point that amd could start supporting these features in the not too distant future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to defer to @gmlueck on this one, because I'm not sure how optional features and aspects actually work in the compiler.

Are you asking me whether all the APIs in "sycl_ext_oneapi_non_uniform_groups" should be "optional kernel features" that are tied to some aspect? Obviously, the extension is easier to use if we can guarantee that it is available for all devices. Do we think that it can be implemented for all devices, and we just haven't completed the implementation on certain backends? Or, do we think these APIs can never be implemented on certain devices?

If it's the former, it would be better not to burden application developers by requiring them to check an aspect. Triggering a static-assert might be reasonable as a short-term solution if we can add the missing support soon.

If it's the later, then we should change the specification to add the aspect and document the APIs as optional device features.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure, and I was hoping we could get more implementation experience before having to answer this question. But my guess is that there will be certain devices that cannot implement all of the functionality in the specification as it's written today (or that there might be certain compilers and/or library-only implementations that choose not to do so).

How hard is it for us to add new optional device features? I ask because it's still not clear to me if we would want one aspect (i.e. for non-uniform groups) or multiple aspects (e.g. for each type of non-uniform group, or each combination of non-uniform group and scope). I suppose what I'm asking is: If we started with one aspect today, how angry would people be if we need to change that later?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this extension is still experimental, we can make API breaking changes in the future. For example, the extension could require all devices to support the APIs now, and we can still add an aspect later if necessary. If we decide that DPC++ can support these APIs on all devices, the extension can expose the APIs as "required device features". When it comes time to adopt them into SYCL-Next, we can still adopt them as "optional features" if other vendors think they will be hard to support.

That said, it is not too hard to add support for an optional feature / aspect. There are two main parts to the implementation:

  • The header file needs to add [[__sycl_detail__::__uses_aspects__(aspect::foo)]] to either the declaration of a function or to the definition of a type as described in the design.

  • We need some backend-specific code that can query a device and decide whether the device supports the aspect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JackAKirk Given Greg's response above, would something like an aspect::non_uniform_groups work for you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it sounds sensible. I can try adding [[__sycl_detail__::__uses_aspects__(aspect::foo)]] to the header and add the aspect and then check the behavior.

@JackAKirk JackAKirk requested a review from a team as a code owner August 30, 2023 17:26
Copy link
Contributor

@Pennycook Pennycook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest changes with aspects look good to me. Thanks for doing the extra work here, @JackAKirk!

@JackAKirk
Copy link
Contributor Author

The latest changes with aspects look good to me. Thanks for doing the extra work here, @JackAKirk!

Thanks, give me a minute, I have a few questions!

Signed-off-by: Jack Kirk <[email protected]>
@JackAKirk
Copy link
Contributor Author

JackAKirk commented Aug 30, 2023

The latest changes with aspects look good to me. Thanks for doing the extra work here, @JackAKirk!

So it seems like this statement in the doc is false?

"This extension also depends on the following other SYCL extensions:

Nothing introduced in the extension document introduces anything new to root_group I think? I think that this extension document should be updated to just remove the above, so there is not confusion that the part I've added applies also to root_group: that it requires the new aspect.

The other issue is that I have done something unusual for this aspect. I did not call PI/UR to get the aspect, because this aspect is something specific to the oneapi model rather than the adapters; and also I did not add a device info corresponding to the aspect, since atm device infos currently always map to a UR device info (except for the sycl-fusion aspect which defines the device info in experimental namespace in a different way), so it wouldn't I think be immediately possible to do this (But I don't see that it is required).
I just want to check that the above is OK. It is an implementation detail as far as sycl runtime is concerned and a corresponding aspect could be added to UR in the future if desired. But if people want the new UR device query corresponding to this aspect then I will add it in the proper way via the UR repo. There is an existing case where there is an aspect without a corresponding UR aspect (sycl-fusion), so what I'm doing here isn't completely new.

@@ -546,6 +546,11 @@ bool device_impl::has(aspect Aspect) const {
sizeof(pi_bool), &support, nullptr) == PI_SUCCESS;
return call_successful && support;
}
case aspect::ext_oneapi_non_uniform_groups: {
return (this->getBackend() == backend::ext_oneapi_level_zero) ||
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Pennycook I also wanted to check that this correctly covers all the SPIRV cases: Are

(this->getBackend() == backend::ext_oneapi_level_zero) ||
           (this->getBackend() == backend::opencl)

The only two backends supporting the spirv non uniform groups?

Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are the only two that we've tested, at least. I'm happy to leave this as it is for now -- if we can prove to ourselves later that things should run elsewhere, we can update the aspect.

@Pennycook
Copy link
Contributor

Nothing introduced in the extension document introduces anything new to root_group I think? I think that this extension document should be updated to just remove the above, so there is not confusion that the part I've added applies also to root_group: that it requires the new aspect.

We should clarify this. The dependency is there because this extension defines is_fixed_topology_group for root_group (if it exists). But there isn't any functionality here that directly depends on the root-group extension.

I think we should remove the link from the dependency section, and move it down to the "Group Taxonomy" section, where there's already a reference to the extension that says "root_group (if sycl_ext_oneapi_root_group is supported)".

There is an existing case where there is an aspect without a corresponding UR aspect (sycl-fusion), so what I'm doing here isn't completely new.

I have no preference here, and defer to @gmlueck.

Jack Kirk added 2 commits August 31, 2023 05:30
This aims to make it clearer that root_group is not part of the
extension.

Signed-off-by: Jack Kirk <[email protected]>
Signed-off-by: Jack Kirk <[email protected]>
@JackAKirk
Copy link
Contributor Author

Nothing introduced in the extension document introduces anything new to root_group I think? I think that this extension document should be updated to just remove the above, so there is not confusion that the part I've added applies also to root_group: that it requires the new aspect.

We should clarify this. The dependency is there because this extension defines is_fixed_topology_group for root_group (if it exists). But there isn't any functionality here that directly depends on the root-group extension.

I think we should remove the link from the dependency section, and move it down to the "Group Taxonomy" section, where there's already a reference to the extension that says "root_group (if sycl_ext_oneapi_root_group is supported)".

OK how are the changes I just made?
Should I move this doc to experimental from proposed?
A final question is on the name of the aspect extension: I removed the prepended sycl_ that is in the extension name from ext_... Is this correct? There seems to be some variation in existing aspects: some prepend with sycl_, but most are just ext_

@JackAKirk JackAKirk changed the title [SYCL] Added asserts to non-supporting platforms for Non-uniform groups [SYCL] Added ext_oneapi_non_uniform_groups aspect Aug 31, 2023
@Pennycook
Copy link
Contributor

OK how are the changes I just made?

Changes look great, thanks.

Should I move this doc to experimental from proposed?

You mean from proposed to experimental?

I think if we want to move this to experimental, we need to add a "Backend support status" section (see the template). This should say 1) that developers need to check the aspect before using the feature; and 2) that the implementation for the CUDA backend is currently incomplete (because it doesn't support tangle_group yet).

Alternatively, we could leave this in proposed until we have tangle_group implemented on the CUDA backend.

A final question is on the name of the aspect extension: I removed the prepended sycl_ that is in the extension name from ext_... Is this correct? There seems to be some variation in existing aspects: some prepend with sycl_, but most are just ext_

I missed this. I think ext_ is right, based on my reading of this part of the SYCL specification:

An extension may add new members or member functions to existing SYCL classes or new values to existing SYCL enumeration types. To ensure these extensions do not collide, vendors are encouraged to name them with the prefix ext_<vendorstring>_.

@JackAKirk JackAKirk requested a review from a team as a code owner August 31, 2023 15:15
Copy link
Contributor

@AlexeySachkov AlexeySachkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeviceConfigFile.td changes LGTM. @maarquitos14: FYI

Jack Kirk added 2 commits August 31, 2023 11:44
@maarquitos14
Copy link
Contributor

DeviceConfigFile.td changes look good to me too.

@Pennycook
Copy link
Contributor

Alternatively, we could leave this in proposed until we have tangle_group implemented on the CUDA backend

@JackAKirk: A random thought just occurred to me... Do you think it would be possible to configure some/all SYCL kernels such that they're built with the Pascal scheduling model? NVIDIA's documentation shows how to do this with their toolchain here.

I believe that if we constrain things to the Pascal scheduling model, your current implementation of opportunistic_group could be reused for tangle_group. We could even say that compiling this way is currently required to support the NVIDIA backend, while we continue exploring alternative implementation approaches.

What do you think?

@JackAKirk
Copy link
Contributor Author

JackAKirk commented Sep 1, 2023

Alternatively, we could leave this in proposed until we have tangle_group implemented on the CUDA backend

@JackAKirk: A random thought just occurred to me... Do you think it would be possible to configure some/all SYCL kernels such that they're built with the Pascal scheduling model? NVIDIA's documentation shows how to do this with their toolchain here.

I believe that if we constrain things to the Pascal scheduling model, your current implementation of opportunistic_group could be reused for tangle_group. We could even say that compiling this way is currently required to support the NVIDIA backend, while we continue exploring alternative implementation approaches.

What do you think?

This sounds like an interesting idea that would maybe for tangle_group. I've actually not seen nvcc -code=sm_70 ... before. I know that if you do nvcc -arch=compute_60 ... without the -code it means that sm_60 code is generated by ptx.

I think it would require some testing to see what the behavior is when we get round to doing this.
I did already make a ticket to investigate implementing tangle_group, but it has been on our backlog for a while.

@JackAKirk JackAKirk requested a review from gmlueck September 5, 2023 15:23
@@ -30,7 +31,13 @@ template <typename Group>
inline std::enable_if_t<sycl::is_group_v<std::decay_t<Group>> &&
std::is_same_v<Group, sycl::sub_group>,
ballot_group<Group>>
#ifdef __SYCL_DEVICE_ONLY__
get_ballot_group [[__sycl_detail__::__uses_aspects__(
sycl::aspect::ext_oneapi_non_uniform_groups)]] (Group group,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right location to place an attribute for a function declaration? I thought the attribute went before the return type.

Copy link
Contributor Author

@JackAKirk JackAKirk Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the docs suggest this, but if I put it before the return type : i.e before std::enable_if_t.. I get:

error: '__uses_aspects__' attribute cannot be applied to types
   33 | [[__sycl_detail__::__uses_aspects__(

If I put it before the inline too then it works, but I can't have any variation like this that doesn't then give a warning when building the compiler like:

warning: extra tokens at end of #endif directive

I have tried putting it in a bunch of places, and the only way I can compile and run without any warnings/errors and get it working correctly is like how I've done it in the PR!

I could make it superficially different so it also avoids the #else statement but it looks a mess, so I decided this way was best.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to cppreference, the attribute is allowed in either place: before the inline or directly after the function name. So, I think what you have is fine.

(https://en.cppreference.com/w/cpp/language/attributes)

FWIW, I presume you could do this if you wanted to place the attribute before the inline:

template <typename Group>
#ifdef __SYCL_DEVICE_ONLY__
[[__sycl_detail__::__uses_aspects__(sycl::aspect::ext_oneapi_non_uniform_groups)]]
#endif
inline std::enable_if_t<sycl::is_group_v<std::decay_t<Group>> &&
                            std::is_same_v<Group, sycl::sub_group>,
                        ballot_group<Group>>
get_ballot_group(Group group, bool predicate);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right this does work, I thought I tried it and it didn't. It is better like that. I'll change it.

Jack Kirk and others added 5 commits September 8, 2023 08:19
Signed-off-by: JackAKirk <[email protected]>
@JackAKirk JackAKirk force-pushed the non-uniform-hip-errors branch from 51d80fc to 2be3514 Compare September 21, 2023 14:53
@JackAKirk JackAKirk temporarily deployed to WindowsCILock September 21, 2023 17:53 — with GitHub Actions Inactive
@JackAKirk JackAKirk temporarily deployed to WindowsCILock September 21, 2023 19:24 — with GitHub Actions Inactive
@JackAKirk JackAKirk temporarily deployed to WindowsCILock September 22, 2023 19:57 — with GitHub Actions Inactive
@JackAKirk JackAKirk temporarily deployed to WindowsCILock September 22, 2023 21:34 — with GitHub Actions Inactive
Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants