Skip to content

[SYCL] [ABI-Break] Partial implementation of sycl_ext_oneapi_cuda_cluster_group #14113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 89 commits into from
Jul 8, 2024

Conversation

AD2605
Copy link
Contributor

@AD2605 AD2605 commented Jun 10, 2024

This PR is a partial implementation of sycl_ext_oneapi_cuda_cluster_group, introducing the cluster_size property to launch a kernel with CUDA's thread block clusters

Only a small part of the extension specification described in #13594 is used in this implementation. To be specific everything after the section "Launching a kernel with a cluster_group" is not included in this PR. A very important point to note is that this PR still fully represents a functional use case of using Nvidia's cuda driver cluster launch feature for its primary purpose which is to accelerate cross-work-group collective operations (particularly for GEMM), leveraging cross-work group asynchronous multi-casting of distributed shared memory across work-groups.
This is a high priority feature that is targeted for the next release.

The other parts of the extension specification described in #13594, primarily related to the "cluster_group" abstraction is a (user-facing) convenience abstraction that is not required to be exposed in libraries that optimize such library collective operations (GEMM). Please therefore focus reviews of this PR on the relevant aspects of the extension that are required for the implementation in this PR and the library based application of it as described in this message.

@AD2605 AD2605 requested review from a team as code owners June 10, 2024 13:50
@AD2605 AD2605 requested review from hdelan and sergey-semenov June 10, 2024 13:50
Signed-off-by: JackAKirk <[email protected]>
@JackAKirk
Copy link
Contributor

JackAKirk commented Jul 4, 2024

@intel/llvm-gatekeepers

Please merge this asap. I'd be very grateful. Thanks.

This has passed all tests. With 50 changed files it is a race against time to avoid further conflicts occurring.

This is a highest priority PR that needs to be merged asap.

Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of my comments are small and could be addressed in a follow-up.

However, the ABI-break I am a little more concerned about. Technically speaking, we are allowed to do it, although the PR should still make it clear that it is breaking, but changing the handler layout is one of the bigger ones and I think we should consider if we should/need to do it.

In the meantime, it would be good to have some unittests for this change, if possible.

Comment on lines +309 to +318
if (Dev->getBackend() == backend::ext_oneapi_cuda) {
sycl::detail::pi::PiResult Err =
Dev->getPlugin()->call_nocheck<PiApiKind::piDeviceGetInfo>(
Dev->getHandleRef(),
PiInfoCode<info::device::ext_oneapi_cuda_cluster_group>::value,
sizeof(result), &result, nullptr);
if (Err != PI_SUCCESS) {
return false;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be preferable to let the UR adapters tell us this, even though this is a cuda-specific extension. It seems like the check is already written in a way where the backends returning an error code would result in false being returned, so it should just be removing the backend check.

Copy link
Contributor

@JackAKirk JackAKirk Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle I would agree: nonwithstanding the naming of the aspect, the return value of ext_oneapi_cuda_cluster_group could in theory return true for backends other than cuda in the future, given that it is experimental and open to renaming/generalization.

In practice whilst other vendors may come up with similar features and we may be able to partially amalgamate them into a single aspect I doubt very much that we can avoid having ext_oneapi_cuda_cluster_group to represent the cuda specific cluster_group, that has various features that are extremely specific to the particular way that NVIDIA has chosen to accelerated cross work group collectives (in particular GEMM).

Since no other backend is ever expected to support this aspect, I think it is more efficient for the implementation to return false directly in the sycl runtime for this aspect.

We could however do as you suggest if you really want, and I don't feel very strongly on this point, but with the impending removal of PI, I recommend that we do it in a follow up after PI is removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance is a good point, but on the other hand it is redundant work when the backend is CUDA, so it really depend on what the hot path is expected to be. The additional overhead should be fairly cheap in either solution though, so I am fine with either, but personally I prefer to have the runtime be as ignorant to the backend as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I misunderstand you but for the cuda backend it isn't actually a redundant call: this features is only supported on sm_90 and later.

Copy link
Contributor

@JackAKirk JackAKirk Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you prefer the call to be completely dealt with in UR I am happy to change it. Would it be acceptable to do this in a follow on work?
I have been told to concentrate on getting this PR merged so that the feature is in a nightly asap (whilst ensuring of course that this is future proof and considering future requirements, avoiding major reworking).
In my haste to get something that passed reasonable tests and encourage reviews to get to the point where we get the feedback we needed, I have only implemented this aspect in the cuda backend in UR. I will have to merge a PR in UR (which is considerably gridlocked wrt PRs atm) in order to support the requested change.
I am happy to do this, but since this is an implementation detail and there is the challenge of getting this feature merged with the facts that it inevitably touches 50 files at a time when many PRs are going in with similar paths and creating conflicts, I think it would be wiser for me to make this change in a follow up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I misunderstand you but for the cuda backend it isn't actually a redundant call: this features is only supported on sm_90 and later.

What I mean is, we have two options: First is to check for the backend and fail fast or don't check the backend and fail slow(er).

For the fail-fast solution, when doing it on a CUDA device it would be faster to not check the backend and go directly to checking if the feature is supported for the device (sm_90 or later.) It is really negligible, but if we expect that users will mainly use this on CUDA devices, I would argue that paying a little extra when the backend is not CUDA should be fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it's fine to have it a follow-up. It's not a big problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I misunderstand you but for the cuda backend it isn't actually a redundant call: this features is only supported on sm_90 and later.

What I mean is, we have two options: First is to check for the backend and fail fast or don't check the backend and fail slow(er).

For the fail-fast solution, when doing it on a CUDA device it would be faster to not check the backend and go directly to checking if the feature is supported for the device (sm_90 or later.) It is really negligible, but if we expect that users will mainly use this on CUDA devices, I would argue that paying a little extra when the backend is not CUDA should be fine.

I see what you mean thanks, it is a valid point.

@@ -633,7 +633,7 @@ pi_int32 enqueueImpKernel(
const detail::EventImplPtr &Event,
const std::function<void *(Requirement *Req)> &getMemAllocationFunc,
sycl::detail::pi::PiKernelCacheConfig KernelCacheConfig,
bool KernelIsCooperative);
bool KernelIsCooperative, const bool KernelUsesClusterLaunch);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to keep adding launch configurations (cooperative, cluster, etc.) we should consider packing them into bitmask or similar structure, just to avoid having to add a new bool in all these functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should consider packing them into bitmask or similar structure, just to avoid having to add a new bool in all these functions.

Yes that definitely would need to be done, otherwise the function arguments will keep on expanding, but
I believe this can be a part of a bigger refactor as we need this feature ASAP on the nightlies.

Though, personally I would say that a bitmask might not be the best for readability one would have to remember which bit represents which property. I would personally be of the opinion that we could pass the property list itself to this function (forcing the definition of this function to come to source file, a considerable refactor) and then something like -

if constexpr(std::tuple_size(properties) ! = 0) {
    ....
} else {
   ....
}

But again, I am of the opinion this can be a part of a later PR, mostly because of the urgency associated with this feature and also because we can then come up with a proper solution as well, and then we can bundle up KernelCacheConfig, KernelIsCooperative and KernelUsesClusterLaunch and future properties together.

Copy link
Contributor

@JackAKirk JackAKirk Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to keep adding launch configurations (cooperative, cluster, etc.) we should consider packing them into bitmask or similar structure, just to avoid having to add a new bool in all these functions.

Yes I think it is inevitable that more launch properties are added later on, and that there should be a long term solution such as the one you suggest. Whilst cooperative is the only one currently implemented that is supported across cuda/hip/l0, most of the ones that we could additionally implement for cuda are inspired by very backend agnostic gpgpu problems.
As @AD2605 mentioned we focused on simplicity in this time-pressured PR, and I agree that a agreed upon longer term solution should be made in a follow up PR.

There is for example a launch property which allows the partial overlap of kernels (setup and teardown but not computation) within an in-order queue (stream), that is particularly targeted towards small kernels within applications comprising many short kernels (that is often the case in applications using graphs).
Now the above described property is a boolean, and therefore can be easily dealt with in a generalized solution such as you suggest that also incorporates the booleans KernelIsCooperative and KernelUsesClusterLaunch.
This is the case for most of the properties features introduced by CUDA (they are mostly just bools).

A less novel launch property feature is a convenient way of setting the required local (cuda shared) memory size prior to kernel launch. This would be a little more complicated to support, since similarly to the clusterLaunch, values for the shared memory size need to be passed. However there are existing ways to deal with launch time shared memory already; I just use this as an example to describe the more general usage of kernel launch properties.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's alright. It seems like something we can address later, given it is not crossing the library boundary.

sycl::range<3> GlobalSize;
sycl::range<3> LocalSize;
sycl::id<3> GlobalOffset;
/// Number of workgroups, used to record the number of workgroups from the
/// simplest form of parallel_for_work_group. If set, all other fields must be
/// zero
sycl::range<3> NumWorkGroups;
sycl::range<3> ClusterDimensions{1, 1, 1};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an ABI break, as suggested by layout_handler.cpp and symbol_size_alignment.cpp. If we want to roll with that, this PR should be marked [ABI-break] and the corresponding label should be added.

Alternatively, you could change MKernelUsesClusterLaunch in handler_impl to be a std::optional<sycl::range<3>> and pass that along. We have had some issues with passing std::optional across the library boundary before, but as long as it stays inside the source files, it should not be a problem.

I am of two minds, because on one hand this seems like a fitting place for the new information and ABI-breaks are allowed. On the other hand, changing the layout of handler is exactly what handler_impl is here to prevent. Maybe a better solution is to make another ABI-break by moving NDRDescT out of handler and into handler_impl. @aelovikov-intel - Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that your logic is sensible, moving to handler_impl might be the best option. @aelovikov-intel I'd also appreciate your input. Thanks

We focused on getting a reasonable implementation of this feature up to collect feedback, particularly on the scheduling/handler details from Intel developers, and this was one of the main points that we foresaw could be challenging/contentious.

I think it would be a good idea to focus on this point and get it right first time since it is an abi-break, and really we only have until the end of next week to solve this and get it merged, since both I and @AD2605 are on holiday after that and won't be back until the ABI-break window is over.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made a patch for moving more detail parts into sources and moving some of the handler members into the handler_impl: #14460

If that is the way we want to go, I would be okay with merging this as-is and moving the new changes as part of the aforementioned patch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

@AD2605 AD2605 changed the title [SYCL] Partial implementation of sycl_ext_oneapi_cuda_cluster_group [SYCL] [ABI-break] Partial implementation of sycl_ext_oneapi_cuda_cluster_group Jul 5, 2024
@AD2605 AD2605 changed the title [SYCL] [ABI-break] Partial implementation of sycl_ext_oneapi_cuda_cluster_group [SYCL] [ABI-Break] Partial implementation of sycl_ext_oneapi_cuda_cluster_group Jul 5, 2024
@JackAKirk
Copy link
Contributor

JackAKirk commented Jul 5, 2024

Most of my comments are small and could be addressed in a follow-up.

However, the ABI-break I am a little more concerned about. Technically speaking, we are allowed to do it, although the PR should still make it clear that it is breaking, but changing the handler layout is one of the bigger ones and I think we should consider if we should/need to do it.

In the meantime, it would be good to have some unittests for this change, if possible.

Regarding unittests. For example a check that the scheduler logic correctly calls the pi function: I intentionally did not add this because I know that PI is being removed in a week or so, and correspondingly unittests will have to be more or less remade to deal with UR directly as far as I understand.

If there is already a decided upon way to deal with unittests going forward that I can add at this point I would be willing to do so. Alternatively I can make sure to add such a test in a follow up.

There are a series of more integration tests in test-e2e that @AD2605 added that tests the various ways that the cluster_dims property can be passed through the scheduler, that whilst not as specific as what I described above, do at least cover for any such failures of the scheduler part, even if they do not pinpoint them as specifically in isolation.

Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
@steffenlarsen
Copy link
Contributor

If there is already a decided upon way to deal with unittests going forward that I can add at this point I would be willing to do so. Alternatively I can make sure to add such a test in a follow up.

If it avoids double-work, I am alright with waiting a bit. I don't know of any current alternative.

There are a series of more integration tests in test-e2e that @AD2605 added that tests the various ways that the cluster_dims property can be passed through the scheduler, that whilst not as specific as what I described above, do at least cover for any such failures of the scheduler part, even if they do not pinpoint them as specifically in isolation.

I still think there is value in unittests, especially for a feature like this where the scheduler needs to propagate the additional information, take special paths because of it and call new UR APIs. Even if the E2E tests cover it, a unittest would likely make it a lot easier to isolate the cause of failure.

@JackAKirk
Copy link
Contributor

If there is already a decided upon way to deal with unittests going forward that I can add at this point I would be willing to do so. Alternatively I can make sure to add such a test in a follow up.

If it avoids double-work, I am alright with waiting a bit. I don't know of any current alternative.

There are a series of more integration tests in test-e2e that @AD2605 added that tests the various ways that the cluster_dims property can be passed through the scheduler, that whilst not as specific as what I described above, do at least cover for any such failures of the scheduler part, even if they do not pinpoint them as specifically in isolation.

I still think there is value in unittests, especially for a feature like this where the scheduler needs to propagate the additional information, take special paths because of it and call new UR APIs. Even if the E2E tests cover it, a unittest would likely make it a lot easier to isolate the cause of failure.

Yeah I see it makes sense to add a unittest. I'll add one after PI is removed. If you are otherwise happy to approve, this can be merged now. Thanks

Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe my remaining issues can be addressed in a follow-up.

@steffenlarsen steffenlarsen merged commit 7b3f215 into intel:sycl Jul 8, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
abi-break change that's breaking abi and waiting for the next window to be able to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.