-
Notifications
You must be signed in to change notification settings - Fork 787
[SYCL] [ABI-Break] Partial implementation of sycl_ext_oneapi_cuda_cluster_group #14113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] [ABI-Break] Partial implementation of sycl_ext_oneapi_cuda_cluster_group #14113
Conversation
Signed-off-by: JackAKirk <[email protected]>
…harva/thread_block_cluster_launch
…ock_cluster_launch
This reverts commit 519034d.
… arch flag in test
Signed-off-by: JackAKirk <[email protected]>
@intel/llvm-gatekeepers Please merge this asap. I'd be very grateful. Thanks. This has passed all tests. With 50 changed files it is a race against time to avoid further conflicts occurring. This is a highest priority PR that needs to be merged asap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of my comments are small and could be addressed in a follow-up.
However, the ABI-break I am a little more concerned about. Technically speaking, we are allowed to do it, although the PR should still make it clear that it is breaking, but changing the handler layout is one of the bigger ones and I think we should consider if we should/need to do it.
In the meantime, it would be good to have some unittests for this change, if possible.
if (Dev->getBackend() == backend::ext_oneapi_cuda) { | ||
sycl::detail::pi::PiResult Err = | ||
Dev->getPlugin()->call_nocheck<PiApiKind::piDeviceGetInfo>( | ||
Dev->getHandleRef(), | ||
PiInfoCode<info::device::ext_oneapi_cuda_cluster_group>::value, | ||
sizeof(result), &result, nullptr); | ||
if (Err != PI_SUCCESS) { | ||
return false; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be preferable to let the UR adapters tell us this, even though this is a cuda-specific extension. It seems like the check is already written in a way where the backends returning an error code would result in false
being returned, so it should just be removing the backend check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle I would agree: nonwithstanding the naming of the aspect, the return value of ext_oneapi_cuda_cluster_group
could in theory return true for backends other than cuda in the future, given that it is experimental and open to renaming/generalization.
In practice whilst other vendors may come up with similar features and we may be able to partially amalgamate them into a single aspect I doubt very much that we can avoid having ext_oneapi_cuda_cluster_group
to represent the cuda specific cluster_group, that has various features that are extremely specific to the particular way that NVIDIA has chosen to accelerated cross work group collectives (in particular GEMM).
Since no other backend is ever expected to support this aspect, I think it is more efficient for the implementation to return false directly in the sycl runtime for this aspect.
We could however do as you suggest if you really want, and I don't feel very strongly on this point, but with the impending removal of PI, I recommend that we do it in a follow up after PI is removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance is a good point, but on the other hand it is redundant work when the backend is CUDA, so it really depend on what the hot path is expected to be. The additional overhead should be fairly cheap in either solution though, so I am fine with either, but personally I prefer to have the runtime be as ignorant to the backend as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I misunderstand you but for the cuda backend it isn't actually a redundant call: this features is only supported on sm_90 and later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you prefer the call to be completely dealt with in UR I am happy to change it. Would it be acceptable to do this in a follow on work?
I have been told to concentrate on getting this PR merged so that the feature is in a nightly asap (whilst ensuring of course that this is future proof and considering future requirements, avoiding major reworking).
In my haste to get something that passed reasonable tests and encourage reviews to get to the point where we get the feedback we needed, I have only implemented this aspect in the cuda backend in UR. I will have to merge a PR in UR (which is considerably gridlocked wrt PRs atm) in order to support the requested change.
I am happy to do this, but since this is an implementation detail and there is the challenge of getting this feature merged with the facts that it inevitably touches 50 files at a time when many PRs are going in with similar paths and creating conflicts, I think it would be wiser for me to make this change in a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I misunderstand you but for the cuda backend it isn't actually a redundant call: this features is only supported on sm_90 and later.
What I mean is, we have two options: First is to check for the backend and fail fast or don't check the backend and fail slow(er).
For the fail-fast solution, when doing it on a CUDA device it would be faster to not check the backend and go directly to checking if the feature is supported for the device (sm_90 or later.) It is really negligible, but if we expect that users will mainly use this on CUDA devices, I would argue that paying a little extra when the backend is not CUDA should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it's fine to have it a follow-up. It's not a big problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I misunderstand you but for the cuda backend it isn't actually a redundant call: this features is only supported on sm_90 and later.
What I mean is, we have two options: First is to check for the backend and fail fast or don't check the backend and fail slow(er).
For the fail-fast solution, when doing it on a CUDA device it would be faster to not check the backend and go directly to checking if the feature is supported for the device (sm_90 or later.) It is really negligible, but if we expect that users will mainly use this on CUDA devices, I would argue that paying a little extra when the backend is not CUDA should be fine.
I see what you mean thanks, it is a valid point.
@@ -633,7 +633,7 @@ pi_int32 enqueueImpKernel( | |||
const detail::EventImplPtr &Event, | |||
const std::function<void *(Requirement *Req)> &getMemAllocationFunc, | |||
sycl::detail::pi::PiKernelCacheConfig KernelCacheConfig, | |||
bool KernelIsCooperative); | |||
bool KernelIsCooperative, const bool KernelUsesClusterLaunch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to keep adding launch configurations (cooperative, cluster, etc.) we should consider packing them into bitmask or similar structure, just to avoid having to add a new bool in all these functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should consider packing them into bitmask or similar structure, just to avoid having to add a new bool in all these functions.
Yes that definitely would need to be done, otherwise the function arguments will keep on expanding, but
I believe this can be a part of a bigger refactor as we need this feature ASAP on the nightlies.
Though, personally I would say that a bitmask might not be the best for readability one would have to remember which bit represents which property. I would personally be of the opinion that we could pass the property list itself to this function (forcing the definition of this function to come to source file, a considerable refactor) and then something like -
if constexpr(std::tuple_size(properties) ! = 0) {
....
} else {
....
}
But again, I am of the opinion this can be a part of a later PR, mostly because of the urgency associated with this feature and also because we can then come up with a proper solution as well, and then we can bundle up KernelCacheConfig, KernelIsCooperative and KernelUsesClusterLaunch and future properties together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to keep adding launch configurations (cooperative, cluster, etc.) we should consider packing them into bitmask or similar structure, just to avoid having to add a new bool in all these functions.
Yes I think it is inevitable that more launch properties are added later on, and that there should be a long term solution such as the one you suggest. Whilst cooperative
is the only one currently implemented that is supported across cuda/hip/l0, most of the ones that we could additionally implement for cuda are inspired by very backend agnostic gpgpu problems.
As @AD2605 mentioned we focused on simplicity in this time-pressured PR, and I agree that a agreed upon longer term solution should be made in a follow up PR.
There is for example a launch property which allows the partial overlap of kernels (setup and teardown but not computation) within an in-order queue (stream), that is particularly targeted towards small kernels within applications comprising many short kernels (that is often the case in applications using graphs).
Now the above described property is a boolean, and therefore can be easily dealt with in a generalized solution such as you suggest that also incorporates the booleans KernelIsCooperative
and KernelUsesClusterLaunch
.
This is the case for most of the properties features introduced by CUDA (they are mostly just bools).
A less novel launch property feature is a convenient way of setting the required local (cuda shared) memory size prior to kernel launch. This would be a little more complicated to support, since similarly to the clusterLaunch, values for the shared memory size need to be passed. However there are existing ways to deal with launch time shared memory already; I just use this as an example to describe the more general usage of kernel launch properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's alright. It seems like something we can address later, given it is not crossing the library boundary.
sycl::range<3> GlobalSize; | ||
sycl::range<3> LocalSize; | ||
sycl::id<3> GlobalOffset; | ||
/// Number of workgroups, used to record the number of workgroups from the | ||
/// simplest form of parallel_for_work_group. If set, all other fields must be | ||
/// zero | ||
sycl::range<3> NumWorkGroups; | ||
sycl::range<3> ClusterDimensions{1, 1, 1}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an ABI break, as suggested by layout_handler.cpp and symbol_size_alignment.cpp. If we want to roll with that, this PR should be marked [ABI-break] and the corresponding label should be added.
Alternatively, you could change MKernelUsesClusterLaunch
in handler_impl
to be a std::optional<sycl::range<3>>
and pass that along. We have had some issues with passing std::optional
across the library boundary before, but as long as it stays inside the source files, it should not be a problem.
I am of two minds, because on one hand this seems like a fitting place for the new information and ABI-breaks are allowed. On the other hand, changing the layout of handler is exactly what handler_impl
is here to prevent. Maybe a better solution is to make another ABI-break by moving NDRDescT
out of handler
and into handler_impl
. @aelovikov-intel - Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that your logic is sensible, moving to handler_impl might be the best option. @aelovikov-intel I'd also appreciate your input. Thanks
We focused on getting a reasonable implementation of this feature up to collect feedback, particularly on the scheduling/handler details from Intel developers, and this was one of the main points that we foresaw could be challenging/contentious.
I think it would be a good idea to focus on this point and get it right first time since it is an abi-break, and really we only have until the end of next week to solve this and get it merged, since both I and @AD2605 are on holiday after that and won't be back until the ABI-break window is over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made a patch for moving more detail
parts into sources and moving some of the handler members into the handler_impl
: #14460
If that is the way we want to go, I would be okay with merging this as-is and moving the new changes as part of the aforementioned patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
Regarding unittests. For example a check that the scheduler logic correctly calls the pi function: I intentionally did not add this because I know that PI is being removed in a week or so, and correspondingly unittests will have to be more or less remade to deal with UR directly as far as I understand. If there is already a decided upon way to deal with unittests going forward that I can add at this point I would be willing to do so. Alternatively I can make sure to add such a test in a follow up. There are a series of more integration tests in test-e2e that @AD2605 added that tests the various ways that the cluster_dims property can be passed through the scheduler, that whilst not as specific as what I described above, do at least cover for any such failures of the scheduler part, even if they do not pinpoint them as specifically in isolation. |
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
If it avoids double-work, I am alright with waiting a bit. I don't know of any current alternative.
I still think there is value in unittests, especially for a feature like this where the scheduler needs to propagate the additional information, take special paths because of it and call new UR APIs. Even if the E2E tests cover it, a unittest would likely make it a lot easier to isolate the cause of failure. |
Yeah I see it makes sense to add a unittest. I'll add one after PI is removed. If you are otherwise happy to approve, this can be merged now. Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe my remaining issues can be addressed in a follow-up.
This PR is a partial implementation of
sycl_ext_oneapi_cuda_cluster_group
, introducing thecluster_size
property to launch a kernel with CUDA's thread block clustersOnly a small part of the extension specification described in #13594 is used in this implementation. To be specific everything after the section "Launching a kernel with a
cluster_group
" is not included in this PR. A very important point to note is that this PR still fully represents a functional use case of using Nvidia's cuda driver cluster launch feature for its primary purpose which is to accelerate cross-work-group collective operations (particularly for GEMM), leveraging cross-work group asynchronous multi-casting of distributed shared memory across work-groups.This is a high priority feature that is targeted for the next release.
The other parts of the extension specification described in #13594, primarily related to the "cluster_group" abstraction is a (user-facing) convenience abstraction that is not required to be exposed in libraries that optimize such library collective operations (GEMM). Please therefore focus reviews of this PR on the relevant aspects of the extension that are required for the implementation in this PR and the library based application of it as described in this message.