[SYCL] [ABI-Break] Partial implementation of sycl_ext_oneapi_cuda_cluster_group #14113

AD2605 · 2024-06-10T13:50:47Z

This PR is a partial implementation of sycl_ext_oneapi_cuda_cluster_group, introducing the cluster_size property to launch a kernel with CUDA's thread block clusters

Only a small part of the extension specification described in #13594 is used in this implementation. To be specific everything after the section "Launching a kernel with a cluster_group" is not included in this PR. A very important point to note is that this PR still fully represents a functional use case of using Nvidia's cuda driver cluster launch feature for its primary purpose which is to accelerate cross-work-group collective operations (particularly for GEMM), leveraging cross-work group asynchronous multi-casting of distributed shared memory across work-groups.
This is a high priority feature that is targeted for the next release.

The other parts of the extension specification described in #13594, primarily related to the "cluster_group" abstraction is a (user-facing) convenience abstraction that is not required to be exposed in libraries that optimize such library collective operations (GEMM). Please therefore focus reviews of this PR on the relevant aspects of the extension that are required for the implementation in this PR and the library based application of it as described in this message.

Signed-off-by: JackAKirk <[email protected]>

…harva/thread_block_cluster_launch

…ock_cluster_launch

This reverts commit 519034d.

… arch flag in test

…cluster_launch

Signed-off-by: JackAKirk <[email protected]>

JackAKirk · 2024-07-04T21:41:32Z

@intel/llvm-gatekeepers

Please merge this asap. I'd be very grateful. Thanks.

This has passed all tests. With 50 changed files it is a race against time to avoid further conflicts occurring.

This is a highest priority PR that needs to be merged asap.

steffenlarsen

Most of my comments are small and could be addressed in a follow-up.

However, the ABI-break I am a little more concerned about. Technically speaking, we are allowed to do it, although the PR should still make it clear that it is breaking, but changing the handler layout is one of the bigger ones and I think we should consider if we should/need to do it.

In the meantime, it would be good to have some unittests for this change, if possible.

sycl/include/sycl/handler.hpp

steffenlarsen · 2024-07-05T04:53:10Z

sycl/source/detail/device_info.hpp

+    if (Dev->getBackend() == backend::ext_oneapi_cuda) {
+      sycl::detail::pi::PiResult Err =
+          Dev->getPlugin()->call_nocheck<PiApiKind::piDeviceGetInfo>(
+              Dev->getHandleRef(),
+              PiInfoCode<info::device::ext_oneapi_cuda_cluster_group>::value,
+              sizeof(result), &result, nullptr);
+      if (Err != PI_SUCCESS) {
+        return false;
+      }
+    }


It would be preferable to let the UR adapters tell us this, even though this is a cuda-specific extension. It seems like the check is already written in a way where the backends returning an error code would result in false being returned, so it should just be removing the backend check.

In principle I would agree: nonwithstanding the naming of the aspect, the return value of ext_oneapi_cuda_cluster_group could in theory return true for backends other than cuda in the future, given that it is experimental and open to renaming/generalization.

In practice whilst other vendors may come up with similar features and we may be able to partially amalgamate them into a single aspect I doubt very much that we can avoid having ext_oneapi_cuda_cluster_group to represent the cuda specific cluster_group, that has various features that are extremely specific to the particular way that NVIDIA has chosen to accelerated cross work group collectives (in particular GEMM).

Since no other backend is ever expected to support this aspect, I think it is more efficient for the implementation to return false directly in the sycl runtime for this aspect.

We could however do as you suggest if you really want, and I don't feel very strongly on this point, but with the impending removal of PI, I recommend that we do it in a follow up after PI is removed.

The performance is a good point, but on the other hand it is redundant work when the backend is CUDA, so it really depend on what the hot path is expected to be. The additional overhead should be fairly cheap in either solution though, so I am fine with either, but personally I prefer to have the runtime be as ignorant to the backend as possible.

Perhaps I misunderstand you but for the cuda backend it isn't actually a redundant call: this features is only supported on sm_90 and later.

Since you prefer the call to be completely dealt with in UR I am happy to change it. Would it be acceptable to do this in a follow on work?
I have been told to concentrate on getting this PR merged so that the feature is in a nightly asap (whilst ensuring of course that this is future proof and considering future requirements, avoiding major reworking).
In my haste to get something that passed reasonable tests and encourage reviews to get to the point where we get the feedback we needed, I have only implemented this aspect in the cuda backend in UR. I will have to merge a PR in UR (which is considerably gridlocked wrt PRs atm) in order to support the requested change.
I am happy to do this, but since this is an implementation detail and there is the challenge of getting this feature merged with the facts that it inevitably touches 50 files at a time when many PRs are going in with similar paths and creating conflicts, I think it would be wiser for me to make this change in a follow up.

Perhaps I misunderstand you but for the cuda backend it isn't actually a redundant call: this features is only supported on sm_90 and later.

What I mean is, we have two options: First is to check for the backend and fail fast or don't check the backend and fail slow(er).

For the fail-fast solution, when doing it on a CUDA device it would be faster to not check the backend and go directly to checking if the feature is supported for the device (sm_90 or later.) It is really negligible, but if we expect that users will mainly use this on CUDA devices, I would argue that paying a little extra when the backend is not CUDA should be fine.

Sure, it's fine to have it a follow-up. It's not a big problem.

Perhaps I misunderstand you but for the cuda backend it isn't actually a redundant call: this features is only supported on sm_90 and later.

What I mean is, we have two options: First is to check for the backend and fail fast or don't check the backend and fail slow(er).

For the fail-fast solution, when doing it on a CUDA device it would be faster to not check the backend and go directly to checking if the feature is supported for the device (sm_90 or later.) It is really negligible, but if we expect that users will mainly use this on CUDA devices, I would argue that paying a little extra when the backend is not CUDA should be fine.

I see what you mean thanks, it is a valid point.

steffenlarsen · 2024-07-05T05:19:31Z

sycl/source/detail/scheduler/commands.hpp

@@ -633,7 +633,7 @@ pi_int32 enqueueImpKernel(
    const detail::EventImplPtr &Event,
    const std::function<void *(Requirement *Req)> &getMemAllocationFunc,
    sycl::detail::pi::PiKernelCacheConfig KernelCacheConfig,
-    bool KernelIsCooperative);
+    bool KernelIsCooperative, const bool KernelUsesClusterLaunch);


If we are going to keep adding launch configurations (cooperative, cluster, etc.) we should consider packing them into bitmask or similar structure, just to avoid having to add a new bool in all these functions.

we should consider packing them into bitmask or similar structure, just to avoid having to add a new bool in all these functions.

Yes that definitely would need to be done, otherwise the function arguments will keep on expanding, but
I believe this can be a part of a bigger refactor as we need this feature ASAP on the nightlies.

Though, personally I would say that a bitmask might not be the best for readability one would have to remember which bit represents which property. I would personally be of the opinion that we could pass the property list itself to this function (forcing the definition of this function to come to source file, a considerable refactor) and then something like -

if constexpr(std::tuple_size(properties) ! = 0) { .... } else { .... }

But again, I am of the opinion this can be a part of a later PR, mostly because of the urgency associated with this feature and also because we can then come up with a proper solution as well, and then we can bundle up KernelCacheConfig, KernelIsCooperative and KernelUsesClusterLaunch and future properties together.

If we are going to keep adding launch configurations (cooperative, cluster, etc.) we should consider packing them into bitmask or similar structure, just to avoid having to add a new bool in all these functions.

Yes I think it is inevitable that more launch properties are added later on, and that there should be a long term solution such as the one you suggest. Whilst cooperative is the only one currently implemented that is supported across cuda/hip/l0, most of the ones that we could additionally implement for cuda are inspired by very backend agnostic gpgpu problems.
As @AD2605 mentioned we focused on simplicity in this time-pressured PR, and I agree that a agreed upon longer term solution should be made in a follow up PR.

There is for example a launch property which allows the partial overlap of kernels (setup and teardown but not computation) within an in-order queue (stream), that is particularly targeted towards small kernels within applications comprising many short kernels (that is often the case in applications using graphs).
Now the above described property is a boolean, and therefore can be easily dealt with in a generalized solution such as you suggest that also incorporates the booleans KernelIsCooperative and KernelUsesClusterLaunch.
This is the case for most of the properties features introduced by CUDA (they are mostly just bools).

A less novel launch property feature is a convenient way of setting the required local (cuda shared) memory size prior to kernel launch. This would be a little more complicated to support, since similarly to the clusterLaunch, values for the shared memory size need to be passed. However there are existing ways to deal with launch time shared memory already; I just use this as an example to describe the more general usage of kernel launch properties.

That's alright. It seems like something we can address later, given it is not crossing the library boundary.

steffenlarsen · 2024-07-05T05:39:19Z

sycl/include/sycl/detail/cg_types.hpp

  sycl::range<3> GlobalSize;
  sycl::range<3> LocalSize;
  sycl::id<3> GlobalOffset;
  /// Number of workgroups, used to record the number of workgroups from the
  /// simplest form of parallel_for_work_group. If set, all other fields must be
  /// zero
  sycl::range<3> NumWorkGroups;
+  sycl::range<3> ClusterDimensions{1, 1, 1};


This is an ABI break, as suggested by layout_handler.cpp and symbol_size_alignment.cpp. If we want to roll with that, this PR should be marked [ABI-break] and the corresponding label should be added.

Alternatively, you could change MKernelUsesClusterLaunch in handler_impl to be a std::optional<sycl::range<3>> and pass that along. We have had some issues with passing std::optional across the library boundary before, but as long as it stays inside the source files, it should not be a problem.

I am of two minds, because on one hand this seems like a fitting place for the new information and ABI-breaks are allowed. On the other hand, changing the layout of handler is exactly what handler_impl is here to prevent. Maybe a better solution is to make another ABI-break by moving NDRDescT out of handler and into handler_impl. @aelovikov-intel - Thoughts?

I think that your logic is sensible, moving to handler_impl might be the best option. @aelovikov-intel I'd also appreciate your input. Thanks

We focused on getting a reasonable implementation of this feature up to collect feedback, particularly on the scheduling/handler details from Intel developers, and this was one of the main points that we foresaw could be challenging/contentious.

I think it would be a good idea to focus on this point and get it right first time since it is an abi-break, and really we only have until the end of next week to solve this and get it merged, since both I and @AD2605 are on holiday after that and won't be back until the ABI-break window is over.

I have made a patch for moving more detail parts into sources and moving some of the handler members into the handler_impl: #14460

If that is the way we want to go, I would be okay with merging this as-is and moving the new changes as part of the aforementioned patch.

…ock_cluster_launch

JackAKirk · 2024-07-05T14:16:54Z

Most of my comments are small and could be addressed in a follow-up.

However, the ABI-break I am a little more concerned about. Technically speaking, we are allowed to do it, although the PR should still make it clear that it is breaking, but changing the handler layout is one of the bigger ones and I think we should consider if we should/need to do it.

In the meantime, it would be good to have some unittests for this change, if possible.

Regarding unittests. For example a check that the scheduler logic correctly calls the pi function: I intentionally did not add this because I know that PI is being removed in a week or so, and correspondingly unittests will have to be more or less remade to deal with UR directly as far as I understand.

If there is already a decided upon way to deal with unittests going forward that I can add at this point I would be willing to do so. Alternatively I can make sure to add such a test in a follow up.

There are a series of more integration tests in test-e2e that @AD2605 added that tests the various ways that the cluster_dims property can be passed through the scheduler, that whilst not as specific as what I described above, do at least cover for any such failures of the scheduler part, even if they do not pinpoint them as specifically in isolation.

Signed-off-by: JackAKirk <[email protected]>

steffenlarsen · 2024-07-08T05:39:56Z

If there is already a decided upon way to deal with unittests going forward that I can add at this point I would be willing to do so. Alternatively I can make sure to add such a test in a follow up.

If it avoids double-work, I am alright with waiting a bit. I don't know of any current alternative.

There are a series of more integration tests in test-e2e that @AD2605 added that tests the various ways that the cluster_dims property can be passed through the scheduler, that whilst not as specific as what I described above, do at least cover for any such failures of the scheduler part, even if they do not pinpoint them as specifically in isolation.

I still think there is value in unittests, especially for a feature like this where the scheduler needs to propagate the additional information, take special paths because of it and call new UR APIs. Even if the E2E tests cover it, a unittest would likely make it a lot easier to isolate the cause of failure.

JackAKirk · 2024-07-08T06:22:18Z

If there is already a decided upon way to deal with unittests going forward that I can add at this point I would be willing to do so. Alternatively I can make sure to add such a test in a follow up.

If it avoids double-work, I am alright with waiting a bit. I don't know of any current alternative.

There are a series of more integration tests in test-e2e that @AD2605 added that tests the various ways that the cluster_dims property can be passed through the scheduler, that whilst not as specific as what I described above, do at least cover for any such failures of the scheduler part, even if they do not pinpoint them as specifically in isolation.

I still think there is value in unittests, especially for a feature like this where the scheduler needs to propagate the additional information, take special paths because of it and call new UR APIs. Even if the E2E tests cover it, a unittest would likely make it a lot easier to isolate the cause of failure.

Yeah I see it makes sense to add a unittest. I'll add one after PI is removed. If you are otherwise happy to approve, this can be merged now. Thanks

steffenlarsen

I believe my remaining issues can be addressed in a follow-up.

JackAKirk and others added 22 commits May 27, 2024 09:45

Introduce mappings to UR launch props.

5242112

Signed-off-by: JackAKirk <[email protected]>

initial changes supporting thread block cluster launch

02001a1

Merge remote-tracking branch 'jack-llvm/pi-launch-properties' into at…

939d01e

…harva/thread_block_cluster_launch

fix compilation issues 1

1b23bc4

fix compilation errors

1a15a1c

fix compilation issues 1

b26f303

delete executable added by mistake

eceb616

add unit test and temporary changes for testing

b718967

restore handler.hpp

c37f4be

Merge remote-tracking branch 'intel-llvm/sycl' into atharva/thread_bl…

80ed867

…ock_cluster_launch

return PI_ERROR_UNSUPPORTED_FEATURE for non cuda backends

9b75ebb

fix bug in test

79f9d1f

cleanup

40c0db4

pass properties in enqueue functions and add test for cluster launch

411aa70

restore enqueue funcitons to latest tip to check if it is buggy

519034d

Revert "restore enqueue funcitons to latest tip to check if it is buggy"

140b4d7

This reverts commit 519034d.

remove test via parallel for, test only via enqueue functions and add…

f95801a

… arch flag in test

remove cluster dim validity as it's already happening in UR

3ff69cf

Merge remote-tracking branch 'origin/sycl' into atharva/thread_block_…

9651d4f

…cluster_launch

Bump PI Minor version

81c5456

add requires cuda in cluster launch test

f2adb81

move cluster_size struct to cuda namespace

4e82463

AD2605 requested review from a team as code owners June 10, 2024 13:50

AD2605 requested review from hdelan and sergey-semenov June 10, 2024 13:50

AD2605 had a problem deploying to WindowsCILock June 10, 2024 13:55 — with GitHub Actions Failure

Fix CHECK-NEXT failure.

c6a8ef2

Signed-off-by: JackAKirk <[email protected]>

JackAKirk temporarily deployed to WindowsCILock July 4, 2024 15:39 — with GitHub Actions Inactive

JackAKirk temporarily deployed to WindowsCILock July 4, 2024 18:41 — with GitHub Actions Inactive

steffenlarsen requested changes Jul 5, 2024

View reviewed changes

AD2605 added 2 commits July 5, 2024 08:01

Merge remote-tracking branch 'intel-llvm/sycl' into atharva/thread_bl…

24eb8f2

…ock_cluster_launch

make size private and refactoor hasClusterDim

91bc7af

AD2605 changed the title ~~[SYCL] Partial implementation of sycl_ext_oneapi_cuda_cluster_group~~ [SYCL] [ABI-break] Partial implementation of sycl_ext_oneapi_cuda_cluster_group Jul 5, 2024

AD2605 changed the title ~~[SYCL] [ABI-break] Partial implementation of sycl_ext_oneapi_cuda_cluster_group~~ [SYCL] [ABI-Break] Partial implementation of sycl_ext_oneapi_cuda_cluster_group Jul 5, 2024

AD2605 had a problem deploying to WindowsCILock July 5, 2024 13:12 — with GitHub Actions Failure

AD2605 had a problem deploying to WindowsCILock July 5, 2024 15:00 — with GitHub Actions Error

JackAKirk added 2 commits July 5, 2024 08:20

Fix build.

f11bc5a

Signed-off-by: JackAKirk <[email protected]>

Merge branch 'sycl' into atharva/thread_block_cluster_launch

8498375

Signed-off-by: JackAKirk <[email protected]>

JackAKirk had a problem deploying to WindowsCILock July 5, 2024 16:28 — with GitHub Actions Error

Fix format.

c24fc47

Signed-off-by: JackAKirk <[email protected]>

JackAKirk had a problem deploying to WindowsCILock July 5, 2024 16:34 — with GitHub Actions Error

Fix format.

a17f229

Signed-off-by: JackAKirk <[email protected]>

JackAKirk had a problem deploying to WindowsCILock July 5, 2024 16:42 — with GitHub Actions Error

Add queue wait to avoid race.

8801a6a

Signed-off-by: JackAKirk <[email protected]>

JackAKirk temporarily deployed to WindowsCILock July 5, 2024 17:06 — with GitHub Actions Inactive

JackAKirk temporarily deployed to WindowsCILock July 5, 2024 18:01 — with GitHub Actions Inactive

sm90 fix.

5a0b039

Signed-off-by: JackAKirk <[email protected]>

JackAKirk temporarily deployed to WindowsCILock July 6, 2024 15:21 — with GitHub Actions Inactive

JackAKirk temporarily deployed to WindowsCILock July 6, 2024 16:13 — with GitHub Actions Inactive

steffenlarsen approved these changes Jul 8, 2024

View reviewed changes

steffenlarsen merged commit 7b3f215 into intel:sycl Jul 8, 2024
14 checks passed

steffenlarsen mentioned this pull request Jul 23, 2024

[SYCL][Docs] Add kernel enqueue functions for kernel and properties #14707

Closed

[SYCL] [ABI-Break] Partial implementation of sycl_ext_oneapi_cuda_cluster_group #14113

[SYCL] [ABI-Break] Partial implementation of sycl_ext_oneapi_cuda_cluster_group #14113

Uh oh!

Conversation

AD2605 commented Jun 10, 2024 • edited by JackAKirk Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JackAKirk commented Jul 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steffenlarsen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackAKirk Jul 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackAKirk Jul 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackAKirk Jul 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackAKirk commented Jul 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steffenlarsen commented Jul 8, 2024

Uh oh!

JackAKirk commented Jul 8, 2024

Uh oh!

steffenlarsen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

AD2605 commented Jun 10, 2024 •

edited by JackAKirk

Loading

JackAKirk commented Jul 4, 2024 •

edited

Loading

JackAKirk Jul 5, 2024 •

edited

Loading

JackAKirk Jul 5, 2024 •

edited

Loading

JackAKirk Jul 5, 2024 •

edited

Loading

JackAKirk commented Jul 5, 2024 •

edited

Loading