Skip to content

[CUDA][HIP] Fix host task mem migration and add pi entry point for urEnqueueNativeCommandExp #14353

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jul 1, 2024

Conversation

hdelan
Copy link
Contributor

@hdelan hdelan commented Jun 28, 2024

The SYCL RT assumes that for devices in the same context, no mem
migration needs to occur across devices for a kernel launch or host
task. However, a CUdeviceptr is relevant to a specific device, so mem
migration must occur between devices in a ctx. If this assumption that
the SYCL RT makes about native mems being accessible to all devices in a
context, it must hand off the HT lambda to the plugin, so that the
plugin can handle the necessary mem migration.

This patch uses the new urEnqueueCustomCommandExp to execute the HT
lambda, which takes care of mem migration implicitly in the plugin.

@hdelan hdelan requested review from a team as code owners June 28, 2024 16:38
@hdelan hdelan force-pushed the fix-host-task-mem-migration branch from dae1178 to 380557b Compare June 28, 2024 16:38
@hdelan hdelan force-pushed the fix-host-task-mem-migration branch from 1ca21ea to 48edd61 Compare July 1, 2024 08:37
@hdelan hdelan temporarily deployed to WindowsCILock July 1, 2024 08:37 — with GitHub Actions Inactive
@hdelan hdelan temporarily deployed to WindowsCILock July 1, 2024 09:09 — with GitHub Actions Inactive
@hdelan hdelan temporarily deployed to WindowsCILock July 1, 2024 10:48 — with GitHub Actions Inactive
@hdelan hdelan changed the title [CUDA][HIP] Fix host task mem migration and add pi entry point for urEnqueueNativeCommandExp [ABI break][CUDA][HIP] Fix host task mem migration and add pi entry point for urEnqueueNativeCommandExp Jul 1, 2024
@hdelan hdelan changed the title [ABI break][CUDA][HIP] Fix host task mem migration and add pi entry point for urEnqueueNativeCommandExp [CUDA][HIP] Fix host task mem migration and add pi entry point for urEnqueueNativeCommandExp Jul 1, 2024
Copy link
Contributor

@PietroGhg PietroGhg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Native CPU LGTM, thank you

@hdelan
Copy link
Contributor Author

hdelan commented Jul 1, 2024

Ping @cperkinsintel @intel/unified-runtime-reviewers this is very high priority. Would be great to get reviews ASAP, thanks.

Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think it looks good.


HostTask.MHostTask->call(MThisCmd->MEvent->getHostProfilingInfo(), IH);
if (IH.get_backend() == backend::ext_oneapi_cuda ||
IH.get_backend() == backend::ext_oneapi_hip) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have preferred that the UR adapter would tell us if we could use this path, instead of making it the runtime checking the type of backend. That said, I can live with this as a stop-gap solution. Is the plan to have the other backends support this UR functionality as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can query this info from UR actually, but I think it's better only to use this entry point if it's needed for correctness, rather than if the entry point is supported. Although I don't feel too strongly about this either way, I can change this to query whether the device supports the experimental UR entry point.

There are indeed plans for other backends to support this although we are focussing our energy on CUDA and HIP for now, since interop is a big perf limiter for these backends. Interop isn't a big issue for other backends at the moment, but I think this will soon be supported in other backends as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe as an interim solution I'll add a TODO in code, saying should this entry point be used for all the backends that support it?. This is an open question.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with a TODO, but in the end I would still prefer if we could rely on the support query for this. Hope would be that any backend that reports support for it should be able to do it efficiently and correctly, as I assume we don't want different behavior for the API.

Maybe it means we can eventually always use this path and scrap some of the implementation in our library. 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that's a good idea! I will change actually to use the query, if that is the preferable path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the check to query piDeviceGetInfo to see if the entry point is supported.

Hugh Delaney and others added 7 commits July 1, 2024 17:07
The SYCL RT assumes that for devices in the same context, no mem
migration needs to occur across devices for a kernel launch or host
task. However, a CUdeviceptr is relevant to a specific device, so mem
migration must occur between devices in a ctx. If this assumption that
the SYCL RT makes about native mems being accessible to all devices in a
context, it must hand off the HT lambda to the plugin, so that the
plugin can handle the necessary mem migration.

This patch uses the new urEnqueueCustomCommandExp to execute the HT
lambda, which takes care of mem migration implicitly in the plugin.
Clang format and add pimock symbol and abi symbols.
Don't branch on backend, use UR entry point if it is supported. Also add
equivalent enum to PI.
@hdelan hdelan force-pushed the fix-host-task-mem-migration branch from 985ed2b to e67b566 Compare July 1, 2024 16:17
@hdelan hdelan temporarily deployed to WindowsCILock July 1, 2024 16:21 — with GitHub Actions Inactive
@hdelan hdelan temporarily deployed to WindowsCILock July 1, 2024 18:59 — with GitHub Actions Inactive
@hdelan
Copy link
Contributor Author

hdelan commented Jul 1, 2024

Ping @intel/llvm-gatekeepers this can be merged. Thanks

@ldrumm ldrumm merged commit 2e212e0 into intel:sycl Jul 1, 2024
14 checks passed
steffenlarsen added a commit to steffenlarsen/llvm that referenced this pull request Jul 2, 2024
This should address the post-commit failure introduced in intel#14353

Signed-off-by: Larsen, Steffen <[email protected]>
steffenlarsen added a commit that referenced this pull request Jul 2, 2024
This should address the post-commit failure introduced in
#14353

Signed-off-by: Larsen, Steffen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants