-
Notifications
You must be signed in to change notification settings - Fork 787
[CUDA][HIP] Fix host task mem migration and add pi entry point for urEnqueueNativeCommandExp #14353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dae1178
to
380557b
Compare
1ca21ea
to
48edd61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Native CPU LGTM, thank you
Ping @cperkinsintel @intel/unified-runtime-reviewers this is very high priority. Would be great to get reviews ASAP, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think it looks good.
|
||
HostTask.MHostTask->call(MThisCmd->MEvent->getHostProfilingInfo(), IH); | ||
if (IH.get_backend() == backend::ext_oneapi_cuda || | ||
IH.get_backend() == backend::ext_oneapi_hip) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have preferred that the UR adapter would tell us if we could use this path, instead of making it the runtime checking the type of backend. That said, I can live with this as a stop-gap solution. Is the plan to have the other backends support this UR functionality as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can query this info from UR actually, but I think it's better only to use this entry point if it's needed for correctness, rather than if the entry point is supported. Although I don't feel too strongly about this either way, I can change this to query whether the device supports the experimental UR entry point.
There are indeed plans for other backends to support this although we are focussing our energy on CUDA and HIP for now, since interop is a big perf limiter for these backends. Interop isn't a big issue for other backends at the moment, but I think this will soon be supported in other backends as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe as an interim solution I'll add a TODO in code, saying should this entry point be used for all the backends that support it?
. This is an open question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with a TODO, but in the end I would still prefer if we could rely on the support query for this. Hope would be that any backend that reports support for it should be able to do it efficiently and correctly, as I assume we don't want different behavior for the API.
Maybe it means we can eventually always use this path and scrap some of the implementation in our library. 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think that's a good idea! I will change actually to use the query, if that is the preferable path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the check to query piDeviceGetInfo
to see if the entry point is supported.
The SYCL RT assumes that for devices in the same context, no mem migration needs to occur across devices for a kernel launch or host task. However, a CUdeviceptr is relevant to a specific device, so mem migration must occur between devices in a ctx. If this assumption that the SYCL RT makes about native mems being accessible to all devices in a context, it must hand off the HT lambda to the plugin, so that the plugin can handle the necessary mem migration. This patch uses the new urEnqueueCustomCommandExp to execute the HT lambda, which takes care of mem migration implicitly in the plugin.
Co-authored-by: Steffen Larsen <[email protected]>
Don't branch on backend, use UR entry point if it is supported. Also add equivalent enum to PI.
985ed2b
to
e67b566
Compare
Ping @intel/llvm-gatekeepers this can be merged. Thanks |
This should address the post-commit failure introduced in intel#14353 Signed-off-by: Larsen, Steffen <[email protected]>
This should address the post-commit failure introduced in #14353 Signed-off-by: Larsen, Steffen <[email protected]>
The SYCL RT assumes that for devices in the same context, no mem
migration needs to occur across devices for a kernel launch or host
task. However, a CUdeviceptr is relevant to a specific device, so mem
migration must occur between devices in a ctx. If this assumption that
the SYCL RT makes about native mems being accessible to all devices in a
context, it must hand off the HT lambda to the plugin, so that the
plugin can handle the necessary mem migration.
This patch uses the new urEnqueueCustomCommandExp to execute the HT
lambda, which takes care of mem migration implicitly in the plugin.