[UR][CUDA] Fix race condition in CUDA stream creation #15100

rafbiels · 2024-08-15T17:38:44Z

Fix race condition in CUDA stream creation in the UR CUDA adapter

See oneapi-src/unified-runtime#1984

sycl/cmake/modules/FetchUnifiedRuntime.cmake

omarahmed1111 · 2024-08-19T15:39:11Z

This will include bump for these PRs:

omarahmed1111 · 2024-08-19T17:56:29Z

@intel/llvm-gatekeepers Please merge, Thanks!

The CUDA and HIP adapters are both using a nearly identical complicated queue that handles creating an out-of-order UR queue from in-order CUDA/HIP streams. This patch extracts all of the queue logic into a separate templated class that can be used by both adapters. Beyond removing a lot of duplicated code, it also makes it a lot easier to maintain. There was a few functional differences between the queues in both adapters, but mostly due to fixes done in the CUDA adapter that were not ported to the HIP adapter. There might be more but I found at least one race condition (intel#15100) and one performance issue (intel#6333) that weren't fixed in the HIP adapter. This patch uses the CUDA version of the queue as a base for the generic queue, and will thus fix for HIP the race condition and performance issue mentioned above. This code is quite complex, so this patch also aimed to minimize any other changes beyond the structural changes needed to share the code. However it did do the following changes in the two adapters: CUDA: * Rename `ur_stream_guard_` to `ur_stream_guard` * Rename `getNextEventID` to `getNextEventId` * Remove duplicate `get_device` getter, use `getDevice` instead HIP: * Fix queue finish so it doesn't fail when no streams need to be synchronized

The CUDA and HIP adapters are both using a nearly identical complicated queue that handles creating an out-of-order UR queue from in-order CUDA/HIP streams. This patch extracts all of the queue logic into a separate templated class that can be used by both adapters. Beyond removing a lot of duplicated code, it also makes it a lot easier to maintain. There was a few functional differences between the queues in both adapters, but mostly due to fixes done in the CUDA adapter that were not ported to the HIP adapter. There might be more but I found at least one race condition (#15100) and one performance issue (#6333) that weren't fixed in the HIP adapter. This patch uses the CUDA version of the queue as a base for the generic queue, and will thus fix for HIP the race condition and performance issue mentioned above. This code is quite complex, so this patch also aimed to minimize any other changes beyond the structural changes needed to share the code. However it did do the following changes in the two adapters: `stream_queue.hpp`: * Remove `urDeviceRetain/Release`: essentially a no-op CUDA: * Rename `ur_stream_guard_` to `ur_stream_guard` * Rename `getNextEventID` to `getNextEventId` * Remove duplicate `get_device` getter, use `getDevice` instead HIP: * Fix queue finish so it doesn't fail when no streams need to be synchronized

The CUDA and HIP adapters are both using a nearly identical complicated queue that handles creating an out-of-order UR queue from in-order CUDA/HIP streams. This patch extracts all of the queue logic into a separate templated class that can be used by both adapters. Beyond removing a lot of duplicated code, it also makes it a lot easier to maintain. There was a few functional differences between the queues in both adapters, but mostly due to fixes done in the CUDA adapter that were not ported to the HIP adapter. There might be more but I found at least one race condition (intel/llvm#15100) and one performance issue (intel/llvm#6333) that weren't fixed in the HIP adapter. This patch uses the CUDA version of the queue as a base for the generic queue, and will thus fix for HIP the race condition and performance issue mentioned above. This code is quite complex, so this patch also aimed to minimize any other changes beyond the structural changes needed to share the code. However it did do the following changes in the two adapters: `stream_queue.hpp`: * Remove `urDeviceRetain/Release`: essentially a no-op CUDA: * Rename `ur_stream_guard_` to `ur_stream_guard` * Rename `getNextEventID` to `getNextEventId` * Remove duplicate `get_device` getter, use `getDevice` instead HIP: * Fix queue finish so it doesn't fail when no streams need to be synchronized

rafbiels requested a review from a team as a code owner August 15, 2024 17:38

rafbiels mentioned this pull request Aug 15, 2024

Fix race condition in CUDA stream creation oneapi-src/unified-runtime#1984

Merged

rafbiels marked this pull request as draft August 15, 2024 18:27

rodburns reviewed Aug 16, 2024

View reviewed changes

sycl/cmake/modules/FetchUnifiedRuntime.cmake Outdated Show resolved Hide resolved

omarahmed1111 force-pushed the rafbiels/cuda-stream-race-cond branch from af58982 to 4d07d50 Compare August 19, 2024 13:00

omarahmed1111 had a problem deploying to WindowsCILock August 19, 2024 13:00 — with GitHub Actions Failure

omarahmed1111 temporarily deployed to WindowsCILock August 19, 2024 13:21 — with GitHub Actions Inactive

omarahmed1111 temporarily deployed to WindowsCILock August 19, 2024 13:46 — with GitHub Actions Inactive

omarahmed1111 force-pushed the rafbiels/cuda-stream-race-cond branch from 4d07d50 to 374453c Compare August 19, 2024 15:25

omarahmed1111 had a problem deploying to WindowsCILock August 19, 2024 15:25 — with GitHub Actions Error

[UR][CUDA] Fix race condition in CUDA stream creation

a399862

omarahmed1111 force-pushed the rafbiels/cuda-stream-race-cond branch from 374453c to a399862 Compare August 19, 2024 15:26

omarahmed1111 temporarily deployed to WindowsCILock August 19, 2024 15:26 — with GitHub Actions Inactive

omarahmed1111 marked this pull request as ready for review August 19, 2024 15:31

This was referenced Aug 19, 2024

Test UR loader log output fix #15110

Closed

[Bindless] Fix unintended crash in catch blocks #14966

Closed

[SYCL][UR] Pull in patch to fix bad type assumption in CL adapter. #15074

Closed

omarahmed1111 approved these changes Aug 19, 2024

View reviewed changes

omarahmed1111 temporarily deployed to WindowsCILock August 19, 2024 15:50 — with GitHub Actions Inactive

steffenlarsen merged commit 2f3919e into intel:sycl Aug 20, 2024
12 checks passed

steffenlarsen mentioned this pull request Aug 20, 2024

Bump UR version to include UMF soversion fix #15116

Closed

npmiller mentioned this pull request Mar 25, 2025

[UR][CUDA][HIP] Unify queue handling between adapters #17641

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[UR][CUDA] Fix race condition in CUDA stream creation #15100

[UR][CUDA] Fix race condition in CUDA stream creation #15100

Uh oh!

rafbiels commented Aug 15, 2024

Uh oh!

Uh oh!

omarahmed1111 commented Aug 19, 2024

Uh oh!

omarahmed1111 commented Aug 19, 2024

Uh oh!

Uh oh!

Uh oh!

[UR][CUDA] Fix race condition in CUDA stream creation #15100

[UR][CUDA] Fix race condition in CUDA stream creation #15100

Uh oh!

Conversation

rafbiels commented Aug 15, 2024

Uh oh!

Uh oh!

omarahmed1111 commented Aug 19, 2024

Uh oh!

omarahmed1111 commented Aug 19, 2024

Uh oh!

Uh oh!

Uh oh!