Skip to content

[UR][CUDA] Fix race condition in CUDA stream creation #15100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 20, 2024

Conversation

rafbiels
Copy link
Contributor

Fix race condition in CUDA stream creation in the UR CUDA adapter

See oneapi-src/unified-runtime#1984

@omarahmed1111 omarahmed1111 force-pushed the rafbiels/cuda-stream-race-cond branch from 374453c to a399862 Compare August 19, 2024 15:26
@omarahmed1111 omarahmed1111 marked this pull request as ready for review August 19, 2024 15:31
@omarahmed1111
Copy link
Contributor

@intel/llvm-gatekeepers Please merge, Thanks!

@steffenlarsen steffenlarsen merged commit 2f3919e into intel:sycl Aug 20, 2024
12 checks passed
npmiller added a commit to npmiller/llvm that referenced this pull request Mar 26, 2025
The CUDA and HIP adapters are both using a nearly identical complicated
queue that handles creating an out-of-order UR queue from in-order
CUDA/HIP streams.

This patch extracts all of the queue logic into a separate templated
class that can be used by both adapters. Beyond removing a lot of
duplicated code, it also makes it a lot easier to maintain.

There was a few functional differences between the queues in both
adapters, but mostly due to fixes done in the CUDA adapter that were not
ported to the HIP adapter. There might be more but I found at least one
race condition (intel#15100) and one
performance issue (intel#6333) that weren't
fixed in the HIP adapter.

This patch uses the CUDA version of the queue as a base for the generic
queue, and will thus fix for HIP the race condition and performance
issue mentioned above.

This code is quite complex, so this patch also aimed to minimize any
other changes beyond the structural changes needed to share the code.
However it did do the following changes in the two adapters:

CUDA:

* Rename `ur_stream_guard_` to `ur_stream_guard`
* Rename `getNextEventID` to `getNextEventId`
* Remove duplicate `get_device` getter, use `getDevice` instead

HIP:

* Fix queue finish so it doesn't fail when no streams need to be
  synchronized
sarnex pushed a commit that referenced this pull request Apr 2, 2025
The CUDA and HIP adapters are both using a nearly identical complicated
queue that handles creating an out-of-order UR queue from in-order
CUDA/HIP streams.

This patch extracts all of the queue logic into a separate templated
class that can be used by both adapters. Beyond removing a lot of
duplicated code, it also makes it a lot easier to maintain.

There was a few functional differences between the queues in both
adapters, but mostly due to fixes done in the CUDA adapter that were not
ported to the HIP adapter. There might be more but I found at least one
race condition (#15100) and one
performance issue (#6333) that weren't
fixed in the HIP adapter.

This patch uses the CUDA version of the queue as a base for the generic
queue, and will thus fix for HIP the race condition and performance
issue mentioned above.

This code is quite complex, so this patch also aimed to minimize any
other changes beyond the structural changes needed to share the code.
However it did do the following changes in the two adapters:

`stream_queue.hpp`:
* Remove `urDeviceRetain/Release`: essentially a no-op

CUDA:

* Rename `ur_stream_guard_` to `ur_stream_guard`
* Rename `getNextEventID` to `getNextEventId`
* Remove duplicate `get_device` getter, use `getDevice` instead

HIP:

* Fix queue finish so it doesn't fail when no streams need to be
synchronized
kbenzie pushed a commit to oneapi-src/unified-runtime that referenced this pull request Apr 3, 2025
The CUDA and HIP adapters are both using a nearly identical complicated
queue that handles creating an out-of-order UR queue from in-order
CUDA/HIP streams.

This patch extracts all of the queue logic into a separate templated
class that can be used by both adapters. Beyond removing a lot of
duplicated code, it also makes it a lot easier to maintain.

There was a few functional differences between the queues in both
adapters, but mostly due to fixes done in the CUDA adapter that were not
ported to the HIP adapter. There might be more but I found at least one
race condition (intel/llvm#15100) and one
performance issue (intel/llvm#6333) that weren't
fixed in the HIP adapter.

This patch uses the CUDA version of the queue as a base for the generic
queue, and will thus fix for HIP the race condition and performance
issue mentioned above.

This code is quite complex, so this patch also aimed to minimize any
other changes beyond the structural changes needed to share the code.
However it did do the following changes in the two adapters:

`stream_queue.hpp`:
* Remove `urDeviceRetain/Release`: essentially a no-op

CUDA:

* Rename `ur_stream_guard_` to `ur_stream_guard`
* Rename `getNextEventID` to `getNextEventId`
* Remove duplicate `get_device` getter, use `getDevice` instead

HIP:

* Fix queue finish so it doesn't fail when no streams need to be
synchronized
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants