Skip to content

[SYCL][PI][CUDA] Fix too many streams getting synchronized #6333

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 21, 2022

Conversation

t4c1
Copy link
Contributor

@t4c1 t4c1 commented Jun 21, 2022

Fixed off-by-one error introduced in #6201 that would cause queue synchronization to synchronize all streams when no stream has been used. The code worked correctly, but this can in some cases impact performance.

@t4c1 t4c1 requested a review from a team as a code owner June 21, 2022 08:30
@t4c1 t4c1 requested a review from smaslov-intel June 21, 2022 08:30
Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find!

@steffenlarsen steffenlarsen merged commit 5352b42 into intel:sycl Jun 21, 2022
abagusetty added a commit to abagusetty/llvm that referenced this pull request Jun 21, 2022
abagusetty added a commit to abagusetty/llvm that referenced this pull request Jun 30, 2022
npmiller added a commit to npmiller/llvm that referenced this pull request Mar 26, 2025
The CUDA and HIP adapters are both using a nearly identical complicated
queue that handles creating an out-of-order UR queue from in-order
CUDA/HIP streams.

This patch extracts all of the queue logic into a separate templated
class that can be used by both adapters. Beyond removing a lot of
duplicated code, it also makes it a lot easier to maintain.

There was a few functional differences between the queues in both
adapters, but mostly due to fixes done in the CUDA adapter that were not
ported to the HIP adapter. There might be more but I found at least one
race condition (intel#15100) and one
performance issue (intel#6333) that weren't
fixed in the HIP adapter.

This patch uses the CUDA version of the queue as a base for the generic
queue, and will thus fix for HIP the race condition and performance
issue mentioned above.

This code is quite complex, so this patch also aimed to minimize any
other changes beyond the structural changes needed to share the code.
However it did do the following changes in the two adapters:

CUDA:

* Rename `ur_stream_guard_` to `ur_stream_guard`
* Rename `getNextEventID` to `getNextEventId`
* Remove duplicate `get_device` getter, use `getDevice` instead

HIP:

* Fix queue finish so it doesn't fail when no streams need to be
  synchronized
sarnex pushed a commit that referenced this pull request Apr 2, 2025
The CUDA and HIP adapters are both using a nearly identical complicated
queue that handles creating an out-of-order UR queue from in-order
CUDA/HIP streams.

This patch extracts all of the queue logic into a separate templated
class that can be used by both adapters. Beyond removing a lot of
duplicated code, it also makes it a lot easier to maintain.

There was a few functional differences between the queues in both
adapters, but mostly due to fixes done in the CUDA adapter that were not
ported to the HIP adapter. There might be more but I found at least one
race condition (#15100) and one
performance issue (#6333) that weren't
fixed in the HIP adapter.

This patch uses the CUDA version of the queue as a base for the generic
queue, and will thus fix for HIP the race condition and performance
issue mentioned above.

This code is quite complex, so this patch also aimed to minimize any
other changes beyond the structural changes needed to share the code.
However it did do the following changes in the two adapters:

`stream_queue.hpp`:
* Remove `urDeviceRetain/Release`: essentially a no-op

CUDA:

* Rename `ur_stream_guard_` to `ur_stream_guard`
* Rename `getNextEventID` to `getNextEventId`
* Remove duplicate `get_device` getter, use `getDevice` instead

HIP:

* Fix queue finish so it doesn't fail when no streams need to be
synchronized
kbenzie pushed a commit to oneapi-src/unified-runtime that referenced this pull request Apr 3, 2025
The CUDA and HIP adapters are both using a nearly identical complicated
queue that handles creating an out-of-order UR queue from in-order
CUDA/HIP streams.

This patch extracts all of the queue logic into a separate templated
class that can be used by both adapters. Beyond removing a lot of
duplicated code, it also makes it a lot easier to maintain.

There was a few functional differences between the queues in both
adapters, but mostly due to fixes done in the CUDA adapter that were not
ported to the HIP adapter. There might be more but I found at least one
race condition (intel/llvm#15100) and one
performance issue (intel/llvm#6333) that weren't
fixed in the HIP adapter.

This patch uses the CUDA version of the queue as a base for the generic
queue, and will thus fix for HIP the race condition and performance
issue mentioned above.

This code is quite complex, so this patch also aimed to minimize any
other changes beyond the structural changes needed to share the code.
However it did do the following changes in the two adapters:

`stream_queue.hpp`:
* Remove `urDeviceRetain/Release`: essentially a no-op

CUDA:

* Rename `ur_stream_guard_` to `ur_stream_guard`
* Rename `getNextEventID` to `getNextEventId`
* Remove duplicate `get_device` getter, use `getDevice` instead

HIP:

* Fix queue finish so it doesn't fail when no streams need to be
synchronized
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants