[SYCL][PI][CUDA] Fix too many streams getting synchronized #6333

t4c1 · 2022-06-21T08:30:17Z

Fixed off-by-one error introduced in #6201 that would cause queue synchronization to synchronize all streams when no stream has been used. The code worked correctly, but this can in some cases impact performance.

…ed when no stream has been used

steffenlarsen

Good find!

The CUDA and HIP adapters are both using a nearly identical complicated queue that handles creating an out-of-order UR queue from in-order CUDA/HIP streams. This patch extracts all of the queue logic into a separate templated class that can be used by both adapters. Beyond removing a lot of duplicated code, it also makes it a lot easier to maintain. There was a few functional differences between the queues in both adapters, but mostly due to fixes done in the CUDA adapter that were not ported to the HIP adapter. There might be more but I found at least one race condition (intel#15100) and one performance issue (intel#6333) that weren't fixed in the HIP adapter. This patch uses the CUDA version of the queue as a base for the generic queue, and will thus fix for HIP the race condition and performance issue mentioned above. This code is quite complex, so this patch also aimed to minimize any other changes beyond the structural changes needed to share the code. However it did do the following changes in the two adapters: CUDA: * Rename `ur_stream_guard_` to `ur_stream_guard` * Rename `getNextEventID` to `getNextEventId` * Remove duplicate `get_device` getter, use `getDevice` instead HIP: * Fix queue finish so it doesn't fail when no streams need to be synchronized

The CUDA and HIP adapters are both using a nearly identical complicated queue that handles creating an out-of-order UR queue from in-order CUDA/HIP streams. This patch extracts all of the queue logic into a separate templated class that can be used by both adapters. Beyond removing a lot of duplicated code, it also makes it a lot easier to maintain. There was a few functional differences between the queues in both adapters, but mostly due to fixes done in the CUDA adapter that were not ported to the HIP adapter. There might be more but I found at least one race condition (#15100) and one performance issue (#6333) that weren't fixed in the HIP adapter. This patch uses the CUDA version of the queue as a base for the generic queue, and will thus fix for HIP the race condition and performance issue mentioned above. This code is quite complex, so this patch also aimed to minimize any other changes beyond the structural changes needed to share the code. However it did do the following changes in the two adapters: `stream_queue.hpp`: * Remove `urDeviceRetain/Release`: essentially a no-op CUDA: * Rename `ur_stream_guard_` to `ur_stream_guard` * Rename `getNextEventID` to `getNextEventId` * Remove duplicate `get_device` getter, use `getDevice` instead HIP: * Fix queue finish so it doesn't fail when no streams need to be synchronized

The CUDA and HIP adapters are both using a nearly identical complicated queue that handles creating an out-of-order UR queue from in-order CUDA/HIP streams. This patch extracts all of the queue logic into a separate templated class that can be used by both adapters. Beyond removing a lot of duplicated code, it also makes it a lot easier to maintain. There was a few functional differences between the queues in both adapters, but mostly due to fixes done in the CUDA adapter that were not ported to the HIP adapter. There might be more but I found at least one race condition (intel/llvm#15100) and one performance issue (intel/llvm#6333) that weren't fixed in the HIP adapter. This patch uses the CUDA version of the queue as a base for the generic queue, and will thus fix for HIP the race condition and performance issue mentioned above. This code is quite complex, so this patch also aimed to minimize any other changes beyond the structural changes needed to share the code. However it did do the following changes in the two adapters: `stream_queue.hpp`: * Remove `urDeviceRetain/Release`: essentially a no-op CUDA: * Rename `ur_stream_guard_` to `ur_stream_guard` * Rename `getNextEventID` to `getNextEventId` * Remove duplicate `get_device` getter, use `getDevice` instead HIP: * Fix queue finish so it doesn't fail when no streams need to be synchronized

fixed off-by-one error that would cause all streams to get synchroniz…

a898047

…ed when no stream has been used

t4c1 requested a review from a team as a code owner June 21, 2022 08:30

t4c1 requested a review from smaslov-intel June 21, 2022 08:30

t4c1 mentioned this pull request Jun 21, 2022

[SYCL][HIP][PI] Multiple HIP streams per SYCL queue #6325

Merged

steffenlarsen approved these changes Jun 21, 2022

View reviewed changes

smaslov-intel approved these changes Jun 21, 2022

View reviewed changes

steffenlarsen merged commit 5352b42 into intel:sycl Jun 21, 2022

abagusetty added a commit to abagusetty/llvm that referenced this pull request Jun 21, 2022

fix to mimic CUDA plugin intel#6333

f2f7141

abagusetty added a commit to abagusetty/llvm that referenced this pull request Jun 30, 2022

reverts a commit CUDA's intel#6333

a2157fa

npmiller mentioned this pull request Mar 25, 2025

[UR][CUDA][HIP] Unify queue handling between adapters #17641

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCL][PI][CUDA] Fix too many streams getting synchronized #6333

[SYCL][PI][CUDA] Fix too many streams getting synchronized #6333

Uh oh!

t4c1 commented Jun 21, 2022

Uh oh!

steffenlarsen left a comment

Uh oh!

Uh oh!

[SYCL][PI][CUDA] Fix too many streams getting synchronized #6333

[SYCL][PI][CUDA] Fix too many streams getting synchronized #6333

Uh oh!

Conversation

t4c1 commented Jun 21, 2022

Uh oh!

steffenlarsen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!