Skip to content

[SYCL][UR][L0 v2] Fix issues with event lifetime #18962

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 17, 2025

Conversation

igchor
Copy link
Member

@igchor igchor commented Jun 12, 2025

When releasing executionEvent in the commandBuffer we need to also retain the queue. Otherwise event->release() in commandBuffer's destructor might attempt to release the event to an event pool that has already been destroyed.

Also, make sure eventPool is destroyed after commandListManager to avoid any issues with events owned by the commandListManager and move context retain/release from commandListManager to queue/command_buffer.

The implementation of commandListManager marked move ctor and assignment operator as defaulted which was wrong - the implementation would have to account for context and device retain/release calls. To avoid this, move the logic to queue/commandBuffer which can have move ctors/assignments operators removed.

@pbalcer
Copy link
Contributor

pbalcer commented Jun 13, 2025

When releasing executionEvent in the commandBuffer we need to also retain the queue. Otherwise event->release() in commandBuffer's destructor might attempt to release the event to an event pool that has already been destroyed.

Event pools are released to the context when the queue is released. So this would mean that the command buffer outlives the context itself, not just the queue.

Copy link
Contributor

@pbalcer pbalcer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, but please add a test that releases a command buffer and queue before using the command buffer for the last time.

@igchor
Copy link
Member Author

igchor commented Jun 13, 2025

When releasing executionEvent in the commandBuffer we need to also retain the queue. Otherwise event->release() in commandBuffer's destructor might attempt to release the event to an event pool that has already been destroyed.

Event pools are released to the context when the queue is released. So this would mean that the command buffer outlives the context itself, not just the queue.

Yes, and we could actually solve this problem by retaining the context in the command buffer, rather than the queue. However, there is a potential problem with the execution event: if the queue associated with that event is destroyed and we want to later synchronize on the execution event we might crash because (as far as I know) counter-based events rely on the associated command-list for keeping some of it's state (and command list is potentially destroyed/re-used as soon as queue is released). I did solve this problem by retaining the queue in command buffer but now as I think about it, it is a more general problem. What if someone wants to pass an event as a dependency to some operation after the associated queue is destroyed?

I think we might actually need to start retaining the queue when allocating events (which we have tried to avoid until now). One alternative I'm considering is to do some logic in the event pool. For example, we could create another raii wrapper that would internally store cache_borrowed_event_pool* and if eventPool->events.size() != eventPool->freelist.size() delay returning the event pool to the context and destroying the queue until all events are released (add a callback to eventPool::free)

@igchor igchor requested a review from a team as a code owner June 13, 2025 16:57
@igchor igchor temporarily deployed to WindowsCILock June 13, 2025 16:57 — with GitHub Actions Inactive
@igchor igchor temporarily deployed to WindowsCILock June 13, 2025 17:33 — with GitHub Actions Inactive
@pbalcer
Copy link
Contributor

pbalcer commented Jun 16, 2025

:(

I think we might actually need to start retaining the queue when allocating events (which we have tried to avoid until now). One alternative I'm considering is to do some logic in the event pool. For example, we could create another raii wrapper that would internally store cache_borrowed_event_pool* and if eventPool->events.size() != eventPool->freelist.size() delay returning the event pool to the context and destroying the queue until all events are released (add a callback to eventPool::free)

I think this might be the best option. I'll think on it.

But let's do that as a separate PR.

@igchor igchor force-pushed the v2_lifetime_fixes branch from 6d6a4c7 to c08ea6f Compare June 16, 2025 14:13
@igchor igchor temporarily deployed to WindowsCILock June 16, 2025 14:13 — with GitHub Actions Inactive
@igchor
Copy link
Member Author

igchor commented Jun 16, 2025

:(

I think we might actually need to start retaining the queue when allocating events (which we have tried to avoid until now). One alternative I'm considering is to do some logic in the event pool. For example, we could create another raii wrapper that would internally store cache_borrowed_event_pool* and if eventPool->events.size() != eventPool->freelist.size() delay returning the event pool to the context and destroying the queue until all events are released (add a callback to eventPool::free)

I think this might be the best option. I'll think on it.

But let's do that as a separate PR.

Yes, for now, I think we can just merge the fix with context retain/release - this is enough for all the current tests to pass (with ooo queue).

@igchor igchor temporarily deployed to WindowsCILock June 16, 2025 14:43 — with GitHub Actions Inactive
@igchor igchor temporarily deployed to WindowsCILock June 16, 2025 14:43 — with GitHub Actions Inactive
@igchor
Copy link
Member Author

igchor commented Jun 16, 2025

@intel/sycl-graphs-reviewers could you please take a look?

@igchor
Copy link
Member Author

igchor commented Jun 17, 2025

@intel/llvm-gatekeepers this is ready to be merged. The failing test is due to: #18996 and CUDA job is unrelated (this PR only changes L0 code).

@aelovikov-intel aelovikov-intel merged commit 1e2eede into intel:sycl Jun 17, 2025
32 of 34 checks passed
pbalcer added a commit that referenced this pull request Jun 18, 2025
kbenzie pushed a commit that referenced this pull request Jun 18, 2025
Reverts #18962

This patch likely caused these UR CTS failures:

https://github.com/intel/llvm/actions/runs/15727687366/job/44323147798?pr=19035
>
urCommandBufferAppendKernelLaunchExpTest.DuplicateSyncPoint/UR_BACKEND_LEVEL_ZERO__Intel_R__oneAPI_Unified_Runtime_over_Level_Zero_V2__Intel_R__Data_Center_GPU_Max_1100_ID0ID____________________

And is also possibly the root cause of
#19034
github-actions bot pushed a commit to oneapi-src/unified-runtime that referenced this pull request Jun 19, 2025
Reverts intel/llvm#18962

This patch likely caused these UR CTS failures:

https://github.com/intel/llvm/actions/runs/15727687366/job/44323147798?pr=19035
>
urCommandBufferAppendKernelLaunchExpTest.DuplicateSyncPoint/UR_BACKEND_LEVEL_ZERO__Intel_R__oneAPI_Unified_Runtime_over_Level_Zero_V2__Intel_R__Data_Center_GPU_Max_1100_ID0ID____________________

And is also possibly the root cause of
intel/llvm#19034
kbenzie pushed a commit to oneapi-src/unified-runtime that referenced this pull request Jun 19, 2025
Reverts intel/llvm#18962

This patch likely caused these UR CTS failures:

https://github.com/intel/llvm/actions/runs/15727687366/job/44323147798?pr=19035
>
urCommandBufferAppendKernelLaunchExpTest.DuplicateSyncPoint/UR_BACKEND_LEVEL_ZERO__Intel_R__oneAPI_Unified_Runtime_over_Level_Zero_V2__Intel_R__Data_Center_GPU_Max_1100_ID0ID____________________

And is also possibly the root cause of
intel/llvm#19034
steffenlarsen pushed a commit that referenced this pull request Jun 25, 2025
…9122)

This reverts commit 754bb93.

The issue was that #18962 (already
reverted) introduced a bug (using uninitialized variable).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants