Skip to content

[SYCL] Keep track of signaled fences to properly recycle associated command-lists. #3215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 18, 2021

Conversation

smaslov-intel
Copy link
Contributor

@smaslov-intel smaslov-intel commented Feb 13, 2021

This fixes the leak of "command-list"s that had their associated fences not yet signaled at the time all events in that list were already completed (sporadic timing issue). In this case the fence remained non-reset and it's command-list is not returned to the command-list cache.

The solution is to track pending fences and reset those still in-use by the time of the queue destruction.

Signed-off-by: Sergey V Maslov [email protected]

@smaslov-intel smaslov-intel requested review from bso-intel and removed request for bso-intel February 13, 2021 23:40
@smaslov-intel smaslov-intel changed the title [SYCL] Wait for fence to signal at event completion to avoid leaking of associated command-list. [SYCL] Keep track of signaled fences to properly recycle associated command-lists. Feb 14, 2021
@bso-intel
Copy link
Contributor

Could you add some description of the current issue and how this PR solved it?

@smaslov-intel
Copy link
Contributor Author

Could you add some description of the current issue and how this PR solved it?

Sure, I've updated the description.

@bso-intel
Copy link
Contributor

This fixes the leak of "command-list"s that had their associated fences not yet signaled at the time all events in that list were already completed (sporadic timing issue).

That's interesting. I am not sure why this ordering issue occur.
Is this a bug in L0 driver?

@smaslov-intel
Copy link
Contributor Author

This fixes the leak of "command-list"s that had their associated fences not yet signaled at the time all events in that list were already completed (sporadic timing issue).

That's interesting. I am not sure why this ordering issue occur.
Is this a bug in L0 driver?

I don't think so, the commands in the command-lists are signaling their events, and the fence is signaled after all commands are completed and thus after all events are signaled. There apparently is some time between all events are signaled but the fence is not signaled yet.

@jandres742 could you please comment on this?

Comment on lines +530 to +531
ZE_CALL(zeFenceReset(MapEntry.second.ZeFence));
ZE_CALL(zeCommandListReset(MapEntry.first));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, here we call zeFenceReset first and then zeCommandListReset.
But, when we are about to release pi_queue, the command list was successfully reset, but fence is still in use (not reset)?
Did I misunderstand this situation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the fence and the command-list it is fencing are not reset. These 2 always come together. The purpose of the fence is to tell if command-list is still in use.

@jandres742
Copy link
Contributor

the commands in the command-lists are signaling their events, and the fence is signaled after all commands are completed and thus after all events are signaled. There apparently is some time between all events are signaled but the fence is not signaled yet.

Correct in the sense that a fence is signaled when all commands in the list have completed, which means when all the events in the list have been signaled. If the problem is that there's could be a delay between the time all events are signaled, and when the fence is signaled, that could be true, but that would be a HW delay, as we only program the signal in the fence, we dont actually signal it.

Now, if the problem is that we need this to be able to recycle the list, then the fence is something more appropriate, as there's no guarantee that all lists have events, and an event is a per command synchronization primitive, whereas a fence is a per list one, so it provides better guarantees that the list is ready for reuse.

Let me know if I'm following the problem statement correctly here.

@smaslov-intel
Copy link
Contributor Author

a delay between the time all events are signaled, and when the fence is signaled, that could be true, but that would be a HW delay

Thanks, that's exactly how I understood the issue. Now the problem with this was that we were polling the fence at events' completion, which sometime (due to mentioned delay) may be a little bit too early. In such case the fence is now remembered to be still in use, and is only polled and released (and command-list recycled) at the next earliest convenience (when a new command-list is needed, or when queue is being finalized).

@bso-intel
Copy link
Contributor

bso-intel commented Feb 16, 2021

Thanks, @jandres742 for clarifying the delay between the command-list and fence.

Now the problem with this was that we were polling the fence at events' completion, which sometime (due to mentioned delay) may be a little bit too early.

Since we know there could be a delay, do we still want this checking for the fence state at event completion?

@smaslov-intel
Copy link
Contributor Author

Since we know there could be a delay, do we still want this checking for the fence state at event completion?

Absolutely, since this is 1) cheap and 2) gets the command-list back to the cache much earlier most of the times.

@bso-intel
Copy link
Contributor

Since we know there could be a delay, do we still want this checking for the fence state at event completion?

Absolutely, since this is 1) cheap and 2) gets the command-list back to the cache much earlier most of the times.

I understand that with your PR the memory leak due to the timing is fixed now.
I was just asking if there is a way to delay the checking for the fence state at a little later time.
It sounds like there is no such way until we really reuse the command-list or destroy the queue.
Thanks.

bso-intel
bso-intel previously approved these changes Feb 16, 2021
Copy link
Contributor

@bso-intel bso-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Your analysis was really a good for this hard problem.
NIT: It would be great to leave a comment about the potential timing issue between command-list reset and fence and that's why we need to keep the live fence list and reclaim at a later time.
Thanks.

@bader
Copy link
Contributor

bader commented Feb 17, 2021

LGTM.
Your analysis was really a good for this hard problem.
NIT: It would be great to leave a comment about the potential timing issue between command-list reset and fence and that's why we need to keep the live fence list and reclaim at a later time.
Thanks.

@smaslov-intel, are you going to apply this suggestion before we merge this PR?

Signed-off-by: Sergey V Maslov <[email protected]>
@smaslov-intel
Copy link
Contributor Author

Added the source code comment clarifying why we need the change.

Copy link
Contributor

@bso-intel bso-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@bader bader merged commit 46e3c64 into intel:sycl Feb 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants