[SYCL] Keep track of signaled fences to properly recycle associated command-lists. #3215

smaslov-intel · 2021-02-13T23:40:19Z

This fixes the leak of "command-list"s that had their associated fences not yet signaled at the time all events in that list were already completed (sporadic timing issue). In this case the fence remained non-reset and it's command-list is not returned to the command-list cache.

The solution is to track pending fences and reset those still in-use by the time of the queue destruction.

Signed-off-by: Sergey V Maslov [email protected]

…command-lists. Signed-off-by: Sergey V Maslov <[email protected]>

bso-intel · 2021-02-16T18:30:32Z

Could you add some description of the current issue and how this PR solved it?

smaslov-intel · 2021-02-16T18:35:27Z

Could you add some description of the current issue and how this PR solved it?

Sure, I've updated the description.

bso-intel · 2021-02-16T19:03:52Z

This fixes the leak of "command-list"s that had their associated fences not yet signaled at the time all events in that list were already completed (sporadic timing issue).

That's interesting. I am not sure why this ordering issue occur.
Is this a bug in L0 driver?

smaslov-intel · 2021-02-16T19:15:23Z

This fixes the leak of "command-list"s that had their associated fences not yet signaled at the time all events in that list were already completed (sporadic timing issue).

That's interesting. I am not sure why this ordering issue occur.
Is this a bug in L0 driver?

I don't think so, the commands in the command-lists are signaling their events, and the fence is signaled after all commands are completed and thus after all events are signaled. There apparently is some time between all events are signaled but the fence is not signaled yet.

@jandres742 could you please comment on this?

bso-intel · 2021-02-16T19:32:52Z

sycl/plugins/level_zero/pi_level_zero.cpp

+  ZE_CALL(zeFenceReset(MapEntry.second.ZeFence));
+  ZE_CALL(zeCommandListReset(MapEntry.first));


So, here we call zeFenceReset first and then zeCommandListReset.
But, when we are about to release pi_queue, the command list was successfully reset, but fence is still in use (not reset)?
Did I misunderstand this situation?

Both the fence and the command-list it is fencing are not reset. These 2 always come together. The purpose of the fence is to tell if command-list is still in use.

jandres742 · 2021-02-16T20:04:54Z

the commands in the command-lists are signaling their events, and the fence is signaled after all commands are completed and thus after all events are signaled. There apparently is some time between all events are signaled but the fence is not signaled yet.

Correct in the sense that a fence is signaled when all commands in the list have completed, which means when all the events in the list have been signaled. If the problem is that there's could be a delay between the time all events are signaled, and when the fence is signaled, that could be true, but that would be a HW delay, as we only program the signal in the fence, we dont actually signal it.

Now, if the problem is that we need this to be able to recycle the list, then the fence is something more appropriate, as there's no guarantee that all lists have events, and an event is a per command synchronization primitive, whereas a fence is a per list one, so it provides better guarantees that the list is ready for reuse.

Let me know if I'm following the problem statement correctly here.

smaslov-intel · 2021-02-16T20:42:32Z

a delay between the time all events are signaled, and when the fence is signaled, that could be true, but that would be a HW delay

Thanks, that's exactly how I understood the issue. Now the problem with this was that we were polling the fence at events' completion, which sometime (due to mentioned delay) may be a little bit too early. In such case the fence is now remembered to be still in use, and is only polled and released (and command-list recycled) at the next earliest convenience (when a new command-list is needed, or when queue is being finalized).

bso-intel · 2021-02-16T20:56:49Z

Thanks, @jandres742 for clarifying the delay between the command-list and fence.

Now the problem with this was that we were polling the fence at events' completion, which sometime (due to mentioned delay) may be a little bit too early.

Since we know there could be a delay, do we still want this checking for the fence state at event completion?

smaslov-intel · 2021-02-16T21:02:19Z

Since we know there could be a delay, do we still want this checking for the fence state at event completion?

Absolutely, since this is 1) cheap and 2) gets the command-list back to the cache much earlier most of the times.

bso-intel · 2021-02-16T21:08:31Z

Since we know there could be a delay, do we still want this checking for the fence state at event completion?

Absolutely, since this is 1) cheap and 2) gets the command-list back to the cache much earlier most of the times.

I understand that with your PR the memory leak due to the timing is fixed now.
I was just asking if there is a way to delay the checking for the fence state at a little later time.
It sounds like there is no such way until we really reuse the command-list or destroy the queue.
Thanks.

bso-intel

LGTM.
Your analysis was really a good for this hard problem.
NIT: It would be great to leave a comment about the potential timing issue between command-list reset and fence and that's why we need to keep the live fence list and reclaim at a later time.
Thanks.

bader · 2021-02-17T06:33:14Z

LGTM.
Your analysis was really a good for this hard problem.
NIT: It would be great to leave a comment about the potential timing issue between command-list reset and fence and that's why we need to keep the live fence list and reclaim at a later time.
Thanks.

@smaslov-intel, are you going to apply this suggestion before we merge this PR?

Signed-off-by: Sergey V Maslov <[email protected]>

smaslov-intel · 2021-02-17T17:42:09Z

Added the source code comment clarifying why we need the change.

bso-intel

LGTM.

smaslov-intel requested review from bso-intel and removed request for bso-intel February 13, 2021 23:40

[SYCL] Keep track of signalled fences to properly recycle associated …

a289cfe

…command-lists. Signed-off-by: Sergey V Maslov <[email protected]>

smaslov-intel force-pushed the fence_sync branch from 91b6851 to a289cfe Compare February 14, 2021 22:53

smaslov-intel changed the title ~~[SYCL] Wait for fence to signal at event completion to avoid leaking of associated command-list.~~ [SYCL] Keep track of signaled fences to properly recycle associated command-lists. Feb 14, 2021

smaslov-intel requested a review from bso-intel February 14, 2021 22:56

bso-intel reviewed Feb 16, 2021

View reviewed changes

bso-intel previously approved these changes Feb 16, 2021

View reviewed changes

added comment

b1e1cdd

Signed-off-by: Sergey V Maslov <[email protected]>

smaslov-intel dismissed bso-intel’s stale review via b1e1cdd February 17, 2021 17:41

bso-intel approved these changes Feb 17, 2021

View reviewed changes

bader merged commit 46e3c64 into intel:sycl Feb 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCL] Keep track of signaled fences to properly recycle associated command-lists. #3215

[SYCL] Keep track of signaled fences to properly recycle associated command-lists. #3215

Uh oh!

smaslov-intel commented Feb 13, 2021 •

edited

Loading

Uh oh!

bso-intel commented Feb 16, 2021

Uh oh!

smaslov-intel commented Feb 16, 2021

Uh oh!

bso-intel commented Feb 16, 2021

Uh oh!

smaslov-intel commented Feb 16, 2021

Uh oh!

bso-intel Feb 16, 2021

Uh oh!

smaslov-intel Feb 16, 2021

Uh oh!

jandres742 commented Feb 16, 2021

Uh oh!

smaslov-intel commented Feb 16, 2021

Uh oh!

bso-intel commented Feb 16, 2021 •

edited

Loading

Uh oh!

smaslov-intel commented Feb 16, 2021

Uh oh!

bso-intel commented Feb 16, 2021

Uh oh!

bso-intel left a comment

Uh oh!

bader commented Feb 17, 2021

Uh oh!

smaslov-intel commented Feb 17, 2021

Uh oh!

bso-intel left a comment

Uh oh!

Uh oh!

		ZE_CALL(zeFenceReset(MapEntry.second.ZeFence));
		ZE_CALL(zeCommandListReset(MapEntry.first));

[SYCL] Keep track of signaled fences to properly recycle associated command-lists. #3215

[SYCL] Keep track of signaled fences to properly recycle associated command-lists. #3215

Uh oh!

Conversation

smaslov-intel commented Feb 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bso-intel commented Feb 16, 2021

Uh oh!

smaslov-intel commented Feb 16, 2021

Uh oh!

bso-intel commented Feb 16, 2021

Uh oh!

smaslov-intel commented Feb 16, 2021

Uh oh!

bso-intel Feb 16, 2021

Choose a reason for hiding this comment

Uh oh!

smaslov-intel Feb 16, 2021

Choose a reason for hiding this comment

Uh oh!

jandres742 commented Feb 16, 2021

Uh oh!

smaslov-intel commented Feb 16, 2021

Uh oh!

bso-intel commented Feb 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smaslov-intel commented Feb 16, 2021

Uh oh!

bso-intel commented Feb 16, 2021

Uh oh!

bso-intel left a comment

Choose a reason for hiding this comment

Uh oh!

bader commented Feb 17, 2021

Uh oh!

smaslov-intel commented Feb 17, 2021

Uh oh!

bso-intel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

smaslov-intel commented Feb 13, 2021 •

edited

Loading

bso-intel commented Feb 16, 2021 •

edited

Loading