-
Notifications
You must be signed in to change notification settings - Fork 787
[SYCL] Keep track of signaled fences to properly recycle associated command-lists. #3215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…command-lists. Signed-off-by: Sergey V Maslov <[email protected]>
91b6851
to
a289cfe
Compare
Could you add some description of the current issue and how this PR solved it? |
Sure, I've updated the description. |
That's interesting. I am not sure why this ordering issue occur. |
I don't think so, the commands in the command-lists are signaling their events, and the fence is signaled after all commands are completed and thus after all events are signaled. There apparently is some time between all events are signaled but the fence is not signaled yet. @jandres742 could you please comment on this? |
ZE_CALL(zeFenceReset(MapEntry.second.ZeFence)); | ||
ZE_CALL(zeCommandListReset(MapEntry.first)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, here we call zeFenceReset first and then zeCommandListReset.
But, when we are about to release pi_queue, the command list was successfully reset, but fence is still in use (not reset)?
Did I misunderstand this situation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both the fence and the command-list it is fencing are not reset. These 2 always come together. The purpose of the fence is to tell if command-list is still in use.
Correct in the sense that a fence is signaled when all commands in the list have completed, which means when all the events in the list have been signaled. If the problem is that there's could be a delay between the time all events are signaled, and when the fence is signaled, that could be true, but that would be a HW delay, as we only program the signal in the fence, we dont actually signal it. Now, if the problem is that we need this to be able to recycle the list, then the fence is something more appropriate, as there's no guarantee that all lists have events, and an event is a per command synchronization primitive, whereas a fence is a per list one, so it provides better guarantees that the list is ready for reuse. Let me know if I'm following the problem statement correctly here. |
Thanks, that's exactly how I understood the issue. Now the problem with this was that we were polling the fence at events' completion, which sometime (due to mentioned delay) may be a little bit too early. In such case the fence is now remembered to be still in use, and is only polled and released (and command-list recycled) at the next earliest convenience (when a new command-list is needed, or when queue is being finalized). |
Thanks, @jandres742 for clarifying the delay between the command-list and fence.
Since we know there could be a delay, do we still want this checking for the fence state at event completion? |
Absolutely, since this is 1) cheap and 2) gets the command-list back to the cache much earlier most of the times. |
I understand that with your PR the memory leak due to the timing is fixed now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Your analysis was really a good for this hard problem.
NIT: It would be great to leave a comment about the potential timing issue between command-list reset and fence and that's why we need to keep the live fence list and reclaim at a later time.
Thanks.
@smaslov-intel, are you going to apply this suggestion before we merge this PR? |
Signed-off-by: Sergey V Maslov <[email protected]>
Added the source code comment clarifying why we need the change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
This fixes the leak of "command-list"s that had their associated fences not yet signaled at the time all events in that list were already completed (sporadic timing issue). In this case the fence remained non-reset and it's command-list is not returned to the command-list cache.
The solution is to track pending fences and reset those still in-use by the time of the queue destruction.
Signed-off-by: Sergey V Maslov [email protected]