-
Notifications
You must be signed in to change notification settings - Fork 788
[SYCL] Remove unnecessary mutex locks from release and retain functions #6192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Commit removes mutex locks from PI release and retain functions. They were added under impression that 2 threads working simultaneously on releasing resources can both reach ref count 0 and cause double free. But since our ref count is atomic and decrement is an atomic operation only single thread will execute code under if (--(PiObj->RefCount) == 0). There can't be 2 threads which can reach ref count 0. Behavior is undefined if DPCPP runtime calls somehow uses deeleted pi object (with ref count 0) or provides it as an argument to any of the pi functions.
// of the Level Zero plugin. Behavior is undefined if DPCPP runtime calls | ||
// piQueueRelease on a queue with external ref count 0 or provides such queue | ||
// as an argument to any of the pi functions. | ||
if (--(Queue->RefCountExternal) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd personally use fetch_sub
interface as not every reader is aware (I certainly wasn't one of them) that operator--
provides the guarantees and it's more likely they'll "cppreference" the member function than an overloaded operator.
if (RefCountZero) | ||
// Only single thread can reach ref count equal to 0. This means that | ||
// the code under if is executed by single thread. | ||
if (--(Sampler->RefCount) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd do an early return here for consistency of the pattern.
// on the Queue that is passed in. | ||
inline static void piQueueRetainNoLock(pi_queue Queue) { Queue->RefCount++; } | ||
// This helper function increments the internal reference counter of the Queue. | ||
inline static void piQueueRetainInternal(pi_queue Queue) { Queue->RefCount++; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we just inline it?
Queue->RefCountExternal++; | ||
piQueueRetainNoLock(Queue); | ||
piQueueRetainInternal(Queue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why we have to counters? Any chance that it's because we assumed we needed a whole lock for one but not for another? If so, we might be able to just merge them in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I don't miss anything, we can't merge these ref counts. External ref count equal to 0 means that there are no external references anymore and we synchronize the queue and reset command lists in this queue. It is possible that at this point internal ref count is still not zero. After all internal references to this queue disappear, i.e. internal ref count = 0 then we can delete the queue.
Shortly, external ref count is used to detect the point when we can synchronize and internal ref count is used to detect the point when we can delete.
// of the Level Zero plugin. Behavior is undefined if DPCPP runtime calls | ||
// piQueueRelease on a queue with external ref count 0 or provides such queue | ||
// as an argument to any of the pi functions. | ||
if (--(Queue->RefCountExternal) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Early return please.
return PI_SUCCESS; | ||
} | ||
|
||
static pi_result QueueRelease(pi_queue Queue, pi_queue LockedQueue) { | ||
static pi_result piQueueReleaseInternal(pi_queue Queue, pi_queue LockedQueue) { | ||
PI_ASSERT(Queue, PI_INVALID_QUEUE); | ||
PI_ASSERT(Queue->RefCount, PI_INVALID_QUEUE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you've removed a similar assert somewhere else in this patch. Why not here?
But SYCL RT does not have visibility of ref-count or when the object is destroyed. It is true, however, that SYCL RT cannot use PI objects that it had released it's own last reference to (there can be other references to it, e.g. from other PI objects).
This sounds right to me, only one thread can get to 0. But is this change just a performance optimization or something more? What if we pretend the ref-count is not atomic and continue to guard it's access, would it be still correct? |
@@ -239,13 +239,46 @@ class pi_shared_mutex : public std::shared_mutex { | |||
} | |||
}; | |||
|
|||
// This wrapper around std::atomic is created to limit operations with reference | |||
// counter and to make allowed operations more transparent in terms of | |||
// thread-safety in the plugin. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain in comment that these operations do not need a mutex guard around them since the underlying data is already atomic (plus extra semantics that is explained for "decrement_and_count")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment.
return PI_SUCCESS; | ||
} | ||
|
||
static pi_result QueueRelease(pi_queue Queue, pi_queue LockedQueue) { | ||
static pi_result piQueueReleaseInternal(pi_queue Queue, pi_queue LockedQueue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LockedQueue has no user. Consider to delete it or add (void)LockedQueue?
error: unused parameter 'LockedQueue'
intel#6192 removed the need for the LockedQueue parameter in piQueueReleaseInternal. These changes remove the parameter from the function. Signed-off-by: Larsen, Steffen <[email protected]>
Commit removes mutex locks from PI release and retain functions. They
were added under impression that 2 threads working simultaneously on
releasing resources can both reach ref count 0 and cause double free.
But since our ref count is atomic and decrement is an atomic operation
only single thread will execute code under if (--(PiObj->RefCount) == 0).
There can't be 2 threads which can reach ref count 0.
Behavior is undefined if DPCPP runtime somehow uses deleted pi object
(with ref count 0) or provides it as an argument to any of the pi functions.