Skip to content

[SYCL] Remove unnecessary mutex locks from release and retain functions #6192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 31, 2022

Conversation

againull
Copy link
Contributor

@againull againull commented May 25, 2022

Commit removes mutex locks from PI release and retain functions. They
were added under impression that 2 threads working simultaneously on
releasing resources can both reach ref count 0 and cause double free.
But since our ref count is atomic and decrement is an atomic operation
only single thread will execute code under if (--(PiObj->RefCount) == 0).
There can't be 2 threads which can reach ref count 0.

Behavior is undefined if DPCPP runtime somehow uses deleted pi object
(with ref count 0) or provides it as an argument to any of the pi functions.

Commit removes mutex locks from PI release and retain functions. They
were added under impression that 2 threads working simultaneously on
releasing resources can both reach ref count 0 and cause double free.
But since our ref count is atomic and decrement is an atomic operation
only single thread will execute code under if (--(PiObj->RefCount) == 0).
There can't be 2 threads which can reach ref count 0.

Behavior is undefined if DPCPP runtime calls somehow uses deeleted pi object
(with ref count 0) or provides it as an argument to any of the pi functions.
@againull againull requested a review from a team as a code owner May 25, 2022 17:16
// of the Level Zero plugin. Behavior is undefined if DPCPP runtime calls
// piQueueRelease on a queue with external ref count 0 or provides such queue
// as an argument to any of the pi functions.
if (--(Queue->RefCountExternal) == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd personally use fetch_sub interface as not every reader is aware (I certainly wasn't one of them) that operator-- provides the guarantees and it's more likely they'll "cppreference" the member function than an overloaded operator.

if (RefCountZero)
// Only single thread can reach ref count equal to 0. This means that
// the code under if is executed by single thread.
if (--(Sampler->RefCount) == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd do an early return here for consistency of the pattern.

// on the Queue that is passed in.
inline static void piQueueRetainNoLock(pi_queue Queue) { Queue->RefCount++; }
// This helper function increments the internal reference counter of the Queue.
inline static void piQueueRetainInternal(pi_queue Queue) { Queue->RefCount++; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we just inline it?

Queue->RefCountExternal++;
piQueueRetainNoLock(Queue);
piQueueRetainInternal(Queue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why we have to counters? Any chance that it's because we assumed we needed a whole lock for one but not for another? If so, we might be able to just merge them in this PR.

Copy link
Contributor Author

@againull againull May 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I don't miss anything, we can't merge these ref counts. External ref count equal to 0 means that there are no external references anymore and we synchronize the queue and reset command lists in this queue. It is possible that at this point internal ref count is still not zero. After all internal references to this queue disappear, i.e. internal ref count = 0 then we can delete the queue.
Shortly, external ref count is used to detect the point when we can synchronize and internal ref count is used to detect the point when we can delete.

// of the Level Zero plugin. Behavior is undefined if DPCPP runtime calls
// piQueueRelease on a queue with external ref count 0 or provides such queue
// as an argument to any of the pi functions.
if (--(Queue->RefCountExternal) == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early return please.

return PI_SUCCESS;
}

static pi_result QueueRelease(pi_queue Queue, pi_queue LockedQueue) {
static pi_result piQueueReleaseInternal(pi_queue Queue, pi_queue LockedQueue) {
PI_ASSERT(Queue, PI_INVALID_QUEUE);
PI_ASSERT(Queue->RefCount, PI_INVALID_QUEUE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you've removed a similar assert somewhere else in this patch. Why not here?

@smaslov-intel
Copy link
Contributor

Behavior is undefined if DPCPP runtime somehow uses deleted pi object
(with ref count 0) or provides it as an argument to any of the pi functions.

But SYCL RT does not have visibility of ref-count or when the object is destroyed. It is true, however, that SYCL RT cannot use PI objects that it had released it's own last reference to (there can be other references to it, e.g. from other PI objects).

There can't be 2 threads which can reach ref count 0.

This sounds right to me, only one thread can get to 0. But is this change just a performance optimization or something more? What if we pretend the ref-count is not atomic and continue to guard it's access, would it be still correct?

@@ -239,13 +239,46 @@ class pi_shared_mutex : public std::shared_mutex {
}
};

// This wrapper around std::atomic is created to limit operations with reference
// counter and to make allowed operations more transparent in terms of
// thread-safety in the plugin.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain in comment that these operations do not need a mutex guard around them since the underlying data is already atomic (plus extra semantics that is explained for "decrement_and_count")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment.

@againull againull merged commit b545c60 into intel:sycl May 31, 2022
return PI_SUCCESS;
}

static pi_result QueueRelease(pi_queue Queue, pi_queue LockedQueue) {
static pi_result piQueueReleaseInternal(pi_queue Queue, pi_queue LockedQueue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LockedQueue has no user. Consider to delete it or add (void)LockedQueue?
error: unused parameter 'LockedQueue'

steffenlarsen added a commit to steffenlarsen/llvm that referenced this pull request Jun 1, 2022
intel#6192 removed the need for the
LockedQueue parameter in piQueueReleaseInternal. These changes remove
the parameter from the function.

Signed-off-by: Larsen, Steffen <[email protected]>
@againull againull deleted the remove_lock_from_release branch December 3, 2022 00:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants