Skip to content

[SYCL] Fix crash and deadlock in level0 plugin when using multiple threads #2550

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

alexanderfle
Copy link
Contributor

@alexanderfle alexanderfle commented Sep 28, 2020

This patch fixes sporadic thread-safety bugs on Windows when level0 plugin is used:

  • sporadic crash in piEventsWait():
    one thread executes EventList[I]->ZeCommandList = nullptr;
    while another thread locks ZeCommandListFenceMapMutex and when it is time to use EventList[I]->ZeCommandList it already is nullptr.
    -- Now the thread will check EventList[I]->ZeCommandList only after locking ZeCommandListFenceMapMutex
  • sporadic dead-lock in getAvailableCommandList():
    wrong order of locking mutexes in the function led to sporadic hang.
    -- Now the order is aligned order as elsewhere: first lock ZeCommandListFenceMapMutex and then ZeCommandListCacheMutex.

Signed-off-by: Alexander Flegontov [email protected]

@alexanderfle
Copy link
Contributor Author

My observation about the sporadic crash in piEventsWait():
When debugging, I see that a thread is setting EventList[I]->ZeCommandList value to nullptr, while another thread has already done the following check: if (EventList[I]->ZeCommandList) and when the context switches to the thread,
the thread continues execution, assuming EventList[I]->ZeCommandList is not nullptr, the thread accesses to EventList[I]->Queue->ZeCommandListFenceMap by null pointer.
I printed the pointers for both threads(before one sets it to nullptr) and can see that they are the same. This case leads to a crash.
I guess ZeCommandLists are reused by threads, but it looks like they start to be reused before other threads released them.

@smaslov-intel
Copy link
Contributor

a thread is setting EventList[I]->ZeCommandList value to nullptr, while another thread has already done the following check: if (EventList[I]->ZeCommandList)

But this means that the same PI event was was simultaneously passed to here from different threads, right? The ZeCommandLists belongs to a PI event, and is not shared between different events. The plugin is only guarding (with mutexes) access to somehow shared data, but not this. It seems to me that this is a bug in the SYCL program (or SYCL RT) and synchronization is needed there.

If I am not right here, then we must have a much bigger problem than just this (like I said we do not guard for multi-threaded access all of the contents of PI objects unless it is shared between PI objects).

Copy link
Contributor

@smaslov-intel smaslov-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please separate out the NFC change of changing lock/unlock to lock_gurad from the real fix. And also the "fix" in piEventsWAit should go (if it should at all) by itself too.

@alexanderfle
Copy link
Contributor Author

But this means that the same PI event was simultaneously passed to here from different threads, right?

right, but I will check that.

It seems to me that this is a bug in the SYCL program (or SYCL RT) and synchronization is needed there.

strange, if that bug is in other places, then I don't know why it works for the same application using OpenCL backend.

Please separate out the NFC change of changing lock/unlock to lock_gurad from the real fix.

Do you suggest reverting the changes lock/unlock to lock_gurad in the current PR and prepare a separate ticket for that?
to clearly understand what the real fix is.

@smaslov-intel
Copy link
Contributor

Do you suggest reverting the changes lock/unlock to lock_gurad in the current PR and prepare a separate ticket for that?
to clearly understand what the real fix is.

Yes, please.

if that bug is in other places, then I don't know why it works for the same application using OpenCL backend.

Might be a timing issue, data races are sporadic in nature. Or maybe OpenCL wait for events is meant to be non-destructive, and can really be called by multiple threads on the same events. I can't see anything discussing this here: https://www.khronos.org/registry/OpenCL/sdk/2.2/docs/man/html/clWaitForEvents.html

@romanovvlad, what do you think?

@romanovvlad
Copy link
Contributor

Do you suggest reverting the changes lock/unlock to lock_gurad in the current PR and prepare a separate ticket for that?
to clearly understand what the real fix is.

Yes, please.

if that bug is in other places, then I don't know why it works for the same application using OpenCL backend.

Might be a timing issue, data races are sporadic in nature. Or maybe OpenCL wait for events is meant to be non-destructive, and can really be called by multiple threads on the same events. I can't see anything discussing this here: https://www.khronos.org/registry/OpenCL/sdk/2.2/docs/man/html/clWaitForEvents.html

@romanovvlad, what do you think?

The OpenCL 2.1 specification says:
All API calls except clSetKernelArg, clSetKernelArgSVMPointer, clSetKernelExecInfo and clCloneKernel are thread-safe.
So, it should be legal and safe to call any API(except mentioned above) with the same or different arguments from multiple threads. Since PI API is based on OpenCL API I would expect that the rule is applicable for PI API as well.

@smaslov-intel
Copy link
Contributor

it should be legal and safe to call any API(except mentioned above) with the same or different arguments from multiple threads.

@romanovvlad : but this effectively means that (almost) all PI API need to be completely locking for exclusive execution, since there is many internal state in PI objects that's transforming during a PI call (and thus other PI call on the same handle should wait until current one is finished). It is especially strange that this had not come up before, and apparently applications aren't doing what you think is legal.

@kbobrovs : do you have an opinion here?

FWIW, Level-Zero API is "free-threaded" https://spec.oneapi.com/level-zero/latest/core/INTRO.html#multithreading-and-concurrency:

multiple, **simultaneous threads may operate on independent driver objects** with no implicit thread-locks

@romanovvlad
Copy link
Contributor

romanovvlad commented Sep 30, 2020

it should be legal and safe to call any API(except mentioned above) with the same or different arguments from multiple threads.

@romanovvlad : but this effectively means that (almost) all PI API need to be completely locking for exclusive execution, since there is many internal state in PI objects that's transforming during a PI call (and thus other PI call on the same handle should wait until current one is finished).

Could you please elaborate why it is so? Can't only write operations to internal state be exclusive?
UPD. After reading the link you have posted, I see, that more complicated mechanism is needed. A mechanism which makes sure that only one API operates on a given handle at the same time. So the issue not only with calling the same API with the same handle, but with different API that take the same handle as well; like calling piWaitForEvent and piGetEventInfo with the same event can cause problems.

It is especially strange that this had not come up before, and apparently applications aren't doing what you think is legal.

I believe the reason for this is that most applications are single threaded.

@kbobrovs
Copy link
Contributor

kbobrovs commented Sep 30, 2020

@kbobrovs : do you have an opinion here?

Originally PI was conceived to match OpenCL execution model, where most of the APIs are thread-safe. Looks like we should review at least Level Zero plugin implementation for this matter.

Could you please elaborate why it is so? Can't only write operations to internal state be exclusive?

Not sure if I correctly get the point here, but reads should be synchronized with writes, obviously.

val = ThreadA::read(X)
ThreadB::write(X)
ThreadA::use(val)

will likely cause problems.

@smaslov-intel
Copy link
Contributor

Could you please elaborate why it is so? Can't only write operations to internal state be exclusive?

This is because the entire contents of each PI objects then becomes shared since multiple PI API may then access them simultaneously from different threads. We then need to lock each PI call for the entirety of working (both read and write) with related PI objects. That sounds like a big synchronization overhead and I wouldn't jump to doing so until we are absolutely sure.

@smaslov-intel
Copy link
Contributor

@alexanderfle : Given the input about needing to be truly truly-safe, we need to rework how we lock. We should no longer lock individual share data in the PI objects (various mutexes in PI queue in this case), but lock the entire PI objects for exclusive access. Let's also add a control (env var) that will skip the locking entirely, so we can evaluate the performance impact of such locking, and possibly use it for apps known to be single-threaded.

kbenzie added a commit to kbenzie/intel-llvm that referenced this pull request Feb 17, 2025
…e-dir

Use PROJECT_SOURCE_DIR instead of CMAKE_SOURCE_DIR
Chenyang-L pushed a commit that referenced this pull request Feb 18, 2025
Use PROJECT_SOURCE_DIR instead of CMAKE_SOURCE_DIR
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants