-
Notifications
You must be signed in to change notification settings - Fork 787
[SYCL] Fix crash and deadlock in level0 plugin when using multiple threads #2550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Fix crash and deadlock in level0 plugin when using multiple threads #2550
Conversation
…reads Signed-off-by: Alexander Flegontov <[email protected]>
My observation about the sporadic crash in |
But this means that the same PI event was was simultaneously passed to here from different threads, right? The If I am not right here, then we must have a much bigger problem than just this (like I said we do not guard for multi-threaded access all of the contents of PI objects unless it is shared between PI objects). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please separate out the NFC change of changing lock/unlock to lock_gurad from the real fix. And also the "fix" in piEventsWAit should go (if it should at all) by itself too.
right, but I will check that.
strange, if that bug is in other places, then I don't know why it works for the same application using OpenCL backend.
Do you suggest reverting the changes lock/unlock to lock_gurad in the current PR and prepare a separate ticket for that? |
Yes, please.
Might be a timing issue, data races are sporadic in nature. Or maybe OpenCL wait for events is meant to be non-destructive, and can really be called by multiple threads on the same events. I can't see anything discussing this here: https://www.khronos.org/registry/OpenCL/sdk/2.2/docs/man/html/clWaitForEvents.html @romanovvlad, what do you think? |
The OpenCL 2.1 specification says: |
…tiple threads" This reverts commit 39b5bde.
Signed-off-by: Alexander Flegontov <[email protected]>
@romanovvlad : but this effectively means that (almost) all PI API need to be completely locking for exclusive execution, since there is many internal state in PI objects that's transforming during a PI call (and thus other PI call on the same handle should wait until current one is finished). It is especially strange that this had not come up before, and apparently applications aren't doing what you think is legal. @kbobrovs : do you have an opinion here? FWIW, Level-Zero API is "free-threaded" https://spec.oneapi.com/level-zero/latest/core/INTRO.html#multithreading-and-concurrency:
|
Could you please elaborate why it is so? Can't only write operations to internal state be exclusive?
I believe the reason for this is that most applications are single threaded. |
Originally PI was conceived to match OpenCL execution model, where most of the APIs are thread-safe. Looks like we should review at least Level Zero plugin implementation for this matter.
Not sure if I correctly get the point here, but reads should be synchronized with writes, obviously.
will likely cause problems. |
This is because the entire contents of each PI objects then becomes shared since multiple PI API may then access them simultaneously from different threads. We then need to lock each PI call for the entirety of working (both read and write) with related PI objects. That sounds like a big synchronization overhead and I wouldn't jump to doing so until we are absolutely sure. |
@alexanderfle : Given the input about needing to be truly truly-safe, we need to rework how we lock. We should no longer lock individual share data in the PI objects (various mutexes in PI queue in this case), but lock the entire PI objects for exclusive access. Let's also add a control (env var) that will skip the locking entirely, so we can evaluate the performance impact of such locking, and possibly use it for apps known to be single-threaded. |
…e-dir Use PROJECT_SOURCE_DIR instead of CMAKE_SOURCE_DIR
Use PROJECT_SOURCE_DIR instead of CMAKE_SOURCE_DIR
This patch fixes sporadic thread-safety bugs on Windows when level0 plugin is used:
piEventsWait()
:one thread executes
EventList[I]->ZeCommandList = nullptr;
while another thread locks
ZeCommandListFenceMapMutex
and when it is time to useEventList[I]->ZeCommandList
it already isnullptr
.-- Now the thread will check
EventList[I]->ZeCommandList
only after lockingZeCommandListFenceMapMutex
getAvailableCommandList()
:wrong order of locking mutexes in the function led to sporadic hang.
-- Now the order is aligned order as elsewhere: first lock
ZeCommandListFenceMapMutex
and thenZeCommandListCacheMutex
.Signed-off-by: Alexander Flegontov [email protected]