Skip to content

[SYCL] fix for leaking commands when exception thrown #16618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
28de9d5
when enqueueing a command and its dependencies, and exception might b…
cperkinsintel Jan 14, 2025
7279592
Merge branch 'sycl' into cperkins-cmd-mem-leak-fix
cperkinsintel Jan 21, 2025
e24a731
fix OTHER memory release path and test both
cperkinsintel Jan 21, 2025
626a833
restoring other CleanUp code which is used by queue memcpy ops
cperkinsintel Jan 23, 2025
9401ae1
interesting and excellent.
cperkinsintel Jan 24, 2025
a03da62
blind fix
cperkinsintel Jan 24, 2025
113927d
checkpoint. cleanup needed
cperkinsintel Jan 28, 2025
59ae241
checkpoint
cperkinsintel Jan 28, 2025
5ac2f12
another checkpoint
cperkinsintel Jan 28, 2025
ea2fe36
ready for more testing. Probably needs clang-format fixes.
cperkinsintel Jan 29, 2025
194b47e
misery loves clang-format
cperkinsintel Jan 29, 2025
0d064b9
improvements. need CI to check GPU side. may need clang-format
cperkinsintel Jan 30, 2025
ce964b9
removing shutdown_win() refs and clang-forrmattery
cperkinsintel Jan 30, 2025
fd0052b
fix of dllMain test. will need clang-format
cperkinsintel Jan 31, 2025
e85ac9b
clang-formation
cperkinsintel Jan 31, 2025
334dfcf
improve comment
cperkinsintel Jan 31, 2025
fa814ca
changes based on timing tests and improve write up
cperkinsintel Feb 4, 2025
724d6e0
XPTI juggling
cperkinsintel Feb 4, 2025
2e61843
safety
cperkinsintel Feb 4, 2025
b396f7f
reformat
cperkinsintel Feb 4, 2025
392a1fa
clang-format
cperkinsintel Feb 4, 2025
1bd79cb
clang-format has a bug, apparently
cperkinsintel Feb 4, 2025
32de860
U is for unit tests and unified runtime
cperkinsintel Feb 5, 2025
db91aba
forgot to add file
cperkinsintel Feb 5, 2025
de45858
you can tell it's late
cperkinsintel Feb 5, 2025
fce2915
Merge branch 'sycl' into cperkins-cmd-mem-leak-fix
cperkinsintel Feb 5, 2025
6969256
Merge branch 'sycl' into cperkins-cmd-mem-leak-fix
cperkinsintel Feb 5, 2025
7d08a9b
clarity
cperkinsintel Feb 6, 2025
d8058c6
test
cperkinsintel Feb 7, 2025
d7730d8
Merge branch 'cperkins-cmd-mem-leak-fix' of https://github.com/cperki…
cperkinsintel Feb 7, 2025
802858e
ci
cperkinsintel Feb 7, 2025
ec4ee66
Merge branch 'cperkins-cmd-mem-leak-fix' of https://github.com/cperki…
cperkinsintel Feb 7, 2025
ea328de
with feathers
cperkinsintel Feb 7, 2025
6193172
not seeing problem locally
cperkinsintel Feb 7, 2025
46490bc
did I say with feathers already?
cperkinsintel Feb 7, 2025
432ecd6
d'oh
cperkinsintel Feb 8, 2025
e61ccfa
hm
cperkinsintel Feb 8, 2025
442d496
Merge branch 'cperkins-cmd-mem-leak-fix' of https://github.com/cperki…
cperkinsintel Feb 8, 2025
a20e93b
adjust unit tests to avoid interactions. might need clang-format
cperkinsintel Feb 10, 2025
3978329
no choice
cperkinsintel Feb 10, 2025
eea07da
clang-format
cperkinsintel Feb 11, 2025
7afd027
more clang-format
cperkinsintel Feb 11, 2025
6fdfa2e
again
cperkinsintel Feb 11, 2025
fac5ffa
moar logging. sad.
cperkinsintel Feb 11, 2025
134e3a9
test
cperkinsintel Feb 11, 2025
9f5556f
transfer
cperkinsintel Feb 11, 2025
c088821
latest win reattempt
cperkinsintel Feb 12, 2025
bde54c6
overlooked. should have been removed
cperkinsintel Feb 12, 2025
f2b9298
relocate
cperkinsintel Feb 12, 2025
a6ae675
another change
cperkinsintel Feb 12, 2025
92c9725
won't work
cperkinsintel Feb 12, 2025
10ac7ca
elegant or kludge? a lady has her secrets
cperkinsintel Feb 12, 2025
cf5d553
improvement
cperkinsintel Feb 12, 2025
f85aa0f
updated comments, temporarily turning off some logging
cperkinsintel Feb 12, 2025
a95284c
restoring
cperkinsintel Feb 13, 2025
6314b63
clang-format
cperkinsintel Feb 13, 2025
f12f495
knew this was coming. Hopefully it won't upset any tests
cperkinsintel Feb 13, 2025
cf8c1b4
cleanup
cperkinsintel Feb 13, 2025
979d6b2
no finishWait on win in threadpool destructor
cperkinsintel Feb 14, 2025
8d111f1
remove stray line. ( Actually, I just want to try a new test run )
cperkinsintel Feb 19, 2025
caedef9
reviewer feedback
cperkinsintel Feb 21, 2025
3b93f64
more reviewer feedback before splitting into two PR
cperkinsintel Feb 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 66 additions & 7 deletions sycl/doc/design/GlobalObjectsInRuntime.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,9 +55,65 @@ Deinitialization is platform-specific. Upon application shutdown, the DPC++
runtime frees memory pointed by `GlobalHandler` global pointer, which triggers
destruction of nested `std::unique_ptr`s.

### Shutdown Tasks and Challenges

As the user's app ends, SYCL's primary goal is to release any UR adapters that
have been gotten, and teardown the plugins/adapters themselves. Additionally,
we need to stop deferring any new buffer releases and clean up any memory
whose release was deferred.

To this end, the shutdown occurs in two phases: early and late. The purpose
for eary shutdown is primarily to stop any further deferring of memory release.
This is because the deferred memory release is based on threads and on Windows
the threads will be abandoned. So as soon as possible we want to stop deferring
memory and try to let go any that has been deferred. The purpose for late
shutdown is to hold onto the handles and adapters longer than the user's
application. We don't want to initiate late shutdown until after all the users
static and thread local vars have been destroyed, in case those destructors are
calling SYCL.

In the early shutdown we stop deferring, tell the scheduler to prepare for release, and
try releasing the memory that has been deferred so far. Following this, if
the user has any global or static handles to sycl objects, they'll be destroyed.
Finally, the late shutdown routine is called the last of the UR handles and
adapters are let go, as is the GlobalHandler itself.


#### Threads
The deferred memory marshalling is built on a thread pool, but there is a
challenge here in that on Windows, once the end of the users main() is reached
and their app is shutting down, the Windows OS will abandon all remaining
in-flight threads. These threads can be .join() but they simply return instantly,
the threads are not completed. Furthermore, any thread specific variables
(or thread_local static vars) will NOT have their destructors called. Note
that the standard while-loop-over-condition-var pattern will cause a hang -
we cannot "wait" on abandoned threads.
On Windows, short of adding some user called API to signal this, there is
no way to detect or avoid this. None of the "end-of-library" lifecycle events
occurs before the threads are abandoned. ( not std::atexit(), not globals or
static, or static thread_local var destruction, not DllMain(DLL_PROCESS_DETACH) )
This means that on Windows, once we arrive at shutdown_early we cannot wait on
host events or the thread pool.

For the deferred memory itself, there is no issue here. The Windows OS will
reclaim the memory for us. The issue of which we must be wary is placing UR
handles (and similar) in host threads. The RAII mechanism of unique and
shared pointers will not work in any thread that is abandoned on Windows.

One last note about threads. It is entirely the OS's discretion when to
start or schedule a thread. If the main process is very busy then it is
possible that threads the SYCL library creates (host_tasks/thread_pool)
won't even be started until AFTER the host application main() function is done.
This is not a normal occurrence, but it can happen if there is no call to queue.wait()


### Linux

On Linux DPC++ runtime uses `__attribute__((destructor))` property with low
On Linux, the "early_shutdown()" is begun by the destruction of a static
StaticVarShutdownHandler object, which is initialized by
platform::get_platforms().

late_shutdown() timing uses `__attribute__((destructor))` property with low
priority value 110. This approach does not guarantee, that `GlobalHandler`
destructor is the last thing to run, as user code may contain a similar function
with the same priority value. At the same time, users may specify priorities
Expand All @@ -72,10 +128,14 @@ times, the memory leak may impact code performance.

### Windows

To identify shutdown moment on Windows, DPC++ runtime uses default `DllMain`
function with `DLL_PROCESS_DETACH` reason. This guarantees, that global objects
deinitialization happens right before `sycl.dll` is unloaded from process
address space.
Differing from Linux, on Windows the "early_shutdown()" is begun by
DllMain(PROCESS_DETACH), unless statically linked.

The "late_shutdown()" is begun by the destruction of a
static StaticVarShutdownHandler object, which is initialized by
platform::get_platforms(). (On Linux, this is when we do "early_shutdown()".
Go figure.) This is as late as we can manage, but it is later than any user
application global, static, or thread_local variable destruction.

### Recommendations for DPC++ runtime developers

Expand Down Expand Up @@ -109,8 +169,7 @@ for (adapter in initializedAdapters) {
urLoaderTearDown();
```

Which in turn is called by either `shutdown_late()` or `shutdown_win()`
depending on platform.
Which in turn is called by `shutdown_late()`.

![](images/adapter-lifetime.jpg)

Expand Down
100 changes: 64 additions & 36 deletions sycl/source/detail/global_handler.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,11 @@ using LockGuard = std::lock_guard<SpinLock>;
SpinLock GlobalHandler::MSyclGlobalHandlerProtector{};

// forward decl
void shutdown_win(); // TODO: win variant will go away soon
void shutdown_early();
void shutdown_late();
#ifdef _WIN32
BOOL isLinkedStatically();
#endif

// Utility class to track references on object.
// Used for GlobalHandler now and created as thread_local object on the first
Expand All @@ -60,10 +62,6 @@ class ObjectUsageCounter {

LockGuard Guard(GlobalHandler::MSyclGlobalHandlerProtector);
MCounter--;
GlobalHandler *RTGlobalObjHandler = GlobalHandler::getInstancePtr();
if (RTGlobalObjHandler) {
RTGlobalObjHandler->prepareSchedulerToRelease(!MCounter);
}
} catch (std::exception &e) {
__SYCL_REPORT_EXCEPTION_TO_STREAM("exception in ~ObjectUsageCounter", e);
}
Expand Down Expand Up @@ -237,24 +235,37 @@ void GlobalHandler::releaseDefaultContexts() {
MPlatformToDefaultContextCache.Inst.reset(nullptr);
}

struct EarlyShutdownHandler {
~EarlyShutdownHandler() {
// Shutdown is split into two parts. shutdown_early() stops any more
// objects from being deferred and takes an initial pass at freeing them.
// shutdown_late() finishes and releases the adapters and the GlobalHandler.
// For Windows, early shutdown is typically called from DllMain,
// and late shutdown is here.
// For Linux, early shutdown is here, and late shutdown is called from
// a low priority destructor.
struct StaticVarShutdownHandler {

~StaticVarShutdownHandler() {
try {
#ifdef _WIN32
// on Windows we keep to the existing shutdown procedure
GlobalHandler::instance().releaseDefaultContexts();
// If statically linked, DllMain will not be called. So we do its work
// here.
if (isLinkedStatically()) {
shutdown_early();
}

shutdown_late();
#else
shutdown_early();
#endif
} catch (std::exception &e) {
__SYCL_REPORT_EXCEPTION_TO_STREAM("exception in ~EarlyShutdownHandler",
e);
__SYCL_REPORT_EXCEPTION_TO_STREAM(
"exception in ~StaticVarShutdownHandler", e);
}
}
};

void GlobalHandler::registerEarlyShutdownHandler() {
static EarlyShutdownHandler handler{};
void GlobalHandler::registerStaticVarShutdownHandler() {
static StaticVarShutdownHandler handler{};
}

bool GlobalHandler::isOkToDefer() const { return OkToDefer; }
Expand Down Expand Up @@ -287,35 +298,29 @@ void GlobalHandler::prepareSchedulerToRelease(bool Blocking) {
#ifndef _WIN32
if (Blocking)
drainThreadPool();
#endif
if (MScheduler.Inst)
MScheduler.Inst->releaseResources(Blocking ? BlockingT::BLOCKING
: BlockingT::NON_BLOCKING);
#endif
}

void GlobalHandler::drainThreadPool() {
if (MHostTaskThreadPool.Inst)
MHostTaskThreadPool.Inst->drain();
}

#ifdef _WIN32
// because of something not-yet-understood on Windows
// threads may be shutdown once the end of main() is reached
// making an orderly shutdown difficult. Fortunately, Windows
// itself is very aggressive about reclaiming memory. Thus,
// we focus solely on unloading the adapters, so as to not
// accidentally retain device handles. etc
void shutdown_win() {
GlobalHandler *&Handler = GlobalHandler::getInstancePtr();
Handler->unloadAdapters();
}
#else
void shutdown_early() {
const LockGuard Lock{GlobalHandler::MSyclGlobalHandlerProtector};
GlobalHandler *&Handler = GlobalHandler::getInstancePtr();
if (!Handler)
return;

#if defined(XPTI_ENABLE_INSTRUMENTATION) && defined(_WIN32)
if (xptiTraceEnabled())
return; // When doing xpti tracing, we can't safely shutdown on Win.
// TODO: figure out why XPTI prevents release.
#endif

// Now that we are shutting down, we will no longer defer MemObj releases.
Handler->endDeferredRelease();

Expand All @@ -337,6 +342,12 @@ void shutdown_late() {
if (!Handler)
return;

#if defined(XPTI_ENABLE_INSTRUMENTATION) && defined(_WIN32)
if (xptiTraceEnabled())
return; // When doing xpti tracing, we can't safely shutdown on Win.
// TODO: figure out why XPTI prevents release.
#endif

// First, release resources, that may access adapters.
Handler->MPlatformCache.Inst.reset(nullptr);
Handler->MScheduler.Inst.reset(nullptr);
Expand All @@ -353,7 +364,6 @@ void shutdown_late() {
delete Handler;
Handler = nullptr;
}
#endif

#ifdef _WIN32
extern "C" __SYCL_EXPORT BOOL WINAPI DllMain(HINSTANCE hinstDLL,
Expand All @@ -374,23 +384,18 @@ extern "C" __SYCL_EXPORT BOOL WINAPI DllMain(HINSTANCE hinstDLL,
if (PrintUrTrace)
std::cout << "---> DLL_PROCESS_DETACH syclx.dll\n" << std::endl;

#ifdef XPTI_ENABLE_INSTRUMENTATION
if (xptiTraceEnabled())
return TRUE; // When doing xpti tracing, we can't safely call shutdown.
// TODO: figure out what XPTI is doing that prevents
// release.
#endif

try {
shutdown_win();
shutdown_early();
} catch (std::exception &e) {
__SYCL_REPORT_EXCEPTION_TO_STREAM("exception in shutdown_win", e);
__SYCL_REPORT_EXCEPTION_TO_STREAM("exception in DLL_PROCESS_DETACH", e);
return FALSE;
}

break;
case DLL_PROCESS_ATTACH:
if (PrintUrTrace)
std::cout << "---> DLL_PROCESS_ATTACH syclx.dll\n" << std::endl;

break;
case DLL_THREAD_ATTACH:
break;
Expand All @@ -399,6 +404,29 @@ extern "C" __SYCL_EXPORT BOOL WINAPI DllMain(HINSTANCE hinstDLL,
}
return TRUE; // Successful DLL_PROCESS_ATTACH.
}
BOOL isLinkedStatically() {
// If the exePath is the same as the dllPath,
// or if the module handle for DllMain is not retrievable,
// then we are linked statically
// Otherwise we are dynamically linked or loaded.
HMODULE hModule = nullptr;
auto LpModuleAddr = reinterpret_cast<LPCSTR>(&DllMain);
if (!GetModuleHandleExA(GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS, LpModuleAddr,
&hModule)) {
return true; // not retrievable, therefore statically linked
} else {
char dllPath[MAX_PATH];
if (GetModuleFileNameA(hModule, dllPath, MAX_PATH)) {
char exePath[MAX_PATH];
if (GetModuleFileNameA(NULL, exePath, MAX_PATH)) {
if (std::string(dllPath) == std::string(exePath)) {
return true; // paths identical, therefore statically linked
}
}
}
}
return false; // Otherwise dynamically linked or loaded
}
#else
// Setting low priority on destructor ensures it runs after all other global
// destructors. Priorities 0-100 are reserved by the compiler. The priority
Expand Down
3 changes: 1 addition & 2 deletions sycl/source/detail/global_handler.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ class GlobalHandler {
XPTIRegistry &getXPTIRegistry();
ThreadPool &getHostTaskThreadPool();

static void registerEarlyShutdownHandler();
static void registerStaticVarShutdownHandler();

bool isOkToDefer() const;
void endDeferredRelease();
Expand All @@ -95,7 +95,6 @@ class GlobalHandler {

bool OkToDefer = true;

friend void shutdown_win();
friend void shutdown_early();
friend void shutdown_late();
friend class ObjectUsageCounter;
Expand Down
9 changes: 9 additions & 0 deletions sycl/source/detail/host_task.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
#pragma once

#include <detail/cg.hpp>
#include <detail/global_handler.hpp>
#include <sycl/detail/cg_types.hpp>
#include <sycl/handler.hpp>
#include <sycl/interop_handle.hpp>
Expand All @@ -33,6 +34,10 @@ class HostTask {
bool isInteropTask() const { return !!MInteropTask; }

void call(HostProfilingInfo *HPI) {
if (!GlobalHandler::instance().isOkToDefer()) {
return;
}

if (HPI)
HPI->start();
MHostTask();
Expand All @@ -41,6 +46,10 @@ class HostTask {
}

void call(HostProfilingInfo *HPI, interop_handle handle) {
if (!GlobalHandler::instance().isOkToDefer()) {
return;
}

if (HPI)
HPI->start();
MInteropTask(handle);
Expand Down
2 changes: 1 addition & 1 deletion sycl/source/detail/platform_impl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,7 @@ std::vector<platform> platform_impl::get_platforms() {

// This initializes a function-local variable whose destructor is invoked as
// the SYCL shared library is first being unloaded.
GlobalHandler::registerEarlyShutdownHandler();
GlobalHandler::registerStaticVarShutdownHandler();

return Platforms;
}
Expand Down
12 changes: 11 additions & 1 deletion sycl/source/detail/queue_impl.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -264,7 +264,17 @@ class queue_impl {
destructorNotification();
#endif
throw_asynchronous();
getAdapter()->call<UrApiKind::urQueueRelease>(MQueues[0]);
auto status =
getAdapter()->call_nocheck<UrApiKind::urQueueRelease>(MQueues[0]);
// If loader is already closed, it'll return a not-initialized status
// which the UR should convert to SUCCESS code. But that isn't always
// working on Windows. This is a temporary workaround until that is fixed.
// TODO: Remove this workaround when UR is fixed, and restore
// ->call<>() instead of ->call_nocheck<>() above.
if (status != UR_RESULT_SUCCESS &&
status != UR_RESULT_ERROR_UNINITIALIZED) {
__SYCL_CHECK_UR_CODE_NO_EXC(status);
}
} catch (std::exception &e) {
__SYCL_REPORT_EXCEPTION_TO_STREAM("exception in ~queue_impl", e);
}
Expand Down
3 changes: 3 additions & 0 deletions sycl/source/detail/scheduler/graph_builder.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -486,6 +486,9 @@ Scheduler::GraphBuilder::addCopyBack(Requirement *Req,

std::vector<Command *> ToCleanUp;
for (Command *Dep : Deps) {
if (Dep->MEnqueueStatus == EnqueueResultT::SyclEnqueueFailed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is too common change and could happen in the middle of program. I think it is not correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is definitely correct. I'm about to separate this out into its own PR, but what is happening is that we are enqueuing a set of dependencies/requirements and a command, during which an exception is thrown (and caught and rethrown). This sets the result to EnqueuFailed, but only for that particular item, not for things that depend on it (or vice versa). The Command pointers are all loose pointers, not smart pointers. So at the time of the exception some of that set is already in the DAG and then abandoned, and they leak. There is no other way to get the EnqueueFailed status.

So the fix is to be more diligent about checking for EnqueuFailed and avoid the leak that way. And this strategy applies to the buffer destructors as well.

continue;

Command *ConnCmd = MemCpyCmd->addDep(
DepDesc{Dep, MemCpyCmd->getRequirement(), SrcAllocaCmd}, ToCleanUp);
if (ConnCmd)
Expand Down
Loading
Loading