[libc] Add loader option to force serial execution of GPU region #101601

jhuber6 · 2024-08-02T02:21:28Z

Summary:
The loader is used as a test utility to run traditionally CPU based unit
tests on the GPU. This has issues when used with something like
llvm-lit because the GPU runtimes have a nasty habit of either running
out of resources or hanging when they are overloaded. To combat this, I
added this option to force each process to perform the GPU part
serially.

This is done right now with a simple file lock on the executing file. I
was originally thinking about using more complex IPC to allow N
processes to share execution, but that seemed overly complicated given
the incredibly large number of failure modes it introduces. File locks
are nice here because if the process crashes or is killed it will
release the lock automatically (at least on Linux). This is in contrast
to something like POSIX shared memory which will stick around until it's
unlinked, meaning that if someone did sigkill on the program it would
never get cleaned up and other threads might wait on a mutex that never
occurs.

Restricting this to one thread isn't overly ideal, given the fact that
the runtime can likely handle at least a few separate processes, but
this was easy and it works, so might as well start here. This will
hopefully unblock me on running libcxx tests, as those ran with so
much parallelism spurious failures were very common.

llvmbot · 2024-08-02T02:21:59Z

@llvm/pr-subscribers-libc

Author: Joseph Huber (jhuber6)

Changes

Summary:
The loader is used as a test utility to run traditionally CPU based unit
tests on the GPU. This has issues when used with something like
llvm-lit because the GPU runtimes have a nasty habit of either running
out of resources or hanging when they are overloaded. To combat this, I
added this option to force each process to perform the GPU part
serially.

This is done right now with a simple file lock on the executing file. I
was originally thinking about using more complex IPC to allow N
processes to share execution, but that seemed overly complicated given
the incredibly large number of failure modes it introduces. File locks
are nice here because if the process crashes or is killed it will
release the lock automatically (at least on Linux). This is in contrast
to something like POSIX shared memory which will stick around until it's
unlinked, meaning that if someone did sigkill on the program it would
never get cleaned up and other threads might wait on a mutex that never
occurs.

Restricting this to one thread isn't overly ideal, given the fact that
the runtime can likely handle at least a few separate processes, but
this was easy and it works, so might as well start here. This will
hopefully unblock me on running libcxx tests, as those ran with so
much parallelism spurious failures were very common.

Full diff: https://github.com/llvm/llvm-project/pull/101601.diff

1 Files Affected:

(modified) libc/utils/gpu/loader/Main.cpp (+22)

diff --git a/libc/utils/gpu/loader/Main.cpp b/libc/utils/gpu/loader/Main.cpp
index 44ed8bf58ab87..7037d772ad2bc 100644
--- a/libc/utils/gpu/loader/Main.cpp
+++ b/libc/utils/gpu/loader/Main.cpp
@@ -11,6 +11,8 @@
 //
 //===----------------------------------------------------------------------===//
 
+#include <sys/file.h>
+
 #include "Loader.h"
 
 #include "llvm/BinaryFormat/Magic.h"
@@ -62,6 +64,12 @@ static cl::opt<bool>
                          cl::desc("Output resource usage of launched kernels"),
                          cl::init(false), cl::cat(loader_category));
 
+static cl::opt<bool>
+    no_parallelism("no-parallelism",
+                   cl::desc("Allows only a single process to use the GPU at a "
+                            "time. Useful to suppress out-of-resource errors"),
+                   cl::init(false), cl::cat(loader_category));
+
 static cl::opt<std::string> file(cl::Positional, cl::Required,
                                  cl::desc("<gpu executable>"),
                                  cl::cat(loader_category));
@@ -98,6 +106,15 @@ int main(int argc, const char **argv, const char **envp) {
   llvm::transform(args, std::back_inserter(new_argv),
                   [](const std::string &arg) { return arg.c_str(); });
 
+  // Claim a file lock on the executable so only a single process can enter this
+  // region if requested. This prevents the loader from spurious failures.
+  int fd = -1;
+  if (no_parallelism) {
+    fd = open(argv[0], O_RDONLY);
+    if (flock(fd, LOCK_EX) == 1)
+      report_error(createStringError("Failed to lock '%s'", argv[0]));
+  }
+
   // Drop the loader from the program arguments.
   LaunchParameters params{threads_x, threads_y, threads_z,
                           blocks_x,  blocks_y,  blocks_z};
@@ -105,5 +122,10 @@ int main(int argc, const char **argv, const char **envp) {
                  const_cast<char *>(image.getBufferStart()),
                  image.getBufferSize(), params, print_resource_usage);
 
+  if (no_parallelism) {
+    if (flock(fd, LOCK_UN) == 1)
+      report_error(createStringError("Failed to unlock '%s'", argv[0]));
+  }
+
   return ret;
 }

libc/utils/gpu/loader/Main.cpp

Summary: The loader is used as a test utility to run traditionally CPU based unit tests on the GPU. This has issues when used with something like `llvm-lit` because the GPU runtimes have a nasty habit of either running out of resources or hanging when they are overloaded. To combat this, I added this option to force each process to perform the GPU part serially. This is done right now with a simple file lock on the executing file. I was originally thinking about using more complex IPC to allow N processes to share execution, but that seemed overly complicated given the incredibly large number of failure modes it introduces. File locks are nice here because if the process crashes or is killed it will release the lock automatically (at least on Linux). This is in contrast to something like POSIX shared memory which will stick around until it's unlinked, meaning that if someone did `sigkill` on the program it would never get cleaned up and other threads might wait on a mutex that never occurs. Restricting this to one thread isn't overly ideal, given the fact that the runtime can likely handle at least a *few* separate processes, but this was easy and it works, so might as well start here. This will hopefully unblock me on running `libcxx` tests, as those ran with so much parallelism spurious failures were very common.

SchrodingerZhu

LGTM

jhuber6 requested review from arsenm, Artem-B, jdoerfert, JonChesterfield, lntue, michaelrj-google, shiltian and yxsamliu August 2, 2024 02:21

llvmbot added the libc label Aug 2, 2024

arsenm reviewed Aug 2, 2024

View reviewed changes

libc/utils/gpu/loader/Main.cpp Outdated Show resolved Hide resolved

libc/utils/gpu/loader/Main.cpp Outdated Show resolved Hide resolved

jhuber6 force-pushed the serial branch 2 times, most recently from a17f2ac to af5cc55 Compare August 2, 2024 14:18

jhuber6 force-pushed the serial branch from af5cc55 to 5a13fac Compare August 2, 2024 16:09

SchrodingerZhu approved these changes Aug 3, 2024

View reviewed changes

jhuber6 merged commit d1b2940 into llvm:main Aug 5, 2024
6 checks passed

jhuber6 deleted the serial branch August 5, 2024 19:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[libc] Add loader option to force serial execution of GPU region #101601

[libc] Add loader option to force serial execution of GPU region #101601

Uh oh!

jhuber6 commented Aug 2, 2024

Uh oh!

llvmbot commented Aug 2, 2024

Uh oh!

Uh oh!

Uh oh!

SchrodingerZhu left a comment

Uh oh!

Uh oh!

Uh oh!

[libc] Add loader option to force serial execution of GPU region #101601

[libc] Add loader option to force serial execution of GPU region #101601

Uh oh!

Conversation

jhuber6 commented Aug 2, 2024

Uh oh!

llvmbot commented Aug 2, 2024

Uh oh!

Uh oh!

Uh oh!

SchrodingerZhu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!