Skip to content

[libc] Add loader option to force serial execution of GPU region #101601

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 5, 2024

Conversation

jhuber6
Copy link
Contributor

@jhuber6 jhuber6 commented Aug 2, 2024

Summary:
The loader is used as a test utility to run traditionally CPU based unit
tests on the GPU. This has issues when used with something like
llvm-lit because the GPU runtimes have a nasty habit of either running
out of resources or hanging when they are overloaded. To combat this, I
added this option to force each process to perform the GPU part
serially.

This is done right now with a simple file lock on the executing file. I
was originally thinking about using more complex IPC to allow N
processes to share execution, but that seemed overly complicated given
the incredibly large number of failure modes it introduces. File locks
are nice here because if the process crashes or is killed it will
release the lock automatically (at least on Linux). This is in contrast
to something like POSIX shared memory which will stick around until it's
unlinked, meaning that if someone did sigkill on the program it would
never get cleaned up and other threads might wait on a mutex that never
occurs.

Restricting this to one thread isn't overly ideal, given the fact that
the runtime can likely handle at least a few separate processes, but
this was easy and it works, so might as well start here. This will
hopefully unblock me on running libcxx tests, as those ran with so
much parallelism spurious failures were very common.

@llvmbot
Copy link
Member

llvmbot commented Aug 2, 2024

@llvm/pr-subscribers-libc

Author: Joseph Huber (jhuber6)

Changes

Summary:
The loader is used as a test utility to run traditionally CPU based unit
tests on the GPU. This has issues when used with something like
llvm-lit because the GPU runtimes have a nasty habit of either running
out of resources or hanging when they are overloaded. To combat this, I
added this option to force each process to perform the GPU part
serially.

This is done right now with a simple file lock on the executing file. I
was originally thinking about using more complex IPC to allow N
processes to share execution, but that seemed overly complicated given
the incredibly large number of failure modes it introduces. File locks
are nice here because if the process crashes or is killed it will
release the lock automatically (at least on Linux). This is in contrast
to something like POSIX shared memory which will stick around until it's
unlinked, meaning that if someone did sigkill on the program it would
never get cleaned up and other threads might wait on a mutex that never
occurs.

Restricting this to one thread isn't overly ideal, given the fact that
the runtime can likely handle at least a few separate processes, but
this was easy and it works, so might as well start here. This will
hopefully unblock me on running libcxx tests, as those ran with so
much parallelism spurious failures were very common.


Full diff: https://github.com/llvm/llvm-project/pull/101601.diff

1 Files Affected:

  • (modified) libc/utils/gpu/loader/Main.cpp (+22)
diff --git a/libc/utils/gpu/loader/Main.cpp b/libc/utils/gpu/loader/Main.cpp
index 44ed8bf58ab87..7037d772ad2bc 100644
--- a/libc/utils/gpu/loader/Main.cpp
+++ b/libc/utils/gpu/loader/Main.cpp
@@ -11,6 +11,8 @@
 //
 //===----------------------------------------------------------------------===//
 
+#include <sys/file.h>
+
 #include "Loader.h"
 
 #include "llvm/BinaryFormat/Magic.h"
@@ -62,6 +64,12 @@ static cl::opt<bool>
                          cl::desc("Output resource usage of launched kernels"),
                          cl::init(false), cl::cat(loader_category));
 
+static cl::opt<bool>
+    no_parallelism("no-parallelism",
+                   cl::desc("Allows only a single process to use the GPU at a "
+                            "time. Useful to suppress out-of-resource errors"),
+                   cl::init(false), cl::cat(loader_category));
+
 static cl::opt<std::string> file(cl::Positional, cl::Required,
                                  cl::desc("<gpu executable>"),
                                  cl::cat(loader_category));
@@ -98,6 +106,15 @@ int main(int argc, const char **argv, const char **envp) {
   llvm::transform(args, std::back_inserter(new_argv),
                   [](const std::string &arg) { return arg.c_str(); });
 
+  // Claim a file lock on the executable so only a single process can enter this
+  // region if requested. This prevents the loader from spurious failures.
+  int fd = -1;
+  if (no_parallelism) {
+    fd = open(argv[0], O_RDONLY);
+    if (flock(fd, LOCK_EX) == 1)
+      report_error(createStringError("Failed to lock '%s'", argv[0]));
+  }
+
   // Drop the loader from the program arguments.
   LaunchParameters params{threads_x, threads_y, threads_z,
                           blocks_x,  blocks_y,  blocks_z};
@@ -105,5 +122,10 @@ int main(int argc, const char **argv, const char **envp) {
                  const_cast<char *>(image.getBufferStart()),
                  image.getBufferSize(), params, print_resource_usage);
 
+  if (no_parallelism) {
+    if (flock(fd, LOCK_UN) == 1)
+      report_error(createStringError("Failed to unlock '%s'", argv[0]));
+  }
+
   return ret;
 }

@jhuber6 jhuber6 force-pushed the serial branch 2 times, most recently from a17f2ac to af5cc55 Compare August 2, 2024 14:18
Summary:
The loader is used as a test utility to run traditionally CPU based unit
tests on the GPU. This has issues when used with something like
`llvm-lit` because the GPU runtimes have a nasty habit of either running
out of resources or hanging when they are overloaded. To combat this, I
added this option to force each process to perform the GPU part
serially.

This is done right now with a simple file lock on the executing file. I
was originally thinking about using more complex IPC to allow N
processes to share execution, but that seemed overly complicated given
the incredibly large number of failure modes it introduces. File locks
are nice here because if the process crashes or is killed it will
release the lock automatically (at least on Linux). This is in contrast
to something like POSIX shared memory which will stick around until it's
unlinked, meaning that if someone did `sigkill` on the program it would
never get cleaned up and other threads might wait on a mutex that never
occurs.

Restricting this to one thread isn't overly ideal, given the fact that
the runtime can likely handle at least a *few* separate processes, but
this was easy and it works, so might as well start here. This will
hopefully unblock me on running `libcxx` tests, as those ran with so
much parallelism spurious failures were very common.
Copy link
Contributor

@SchrodingerZhu SchrodingerZhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jhuber6 jhuber6 merged commit d1b2940 into llvm:main Aug 5, 2024
6 checks passed
@jhuber6 jhuber6 deleted the serial branch August 5, 2024 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants