[Offload] Make only a single thread handle the RPC server thread #126067

jhuber6 · 2025-02-06T13:43:38Z

Summary:
This patch just changes the interface to make starting the thread
multiple times permissable since it will only be done the first time.
Note that this does not refcount it or anything, so it's onto the user
to make sure that they don't shut down the thread before everyone is
done using it. That is the case today because the shutDown portion is
run by a single thread in the destructor phase.

Another question is if we should make this thread truly global state,
because currently it will be private to each plugin instance, so if you
have an AMD and NVIDIA image there will be two, similarly if you have
those inside of a shared library.

Summary: This patch just changes the interface to make starting the thread multiple times permissable since it will only be done the first time. Note that this does not refcount it or anything, so it's onto the user to make sure that they don't shut down the thread before everyone is done using it. That is the case today because the shutDown portion is run by a single thread in the destructor phase. Another question is if we should make this thread truly global state, because currently it will be private to each plugin instance, so if you have an AMD and NVIDIA image there will be two, similarly if you have those inside of a shared library.

llvmbot · 2025-02-06T13:44:12Z

@llvm/pr-subscribers-offload

Author: Joseph Huber (jhuber6)

Changes

Summary:
This patch just changes the interface to make starting the thread
multiple times permissable since it will only be done the first time.
Note that this does not refcount it or anything, so it's onto the user
to make sure that they don't shut down the thread before everyone is
done using it. That is the case today because the shutDown portion is
run by a single thread in the destructor phase.

Another question is if we should make this thread truly global state,
because currently it will be private to each plugin instance, so if you
have an AMD and NVIDIA image there will be two, similarly if you have
those inside of a shared library.

Full diff: https://github.com/llvm/llvm-project/pull/126067.diff

3 Files Affected:

(modified) offload/plugins-nextgen/common/include/RPC.h (+1-1)
(modified) offload/plugins-nextgen/common/src/PluginInterface.cpp (+4-6)
(modified) offload/plugins-nextgen/common/src/RPC.cpp (+4-7)

diff --git a/offload/plugins-nextgen/common/include/RPC.h b/offload/plugins-nextgen/common/include/RPC.h
index d750ce30e74b05e..7b031083647aafd 100644
--- a/offload/plugins-nextgen/common/include/RPC.h
+++ b/offload/plugins-nextgen/common/include/RPC.h
@@ -80,7 +80,7 @@ struct RPCServerTy {
     std::thread Worker;
 
     /// A boolean indicating whether or not the worker thread should continue.
-    std::atomic<bool> Running;
+    std::atomic<uint32_t> Running;
 
     /// The number of currently executing kernels across all devices that need
     /// the server thread to be running.
diff --git a/offload/plugins-nextgen/common/src/PluginInterface.cpp b/offload/plugins-nextgen/common/src/PluginInterface.cpp
index 76ae0a2dd9c4523..48c9b671c1a91d3 100644
--- a/offload/plugins-nextgen/common/src/PluginInterface.cpp
+++ b/offload/plugins-nextgen/common/src/PluginInterface.cpp
@@ -1058,9 +1058,8 @@ Error GenericDeviceTy::setupRPCServer(GenericPluginTy &Plugin,
   if (auto Err = Server.initDevice(*this, Plugin.getGlobalHandler(), Image))
     return Err;
 
-  if (!Server.Thread->Running.load(std::memory_order_acquire))
-    if (auto Err = Server.startThread())
-      return Err;
+  if (auto Err = Server.startThread())
+    return Err;
 
   RPCServer = &Server;
   DP("Running an RPC server on device %d\n", getDeviceId());
@@ -1635,12 +1634,11 @@ Error GenericPluginTy::deinit() {
   if (GlobalHandler)
     delete GlobalHandler;
 
-  if (RPCServer && RPCServer->Thread->Running.load(std::memory_order_acquire))
+  if (RPCServer) {
     if (Error Err = RPCServer->shutDown())
       return Err;
-
-  if (RPCServer)
     delete RPCServer;
+  }
 
   if (RecordReplay)
     delete RecordReplay;
diff --git a/offload/plugins-nextgen/common/src/RPC.cpp b/offload/plugins-nextgen/common/src/RPC.cpp
index e6750a540b391d7..4289f920c0e1eb2 100644
--- a/offload/plugins-nextgen/common/src/RPC.cpp
+++ b/offload/plugins-nextgen/common/src/RPC.cpp
@@ -99,18 +99,15 @@ static rpc::Status runServer(plugin::GenericDeviceTy &Device, void *Buffer) {
 }
 
 void RPCServerTy::ServerThread::startThread() {
-  assert(!Running.load(std::memory_order_relaxed) &&
-         "Attempting to start thread that is already running");
-  Running.store(true, std::memory_order_release);
-  Worker = std::thread([this]() { run(); });
+  if (!Running.fetch_or(true, std::memory_order_acquire))
+    Worker = std::thread([this]() { run(); });
 }
 
 void RPCServerTy::ServerThread::shutDown() {
-  assert(Running.load(std::memory_order_relaxed) &&
-         "Attempting to shut down a thread that is not running");
+  if (!Running.fetch_and(false, std::memory_order_release))
+    return;
   {
     std::lock_guard<decltype(Mutex)> Lock(Mutex);
-    Running.store(false, std::memory_order_release);
     CV.notify_all();
   }
   if (Worker.joinable())

shiltian · 2025-02-06T14:37:48Z

Another question is if we should make this thread truly global state,
because currently it will be private to each plugin instance, so if you
have an AMD and NVIDIA image there will be two, similarly if you have
those inside of a shared library.

I think we still want to have one for each. There isn’t much to gain from sharing a single instance. To support both AMD and NVIDIA cards on the same system, an offloading program could use both cards simultaneously, with each relying on host RPC calls. While I wouldn’t say it’s entirely impossible, it’s 99.99% unlikely. One potential benefit of keeping them private to each plugin is the ability to customize the handling of certain RPC calls if needed.

JonChesterfield

Good fix. We might want a different strategy for host threads later but staying with one per device looks good for now.

dhruvachak

LGTM.

…m#126067) Summary: This patch just changes the interface to make starting the thread multiple times permissable since it will only be done the first time. Note that this does not refcount it or anything, so it's onto the user to make sure that they don't shut down the thread before everyone is done using it. That is the case today because the shutDown portion is run by a single thread in the destructor phase. Another question is if we should make this thread truly global state, because currently it will be private to each plugin instance, so if you have an AMD and NVIDIA image there will be two, similarly if you have those inside of a shared library.

jhuber6 requested review from dhruvachak, JonChesterfield, jplehr, ronlieb and shiltian February 6, 2025 13:43

llvmbot added the offload label Feb 6, 2025

JonChesterfield approved these changes Feb 6, 2025

View reviewed changes

dhruvachak approved these changes Feb 6, 2025

View reviewed changes

jhuber6 merged commit 5812d0b into llvm:main Feb 6, 2025
8 checks passed

jhuber6 deleted the Thread branch February 6, 2025 17:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Offload] Make only a single thread handle the RPC server thread #126067

[Offload] Make only a single thread handle the RPC server thread #126067

Uh oh!

jhuber6 commented Feb 6, 2025

Uh oh!

llvmbot commented Feb 6, 2025

Uh oh!

shiltian commented Feb 6, 2025

Uh oh!

JonChesterfield left a comment

Uh oh!

dhruvachak left a comment

Uh oh!

Uh oh!

Uh oh!

[Offload] Make only a single thread handle the RPC server thread #126067

[Offload] Make only a single thread handle the RPC server thread #126067

Uh oh!

Conversation

jhuber6 commented Feb 6, 2025

Uh oh!

llvmbot commented Feb 6, 2025

Uh oh!

shiltian commented Feb 6, 2025

Uh oh!

JonChesterfield left a comment

Choose a reason for hiding this comment

Uh oh!

dhruvachak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!