Skip to content

release/20.x: [Offload] Stop the RPC server faiilng with more than one GPU (#125982) #125985

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 8, 2025

Conversation

llvmbot
Copy link
Member

@llvmbot llvmbot commented Feb 6, 2025

Backport 7a87794

Requested by: @jhuber6

@llvmbot llvmbot added this to the LLVM 20.X Release milestone Feb 6, 2025
@llvmbot
Copy link
Member Author

llvmbot commented Feb 6, 2025

@ronlieb What do you think about merging this PR to the release branch?

@llvmbot
Copy link
Member Author

llvmbot commented Feb 6, 2025

@llvm/pr-subscribers-offload

Author: None (llvmbot)

Changes

Backport 7a87794

Requested by: @jhuber6


Full diff: https://github.com/llvm/llvm-project/pull/125985.diff

1 Files Affected:

  • (modified) offload/plugins-nextgen/common/src/PluginInterface.cpp (+4-3)
diff --git a/offload/plugins-nextgen/common/src/PluginInterface.cpp b/offload/plugins-nextgen/common/src/PluginInterface.cpp
index 16f510de3ecc5ca..57672b0223bec81 100644
--- a/offload/plugins-nextgen/common/src/PluginInterface.cpp
+++ b/offload/plugins-nextgen/common/src/PluginInterface.cpp
@@ -1057,8 +1057,9 @@ Error GenericDeviceTy::setupRPCServer(GenericPluginTy &Plugin,
   if (auto Err = Server.initDevice(*this, Plugin.getGlobalHandler(), Image))
     return Err;
 
-  if (auto Err = Server.startThread())
-    return Err;
+  if (!Server.Thread->Running.load(std::memory_order_acquire))
+    if (auto Err = Server.startThread())
+      return Err;
 
   RPCServer = &Server;
   DP("Running an RPC server on device %d\n", getDeviceId());
@@ -1633,7 +1634,7 @@ Error GenericPluginTy::deinit() {
   if (GlobalHandler)
     delete GlobalHandler;
 
-  if (RPCServer && RPCServer->Thread->Running.load(std::memory_order_relaxed))
+  if (RPCServer && RPCServer->Thread->Running.load(std::memory_order_acquire))
     if (Error Err = RPCServer->shutDown())
       return Err;
 

…5982)

Summary:
Pretty dumb mistake of me, forgot that this is run per-device and
per-plugin, which fell through the cracks with my testing because I have
two GPUs that use different plugins.

(cherry picked from commit 7a87794)
@tstellar tstellar merged commit dbb2699 into llvm:release/20.x Feb 8, 2025
8 of 9 checks passed
Copy link

github-actions bot commented Feb 8, 2025

@jhuber6 (or anyone else). If you would like to add a note about this fix in the release notes (completely optional). Please reply to this comment with a one or two sentence description of the fix. When you are done, please add the release:note label to this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging this pull request may close these issues.

4 participants