Skip to content

[Offload] Stop the RPC server faiilng with more than one GPU #125982

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 6, 2025

Conversation

jhuber6
Copy link
Contributor

@jhuber6 jhuber6 commented Feb 6, 2025

Summary:
Pretty dumb mistake of me, forgot that this is run per-device and
per-plugin, which fell through the cracks with my testing because I have
two GPUs that use different plugins.

Summary:
Pretty dumb mistake of me, forgot that this is run per-device and
per-plugin, which fell through the cracks with my testing because I have
two GPUs that use different plugins.
@llvmbot llvmbot added the offload label Feb 6, 2025
@llvmbot
Copy link
Member

llvmbot commented Feb 6, 2025

@llvm/pr-subscribers-offload

Author: Joseph Huber (jhuber6)

Changes

Summary:
Pretty dumb mistake of me, forgot that this is run per-device and
per-plugin, which fell through the cracks with my testing because I have
two GPUs that use different plugins.


Full diff: https://github.com/llvm/llvm-project/pull/125982.diff

1 Files Affected:

  • (modified) offload/plugins-nextgen/common/src/PluginInterface.cpp (+4-3)
diff --git a/offload/plugins-nextgen/common/src/PluginInterface.cpp b/offload/plugins-nextgen/common/src/PluginInterface.cpp
index d2451d8a3422121..76ae0a2dd9c4523 100644
--- a/offload/plugins-nextgen/common/src/PluginInterface.cpp
+++ b/offload/plugins-nextgen/common/src/PluginInterface.cpp
@@ -1058,8 +1058,9 @@ Error GenericDeviceTy::setupRPCServer(GenericPluginTy &Plugin,
   if (auto Err = Server.initDevice(*this, Plugin.getGlobalHandler(), Image))
     return Err;
 
-  if (auto Err = Server.startThread())
-    return Err;
+  if (!Server.Thread->Running.load(std::memory_order_acquire))
+    if (auto Err = Server.startThread())
+      return Err;
 
   RPCServer = &Server;
   DP("Running an RPC server on device %d\n", getDeviceId());
@@ -1634,7 +1635,7 @@ Error GenericPluginTy::deinit() {
   if (GlobalHandler)
     delete GlobalHandler;
 
-  if (RPCServer && RPCServer->Thread->Running.load(std::memory_order_relaxed))
+  if (RPCServer && RPCServer->Thread->Running.load(std::memory_order_acquire))
     if (Error Err = RPCServer->shutDown())
       return Err;
 

@ronlieb ronlieb self-requested a review February 6, 2025 02:50
@jhuber6 jhuber6 merged commit 7a87794 into llvm:main Feb 6, 2025
8 checks passed
@jhuber6 jhuber6 added this to the LLVM 20.X Release milestone Feb 6, 2025
@jhuber6
Copy link
Contributor Author

jhuber6 commented Feb 6, 2025

/cherry-pick 7a87794

@llvmbot
Copy link
Member

llvmbot commented Feb 6, 2025

/pull-request #125985

searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Feb 7, 2025
…5982)

Summary:
Pretty dumb mistake of me, forgot that this is run per-device and
per-plugin, which fell through the cracks with my testing because I have
two GPUs that use different plugins.
swift-ci pushed a commit to swiftlang/llvm-project that referenced this pull request Feb 8, 2025
…5982)

Summary:
Pretty dumb mistake of me, forgot that this is run per-device and
per-plugin, which fell through the cracks with my testing because I have
two GPUs that use different plugins.

(cherry picked from commit 7a87794)
Icohedron pushed a commit to Icohedron/llvm-project that referenced this pull request Feb 11, 2025
…5982)

Summary:
Pretty dumb mistake of me, forgot that this is run per-device and
per-plugin, which fell through the cracks with my testing because I have
two GPUs that use different plugins.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging this pull request may close these issues.

3 participants