[OpenMP][libomptarget] Enable parallel copies via multiple SDMA engines #71801

jplehr · 2023-11-09T12:18:53Z

This enables the AMDGPU plugin to use a new ROCm 5.7 interface to dispatch asynchronous data transfers across SDMA engines.

The default functionality stays unchanged, meaning that all data transfers are enqueued into a H2D queue or an D2H queue, depending on transfer direction, via the HSA interface used previously.

The new interface can be enabled via the environment variable LIBOMPTARGET_AMDGPU_USE_MULTIPLE_SDMA_ENGINES=true when libomptarget is built against a recent ROCm version (5.7 and later).
As of now, requests are distributed in a round-robin fashion across available SDMA engines.

llvmbot · 2023-11-09T12:19:27Z

@llvm/pr-subscribers-backend-amdgpu

Author: Jan Patrick Lehr (jplehr)

Changes

This enables the AMDGPU plugin to use a new ROCm 5.7 interface to dispatch asynchronous data transfers across SDMA engines.

The default functionality stays unchanged, meaning that all data transfers are enqueued into a H2D queue or an D2H queue, depending on transfer direction, via the HSA interface used previously.

The new interface can be enabled via the environment variable LIBOMPTARGET_AMDGPU_USE_MULTIPLE_SDMA_ENGINES=true when libomptarget is built against a recent ROCm version (5.7 and later).
As of now, requests are distributed in a round-robin fashion across available SDMA engines.

Full diff: https://github.com/llvm/llvm-project/pull/71801.diff

2 Files Affected:

(modified) openmp/libomptarget/plugins-nextgen/amdgpu/dynamic_hsa/hsa_ext_amd.h (+4-6)
(modified) openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp (+62-43)

diff --git a/openmp/libomptarget/plugins-nextgen/amdgpu/dynamic_hsa/hsa_ext_amd.h b/openmp/libomptarget/plugins-nextgen/amdgpu/dynamic_hsa/hsa_ext_amd.h
index 188dd2600a610c6..2cebd0d35df088f 100644
--- a/openmp/libomptarget/plugins-nextgen/amdgpu/dynamic_hsa/hsa_ext_amd.h
+++ b/openmp/libomptarget/plugins-nextgen/amdgpu/dynamic_hsa/hsa_ext_amd.h
@@ -151,12 +151,10 @@ typedef struct hsa_amd_pointer_info_s {
   size_t sizeInBytes;
 } hsa_amd_pointer_info_t;
 
-hsa_status_t hsa_amd_pointer_info(const void* ptr,
-                                          hsa_amd_pointer_info_t* info,
-                                          void* (*alloc)(size_t),
-                                          uint32_t* num_agents_accessible,
-                                          hsa_agent_t** accessible);
-
+hsa_status_t hsa_amd_pointer_info(const void *ptr, hsa_amd_pointer_info_t *info,
+                                  void *(*alloc)(size_t),
+                                  uint32_t *num_agents_accessible,
+                                  hsa_agent_t **accessible);
 #ifdef __cplusplus
 }
 #endif
diff --git a/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp b/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
index 399a71390a65abe..9c2db4cdaf3bcd8 100644
--- a/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
+++ b/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
@@ -130,6 +130,40 @@ Error iterateAgentMemoryPools(hsa_agent_t Agent, CallbackTy Cb) {
                        "Error in hsa_amd_agent_iterate_memory_pools: %s");
 }
 
+/// Dispatches an asynchronous memory copy
+/// Enables different SDMA engines for the dispatch in a round-robin fashion.
+Error asyncMemCopy(void *Dst, hsa_agent_t DstAgent, const void *Src,
+                   hsa_agent_t SrcAgent, size_t Size, uint32_t NumDepSignals,
+                   const hsa_signal_t *DepSignals,
+                   hsa_signal_t CompletionSignal) {
+  static BoolEnvar OMPX_UseMultipleSdmaEngines{
+      "LIBOMPTARGET_AMDGPU_USE_MULTIPLE_SDMA_ENGINES", false};
+  if (!OMPX_UseMultipleSdmaEngines) {
+    hsa_status_t S =
+        hsa_amd_memory_async_copy(Dst, DstAgent, Src, SrcAgent, Size,
+                                  NumDepSignals, DepSignals, CompletionSignal);
+    return Plugin::check(S, "Error in hsa_amd_memory_async_copy");
+  }
+
+// This solution is probably not the best
+#if !(HSA_AMD_INTERFACE_VERSION_MAJOR >= 1 &&                                  \
+      HSA_AMD_INTERFACE_VERSION_MINOR >= 2)
+  return Plugin::error("Async copy on selected SDMA requires ROCm 5.7");
+#else
+  static int SdmaEngine = 1;
+
+  // This call is only avail in ROCm >= 5.7
+  hsa_status_t S = hsa_amd_memory_async_copy_on_engine(
+      Dst, DstAgent, Src, SrcAgent, Size, NumDepSignals, DepSignals,
+      CompletionSignal, (hsa_amd_sdma_engine_id_t)SdmaEngine,
+      /*force_copy_on_sdma=*/true);
+  // Increment to use one of three SDMA engines: 0x1, 0x2, 0x4
+  SdmaEngine = (SdmaEngine << 1) % 7;
+
+  return Plugin::check(S, "Error in hsa_amd_memory_async_copy_on_engine");
+#endif
+}
+
 } // namespace utils
 
 /// Utility class representing generic resource references to AMDGPU resources.
@@ -1170,15 +1204,14 @@ struct AMDGPUStreamTy {
       InputSignal = nullptr;
 
     // Issue the async memory copy.
-    hsa_status_t Status;
     if (InputSignal) {
       hsa_signal_t InputSignalRaw = InputSignal->get();
-      Status = hsa_amd_memory_async_copy(Dst, Agent, Src, Agent, CopySize, 1,
-                                         &InputSignalRaw, OutputSignal->get());
-    } else
-      Status = hsa_amd_memory_async_copy(Dst, Agent, Src, Agent, CopySize, 0,
-                                         nullptr, OutputSignal->get());
-    return Plugin::check(Status, "Error in hsa_amd_memory_async_copy: %s");
+      return utils::asyncMemCopy(Dst, Agent, Src, Agent, CopySize, 1,
+                                 &InputSignalRaw, OutputSignal->get());
+    }
+
+    return utils::asyncMemCopy(Dst, Agent, Src, Agent, CopySize, 0, nullptr,
+                               OutputSignal->get());
   }
 
   /// Push an asynchronous memory copy device-to-host involving an unpinned
@@ -1214,21 +1247,18 @@ struct AMDGPUStreamTy {
 
     // Issue the first step: device to host transfer. Avoid defining the input
     // dependency if already satisfied.
-    hsa_status_t Status;
     if (InputSignal) {
       hsa_signal_t InputSignalRaw = InputSignal->get();
-      Status =
-          hsa_amd_memory_async_copy(Inter, Agent, Src, Agent, CopySize, 1,
-                                    &InputSignalRaw, OutputSignals[0]->get());
+      if (auto Err =
+              utils::asyncMemCopy(Inter, Agent, Src, Agent, CopySize, 1,
+                                  &InputSignalRaw, OutputSignals[0]->get()))
+        return Err;
     } else {
-      Status = hsa_amd_memory_async_copy(Inter, Agent, Src, Agent, CopySize, 0,
-                                         nullptr, OutputSignals[0]->get());
+      if (auto Err = utils::asyncMemCopy(Inter, Agent, Src, Agent, CopySize, 0,
+                                         nullptr, OutputSignals[0]->get()))
+        return Err;
     }
 
-    if (auto Err =
-            Plugin::check(Status, "Error in hsa_amd_memory_async_copy: %s"))
-      return Err;
-
     // Consume another stream slot and compute dependencies.
     std::tie(Curr, InputSignal) = consume(OutputSignals[1]);
     assert(InputSignal && "Invalid input signal");
@@ -1242,7 +1272,7 @@ struct AMDGPUStreamTy {
     std::atomic_thread_fence(std::memory_order_release);
 
     // Issue the second step: host to host transfer.
-    Status = hsa_amd_signal_async_handler(
+    hsa_status_t Status = hsa_amd_signal_async_handler(
         InputSignal->get(), HSA_SIGNAL_CONDITION_EQ, 0, asyncActionCallback,
         (void *)&Slots[Curr]);
 
@@ -1318,16 +1348,13 @@ struct AMDGPUStreamTy {
 
     // Issue the second step: host to device transfer. Avoid defining the input
     // dependency if already satisfied.
-    hsa_status_t Status;
     if (InputSignal && InputSignal->load()) {
       hsa_signal_t InputSignalRaw = InputSignal->get();
-      Status = hsa_amd_memory_async_copy(Dst, Agent, Inter, Agent, CopySize, 1,
-                                         &InputSignalRaw, OutputSignal->get());
-    } else
-      Status = hsa_amd_memory_async_copy(Dst, Agent, Inter, Agent, CopySize, 0,
-                                         nullptr, OutputSignal->get());
-
-    return Plugin::check(Status, "Error in hsa_amd_memory_async_copy: %s");
+      return utils::asyncMemCopy(Dst, Agent, Inter, Agent, CopySize, 1,
+                                 &InputSignalRaw, OutputSignal->get());
+    }
+    return utils::asyncMemCopy(Dst, Agent, Inter, Agent, CopySize, 0, nullptr,
+                               OutputSignal->get());
   }
 
   // AMDGPUDeviceTy is incomplete here, passing the underlying agent instead
@@ -1353,17 +1380,13 @@ struct AMDGPUStreamTy {
     // allocated by this runtime or the caller made the appropriate
     // access calls.
 
-    hsa_status_t Status;
     if (InputSignal && InputSignal->load()) {
       hsa_signal_t InputSignalRaw = InputSignal->get();
-      Status =
-          hsa_amd_memory_async_copy(Dst, DstAgent, Src, SrcAgent, CopySize, 1,
-                                    &InputSignalRaw, OutputSignal->get());
-    } else
-      Status = hsa_amd_memory_async_copy(Dst, DstAgent, Src, SrcAgent, CopySize,
-                                         0, nullptr, OutputSignal->get());
-
-    return Plugin::check(Status, "Error in D2D hsa_amd_memory_async_copy: %s");
+      return utils::asyncMemCopy(Dst, DstAgent, Src, SrcAgent, CopySize, 1,
+                                 &InputSignalRaw, OutputSignal->get());
+    }
+    return utils::asyncMemCopy(Dst, DstAgent, Src, SrcAgent, CopySize, 0,
+                               nullptr, OutputSignal->get());
   }
 
   /// Synchronize with the stream. The current thread waits until all operations
@@ -2196,10 +2219,8 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
       if (auto Err = Signal.init())
         return Err;
 
-      Status = hsa_amd_memory_async_copy(TgtPtr, Agent, PinnedPtr, Agent, Size,
-                                         0, nullptr, Signal.get());
-      if (auto Err =
-              Plugin::check(Status, "Error in hsa_amd_memory_async_copy: %s"))
+      if (auto Err = utils::asyncMemCopy(TgtPtr, Agent, PinnedPtr, Agent, Size,
+                                         0, nullptr, Signal.get()))
         return Err;
 
       if (auto Err = Signal.wait(getStreamBusyWaitMicroseconds()))
@@ -2257,10 +2278,8 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
       if (auto Err = Signal.init())
         return Err;
 
-      Status = hsa_amd_memory_async_copy(PinnedPtr, Agent, TgtPtr, Agent, Size,
-                                         0, nullptr, Signal.get());
-      if (auto Err =
-              Plugin::check(Status, "Error in hsa_amd_memory_async_copy: %s"))
+      if (auto Err = utils::asyncMemCopy(PinnedPtr, Agent, TgtPtr, Agent, Size,
+                                         0, nullptr, Signal.get()))
         return Err;
 
       if (auto Err = Signal.wait(getStreamBusyWaitMicroseconds()))

openmp/libomptarget/plugins-nextgen/amdgpu/dynamic_hsa/hsa_ext_amd.h

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

This enables the AMDGPU plugin to use a new ROCm 5.7 interface to dispatch asynchronous data transfers across SDMA engines. The default functionality stays unchanged, meaning that all data transfers are enqueued into a H2D queue or an D2H queue, depending on transfer direction, via the HSA interface used previously. The new interface can be enabled via the environment variable LIBOMPTARGET_AMDGPU_USE_MULTIPLE_SDMA_ENGINES=true when libomptarget is built against a recent ROCm version (5.7 and later). As of now, requests are distributed in a round-robin fashion across available SDMA engines.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

jhuber6 · 2023-11-09T21:56:56Z

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

+      HSA_AMD_INTERFACE_VERSION_MINOR >= 2)
+  return Plugin::error("Async copy on selected SDMA requires ROCm 5.7");
+#else
+  static int SdmaEngine = 1;


This is a shared variable. What happens when multiple threads potentially run through here?

That's a good point. In most instances, this function is called from a region that is protected via a lock_guard, so IIRC for those cases this is protected and there should not be two threads here at the same time.
But I have to look more into one case, where this may race.

We could probably use an atomic RMW just to be safe in any case.

So, I looked into it and we do have one code path that is not under a lock. I made this an atomic and do RMW.

With this solution we could still run into a scenario where two threads read the same value and dispatch to the same SDMA engine. While not desirable, it's not a correctness issue, and I think the probability is quite low.

…A engines

…ple SDMA engines

jhuber6

LG, thanks for sticking with it. Shame there's no real way to check for the HSA version.

…MA engines (#71801)" This causes the tests to fail because the bots were not updated in time. Revert until we update the bots to a valid version. This reverts commit e876250.

jplehr · 2023-11-14T19:49:39Z

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

+                   const void *Src, hsa_agent_t SrcAgent, size_t Size,
+                   uint32_t NumDepSignals, const hsa_signal_t *DepSignals,
+                   hsa_signal_t CompletionSignal) {
+  if (UseMultipleSdmaEngines) {


This needs to be negated to maintain the current functionality per default and not rely on ROCm 5.7 functionality.

…A engines (llvm#71801) This enables the AMDGPU plugin to use a new ROCm 5.7 interface to dispatch asynchronous data transfers across SDMA engines. The default functionality stays unchanged, meaning that all data transfers are enqueued into a H2D queue or an D2H queue, depending on transfer direction, via the HSA interface used previously. The new interface can be enabled via the environment variable `LIBOMPTARGET_AMDGPU_USE_MULTIPLE_SDMA_ENGINES=true` when libomptarget is built against a recent ROCm version (5.7 and later). As of now, requests are distributed in a round-robin fashion across available SDMA engines.

#72307) …A engines (#71801) This enables the AMDGPU plugin to use a new ROCm 5.7 interface to dispatch asynchronous data transfers across SDMA engines. The default functionality stays unchanged, meaning that all data transfers are enqueued into a H2D queue or an D2H queue, depending on transfer direction, via the HSA interface used previously. The new interface can be enabled via the environment variable `LIBOMPTARGET_AMDGPU_USE_MULTIPLE_SDMA_ENGINES=true` when libomptarget is built against a recent ROCm version (5.7 and later). As of now, requests are distributed in a round-robin fashion across available SDMA engines.

…es (llvm#71801) This enables the AMDGPU plugin to use a new ROCm 5.7 interface to dispatch asynchronous data transfers across SDMA engines. The default functionality stays unchanged, meaning that all data transfers are enqueued into a H2D queue or an D2H queue, depending on transfer direction, via the HSA interface used previously. The new interface can be enabled via the environment variable `LIBOMPTARGET_AMDGPU_USE_MULTIPLE_SDMA_ENGINES=true` when libomptarget is built against a recent ROCm version (5.7 and later). As of now, requests are distributed in a round-robin fashion across available SDMA engines.

…MA engines (llvm#71801)" This causes the tests to fail because the bots were not updated in time. Revert until we update the bots to a valid version. This reverts commit e876250.

llvm#72307) …A engines (llvm#71801) This enables the AMDGPU plugin to use a new ROCm 5.7 interface to dispatch asynchronous data transfers across SDMA engines. The default functionality stays unchanged, meaning that all data transfers are enqueued into a H2D queue or an D2H queue, depending on transfer direction, via the HSA interface used previously. The new interface can be enabled via the environment variable `LIBOMPTARGET_AMDGPU_USE_MULTIPLE_SDMA_ENGINES=true` when libomptarget is built against a recent ROCm version (5.7 and later). As of now, requests are distributed in a round-robin fashion across available SDMA engines.

jplehr requested review from carlobertolli, jdoerfert, kevinsala, shiltian, JonChesterfield and jhuber6 November 9, 2023 12:18

llvmbot added backend:AMDGPU openmp:libomptarget OpenMP offload runtime labels Nov 9, 2023

arsenm reviewed Nov 9, 2023

View reviewed changes

openmp/libomptarget/plugins-nextgen/amdgpu/dynamic_hsa/hsa_ext_amd.h Outdated Show resolved Hide resolved

jhuber6 reviewed Nov 9, 2023

View reviewed changes

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp Outdated Show resolved Hide resolved

jplehr force-pushed the feat/trunk/asyncCopy branch from 2d550f3 to 02fada7 Compare November 9, 2023 20:20

jhuber6 reviewed Nov 9, 2023

View reviewed changes

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp Outdated Show resolved Hide resolved

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp Show resolved Hide resolved

jhuber6 reviewed Nov 9, 2023

View reviewed changes

jplehr added 2 commits November 9, 2023 17:11

fixup! [OpenMP][libomptarget] Enable parallel copies via multiple SDM…

9f6224d

…A engines

fixup! fixup! [OpenMP][libomptarget] Enable parallel copies via multi…

76797bc

…ple SDMA engines

jhuber6 approved these changes Nov 14, 2023

View reviewed changes

jplehr merged commit e876250 into llvm:main Nov 14, 2023

jplehr commented Nov 14, 2023

View reviewed changes

jplehr mentioned this pull request Nov 14, 2023

Reland [OpenMP][libomptarget] Enable parallel copies via multiple SDM… #72307

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[OpenMP][libomptarget] Enable parallel copies via multiple SDMA engines #71801

[OpenMP][libomptarget] Enable parallel copies via multiple SDMA engines #71801

Uh oh!

jplehr commented Nov 9, 2023

Uh oh!

llvmbot commented Nov 9, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jhuber6 Nov 9, 2023

Uh oh!

jplehr Nov 9, 2023

Uh oh!

jhuber6 Nov 9, 2023

Uh oh!

jplehr Nov 10, 2023

Uh oh!

jhuber6 left a comment

Uh oh!

jplehr Nov 14, 2023

Uh oh!

Uh oh!

[OpenMP][libomptarget] Enable parallel copies via multiple SDMA engines #71801

[OpenMP][libomptarget] Enable parallel copies via multiple SDMA engines #71801

Uh oh!

Conversation

jplehr commented Nov 9, 2023

Uh oh!

llvmbot commented Nov 9, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jhuber6 Nov 9, 2023

Choose a reason for hiding this comment

Uh oh!

jplehr Nov 9, 2023

Choose a reason for hiding this comment

Uh oh!

jhuber6 Nov 9, 2023

Choose a reason for hiding this comment

Uh oh!

jplehr Nov 10, 2023

Choose a reason for hiding this comment

Uh oh!

jhuber6 left a comment

Choose a reason for hiding this comment

Uh oh!

jplehr Nov 14, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!