[OpenMP] [amdgpu] Added a synchronous version of data exchange. #87032

dhruvachak · 2024-03-29T01:58:37Z

Similar to H2D and D2H, use synchronous mode for large data transfers
beyond a certain size for D2D as well. As with H2D and D2H, this size is
controlled by an env-var.

llvmbot · 2024-03-29T01:59:10Z

@llvm/pr-subscribers-backend-amdgpu

Author: None (dhruvachak)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/87032.diff

2 Files Affected:

(modified) openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp (+21)
(added) openmp/libomptarget/test/offloading/d2d_memcpy_sync.c (+67)

diff --git a/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp b/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
index 2dd08dd5d0b4ea..a0fdde951b74a7 100644
--- a/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
+++ b/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
@@ -2402,6 +2402,27 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
                          AsyncInfoWrapperTy &AsyncInfoWrapper) override {
     AMDGPUDeviceTy &DstDevice = static_cast<AMDGPUDeviceTy &>(DstGenericDevice);
 
+    // For large transfers use synchronous behavior.
+    if (Size >= OMPX_MaxAsyncCopyBytes) {
+      if (AsyncInfoWrapper.hasQueue())
+        if (auto Err = synchronize(AsyncInfoWrapper))
+          return Err;
+
+      AMDGPUSignalTy Signal;
+      if (auto Err = Signal.init())
+        return Err;
+
+      if (auto Err = utils::asyncMemCopy(
+              useMultipleSdmaEngines(), DstPtr, DstDevice.getAgent(), SrcPtr,
+              getAgent(), (uint64_t)Size, 0, nullptr, Signal.get()))
+        return Err;
+
+      if (auto Err = Signal.wait(getStreamBusyWaitMicroseconds()))
+        return Err;
+
+      return Signal.deinit();
+    }
+
     AMDGPUStreamTy *Stream = nullptr;
     if (auto Err = getStream(AsyncInfoWrapper, Stream))
       return Err;
diff --git a/openmp/libomptarget/test/offloading/d2d_memcpy_sync.c b/openmp/libomptarget/test/offloading/d2d_memcpy_sync.c
new file mode 100644
index 00000000000000..a768cd1209ac52
--- /dev/null
+++ b/openmp/libomptarget/test/offloading/d2d_memcpy_sync.c
@@ -0,0 +1,67 @@
+// RUN: %libomptarget-compile-generic && \
+// RUN: env LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES=0 %libomptarget-run-generic | \
+// RUN: %fcheck-generic -allow-empty
+// REQUIRES: amdgcn-amd-amdhsa
+
+#include <assert.h>
+#include <omp.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+const int magic_num = 7;
+
+int main(int argc, char *argv[]) {
+  const int N = 128;
+  const int num_devices = omp_get_num_devices();
+
+  // No target device, just return
+  if (num_devices == 0) {
+    printf("PASS\n");
+    return 0;
+  }
+
+  const int src_device = 0;
+  int dst_device = num_devices - 1;
+
+  int length = N * sizeof(int);
+  int *src_ptr = omp_target_alloc(length, src_device);
+  int *dst_ptr = omp_target_alloc(length, dst_device);
+
+  assert(src_ptr && "src_ptr is NULL");
+  assert(dst_ptr && "dst_ptr is NULL");
+
+#pragma omp target teams distribute parallel for device(src_device)            \
+    is_device_ptr(src_ptr)
+  for (int i = 0; i < N; ++i) {
+    src_ptr[i] = magic_num;
+  }
+
+  int rc =
+      omp_target_memcpy(dst_ptr, src_ptr, length, 0, 0, dst_device, src_device);
+
+  assert(rc == 0 && "error in omp_target_memcpy");
+
+  int *buffer = malloc(length);
+
+  assert(buffer && "failed to allocate host buffer");
+
+#pragma omp target teams distribute parallel for device(dst_device)            \
+    map(from : buffer[0 : N]) is_device_ptr(dst_ptr)
+  for (int i = 0; i < N; ++i) {
+    buffer[i] = dst_ptr[i] + magic_num;
+  }
+
+  for (int i = 0; i < N; ++i)
+    assert(buffer[i] == 2 * magic_num);
+
+  printf("PASS\n");
+
+  // Free host and device memory
+  free(buffer);
+  omp_target_free(src_ptr, src_device);
+  omp_target_free(dst_ptr, dst_device);
+
+  return 0;
+}
+
+// CHECK: PASS

jhuber6 · 2024-03-29T02:20:27Z

Can you update the description to describe why we want synchronous data exchange?

Similar to H2D and D2H, use synchronous mode for large data transfers beyond a certain size for D2D as well. As with H2D and D2H, this size is controlled by an env-var.

dhruvachak · 2024-03-29T03:24:42Z

I amended the commit message to include the following:

Similar to H2D and D2H, use synchronous mode for large data transfers
beyond a certain size for D2D as well. As with H2D and D2H, this size is
controlled by an env-var.

arsenm · 2024-03-29T05:45:36Z

openmp/libomptarget/test/offloading/d2d_memcpy_sync.c

+  assert(src_ptr && "src_ptr is NULL");
+  assert(dst_ptr && "dst_ptr is NULL");


don't assert to check allocation failure

Changed to check and FAIL if required. The earlier change was based on an existing test which asserted.

arsenm · 2024-03-29T05:45:54Z

openmp/libomptarget/test/offloading/d2d_memcpy_sync.c

+
+  int *buffer = malloc(length);
+
+  assert(buffer && "failed to allocate host buffer");


Changed test to not assert on allocation failure. Instead it checks for that condition and returns a failure status.

jhuber6 · 2024-03-29T20:18:25Z

I amended the commit message to include the following:

Similar to H2D and D2H, use synchronous mode for large data transfers beyond a certain size for D2D as well. As with H2D and D2H, this size is controlled by an env-var.

You should do it to the PR message as well. When the PR is merged it will squash the commits and use the PR's description instead.

dhruvachak · 2024-03-29T20:20:25Z

I amended the commit message to include the following:
Similar to H2D and D2H, use synchronous mode for large data transfers beyond a certain size for D2D as well. As with H2D and D2H, this size is controlled by an env-var.

You should do it to the PR message as well. When the PR is merged it will squash the commits and use the PR's description instead.

Yes, I just realized that the PR comment was not updated. Now it should be updated.

jhuber6

LG, thanks

What's the expected behavior when doing a D2D memcpy onto the same device? Just wondering if that's a no-op under the hood for the purposes of the buildbot.

dhruvachak · 2024-03-29T20:33:19Z

LG, thanks

What's the expected behavior when doing a D2D memcpy onto the same device? Just wondering if that's a no-op under the hood for the purposes of the buildbot.

The plugin still uses the same async memcpy. I don't know whether there is some underlying optimization for the same-device case.

…#87032) Similar to H2D and D2H, use synchronous mode for large data transfers beyond a certain size for D2D as well. As with H2D and D2H, this size is controlled by an env-var. Partial fix for ROCm/aomp#851. Change-Id: I25e6a9a9620191c16b9312f66369d9bc1840a625

dhruvachak requested review from carlobertolli, doru1004, jdoerfert, kevinsala, jplehr and jhuber6 March 29, 2024 01:58

llvmbot added backend:AMDGPU openmp:libomptarget OpenMP offload runtime labels Mar 29, 2024

[OpenMP] [amdgpu] Added a synchronous version of data exchange.

962bb4c

Similar to H2D and D2H, use synchronous mode for large data transfers beyond a certain size for D2D as well. As with H2D and D2H, this size is controlled by an env-var.

dhruvachak force-pushed the add_async_exchange branch from 5f65fb6 to 962bb4c Compare March 29, 2024 03:19

arsenm reviewed Mar 29, 2024

View reviewed changes

[OpenMP] [amdgpu] Added a synchronous version of data exchange.

9ce9adc

Changed test to not assert on allocation failure. Instead it checks for that condition and returns a failure status.

jhuber6 approved these changes Mar 29, 2024

View reviewed changes

dhruvachak merged commit cc8c6b0 into llvm:main Mar 29, 2024

dhruvachak deleted the add_async_exchange branch March 29, 2024 20:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[OpenMP] [amdgpu] Added a synchronous version of data exchange. #87032

[OpenMP] [amdgpu] Added a synchronous version of data exchange. #87032

Uh oh!

dhruvachak commented Mar 29, 2024 •

edited

Loading

Uh oh!

llvmbot commented Mar 29, 2024

Uh oh!

jhuber6 commented Mar 29, 2024

Uh oh!

dhruvachak commented Mar 29, 2024

Uh oh!

arsenm Mar 29, 2024

Uh oh!

dhruvachak Mar 29, 2024

Uh oh!

arsenm Mar 29, 2024

Uh oh!

dhruvachak Mar 29, 2024

Uh oh!

jhuber6 commented Mar 29, 2024

Uh oh!

dhruvachak commented Mar 29, 2024

Uh oh!

jhuber6 left a comment

Uh oh!

dhruvachak commented Mar 29, 2024

Uh oh!

Uh oh!

		assert(src_ptr && "src_ptr is NULL");
		assert(dst_ptr && "dst_ptr is NULL");


		int *buffer = malloc(length);

		assert(buffer && "failed to allocate host buffer");

[OpenMP] [amdgpu] Added a synchronous version of data exchange. #87032

[OpenMP] [amdgpu] Added a synchronous version of data exchange. #87032

Uh oh!

Conversation

dhruvachak commented Mar 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Mar 29, 2024

Uh oh!

jhuber6 commented Mar 29, 2024

Uh oh!

dhruvachak commented Mar 29, 2024

Uh oh!

arsenm Mar 29, 2024

Choose a reason for hiding this comment

Uh oh!

dhruvachak Mar 29, 2024

Choose a reason for hiding this comment

Uh oh!

arsenm Mar 29, 2024

Choose a reason for hiding this comment

Uh oh!

dhruvachak Mar 29, 2024

Choose a reason for hiding this comment

Uh oh!

jhuber6 commented Mar 29, 2024

Uh oh!

dhruvachak commented Mar 29, 2024

Uh oh!

jhuber6 left a comment

Choose a reason for hiding this comment

Uh oh!

dhruvachak commented Mar 29, 2024

Uh oh!

Uh oh!

dhruvachak commented Mar 29, 2024 •

edited

Loading