-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[OpenMP] [amdgpu] Added a synchronous version of data exchange. #87032
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-amdgpu Author: None (dhruvachak) ChangesFull diff: https://github.com/llvm/llvm-project/pull/87032.diff 2 Files Affected:
diff --git a/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp b/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
index 2dd08dd5d0b4ea..a0fdde951b74a7 100644
--- a/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
+++ b/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
@@ -2402,6 +2402,27 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
AsyncInfoWrapperTy &AsyncInfoWrapper) override {
AMDGPUDeviceTy &DstDevice = static_cast<AMDGPUDeviceTy &>(DstGenericDevice);
+ // For large transfers use synchronous behavior.
+ if (Size >= OMPX_MaxAsyncCopyBytes) {
+ if (AsyncInfoWrapper.hasQueue())
+ if (auto Err = synchronize(AsyncInfoWrapper))
+ return Err;
+
+ AMDGPUSignalTy Signal;
+ if (auto Err = Signal.init())
+ return Err;
+
+ if (auto Err = utils::asyncMemCopy(
+ useMultipleSdmaEngines(), DstPtr, DstDevice.getAgent(), SrcPtr,
+ getAgent(), (uint64_t)Size, 0, nullptr, Signal.get()))
+ return Err;
+
+ if (auto Err = Signal.wait(getStreamBusyWaitMicroseconds()))
+ return Err;
+
+ return Signal.deinit();
+ }
+
AMDGPUStreamTy *Stream = nullptr;
if (auto Err = getStream(AsyncInfoWrapper, Stream))
return Err;
diff --git a/openmp/libomptarget/test/offloading/d2d_memcpy_sync.c b/openmp/libomptarget/test/offloading/d2d_memcpy_sync.c
new file mode 100644
index 00000000000000..a768cd1209ac52
--- /dev/null
+++ b/openmp/libomptarget/test/offloading/d2d_memcpy_sync.c
@@ -0,0 +1,67 @@
+// RUN: %libomptarget-compile-generic && \
+// RUN: env LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES=0 %libomptarget-run-generic | \
+// RUN: %fcheck-generic -allow-empty
+// REQUIRES: amdgcn-amd-amdhsa
+
+#include <assert.h>
+#include <omp.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+const int magic_num = 7;
+
+int main(int argc, char *argv[]) {
+ const int N = 128;
+ const int num_devices = omp_get_num_devices();
+
+ // No target device, just return
+ if (num_devices == 0) {
+ printf("PASS\n");
+ return 0;
+ }
+
+ const int src_device = 0;
+ int dst_device = num_devices - 1;
+
+ int length = N * sizeof(int);
+ int *src_ptr = omp_target_alloc(length, src_device);
+ int *dst_ptr = omp_target_alloc(length, dst_device);
+
+ assert(src_ptr && "src_ptr is NULL");
+ assert(dst_ptr && "dst_ptr is NULL");
+
+#pragma omp target teams distribute parallel for device(src_device) \
+ is_device_ptr(src_ptr)
+ for (int i = 0; i < N; ++i) {
+ src_ptr[i] = magic_num;
+ }
+
+ int rc =
+ omp_target_memcpy(dst_ptr, src_ptr, length, 0, 0, dst_device, src_device);
+
+ assert(rc == 0 && "error in omp_target_memcpy");
+
+ int *buffer = malloc(length);
+
+ assert(buffer && "failed to allocate host buffer");
+
+#pragma omp target teams distribute parallel for device(dst_device) \
+ map(from : buffer[0 : N]) is_device_ptr(dst_ptr)
+ for (int i = 0; i < N; ++i) {
+ buffer[i] = dst_ptr[i] + magic_num;
+ }
+
+ for (int i = 0; i < N; ++i)
+ assert(buffer[i] == 2 * magic_num);
+
+ printf("PASS\n");
+
+ // Free host and device memory
+ free(buffer);
+ omp_target_free(src_ptr, src_device);
+ omp_target_free(dst_ptr, dst_device);
+
+ return 0;
+}
+
+// CHECK: PASS
|
Can you update the description to describe why we want synchronous data exchange? |
Similar to H2D and D2H, use synchronous mode for large data transfers beyond a certain size for D2D as well. As with H2D and D2H, this size is controlled by an env-var.
5f65fb6
to
962bb4c
Compare
I amended the commit message to include the following: Similar to H2D and D2H, use synchronous mode for large data transfers |
assert(src_ptr && "src_ptr is NULL"); | ||
assert(dst_ptr && "dst_ptr is NULL"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't assert to check allocation failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to check and FAIL if required. The earlier change was based on an existing test which asserted.
|
||
int *buffer = malloc(length); | ||
|
||
assert(buffer && "failed to allocate host buffer"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
Changed test to not assert on allocation failure. Instead it checks for that condition and returns a failure status.
You should do it to the PR message as well. When the PR is merged it will squash the commits and use the PR's description instead. |
Yes, I just realized that the PR comment was not updated. Now it should be updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG, thanks
What's the expected behavior when doing a D2D memcpy onto the same device? Just wondering if that's a no-op under the hood for the purposes of the buildbot.
The plugin still uses the same async memcpy. I don't know whether there is some underlying optimization for the same-device case. |
…#87032) Similar to H2D and D2H, use synchronous mode for large data transfers beyond a certain size for D2D as well. As with H2D and D2H, this size is controlled by an env-var. Partial fix for ROCm/aomp#851. Change-Id: I25e6a9a9620191c16b9312f66369d9bc1840a625
Similar to H2D and D2H, use synchronous mode for large data transfers
beyond a certain size for D2D as well. As with H2D and D2H, this size is
controlled by an env-var.