[OpenMP] Enable automatic unified shared memory on MI300A. #77512

carlobertolli · 2024-01-09T19:08:17Z

This patch enables applications that did not request OpenMP unified_shared_memory to run with the same zero-copy behavior, where mapped memory does not result in extra memory allocations and memory copies, but CPU-allocated memory is accessed from the device. The name for this behavior is "automatic zero-copy" and it relies on detecting: that the runtime is running on a MI300A, that the user did not select unified_shared_memory in their program, and that XNACK (unified memory support) is enabled in the current GPU configuration. If all these conditions are met, then automatic zero-copy is triggered.

This patch also introduces an environment variable OMPX_APU_MAPS that, if set, triggers automatic zero-copy also on non APU GPUs (e.g., on discrete GPUs).
This patch is still missing support for global variables, which will be provided in a subsequent patch.

llvmbot · 2024-01-09T19:08:44Z

@llvm/pr-subscribers-backend-amdgpu

Author: None (carlobertolli)

Changes

This patch enables applications that did not request OpenMP unified_shared_memory to run with the same zero-copy behavior, where mapped memory does not result in extra memory allocations and memory copies, but CPU-allocated memory is accessed from the device. The name for this behavior is "automatic zero-copy" and it relies on detecting: that the runtime is running on a MI300A, that the user did not select unified_shared_memory in their program, and that XNACK (unified memory support) is enabled in the current GPU configuration. If all these conditions are met, then automatic zero-copy is triggered.

This patch also introduces an environment variable OMPX_APU_MAPS that, if set, triggers automatic zero-copy also on non APU GPUs (e.g., on discrete GPUs).
This patch is still missing support for global variables, which will be provided in a subsequent patch.

Full diff: https://github.com/llvm/llvm-project/pull/77512.diff

13 Files Affected:

(modified) openmp/libomptarget/include/Shared/PluginAPI.h (+3)
(modified) openmp/libomptarget/include/Shared/PluginAPI.inc (+1)
(modified) openmp/libomptarget/include/Shared/Requirements.h (+14-1)
(modified) openmp/libomptarget/include/device.h (+3)
(modified) openmp/libomptarget/plugins-nextgen/amdgpu/dynamic_hsa/hsa_ext_amd.h (+1)
(modified) openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp (+49)
(modified) openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h (+28)
(modified) openmp/libomptarget/plugins-nextgen/common/include/PluginInterface.h (+5)
(modified) openmp/libomptarget/plugins-nextgen/common/src/PluginInterface.cpp (+10)
(modified) openmp/libomptarget/src/OpenMP/Mapping.cpp (+10-3)
(modified) openmp/libomptarget/src/PluginManager.cpp (+13)
(modified) openmp/libomptarget/src/device.cpp (+6)
(added) openmp/libomptarget/test/mapping/auto_zero_copy.cpp (+55)

diff --git a/openmp/libomptarget/include/Shared/PluginAPI.h b/openmp/libomptarget/include/Shared/PluginAPI.h
index c6aacf4ce2124b..aece53d7ee1caa 100644
--- a/openmp/libomptarget/include/Shared/PluginAPI.h
+++ b/openmp/libomptarget/include/Shared/PluginAPI.h
@@ -219,6 +219,9 @@ int32_t __tgt_rtl_initialize_record_replay(int32_t DeviceId, int64_t MemorySize,
                                            void *VAddr, bool isRecord,
                                            bool SaveOutput,
                                            uint64_t &ReqPtrArgOffset);
+
+// Returns true if the device \p DeviceId suggests to use auto zero-copy.
+int32_t __tgt_rtl_use_auto_zero_copy(int32_t DeviceId);
 }
 
 #endif // OMPTARGET_SHARED_PLUGIN_API_H
diff --git a/openmp/libomptarget/include/Shared/PluginAPI.inc b/openmp/libomptarget/include/Shared/PluginAPI.inc
index 25ebe7d437f9d1..b842c6eef1d4fc 100644
--- a/openmp/libomptarget/include/Shared/PluginAPI.inc
+++ b/openmp/libomptarget/include/Shared/PluginAPI.inc
@@ -47,3 +47,4 @@ PLUGIN_API_HANDLE(data_notify_mapped, false);
 PLUGIN_API_HANDLE(data_notify_unmapped, false);
 PLUGIN_API_HANDLE(set_device_offset, false);
 PLUGIN_API_HANDLE(initialize_record_replay, false);
+PLUGIN_API_HANDLE(use_auto_zero_copy, false);
diff --git a/openmp/libomptarget/include/Shared/Requirements.h b/openmp/libomptarget/include/Shared/Requirements.h
index 19d6b8ffca495f..b16a1650f0c403 100644
--- a/openmp/libomptarget/include/Shared/Requirements.h
+++ b/openmp/libomptarget/include/Shared/Requirements.h
@@ -33,7 +33,12 @@ enum OpenMPOffloadingRequiresDirFlags : int64_t {
   /// unified_shared_memory clause.
   OMP_REQ_UNIFIED_SHARED_MEMORY = 0x008,
   /// dynamic_allocators clause.
-  OMP_REQ_DYNAMIC_ALLOCATORS = 0x010
+  OMP_REQ_DYNAMIC_ALLOCATORS = 0x010,
+  /// Auto zero-copy extension:
+  /// when running on an APU, the GPU plugin may decide to
+  /// run in zero-copy even though the user did not program
+  /// their application with unified_shared_memory requirement.
+  OMPX_REQ_AUTO_ZERO_COPY = 0x020
 };
 
 class RequirementCollection {
@@ -65,6 +70,14 @@ class RequirementCollection {
       return;
     }
 
+    // Auto zero-copy is only valid when no other requirement has been set
+    // and it is computed at device initialization time, after the requirement
+    // flag has already been set to OMP_REQ_NONE.
+    if (SetFlags == OMP_REQ_NONE && NewFlags == OMPX_REQ_AUTO_ZERO_COPY) {
+      SetFlags = NewFlags;
+      return;
+    }
+
     // If multiple compilation units are present enforce
     // consistency across all of them for require clauses:
     //  - reverse_offload
diff --git a/openmp/libomptarget/include/device.h b/openmp/libomptarget/include/device.h
index e94f48891dab36..3023fba6cc64db 100644
--- a/openmp/libomptarget/include/device.h
+++ b/openmp/libomptarget/include/device.h
@@ -164,6 +164,9 @@ struct DeviceTy {
   /// Print all offload entries to stderr.
   void dumpOffloadEntries();
 
+  /// Ask the device whether the runtime should use auto zero-copy.
+  bool useAutoZeroCopy();
+
 private:
   /// Deinitialize the device (and plugin).
   void deinit();
diff --git a/openmp/libomptarget/plugins-nextgen/amdgpu/dynamic_hsa/hsa_ext_amd.h b/openmp/libomptarget/plugins-nextgen/amdgpu/dynamic_hsa/hsa_ext_amd.h
index 9c59d3bf824de3..3117763e35896d 100644
--- a/openmp/libomptarget/plugins-nextgen/amdgpu/dynamic_hsa/hsa_ext_amd.h
+++ b/openmp/libomptarget/plugins-nextgen/amdgpu/dynamic_hsa/hsa_ext_amd.h
@@ -63,6 +63,7 @@ typedef enum {
 } hsa_amd_memory_pool_access_t;
 
 typedef enum hsa_amd_agent_info_s {
+  HSA_AMD_AGENT_INFO_CHIP_ID = 0xA000,
   HSA_AMD_AGENT_INFO_CACHELINE_SIZE = 0xA001,
   HSA_AMD_AGENT_INFO_COMPUTE_UNIT_COUNT = 0xA002,
   HSA_AMD_AGENT_INFO_MAX_CLOCK_FREQUENCY = 0xA003,
diff --git a/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp b/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
index b67642e9e1bcb3..f72b741cfd675a 100644
--- a/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
+++ b/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
@@ -1848,6 +1848,7 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
         OMPX_StreamBusyWait("LIBOMPTARGET_AMDGPU_STREAM_BUSYWAIT", 2000000),
         OMPX_UseMultipleSdmaEngines(
             "LIBOMPTARGET_AMDGPU_USE_MULTIPLE_SDMA_ENGINES", false),
+        HSAXnackEnv("HSA_XNACK", false), OMPX_ApuMaps("OMPX_APU_MAPS", false),
         AMDGPUStreamManager(*this, Agent), AMDGPUEventManager(*this),
         AMDGPUSignalManager(*this), Agent(Agent), HostDevice(HostDevice) {}
 
@@ -1940,6 +1941,10 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
     if (auto Err = AMDGPUSignalManager.init(OMPX_InitialNumSignals))
       return Err;
 
+    // detect if device is an APU.
+    if (auto Err = checkIfAPU())
+      return Err;
+
     return Plugin::success();
   }
 
@@ -2631,6 +2636,16 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
     return Plugin::success();
   }
 
+  /// Returns true if auto zero-copy the best configuration for the current
+  /// arch.
+  bool useAutoZeroCopyImpl() override {
+    // XNACK can be enabled with with kernel boot parameter or with
+    // environment variable. Automatic zero-copy is used on APUs
+    // and on dGPUs when OMPX_APU_MAPS is set to true.
+    return ((IsAPU || OMPX_ApuMaps) &&
+            (HSAXnackEnv || utils::isXnackEnabledViaKernelParam()));
+  }
+
   /// Getters and setters for stack and heap sizes.
   Error getDeviceStackSize(uint64_t &Value) override {
     Value = StackSize;
@@ -2728,6 +2743,30 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
     return Err;
   }
 
+  /// Detect if current architecture is an APU.
+  Error checkIfAPU() {
+    std::string StrGfxName(ComputeUnitKind);
+    std::transform(std::begin(StrGfxName), std::end(StrGfxName),
+                   std::begin(StrGfxName),
+                   [](char c) { return std::tolower(c); });
+    if (StrGfxName == "gfx940") {
+      IsAPU = true;
+      return Plugin::success();
+    }
+    if (StrGfxName == "gfx942") {
+      // can be MI300A or MI300X
+      uint32_t ChipID = 0;
+      if (auto Err = getDeviceAttr(HSA_AMD_AGENT_INFO_CHIP_ID, ChipID))
+        return Err;
+
+      if (!(ChipID & 0x1)) {
+        IsAPU = true;
+        return Plugin::success();
+      }
+    }
+    return Plugin::success();
+  }
+
   /// Envar for controlling the number of HSA queues per device. High number of
   /// queues may degrade performance.
   UInt32Envar OMPX_NumQueues;
@@ -2764,6 +2803,13 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
   /// Use ROCm 5.7 interface for multiple SDMA engines
   BoolEnvar OMPX_UseMultipleSdmaEngines;
 
+  /// Value of HSA_XNACK environment variable.
+  BoolEnvar HSAXnackEnv;
+
+  /// Value of OMPX_APU_MAPS env var used to force
+  /// automatic zero-copy behavior on non-APU GPUs.
+  BoolEnvar OMPX_ApuMaps;
+
   /// Stream manager for AMDGPU streams.
   AMDGPUStreamManagerTy AMDGPUStreamManager;
 
@@ -2794,6 +2840,9 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
   /// The current size of the stack that will be used in cases where it could
   /// not be statically determined.
   uint64_t StackSize = 16 * 1024 /* 16 KB */;
+
+  /// Is the plugin associated with an APU?
+  bool IsAPU{false};
 };
 
 Error AMDGPUDeviceImageTy::loadExecutable(const AMDGPUDeviceTy &Device) {
diff --git a/openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h b/openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
index 58a3b5df00fac6..c5a58f82441458 100644
--- a/openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
+++ b/openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h
@@ -116,6 +116,34 @@ inline bool isImageCompatibleWithEnv(StringRef ImageArch, uint32_t ImageFlags,
   return true;
 }
 
+inline bool isXnackEnabledViaKernelParam() {
+
+  ErrorOr<std::unique_ptr<MemoryBuffer>> FileOrError =
+      MemoryBuffer::getFileAsStream("/proc/cmdline");
+
+  if (std::error_code ErrorCode = FileOrError.getError()) {
+    FAILURE_MESSAGE("Cannot open /proc/cmdline : %s\n",
+                    ErrorCode.message().c_str());
+    return false;
+  }
+
+  StringRef FileContent = (FileOrError.get())->getBuffer();
+
+  StringRef RefString("amdgpu.noretry=");
+  int SizeOfRefString = RefString.size();
+
+  size_t Pos = FileContent.find_insensitive(RefString);
+  // Is noretry defined?
+  if (Pos != StringRef::npos) {
+    bool NoRetryValue = FileContent[Pos + SizeOfRefString] - '0';
+    // is noretry set to 0
+    if (!NoRetryValue)
+      return true;
+  }
+
+  return false;
+}
+
 struct KernelMetaDataTy {
   uint64_t KernelObject;
   uint32_t GroupSegmentList;
diff --git a/openmp/libomptarget/plugins-nextgen/common/include/PluginInterface.h b/openmp/libomptarget/plugins-nextgen/common/include/PluginInterface.h
index b85dc146d86d2f..abe85f43c2e726 100644
--- a/openmp/libomptarget/plugins-nextgen/common/include/PluginInterface.h
+++ b/openmp/libomptarget/plugins-nextgen/common/include/PluginInterface.h
@@ -872,6 +872,11 @@ struct GenericDeviceTy : public DeviceAllocatorTy {
 
   virtual Error getDeviceStackSize(uint64_t &V) = 0;
 
+  /// Returns true if current plugin architecture is an APU
+  /// and unified_shared_memory was not requested by the program.
+  bool useAutoZeroCopy();
+  virtual bool useAutoZeroCopyImpl() { return false; }
+
 private:
   /// Register offload entry for global variable.
   Error registerGlobalOffloadEntry(DeviceImageTy &DeviceImage,
diff --git a/openmp/libomptarget/plugins-nextgen/common/src/PluginInterface.cpp b/openmp/libomptarget/plugins-nextgen/common/src/PluginInterface.cpp
index 9490e58fc669cd..e82c2f7bef14f0 100644
--- a/openmp/libomptarget/plugins-nextgen/common/src/PluginInterface.cpp
+++ b/openmp/libomptarget/plugins-nextgen/common/src/PluginInterface.cpp
@@ -1561,6 +1561,8 @@ Error GenericDeviceTy::syncEvent(void *EventPtr) {
   return syncEventImpl(EventPtr);
 }
 
+bool GenericDeviceTy::useAutoZeroCopy() { return useAutoZeroCopyImpl(); }
+
 Error GenericPluginTy::init() {
   auto NumDevicesOrErr = initImpl();
   if (!NumDevicesOrErr)
@@ -2073,6 +2075,14 @@ int32_t __tgt_rtl_set_device_offset(int32_t DeviceIdOffset) {
   return OFFLOAD_SUCCESS;
 }
 
+int32_t __tgt_rtl_use_auto_zero_copy(int32_t DeviceId) {
+  // Automatic zero-copy only applies to programs that did
+  // not request unified_shared_memory and are deployed on an
+  // APU with XNACK enabled.
+  if (Plugin::get().getRequiresFlags() & OMP_REQ_UNIFIED_SHARED_MEMORY)
+    return false;
+  return Plugin::get().getDevice(DeviceId).useAutoZeroCopy();
+}
 #ifdef __cplusplus
 }
 #endif
diff --git a/openmp/libomptarget/src/OpenMP/Mapping.cpp b/openmp/libomptarget/src/OpenMP/Mapping.cpp
index a5c24810e0af95..d95513d068afe8 100644
--- a/openmp/libomptarget/src/OpenMP/Mapping.cpp
+++ b/openmp/libomptarget/src/OpenMP/Mapping.cpp
@@ -252,8 +252,10 @@ TargetPointerResultTy MappingInfoTy::getTargetPointer(
       MESSAGE("device mapping required by 'present' map type modifier does not "
               "exist for host address " DPxMOD " (%" PRId64 " bytes)",
               DPxPTR(HstPtrBegin), Size);
-  } else if (PM->getRequirements() & OMP_REQ_UNIFIED_SHARED_MEMORY &&
-             !HasCloseModifier) {
+  } else if ((PM->getRequirements() & OMP_REQ_UNIFIED_SHARED_MEMORY &&
+              !HasCloseModifier) ||
+             (PM->getRequirements() & OMPX_REQ_AUTO_ZERO_COPY)) {
+
     // If unified shared memory is active, implicitly mapped variables that are
     // not privatized use host address. Any explicitly mapped variables also use
     // host address where correctness is not impeded. In all other cases maps
@@ -261,6 +263,10 @@ TargetPointerResultTy MappingInfoTy::getTargetPointer(
     // In addition to the mapping rules above, the close map modifier forces the
     // mapping of the variable to the device.
     if (Size) {
+      INFO(OMP_INFOTYPE_MAPPING_CHANGED, Device.DeviceID,
+           "Return HstPtrBegin " DPxMOD " Size=%" PRId64 " for unified shared "
+           "memory\n",
+           DPxPTR((uintptr_t)HstPtrBegin), Size);
       DP("Return HstPtrBegin " DPxMOD " Size=%" PRId64 " for unified shared "
          "memory\n",
          DPxPTR((uintptr_t)HstPtrBegin), Size);
@@ -415,7 +421,8 @@ TargetPointerResultTy MappingInfoTy::getTgtPtrBegin(
          LR.TPR.getEntry()->dynRefCountToStr().c_str(), DynRefCountAction,
          LR.TPR.getEntry()->holdRefCountToStr().c_str(), HoldRefCountAction);
     LR.TPR.TargetPointer = (void *)TP;
-  } else if (PM->getRequirements() & OMP_REQ_UNIFIED_SHARED_MEMORY) {
+  } else if (PM->getRequirements() & OMP_REQ_UNIFIED_SHARED_MEMORY ||
+             PM->getRequirements() & OMPX_REQ_AUTO_ZERO_COPY) {
     // If the value isn't found in the mapping and unified shared memory
     // is on then it means we have stumbled upon a value which we need to
     // use directly from the host.
diff --git a/openmp/libomptarget/src/PluginManager.cpp b/openmp/libomptarget/src/PluginManager.cpp
index 83bf65f0f0de56..b624b8cff8ab53 100644
--- a/openmp/libomptarget/src/PluginManager.cpp
+++ b/openmp/libomptarget/src/PluginManager.cpp
@@ -144,6 +144,11 @@ void PluginAdaptorTy::initDevices(PluginManager &PM) {
 
   int32_t NumPD = getNumberOfPluginDevices();
   ExclusiveDevicesAccessor->reserve(DeviceOffset + NumPD);
+  // Auto zero-copy is a per-device property. We need to ensure
+  // that all devices are suggesting to use it.
+  bool UseAutoZeroCopy = true;
+  if (NumPD == 0)
+    UseAutoZeroCopy = false;
   for (int32_t PDevI = 0, UserDevId = DeviceOffset; PDevI < NumPD; PDevI++) {
     auto Device = std::make_unique<DeviceTy>(this, UserDevId, PDevI);
     if (auto Err = Device->init()) {
@@ -151,12 +156,20 @@ void PluginAdaptorTy::initDevices(PluginManager &PM) {
          toString(std::move(Err)).c_str());
       continue;
     }
+    UseAutoZeroCopy = UseAutoZeroCopy && Device->useAutoZeroCopy();
 
     ExclusiveDevicesAccessor->push_back(std::move(Device));
     ++NumberOfUserDevices;
     ++UserDevId;
   }
 
+  // Auto Zero-Copy can only be currently triggered when the system is an
+  // homogeneous APU architecture without attached discrete GPUs.
+  // If all devices suggest to use it, change requirment flags to trigger
+  // zero-copy behavior when mapping memory.
+  if (UseAutoZeroCopy)
+    PM.addRequirements(OMPX_REQ_AUTO_ZERO_COPY);
+
   DP("Plugin adaptor " DPxMOD " has index %d, exposes %d out of %d devices!\n",
      DPxPTR(LibraryHandler.get()), DeviceOffset, NumberOfUserDevices,
      NumberOfPluginDevices);
diff --git a/openmp/libomptarget/src/device.cpp b/openmp/libomptarget/src/device.cpp
index fa8932361a51c9..55f46e6bd6374e 100644
--- a/openmp/libomptarget/src/device.cpp
+++ b/openmp/libomptarget/src/device.cpp
@@ -339,3 +339,9 @@ void DeviceTy::dumpOffloadEntries() {
     fprintf(stderr, "  %11s: %s\n", Kind, It.second.getNameAsCStr());
   }
 }
+
+bool DeviceTy::useAutoZeroCopy() {
+  if (RTL->use_auto_zero_copy)
+    return RTL->use_auto_zero_copy(RTLDeviceID);
+  return false;
+}
diff --git a/openmp/libomptarget/test/mapping/auto_zero_copy.cpp b/openmp/libomptarget/test/mapping/auto_zero_copy.cpp
new file mode 100644
index 00000000000000..4201399d489599
--- /dev/null
+++ b/openmp/libomptarget/test/mapping/auto_zero_copy.cpp
@@ -0,0 +1,55 @@
+// clang-format off
+// RUN: %libomptarget-compilexx-generic
+// RUN: env OMPX_APU_MAPS=1 HSA_XNACK=1 LIBOMPTARGET_INFO=30 %libomptarget-run-generic 2>&1 \
+// RUN: | %fcheck-generic -check-prefix=INFO_ZERO -check-prefix=CHECK
+
+// RUN: %libomptarget-compilexx-generic
+// RUN: env HSA_XNACK=0 LIBOMPTARGET_INFO=30 %libomptarget-run-generic 2>&1 \
+// RUN: | %fcheck-generic -check-prefix=INFO_COPY -check-prefix=CHECK
+
+// UNSUPPORTED: aarch64-unknown-linux-gnu
+// UNSUPPORTED: aarch64-unknown-linux-gnu-LTO
+// UNSUPPORTED: nvptx64-nvidia-cuda
+// UNSUPPORTED: nvptx64-nvidia-cuda-LTO
+// UNSUPPORTED: x86_64-pc-linux-gnu
+// UNSUPPORTED: x86_64-pc-linux-gnu-LTO
+
+// clang-format on
+
+#include <cstdio>
+
+int main() {
+  int n = 1024;
+
+  // test various mapping types
+  int *a = new int[n];
+  int k = 3;
+  int b[n];
+
+  for (int i = 0; i < n; i++)
+    b[i] = i;
+
+    // clang-format off
+  // INFO_ZERO: Return HstPtrBegin 0x{{.*}} Size=4096 for unified shared memory
+  // INFO_ZERO: Return HstPtrBegin 0x{{.*}} Size=4096 for unified shared memory
+
+  // INFO_COPY: Creating new map entry with HstPtrBase=0x{{.*}}, HstPtrBegin=0x{{.*}}, TgtAllocBegin=0x{{.*}}, TgtPtrBegin=0x{{.*}}, Size=4096,
+  // INFO_COPY: Creating new map entry with HstPtrBase=0x{{.*}}, HstPtrBegin=0x{{.*}}, TgtAllocBegin=0x{{.*}}, TgtPtrBegin=0x{{.*}}, Size=4096,
+  // INFO_COPY: Mapping exists with HstPtrBegin=0x{{.*}}, TgtPtrBegin=0x{{.*}}, Size=4096, DynRefCount=1 (update suppressed)
+  // INFO_COPY: Mapping exists with HstPtrBegin=0x{{.*}}, TgtPtrBegin=0x{{.*}}, Size=4096, DynRefCount=1 (update suppressed)
+// clang-format on
+#pragma omp target teams distribute parallel for map(tofrom : a[ : n])         \
+    map(to : b[ : n])
+  for (int i = 0; i < n; i++)
+    a[i] = i + b[i] + k;
+
+  int err = 0;
+  for (int i = 0; i < n; i++)
+    if (a[i] != i + b[i] + k)
+      err++;
+
+  // CHECK: PASS
+  if (err == 0)
+    printf("PASS\n");
+  return err;
+}

carlobertolli · 2024-01-09T19:09:05Z

Reland #75999

carlobertolli · 2024-01-10T02:59:14Z

Updated comment to describe when automatic zero-copy is triggered for AMD gpus.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

openmp/libomptarget/plugins-nextgen/amdgpu/dynamic_hsa/hsa_ext_amd.h

openmp/libomptarget/include/Shared/Requirements.h

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h

carlobertolli · 2024-01-11T18:41:45Z

I believe I have addressed all concerns with the latest update. I also added a small refactoring of common code in the amdgpu plugin, used to obtain a string with target triple and features.
Thanks for the comments!

openmp/libomptarget/include/Shared/PluginAPI.h

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

openmp/libomptarget/src/PluginManager.cpp

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

jhuber6 · 2024-01-12T21:25:05Z

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

@@ -2728,6 +2780,34 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
    return Err;
  }

+  /// Detect if current architecture is an APU.
+  Error checkIfAPU() {


I was talking with SuiChi about checking this. He says that HSA_PROFILE_FULL should be set if the pointters are usable https://alt-doc-doc-test.readthedocs.io/en/latest/ROCm_API_References/ROCr-Runtime.html#group__agentinfo_1gacafd4247e2a04cbe0ac0b3998c127532. This is apparently how it's used in rocm-info https://github.com/ROCm/ROCR-Runtime/blob/master/src/core/runtime/amd_gpu_agent.cpp#L115. So you need to query agent info with the profile as an argument.

Host-allocated ppointers are usable on any gpu that supports unified memory, including MI200 and others.
This is not what we are determining here. What we are finding is whether the architecture is an APU, not a discrete GPU with unified memory capability. If this is changed to return true on MI200, it will make automatic zero-copy the default behavior on that gpu if XNACK is enabled. That's not what we want. What we want is to enable automatic zero-copy on APUs by default, but not enable it by default on discrete GPUs.
This logic follows an observation that OpenMP applications on discrete GPUs are typically optimized for data movement, and that configuration performs better than driver-level paging.
HIP doesn't have a concept of automatic zero-copy, as far as I know.

Well maybe this function should be renamed then? Because I saw that the linked code above used that method to determine is_apu_node, which seemed identical to what this is trying to determine. So this profile is set based off of is_apu_node in the above source. So is this referring to something different?

The patch (or libomptarget after patching it) doesn't have a "is_apu_node" name in it (either function or variable). I wonder if you are getting confused by successive amend's or if you are referring to downstream?

I'm just confused, because the name suggests that the full profile implies we have USM, which is what's necessary for this behavior. The ROCR code in question uses the same logic to set a variable called isAPU which suggests the same logic t his checkIfAPU is trying to accomplish. I guess the problem is that this is more general than what we're going for here and merely want it for this single MI300 architecture?

Leaving a comment here about investigations into rocr behavior: that code does not apply to MI300A. A new API to be able to detect an APU might be available soon, but in the meantime we need to go with the code that we have. I will modify to use the relevant ROCr API if and when it becomes available.

Alright, maybe put a TODO here to update this function once said feature is available or something.

jhuber6

A few final nits and I'll approve it.

jhuber6 · 2024-01-16T22:17:44Z

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

+    StringRef TargeTripleAndFeatures(*TargeTripleAndFeaturesOrError);
+    if (TargeTripleAndFeatures.contains("xnack+"))


Nit, it's just do ->contains and save the extra var.

jhuber6 · 2024-01-16T22:21:21Z

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

+    llvm::StringRef TripleTarget(ISAName.begin(), Length);
+    if (TripleTarget.consume_front("amdgcn-amd-amdhsa"))
+      Target = TripleTarget.ltrim('-').rtrim('\0').str();
+    return HSA_STATUS_SUCCESS;


I never looked at this much, what's the expected result here if there are multiple ISAs? Isn't there HSA_STATUS_INFO_BREAK to make the iterator stop?

We really loop over a single agent and I checked with rocr team: it can only resolve in a single isa. I am guessing that the HSA API (of which ROCr is only an implementation) leaves room for multiple ISAs associated with each agent, but that doesn't happen in this case.

jhuber6 · 2024-01-16T22:22:17Z

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

@@ -2728,6 +2780,34 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
    return Err;
  }

+  /// Detect if current architecture is an APU.
+  Error checkIfAPU() {


Alright, maybe put a TODO here to update this function once said feature is available or something.

jhuber6 · 2024-01-16T22:22:59Z

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

+                        .Default(false);
+    if (!MayBeAPU)
+      return Plugin::success();
+    else {


What's with this else? If we returned then it shouldn't be necessary.

jhuber6

LG, Thanks

This patch enables applications that did not request OpenMP unified_shared_memory to run with the same zero-copy behavior, where mapped memory does not result in extra memory allocations and memory copies, but CPU-allocated memory is accessed from the device. The name for this behavior is "automatic zero-copy" and it relies on detecting: that the runtime is running on a MI300A, that the user did not select unified_shared_memory in their program, and that XNACK (unified memory support) is enabled in the current GPU configuration. If all these conditions are met, then automatic zero-copy is triggered. This patch also introduces an environment variable OMPX_APU_MAPS that, if set, triggers automatic zero-copy also on non APU GPUs (e.g., on discrete GPUs). This patch is still missing support for global variables, which will be provided in a subsequent patch. Co-authored-by: Thorsten Blass <[email protected]> Co-authored-by: Carlo Bertolli <[email protected]>

lands and reverts: [OpenMP] Enable automatic unified shared memory on MI300A. (llvm#77512) Change-Id: I69d3d09d7aad0b0f1709f30ee0a940ec2b202b83

This patch includes re-organizing some of the functionalities in the AMDGPU plugin in order to re-enable them after trunk moved support for automatic zero-copy from the AMDGPU plugin class to the AMDGPU device class. Change-Id: I2b2df57fddd2d0e82b2a840c362162fb0c64f61f

carlobertolli requested review from jdoerfert, jhuber6 and jplehr January 9, 2024 19:08

llvmbot added backend:AMDGPU openmp:libomptarget OpenMP offload runtime labels Jan 9, 2024

carlobertolli requested review from doru1004, ronlieb and saiislam January 9, 2024 19:08

carlobertolli force-pushed the RelandAutoZeroCopy branch from 449d4b9 to d8654e4 Compare January 10, 2024 02:58

arsenm reviewed Jan 10, 2024

View reviewed changes

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp Outdated Show resolved Hide resolved

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp Outdated Show resolved Hide resolved

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp Outdated Show resolved Hide resolved

jhuber6 reviewed Jan 10, 2024

View reviewed changes

carlobertolli force-pushed the RelandAutoZeroCopy branch from d8654e4 to 2c7a46c Compare January 11, 2024 18:40

jhuber6 reviewed Jan 11, 2024

View reviewed changes

openmp/libomptarget/include/Shared/PluginAPI.h Show resolved Hide resolved

jhuber6 reviewed Jan 11, 2024

View reviewed changes

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp Outdated Show resolved Hide resolved

carlobertolli force-pushed the RelandAutoZeroCopy branch from 2c7a46c to 5438fe3 Compare January 11, 2024 21:52

jhuber6 reviewed Jan 12, 2024

View reviewed changes

openmp/libomptarget/src/PluginManager.cpp Outdated Show resolved Hide resolved

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp Outdated Show resolved Hide resolved

jhuber6 reviewed Jan 12, 2024

View reviewed changes

carlobertolli force-pushed the RelandAutoZeroCopy branch from 5438fe3 to c4e765d Compare January 16, 2024 16:22

jhuber6 reviewed Jan 16, 2024

View reviewed changes

carlobertolli force-pushed the RelandAutoZeroCopy branch from c4e765d to 10c8559 Compare January 19, 2024 00:16

jhuber6 approved these changes Jan 19, 2024

View reviewed changes

carlobertolli force-pushed the RelandAutoZeroCopy branch from 10c8559 to 3d1093b Compare January 22, 2024 16:06

carlobertolli merged commit ae99966 into llvm:main Jan 22, 2024

searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Jan 23, 2024

merge main into amd-staging

7727056

lands and reverts: [OpenMP] Enable automatic unified shared memory on MI300A. (llvm#77512) Change-Id: I69d3d09d7aad0b0f1709f30ee0a940ec2b202b83

		StringRef TargeTripleAndFeatures(*TargeTripleAndFeaturesOrError);
		if (TargeTripleAndFeatures.contains("xnack+"))

[OpenMP] Enable automatic unified shared memory on MI300A. #77512

[OpenMP] Enable automatic unified shared memory on MI300A. #77512

Uh oh!

Conversation

carlobertolli commented Jan 9, 2024

Uh oh!

llvmbot commented Jan 9, 2024

Uh oh!

carlobertolli commented Jan 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carlobertolli commented Jan 10, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

carlobertolli commented Jan 11, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhuber6 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhuber6 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

carlobertolli commented Jan 9, 2024 •

edited

Loading