[amdgpu-arch] Replace use of HSA with reading sysfs directly #116651

jhuber6 · 2024-11-18T16:37:36Z

Summary:
For Linux systems, we currently use the HSA library to determine the
installed GPUs. However, this isn't really necessary and adds a
dependency on the HSA runtime as well as a lot of overhead. Instead,
this patch uses the sysfs interface exposed by amdkfd to do this
directly.

Summary: For Linux systems, we currently use the HSA library to determine the installed GPUs. However, this isn't really necessary and adds a dependency on the HSA runtime as well as a lot of overhead. Instead, this patch uses the `sysfs` interface exposed by `amdkfd` to do this directly.

llvmbot · 2024-11-18T16:38:11Z

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-clang

Author: Joseph Huber (jhuber6)

Changes

Summary:
For Linux systems, we currently use the HSA library to determine the
installed GPUs. However, this isn't really necessary and adds a
dependency on the HSA runtime as well as a lot of overhead. Instead,
this patch uses the sysfs interface exposed by amdkfd to do this
directly.

Full diff: https://github.com/llvm/llvm-project/pull/116651.diff

4 Files Affected:

(modified) clang/tools/amdgpu-arch/AMDGPUArch.cpp (+2-2)
(removed) clang/tools/amdgpu-arch/AMDGPUArchByHSA.cpp (-122)
(added) clang/tools/amdgpu-arch/AMDGPUArchByKFD.cpp (+74)
(modified) clang/tools/amdgpu-arch/CMakeLists.txt (+1-1)

diff --git a/clang/tools/amdgpu-arch/AMDGPUArch.cpp b/clang/tools/amdgpu-arch/AMDGPUArch.cpp
index 7ae57b7877e1fe..6c10cbc5c46a83 100644
--- a/clang/tools/amdgpu-arch/AMDGPUArch.cpp
+++ b/clang/tools/amdgpu-arch/AMDGPUArch.cpp
@@ -25,7 +25,7 @@ static void PrintVersion(raw_ostream &OS) {
   OS << clang::getClangToolFullVersion("amdgpu-arch") << '\n';
 }
 
-int printGPUsByHSA();
+int printGPUsByKFD();
 int printGPUsByHIP();
 
 int main(int argc, char *argv[]) {
@@ -45,7 +45,7 @@ int main(int argc, char *argv[]) {
   }
 
 #ifndef _WIN32
-  if (!printGPUsByHSA())
+  if (!printGPUsByKFD())
     return 0;
 #endif
 
diff --git a/clang/tools/amdgpu-arch/AMDGPUArchByHSA.cpp b/clang/tools/amdgpu-arch/AMDGPUArchByHSA.cpp
deleted file mode 100644
index 432f2c414ed244..00000000000000
--- a/clang/tools/amdgpu-arch/AMDGPUArchByHSA.cpp
+++ /dev/null
@@ -1,122 +0,0 @@
-//===- AMDGPUArchByHSA.cpp - list AMDGPU installed ------*- C++ -*---------===//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-//
-// This file implements a tool for detecting name of AMDGPU installed in system
-// using HSA on Linux. This tool is used by AMDGPU OpenMP and HIP driver.
-//
-//===----------------------------------------------------------------------===//
-
-#include "clang/Basic/Version.h"
-#include "llvm/Support/CommandLine.h"
-#include "llvm/Support/DynamicLibrary.h"
-#include "llvm/Support/Error.h"
-#include "llvm/Support/raw_ostream.h"
-#include <memory>
-#include <string>
-#include <vector>
-
-using namespace llvm;
-
-typedef enum {
-  HSA_STATUS_SUCCESS = 0x0,
-} hsa_status_t;
-
-typedef enum {
-  HSA_DEVICE_TYPE_CPU = 0,
-  HSA_DEVICE_TYPE_GPU = 1,
-} hsa_device_type_t;
-
-typedef enum {
-  HSA_AGENT_INFO_NAME = 0,
-  HSA_AGENT_INFO_DEVICE = 17,
-} hsa_agent_info_t;
-
-typedef struct hsa_agent_s {
-  uint64_t handle;
-} hsa_agent_t;
-
-hsa_status_t (*hsa_init)();
-hsa_status_t (*hsa_shut_down)();
-hsa_status_t (*hsa_agent_get_info)(hsa_agent_t, hsa_agent_info_t, void *);
-hsa_status_t (*hsa_iterate_agents)(hsa_status_t (*)(hsa_agent_t, void *),
-                                   void *);
-
-constexpr const char *DynamicHSAPath = "libhsa-runtime64.so";
-
-llvm::Error loadHSA() {
-  std::string ErrMsg;
-  auto DynlibHandle = std::make_unique<llvm::sys::DynamicLibrary>(
-      llvm::sys::DynamicLibrary::getPermanentLibrary(DynamicHSAPath, &ErrMsg));
-  if (!DynlibHandle->isValid()) {
-    return llvm::createStringError(llvm::inconvertibleErrorCode(),
-                                   "Failed to 'dlopen' %s", DynamicHSAPath);
-  }
-#define DYNAMIC_INIT(SYMBOL)                                                   \
-  {                                                                            \
-    void *SymbolPtr = DynlibHandle->getAddressOfSymbol(#SYMBOL);               \
-    if (!SymbolPtr)                                                            \
-      return llvm::createStringError(llvm::inconvertibleErrorCode(),           \
-                                     "Failed to 'dlsym' " #SYMBOL);            \
-    SYMBOL = reinterpret_cast<decltype(SYMBOL)>(SymbolPtr);                    \
-  }
-  DYNAMIC_INIT(hsa_init);
-  DYNAMIC_INIT(hsa_shut_down);
-  DYNAMIC_INIT(hsa_agent_get_info);
-  DYNAMIC_INIT(hsa_iterate_agents);
-#undef DYNAMIC_INIT
-  return llvm::Error::success();
-}
-
-static hsa_status_t iterateAgentsCallback(hsa_agent_t Agent, void *Data) {
-  hsa_device_type_t DeviceType;
-  hsa_status_t Status =
-      hsa_agent_get_info(Agent, HSA_AGENT_INFO_DEVICE, &DeviceType);
-
-  // continue only if device type if GPU
-  if (Status != HSA_STATUS_SUCCESS || DeviceType != HSA_DEVICE_TYPE_GPU) {
-    return Status;
-  }
-
-  std::vector<std::string> *GPUs =
-      static_cast<std::vector<std::string> *>(Data);
-  char GPUName[64];
-  Status = hsa_agent_get_info(Agent, HSA_AGENT_INFO_NAME, GPUName);
-  if (Status != HSA_STATUS_SUCCESS) {
-    return Status;
-  }
-  GPUs->push_back(GPUName);
-  return HSA_STATUS_SUCCESS;
-}
-
-int printGPUsByHSA() {
-  // Attempt to load the HSA runtime.
-  if (llvm::Error Err = loadHSA()) {
-    logAllUnhandledErrors(std::move(Err), llvm::errs());
-    return 1;
-  }
-
-  hsa_status_t Status = hsa_init();
-  if (Status != HSA_STATUS_SUCCESS) {
-    return 1;
-  }
-
-  std::vector<std::string> GPUs;
-  Status = hsa_iterate_agents(iterateAgentsCallback, &GPUs);
-  if (Status != HSA_STATUS_SUCCESS) {
-    return 1;
-  }
-
-  for (const auto &GPU : GPUs)
-    llvm::outs() << GPU << '\n';
-
-  if (GPUs.size() < 1)
-    return 1;
-
-  hsa_shut_down();
-  return 0;
-}
diff --git a/clang/tools/amdgpu-arch/AMDGPUArchByKFD.cpp b/clang/tools/amdgpu-arch/AMDGPUArchByKFD.cpp
new file mode 100644
index 00000000000000..c7590572de63d4
--- /dev/null
+++ b/clang/tools/amdgpu-arch/AMDGPUArchByKFD.cpp
@@ -0,0 +1,74 @@
+//===- AMDGPUArchByKFD.cpp - list AMDGPU installed ------*- C++ -*---------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This file implements a tool for detecting name of AMD GPUs installed in
+// system using the Linux sysfs interface for the AMD KFD driver.
+//
+//===----------------------------------------------------------------------===//
+
+#include "llvm/Support/FileSystem.h"
+#include "llvm/Support/LineIterator.h"
+#include "llvm/Support/MemoryBuffer.h"
+#include "llvm/Support/Path.h"
+#include <memory>
+#include <string>
+
+using namespace llvm;
+
+constexpr const char *KFD_SYSFS_NODE_PATH =
+    "/sys/devices/virtual/kfd/kfd/topology/nodes";
+
+constexpr static long getMajor(long Ver) { return (Ver / 10000) % 100; }
+constexpr static long getMinor(long Ver) { return (Ver / 100) % 100; }
+constexpr static long getStep(long Ver) { return Ver % 100; }
+
+int printGPUsByKFD() {
+  SmallVector<std::pair<long, long>> Devices;
+  std::error_code EC;
+  for (sys::fs::directory_iterator Begin(KFD_SYSFS_NODE_PATH, EC), End;
+       Begin != End; Begin.increment(EC)) {
+    if (EC)
+      return 1;
+
+    long Node = 0;
+    if (sys::path::stem(Begin->path()).consumeInteger(10, Node))
+      return 1;
+
+    SmallVector<char> Path(Begin->path().begin(), Begin->path().end());
+    sys::path::append(Path, "properties");
+
+    ErrorOr<std::unique_ptr<MemoryBuffer>> BufferOrErr =
+        MemoryBuffer::getFileOrSTDIN(Path);
+    if (std::error_code EC = BufferOrErr.getError())
+      return 1;
+
+    long GFXVersion = 0;
+    for (line_iterator Lines(**BufferOrErr, false); !Lines.is_at_end();
+         ++Lines) {
+      if (Lines->starts_with("gfx_target_version")) {
+        if (Lines->drop_front(sizeof("gfx_target_version"))
+                .consumeInteger(10, GFXVersion))
+          return 1;
+        break;
+      }
+    }
+
+    // If this is zero the node is a CPU.
+    if (GFXVersion == 0)
+      continue;
+    Devices.emplace_back(Node, GFXVersion);
+  }
+
+  // Sort the devices by their node to make sure it prints in order.
+  llvm::sort(Devices, [](auto &L, auto &R) { return L.first < R.first; });
+  for (const auto &[Node, GFXVersion] : Devices)
+    std::fprintf(stdout, "gfx%ld%ld%lx\n", getMajor(GFXVersion),
+                 getMinor(GFXVersion), getStep(GFXVersion));
+
+  return 0;
+}
diff --git a/clang/tools/amdgpu-arch/CMakeLists.txt b/clang/tools/amdgpu-arch/CMakeLists.txt
index 1657c701251308..c4c8de614565a7 100644
--- a/clang/tools/amdgpu-arch/CMakeLists.txt
+++ b/clang/tools/amdgpu-arch/CMakeLists.txt
@@ -8,6 +8,6 @@
 
 set(LLVM_LINK_COMPONENTS Support)
 
-add_clang_tool(amdgpu-arch AMDGPUArch.cpp AMDGPUArchByHSA.cpp AMDGPUArchByHIP.cpp)
+add_clang_tool(amdgpu-arch AMDGPUArch.cpp AMDGPUArchByKFD.cpp AMDGPUArchByHIP.cpp)
 
 target_link_libraries(amdgpu-arch PRIVATE clangBasic)

clang/tools/amdgpu-arch/AMDGPUArchByKFD.cpp

arsenm · 2024-11-18T16:46:53Z

clang/tools/amdgpu-arch/AMDGPUArchByKFD.cpp

+  for (sys::fs::directory_iterator Begin(KFD_SYSFS_NODE_PATH, EC), End;
+       Begin != End; Begin.increment(EC)) {
+    if (EC)
+      return 1;


Should print some errors when this fails

I was unsure if I should, since we can still fall back to HIP in the current implementation.

clang/tools/amdgpu-arch/AMDGPUArchByHSA.cpp

b-sumner · 2024-11-18T16:55:45Z

@jhuber6 can you comment on "lot of overhead" and if that matters? Also, not sure why the HSA library dependence is a problem. This seems to be exposing amdgpu-arch to more maintenance overhead.

jhuber6 · 2024-11-18T17:01:28Z

@jhuber6 can you comment on "lot of overhead" and if that matters? Also, not sure why the HSA library dependence is a problem. This seems to be exposing amdgpu-arch to more maintenance overhead.

Sometimes the driver will hang and since this is used inside of clang to support --offload-arch=native I've had cases where the compiler hangs forever, so I added a timeout to keep it from doing that in the past. This removes that possibility entirely. I have also had reports from cluster users that it becomes very slow when others are stressing the GPU. It's faster and since this will be installed on every single LLVM build, not everyone has ROCm so it would be nice for this to work. I think ti's fair to do this as the fast-path on Linux systems and then fall-back to HIP if something goes terribly wrong.

b-sumner · 2024-11-18T17:37:53Z

@jhuber6 can you comment on "lot of overhead" and if that matters? Also, not sure why the HSA library dependence is a problem. This seems to be exposing amdgpu-arch to more maintenance overhead.

Sometimes the driver will hang and since this is used inside of clang to support --offload-arch=native I've had cases where the compiler hangs forever, so I added a timeout to keep it from doing that in the past. This removes that possibility entirely. I have also had reports from cluster users that it becomes very slow when others are stressing the GPU. It's faster and since this will be installed on every single LLVM build, not everyone has ROCm so it would be nice for this to work. I think ti's fair to do this as the fast-path on Linux systems and then fall-back to HIP if something goes terribly wrong.

I don't really understand why cluster users are compiling on a system where the GPUs are being stressed, and I still don't see why it's a good idea to break layering for this case. Also, I wasn't aware that the "native" offload arch is supported by ROCm.

jhuber6 · 2024-11-18T17:50:35Z

I don't really understand why cluster users are compiling on a system where the GPUs are being stressed, and I still don't see why it's a good idea to break layering for this case.

I'd like to eliminate a class of failures I've seen with --offload-arch=native either causing the compilation job to fail, hang forever, or take a really long time. I suppose I cannot attest to whether or not the Driver team will modify this interface, but since it's hard-coded in ROCT I would doubt it. This is also sufficiently tested by libc which uses -mcpu=native and has its own bot, along with some downstream OpenMP tests using --offload-arch=native so I don't think this would silently fail and make it into a release.

Also, I wasn't aware that the "native" offload arch is supported by ROCm.

HIP does at least upstream, don't know if that was modified in ROCm. OpenMP supports it in the fork.

arsenm · 2024-11-18T17:53:56Z

I suppose I cannot attest to whether or not the Driver team will modify this interface,

They must never ever do this. It is their fault and problem if this breaks

yxsamliu · 2024-11-18T17:54:05Z

@jhuber6 can you comment on "lot of overhead" and if that matters? Also, not sure why the HSA library dependence is a problem. This seems to be exposing amdgpu-arch to more maintenance overhead.

Sometimes the driver will hang and since this is used inside of clang to support --offload-arch=native I've had cases where the compiler hangs forever, so I added a timeout to keep it from doing that in the past. This removes that possibility entirely. I have also had reports from cluster users that it becomes very slow when others are stressing the GPU. It's faster and since this will be installed on every single LLVM build, not everyone has ROCm so it would be nice for this to work. I think ti's fair to do this as the fast-path on Linux systems and then fall-back to HIP if something goes terribly wrong.

I don't really understand why cluster users are compiling on a system where the GPUs are being stressed, and I still don't see why it's a good idea to break layering for this case. Also, I wasn't aware that the "native" offload arch is supported by ROCm.

The issue also happens to machines with one GPU if amdgpu-arch is executed multiple times in a short time due to some limitation of the driver. --offload-arch=native calls amdpu-arch to get the actual GPU archs.

clang/tools/amdgpu-arch/AMDGPUArchByKFD.cpp

yxsamliu · 2024-11-18T18:00:24Z

I think we could use this approach for Linux. amdgpu-arch can still use HIP runtime to detect GPU for both Linux and Windows, so we still have a fallback even if we remove the HSA approach.

clang/tools/amdgpu-arch/AMDGPUArchByKFD.cpp

b-sumner · 2024-11-18T18:05:58Z

I feel like this is a workaround. Can we not fix the "limitation of the driver" and if there is an issue with HSA overhead, shouldn't we file a ticket?

jhuber6 · 2024-11-18T18:09:13Z

I feel like this is a workaround. Can we not fix the "limitation of the driver" and if there is an issue with HSA overhead, shouldn't we file a ticket?

I don't know for sure, but I'd guess that the limitations seen may be related to the limit of 128 or so doorbells I think we have somewhere. I don't see this as a workaround, it deliberately gets the same information without requiring /dev/kfd to be opened. That's unlikely something we'd want to push into HSA.

arsenm · 2024-11-18T18:14:31Z

We shouldn't need to load a driver to query the devices on the system. This exists for this purpose, if there isn't going to be some ioctl to read it for you

jhuber6 · 2024-11-18T18:22:05Z

Here's a question, should it respect ROCR_VISIBLE_DEVICES?

shiltian · 2024-11-18T18:23:37Z

Here's a question, should it respect ROCR_VISIBLE_DEVICES?

I think it depends on whether you consider this as a ROCm toolchain. If not, I'd prefer not to be bound to any ROCm related stuff.

jhuber6 · 2024-11-18T18:24:20Z

Here's a question, should it respect ROCR_VISIBLE_DEVICES?

I think it depends on whether you consider this as a ROCm toolchain. If not, I'd prefer not to be bound to any ROCm related stuff.

Agreed, maybe in the fork but makes sense to leave community LLVM alone.

shiltian · 2024-11-18T18:25:53Z

Maybe add this information into the beginning of the file.

arsenm · 2024-11-18T18:31:12Z

Here's a question, should it respect ROCR_VISIBLE_DEVICES?

No. That is a transient runtime debug flag. This is not executing code on the device

b-sumner · 2024-11-18T18:40:36Z

We shouldn't need to load a driver to query the devices on the system. This exists for this purpose, if there isn't going to be some ioctl to read it for you

It depends on the device and the system. In this case, since linux provides an alternative I agree it makes sense to use it, and now see why this is not a workaround. Thanks.

yxsamliu · 2024-11-18T18:44:41Z

I feel like this is a workaround. Can we not fix the "limitation of the driver" and if there is an issue with HSA overhead, shouldn't we file a ticket?

I think there was already a ticket but it cannot be fixed easily.

JonChesterfield

Absolutely yes, awesome. I had a todo to have the kernel export this under sysfs literally years ago and didn't get around to working out their commit structure, delighted to see it is exposed. The unreliable hsa calls has been a consistent nuisance for ages, great to see we don't need that hack any more.

Possible after-commit fix is to replace all the return 1 with different numbers (maybe an enum?) so that we have exit 0 on success and an exit code which indicates what went wrong on failure.

JonChesterfield · 2024-11-18T19:13:19Z

Oh. I now see there was a bunch of discussion about this, will add some context.

The driver has a hard limit on how many processes can open it at a time. clang calls this utility to ask what gpu to compile for by default. If you put those together, a parallel build on a vaguely modern desktop immediately blows through that process limit and proceeds to fail, so the user has to deliberately build code slowly or specify the gpu by hand, to work around our tooling falling over.

The limit on number of processes is generally reasonable - launching hundreds of processes that all open the GPU and allocate queues and launch kernels is generally a disaster and having the kernel return an equivalent to "too many open" is great. However in the specific case where we are only asking for trivial information, which doesn't need to allocate a queue or do anything whatsoever with the GPU, this is a spurious and annoying limitation.

I suppose it could be "fixed" in the driver - some sort of reference count which is incremented when you do something non-trivial instead of on open - but I'd expect the kernel people to tell us to stop opening hundreds of processes when none of them do any work.

One clumsy outstanding thing is that this should now be some code in a header that clang includes so that instead of the subprocess shell to handle arch=native clang just looks up the information directly.

@b-sumner does the additional context make the design choice clear?

jhuber6 · 2024-11-18T19:15:48Z

One clumsy outstanding thing is that this should now be some code in a header that clang includes so that instead of the subprocess shell to handle arch=native clang just looks up the information directly.

Would only work for Linux unfortunately, unless some Windows driver developer out there knows if there's some similar win32 magic.

JonChesterfield · 2024-11-18T19:46:13Z

Would only work for Linux unfortunately, unless some Windows driver developer out there knows if there's some similar win32 magic.

Windows getting subprocess calls until their driver catches up (or someone points out how to do this) seems fine to me. Linux people get a slightly faster clang, and maybe even one that prints useful error messages while it dies derived from the return code.

llvm-ci · 2024-11-19T12:34:21Z

LLVM Buildbot has detected a new failure on builder ppc64le-lld-multistage-test running on ppc64le-lld-multistage-test while building clang at step 12 "build-stage2-unified-tree".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/168/builds/5681