Skip to content

[libc] Add Timing Utils for AMDGPU #96828

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 10, 2024
Merged

Conversation

jameshu15869
Copy link
Contributor

@jameshu15869 jameshu15869 commented Jun 26, 2024

PR for adding AMDGPU timing utils for benchmarking.

I was not able to test this code since I do not have an AMD GPU, but I was able to successfully compile this code using -DRUNTIMES_amdgcn-amd-amdhsa_LIBC_GPU_TEST_ARCHITECTURE=gfx90a -DRUNTIMES_amdgcn-amd-amdhsa_LIBC_GPU_LOADER_EXECUTABLE=echo -DRUNTIMES_amdgcn_amd-amdhsa_LIBC_GPU_TARGET_ARCHITECTURE=gfx90a to force the code to compile without having an AMD gpu on my machine.

@jhuber6

@llvmbot
Copy link
Member

llvmbot commented Jun 26, 2024

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-libc

Author: None (jameshu15869)

Changes

PR for adding AMDGPU timing utils for benchmarking.

I was not able to test this code since I do not have an AMD GPU, but I was able to successfully compile this code using -DRUNTIMES_amdgcn-amd-amdhsa_LIBC_GPU_TEST_ARCHITECTURE=gfx90a -DRUNTIMES_amdgcn-amd-amdhsa_LIBC_GPU_LOADER_EXECUTABLE=echo -DRUNTIMES_amdgcn_amd-amdhsa_LIBC_GPU_TARGET_ARCHITECTURE=gfx90a to force the code to compile without having an AMD gpu on my machine.


Full diff: https://github.com/llvm/llvm-project/pull/96828.diff

3 Files Affected:

  • (added) libc/benchmarks/gpu/timing/amdgpu/CMakeLists.txt (+7)
  • (added) libc/benchmarks/gpu/timing/amdgpu/timing.h (+73)
  • (modified) libc/benchmarks/gpu/timing/timing.h (+1-1)
diff --git a/libc/benchmarks/gpu/timing/amdgpu/CMakeLists.txt b/libc/benchmarks/gpu/timing/amdgpu/CMakeLists.txt
new file mode 100644
index 0000000000000..179429db9a09a
--- /dev/null
+++ b/libc/benchmarks/gpu/timing/amdgpu/CMakeLists.txt
@@ -0,0 +1,7 @@
+add_header_library(
+  amdgpu_timing
+  HDRS
+    timing.h
+  DEPENDS
+    libc.src.__support.common
+)
diff --git a/libc/benchmarks/gpu/timing/amdgpu/timing.h b/libc/benchmarks/gpu/timing/amdgpu/timing.h
new file mode 100644
index 0000000000000..3d13826ffee30
--- /dev/null
+++ b/libc/benchmarks/gpu/timing/amdgpu/timing.h
@@ -0,0 +1,73 @@
+//===------------- AMDGPU implementation of timing utils --------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_LIBC_UTILS_GPU_TIMING_AMDGPU
+#define LLVM_LIBC_UTILS_GPU_TIMING_AMDGPU
+
+#include "src/__support/GPU/utils.h"
+#include "src/__support/common.h"
+#include "src/__support/macros/attributes.h"
+#include "src/__support/macros/config.h"
+
+#include <stdint.h>
+
+namespace LIBC_NAMESPACE {
+
+// Returns the overhead associated with calling the profiling region. This
+// allows us to substract the constant-time overhead from the latency to
+// obtain a true result. This can vary with system load.
+[[gnu::noinline]] static LIBC_INLINE uint64_t overhead() {
+  gpu::memory_fence();
+  uint64_t start = gpu::processor_clock();
+  uint32_t result = 0.0;
+  asm("v_or_b32 %[v_reg], 0, %[v_reg]\n" ::[v_reg] "v"(result) :);
+  asm("" ::"s"(start));
+  uint64_t stop = gpu::processor_clock();
+  return stop - start;
+}
+
+// Profile a simple function and obtain its latency in clock cycles on the
+// system. This function cannot be inlined or else it will disturb the very
+// delicate balance of hard-coded dependencies.
+template <typename F, typename T>
+[[gnu::noinline]] static LIBC_INLINE uint64_t latency(F f, T t) {
+  // We need to store the input somewhere to guarantee that the compiler will
+  // not constant propagate it and remove the profiling region.
+  volatile uint32_t storage = t;
+  float arg = storage;
+  asm("" ::"s"(arg));
+
+  // The AMDGPU architecture needs to wait on pending results.
+  gpu::memory_fence();
+  // Get the current timestamp from the clock.
+  uint64_t start = gpu::processor_clock();
+
+  // This forces the compiler to load the input argument and run the clock cycle
+  // counter before the profiling region.
+  asm("" ::"s"(arg), "s"(start));
+
+  // Run the function under test and return its value.
+  auto result = f(arg);
+
+  // This inline assembly performs a no-op which forces the result to both be
+  // used and prevents us from exiting this region before it's complete.
+  asm("v_or_b32 %[v_reg], 0, %[v_reg]\n" ::[v_reg] "v"(result) :);
+
+  // Obtain the current timestamp after running the calculation and force
+  // ordering.
+  uint64_t stop = gpu::processor_clock();
+  asm("" ::"s"(stop));
+  gpu::memory_fence();
+
+  // Return the time elapsed.
+  return stop - start;
+}
+
+} // namespace LIBC_NAMESPACE
+
+#endif // LLVM_LIBC_UTILS_GPU_TIMING_AMDGPU
diff --git a/libc/benchmarks/gpu/timing/timing.h b/libc/benchmarks/gpu/timing/timing.h
index 180ea77954ae5..2e098feb4b3a5 100644
--- a/libc/benchmarks/gpu/timing/timing.h
+++ b/libc/benchmarks/gpu/timing/timing.h
@@ -12,7 +12,7 @@
 #include "src/__support/macros/properties/architectures.h"
 
 #if defined(LIBC_TARGET_ARCH_IS_AMDGPU)
-#error "amdgpu not yet supported"
+#include "amdgpu/timing.h"
 #elif defined(LIBC_TARGET_ARCH_IS_NVPTX)
 #include "nvptx/timing.h"
 #else

// We need to store the input somewhere to guarantee that the compiler will
// not constant propagate it and remove the profiling region.
volatile uint32_t storage = t;
float arg = storage;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be T, not float.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we need versions with more than one input, right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many do you think we need? I think I remember you mentioned 6 inputs at max - is that still correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a guess, for now at least two, we can add more as needed. Maybe I just don't know enough C++ magic to automate it since the inputs can be potentially different types. (Plus all that C++ magic needs to exist in the libc project)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there problems with trying to link when it's generic? I didn't notice before but trying to use T causes lld to throw errors such as error: couldn't allocate input reg for constraint 'r'(And for the s asm input constraint) when I switched from uint64_t to T

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific constraint that generics need to use, I tried a couple like v and Sg but lld still threw errors - am I missing a step that we need to work with generics since the size isn't known for sure?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never use the r constraint. We should just error on it.

GCC went and added a bunch of constraints on their own. We don't support all of the ones there.

What do you mean by generics? The constraints don't care what the type is (other than maybe for the weird immediate cases), they just need to be scalar or vector. The SIMD analog doesn't actually work, these shouldn't be thought of as SIMD vectors vs. scalars

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the short answer is 99% of the time you want "v". There's definitely a lot of weirdness so it's not exactly, VGPRs are more like a big block of resource that can be assigned. I think you can even unassign a VGPR during execution? I should probably find a concise way to explain the magic of SIMT and hardware.

Copy link
Contributor Author

@jameshu15869 jameshu15869 Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by generics? The constraints don't care what the type is (other than maybe for the weird immediate cases), they just need to be scalar or vector. The SIMD analog doesn't actually work, these shouldn't be thought of as SIMD vectors vs. scalars

Ah, I was mixing this up with something else. I think I meant to ask if there are certain conditions that need to be met for these assembly constraints when using templates. lld throws an error that it can't allocate the input register for the v constraint for my templated local. Is it because the compiler doesn't know how much space the variable could take up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be a legal, handled type which is about it

@jhuber6 jhuber6 merged commit eb66e31 into llvm:main Jul 10, 2024
6 checks passed
aaryanshukla pushed a commit to aaryanshukla/llvm-project that referenced this pull request Jul 14, 2024
PR for adding AMDGPU timing utils for benchmarking.

I was not able to test this code since I do not have an AMD GPU, but I
was able to successfully compile this code using
-DRUNTIMES_amdgcn-amd-amdhsa_LIBC_GPU_TEST_ARCHITECTURE=gfx90a
-DRUNTIMES_amdgcn-amd-amdhsa_LIBC_GPU_LOADER_EXECUTABLE=echo
-DRUNTIMES_amdgcn_amd-amdhsa_LIBC_GPU_TARGET_ARCHITECTURE=gfx90a to
force the code to compile without having an AMD gpu on my machine.

@jhuber6
@jameshu15869 jameshu15869 deleted the amdgpu-profiling branch July 14, 2024 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants