-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[libc] Add Timing Utils for AMDGPU #96828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-amdgpu @llvm/pr-subscribers-libc Author: None (jameshu15869) ChangesPR for adding AMDGPU timing utils for benchmarking. I was not able to test this code since I do not have an AMD GPU, but I was able to successfully compile this code using -DRUNTIMES_amdgcn-amd-amdhsa_LIBC_GPU_TEST_ARCHITECTURE=gfx90a -DRUNTIMES_amdgcn-amd-amdhsa_LIBC_GPU_LOADER_EXECUTABLE=echo -DRUNTIMES_amdgcn_amd-amdhsa_LIBC_GPU_TARGET_ARCHITECTURE=gfx90a to force the code to compile without having an AMD gpu on my machine. Full diff: https://github.com/llvm/llvm-project/pull/96828.diff 3 Files Affected:
diff --git a/libc/benchmarks/gpu/timing/amdgpu/CMakeLists.txt b/libc/benchmarks/gpu/timing/amdgpu/CMakeLists.txt
new file mode 100644
index 0000000000000..179429db9a09a
--- /dev/null
+++ b/libc/benchmarks/gpu/timing/amdgpu/CMakeLists.txt
@@ -0,0 +1,7 @@
+add_header_library(
+ amdgpu_timing
+ HDRS
+ timing.h
+ DEPENDS
+ libc.src.__support.common
+)
diff --git a/libc/benchmarks/gpu/timing/amdgpu/timing.h b/libc/benchmarks/gpu/timing/amdgpu/timing.h
new file mode 100644
index 0000000000000..3d13826ffee30
--- /dev/null
+++ b/libc/benchmarks/gpu/timing/amdgpu/timing.h
@@ -0,0 +1,73 @@
+//===------------- AMDGPU implementation of timing utils --------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_LIBC_UTILS_GPU_TIMING_AMDGPU
+#define LLVM_LIBC_UTILS_GPU_TIMING_AMDGPU
+
+#include "src/__support/GPU/utils.h"
+#include "src/__support/common.h"
+#include "src/__support/macros/attributes.h"
+#include "src/__support/macros/config.h"
+
+#include <stdint.h>
+
+namespace LIBC_NAMESPACE {
+
+// Returns the overhead associated with calling the profiling region. This
+// allows us to substract the constant-time overhead from the latency to
+// obtain a true result. This can vary with system load.
+[[gnu::noinline]] static LIBC_INLINE uint64_t overhead() {
+ gpu::memory_fence();
+ uint64_t start = gpu::processor_clock();
+ uint32_t result = 0.0;
+ asm("v_or_b32 %[v_reg], 0, %[v_reg]\n" ::[v_reg] "v"(result) :);
+ asm("" ::"s"(start));
+ uint64_t stop = gpu::processor_clock();
+ return stop - start;
+}
+
+// Profile a simple function and obtain its latency in clock cycles on the
+// system. This function cannot be inlined or else it will disturb the very
+// delicate balance of hard-coded dependencies.
+template <typename F, typename T>
+[[gnu::noinline]] static LIBC_INLINE uint64_t latency(F f, T t) {
+ // We need to store the input somewhere to guarantee that the compiler will
+ // not constant propagate it and remove the profiling region.
+ volatile uint32_t storage = t;
+ float arg = storage;
+ asm("" ::"s"(arg));
+
+ // The AMDGPU architecture needs to wait on pending results.
+ gpu::memory_fence();
+ // Get the current timestamp from the clock.
+ uint64_t start = gpu::processor_clock();
+
+ // This forces the compiler to load the input argument and run the clock cycle
+ // counter before the profiling region.
+ asm("" ::"s"(arg), "s"(start));
+
+ // Run the function under test and return its value.
+ auto result = f(arg);
+
+ // This inline assembly performs a no-op which forces the result to both be
+ // used and prevents us from exiting this region before it's complete.
+ asm("v_or_b32 %[v_reg], 0, %[v_reg]\n" ::[v_reg] "v"(result) :);
+
+ // Obtain the current timestamp after running the calculation and force
+ // ordering.
+ uint64_t stop = gpu::processor_clock();
+ asm("" ::"s"(stop));
+ gpu::memory_fence();
+
+ // Return the time elapsed.
+ return stop - start;
+}
+
+} // namespace LIBC_NAMESPACE
+
+#endif // LLVM_LIBC_UTILS_GPU_TIMING_AMDGPU
diff --git a/libc/benchmarks/gpu/timing/timing.h b/libc/benchmarks/gpu/timing/timing.h
index 180ea77954ae5..2e098feb4b3a5 100644
--- a/libc/benchmarks/gpu/timing/timing.h
+++ b/libc/benchmarks/gpu/timing/timing.h
@@ -12,7 +12,7 @@
#include "src/__support/macros/properties/architectures.h"
#if defined(LIBC_TARGET_ARCH_IS_AMDGPU)
-#error "amdgpu not yet supported"
+#include "amdgpu/timing.h"
#elif defined(LIBC_TARGET_ARCH_IS_NVPTX)
#include "nvptx/timing.h"
#else
|
// We need to store the input somewhere to guarantee that the compiler will | ||
// not constant propagate it and remove the profiling region. | ||
volatile uint32_t storage = t; | ||
float arg = storage; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be T
, not float
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also we need versions with more than one input, right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many do you think we need? I think I remember you mentioned 6 inputs at max - is that still correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a guess, for now at least two, we can add more as needed. Maybe I just don't know enough C++ magic to automate it since the inputs can be potentially different types. (Plus all that C++ magic needs to exist in the libc
project)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there problems with trying to link when it's generic? I didn't notice before but trying to use T
causes lld
to throw errors such as error: couldn't allocate input reg for constraint 'r'
(And for the s
asm input constraint) when I switched from uint64_t
to T
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific constraint that generics need to use, I tried a couple like v
and Sg
but lld
still threw errors - am I missing a step that we need to work with generics since the size isn't known for sure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never use the r constraint. We should just error on it.
GCC went and added a bunch of constraints on their own. We don't support all of the ones there.
What do you mean by generics? The constraints don't care what the type is (other than maybe for the weird immediate cases), they just need to be scalar or vector. The SIMD analog doesn't actually work, these shouldn't be thought of as SIMD vectors vs. scalars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the short answer is 99% of the time you want "v"
. There's definitely a lot of weirdness so it's not exactly, VGPRs are more like a big block of resource that can be assigned. I think you can even unassign a VGPR during execution? I should probably find a concise way to explain the magic of SIMT and hardware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by generics? The constraints don't care what the type is (other than maybe for the weird immediate cases), they just need to be scalar or vector. The SIMD analog doesn't actually work, these shouldn't be thought of as SIMD vectors vs. scalars
Ah, I was mixing this up with something else. I think I meant to ask if there are certain conditions that need to be met for these assembly constraints when using templates. lld
throws an error that it can't allocate the input register for the v
constraint for my templated local. Is it because the compiler doesn't know how much space the variable could take up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs to be a legal, handled type which is about it
PR for adding AMDGPU timing utils for benchmarking. I was not able to test this code since I do not have an AMD GPU, but I was able to successfully compile this code using -DRUNTIMES_amdgcn-amd-amdhsa_LIBC_GPU_TEST_ARCHITECTURE=gfx90a -DRUNTIMES_amdgcn-amd-amdhsa_LIBC_GPU_LOADER_EXECUTABLE=echo -DRUNTIMES_amdgcn_amd-amdhsa_LIBC_GPU_TARGET_ARCHITECTURE=gfx90a to force the code to compile without having an AMD gpu on my machine. @jhuber6
PR for adding AMDGPU timing utils for benchmarking.
I was not able to test this code since I do not have an AMD GPU, but I was able to successfully compile this code using
-DRUNTIMES_amdgcn-amd-amdhsa_LIBC_GPU_TEST_ARCHITECTURE=gfx90a -DRUNTIMES_amdgcn-amd-amdhsa_LIBC_GPU_LOADER_EXECUTABLE=echo -DRUNTIMES_amdgcn_amd-amdhsa_LIBC_GPU_TARGET_ARCHITECTURE=gfx90a
to force the code to compile without having an AMD gpu on my machine.@jhuber6