Skip to content

[libc] Add Timing Utils for AMDGPU #96828

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 10, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions libc/benchmarks/gpu/timing/amdgpu/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
add_header_library(
amdgpu_timing
HDRS
timing.h
DEPENDS
libc.src.__support.common
)
73 changes: 73 additions & 0 deletions libc/benchmarks/gpu/timing/amdgpu/timing.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
//===------------- AMDGPU implementation of timing utils --------*- C++ -*-===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//

#ifndef LLVM_LIBC_UTILS_GPU_TIMING_AMDGPU
#define LLVM_LIBC_UTILS_GPU_TIMING_AMDGPU

#include "src/__support/GPU/utils.h"
#include "src/__support/common.h"
#include "src/__support/macros/attributes.h"
#include "src/__support/macros/config.h"

#include <stdint.h>

namespace LIBC_NAMESPACE {

// Returns the overhead associated with calling the profiling region. This
// allows us to substract the constant-time overhead from the latency to
// obtain a true result. This can vary with system load.
[[gnu::noinline]] static LIBC_INLINE uint64_t overhead() {
gpu::memory_fence();
uint64_t start = gpu::processor_clock();
uint32_t result = 0.0;
asm("v_or_b32 %[v_reg], 0, %[v_reg]\n" ::[v_reg] "v"(result) :);
asm("" ::"s"(start));
uint64_t stop = gpu::processor_clock();
return stop - start;
}

// Profile a simple function and obtain its latency in clock cycles on the
// system. This function cannot be inlined or else it will disturb the very
// delicate balance of hard-coded dependencies.
template <typename F, typename T>
[[gnu::noinline]] static LIBC_INLINE uint64_t latency(F f, T t) {
// We need to store the input somewhere to guarantee that the compiler will
// not constant propagate it and remove the profiling region.
volatile uint32_t storage = t;
float arg = storage;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be T, not float.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we need versions with more than one input, right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many do you think we need? I think I remember you mentioned 6 inputs at max - is that still correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a guess, for now at least two, we can add more as needed. Maybe I just don't know enough C++ magic to automate it since the inputs can be potentially different types. (Plus all that C++ magic needs to exist in the libc project)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there problems with trying to link when it's generic? I didn't notice before but trying to use T causes lld to throw errors such as error: couldn't allocate input reg for constraint 'r'(And for the s asm input constraint) when I switched from uint64_t to T

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific constraint that generics need to use, I tried a couple like v and Sg but lld still threw errors - am I missing a step that we need to work with generics since the size isn't known for sure?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never use the r constraint. We should just error on it.

GCC went and added a bunch of constraints on their own. We don't support all of the ones there.

What do you mean by generics? The constraints don't care what the type is (other than maybe for the weird immediate cases), they just need to be scalar or vector. The SIMD analog doesn't actually work, these shouldn't be thought of as SIMD vectors vs. scalars

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the short answer is 99% of the time you want "v". There's definitely a lot of weirdness so it's not exactly, VGPRs are more like a big block of resource that can be assigned. I think you can even unassign a VGPR during execution? I should probably find a concise way to explain the magic of SIMT and hardware.

Copy link
Contributor Author

@jameshu15869 jameshu15869 Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by generics? The constraints don't care what the type is (other than maybe for the weird immediate cases), they just need to be scalar or vector. The SIMD analog doesn't actually work, these shouldn't be thought of as SIMD vectors vs. scalars

Ah, I was mixing this up with something else. I think I meant to ask if there are certain conditions that need to be met for these assembly constraints when using templates. lld throws an error that it can't allocate the input register for the v constraint for my templated local. Is it because the compiler doesn't know how much space the variable could take up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be a legal, handled type which is about it

asm("" ::"s"(arg));

// The AMDGPU architecture needs to wait on pending results.
gpu::memory_fence();
// Get the current timestamp from the clock.
uint64_t start = gpu::processor_clock();

// This forces the compiler to load the input argument and run the clock cycle
// counter before the profiling region.
asm("" ::"s"(arg), "s"(start));

// Run the function under test and return its value.
auto result = f(arg);

// This inline assembly performs a no-op which forces the result to both be
// used and prevents us from exiting this region before it's complete.
asm("v_or_b32 %[v_reg], 0, %[v_reg]\n" ::[v_reg] "v"(result) :);

// Obtain the current timestamp after running the calculation and force
// ordering.
uint64_t stop = gpu::processor_clock();
asm("" ::"s"(stop));
gpu::memory_fence();

// Return the time elapsed.
return stop - start;
}

} // namespace LIBC_NAMESPACE

#endif // LLVM_LIBC_UTILS_GPU_TIMING_AMDGPU
2 changes: 1 addition & 1 deletion libc/benchmarks/gpu/timing/timing.h
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
#include "src/__support/macros/properties/architectures.h"

#if defined(LIBC_TARGET_ARCH_IS_AMDGPU)
#error "amdgpu not yet supported"
#include "amdgpu/timing.h"
#elif defined(LIBC_TARGET_ARCH_IS_NVPTX)
#include "nvptx/timing.h"
#else
Expand Down
Loading