Enable .ptr .global .align attributes for kernel attributes for CUDA #114874

LewisCrawford · 2024-11-04T21:38:04Z

The current issue is PTX doesn't vectorise load and stores that can be vectorized.

We noticed that we were missing vectorization for sin, cos and power operations from LLVM resulting in lesser speedup. The reason is currently we don't generate any .ptr and .align attributes for kernel parameters in CUDA and the required alignment information is missing. This results in missing out on vectorization opportunities.
The change enables adding .align attribute for alignment information and .ptr attribute for kernel pointers in kernel parameters under the assumption that all kernel parameters pointers point to global memory space. This results in vectorization and boosting the speedup by ~2x.
.align is enabled only when the pointer has explicit alignment specifier.

PTX ISA doc - https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#kernel-parameter-attribute-ptr

This is a rework of the original patch proposed in #79646

Clang formatting reverted.

Change the asmprinter so that it will not force .const or .shared pointers to be changed to .global ones if they are explicitly annotated with those address-spaces (even though they are not expected to be present). Unify the code-path for printing addrspace annotations for both CUDA and CL, and only coerce generic pointers into .global pointers on CUDA. Emit alignment info for both CL and CUDA, but omit it on CUDA if it is not explicitly supplied. Update tests to have both aligned and unaligned pointers in all relevant addrspaces.

github-actions · 2024-11-04T21:41:33Z

✅ With the latest revision this PR passed the C/C++ code formatter.

LewisCrawford · 2024-11-04T21:53:28Z

Adding @jholewinski and @AlexMaclean as reviewers, as I've altered the original patch from #79646 quite a bit to address @Artem-B 's comments there to handle pointers explicitly annotated as shared/const more gracefully and add tests for them (even though CUDA assumes they do not occur).

In the original patch for CUDA, only explicitly aligned kernel ptr args got ".ptr .global .align N" added, and all others were ignored (no additional attrs, just emit a .u64 or u32 param).

In the version I've posted here:

All pointers get ".ptr".
The address space is set to .global/.shared/.const if specified explicitly (or set to .global if not specified on CUDA).
If the pointer has an alignment explicitly set as anything other than 1, it will get a ".align" value added. Otherwise, on CUDA the pointer has no .align, but on CL, it gets .align 1.

AlexMaclean · 2024-11-04T22:06:06Z

llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp

          }
+
+          Align ParamAlign = I->getParamAlign().valueOrOne();
+          if (ParamAlign != 1 || !IsCUDA)


If the alignment is explicitly set to 1 wouldn't we still want to emit the directive?

If an explicit alignment is specified in the IR, it should be propagated to PTX. The default alignment in PTX is 4 bytes; we need to preserve 1- and 2-byte alignment to maintain correctness.

I agree that makes more sense. I wasn't sure whether there was a good reason for avoiding emitting .align 1 in the original patch, so I kept that behaviour intact.

I've changed the code to emit all explicitly specified alignments on CUDA now.

Do we still need to preserve the CUDA vs OpenCL behavioural difference of emitting .align 1 for CL by default, and having no .align specifier for CUDA by default if no align is specified too, or can they both be changed to just not emit the alignment if it is not specified?

Discussion of emitting 1 by default for CL is on the review here: https://reviews.llvm.org/D118894
Clang should emit explicit alignment anyway, so it mostly only impacts other frontends like XLA or Julia (potentially). It maybe makes sense to keep that behaviour for CL there for maximum compatibility, but choose the potentially faster default alignment on CUDA.

+1 for emitting alignment when it's known. If it turns out to be not optimal, that would be a separate issue to fix.

The latest version is:

const bool IsCUDA = static_cast<NVPTXTargetMachine &>(TM).getDrvInterface() == NVPTX::CUDA; MaybeAlign ParamAlign = I->getParamAlign(); if (ParamAlign.has_value() || !IsCUDA) O << ".align " << ParamAlign.valueOrOne().value() << " ";

So, it always emits the alignment when known.
When unknown, OpenCL defaults to 1, and CUDA defaults to "do not emit align info" (implicitly assume 4 according to the PTX spec: https://docs.nvidia.com/cuda/parallel-thread-execution/#kernel-function-parameter-attributes ).

I'm a little unsure about why it is safe to implicitly assume 4 for CUDA. What are the semantics for LLVM IR when no alignment is specified? Shouldn't we faithfully translate these into PTX?

IMO, we should play it safe and always emit the alignment. Even when it's not specified in LLVM IR, it defaults to the ABI alignment. I don't see a reason why we should deviate from that. If CL needs something different, it should be handled in the front-end with explicit alignment annotations.

Ok, I've changed to always emit .align 1 if no explicit alignment is specified (for both CUDA and OpenCL).

O << " .align " << I->getParamAlign().valueOrOne().value();

The closest thing I've found in the langref to explaining the default behaviour is here: https://llvm.org/docs/LangRef.html#paramattrs

Note that align 1 has no effect on non-byval, non-preallocated arguments.

As far as I can tell, if "align 1 has no effect", that sounds like no-alignment specified == align 1 == no assumptions made about alignment.

llvm/test/CodeGen/NVPTX/kernel-param-align.ll

llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp

AlexMaclean · 2024-11-04T22:14:00Z

llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp

+            // See:
+            // https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parameter-state-space
+            if (IsCUDA)
+              O << " .global ";


This is likely just reflective of my own ignorance, but I'm still a bit confused as to why we should add ".global" to generic pointers in cuda?

I'm inclined to say we should always honor the LLVM address space. If the front-end wants to assume a pointer is in the global space, it can use addrspace(1) on the function argument.

The original patch which gave us perf improvements did so by assuming all pointers used as kernel args are .global.

However, I have been struggling to reproduce that case locally myself, and as far as I can tell from the original discussions, it might only be the missing .align info that was needed anyway (as the address space can be inferred from things like cvta.to.global.u64 and ld.global.nc already).

I agree that it is much cleaner to just honour the LLVM address space, rather than making undocumented and potentially invalid assumptions. If there turn out to be strong perf reasons for assuming .global for generic pointers later on, we should probably document this implicit assumption better somewhere.

I've now also added support for emitting .local here, as the PTX spec says this is technically allowed here too: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#kernel-parameter-attribute-ptr

llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp

Allow explicitly defined kernel .ptr arg alignments of 1 to be specified. Avoid forcing .global annotations for generic pointers on CUDA. Allow .local pointers, since the PTX specification says they are legal here. Avoid outputting unnecessary extra spaces in PTX between .ptr and the memory space. Improve test-cases by removing unnecessary function bodies, and adding more varied alignments + address spaces.

Refactor to use emitPTXAddressSpace instead of duplicating the code in a separate switch statement for kernel .ptr args.

Change to explicitly emit .align 1 for both CL and CUDA if alignment is unspecified. In most cases, the frontend should already specify alignment, so this should ideally not change much in practice.

AlexMaclean

With the latest changes, this looks good to me. I'm not familiar enough with the semantics of llvm IR to be sure that the default alignment is 1, but that seems like a reasonable and conservative way to go. Please wait for an additional approval before landing.

tom91136 · 2025-02-14T01:29:07Z

Hello,
I've just tested this patch and I'm getting CUDA_ERROR_INVALID_IMAGE for any ptr kernel param attributes that isn't .global or empty.
I've reduced it to the following PTX which fails on load and reports CUDA_ERROR_INVALID_IMAGE

#include <cstdio>
#include <cuda.h>
#define CUDA_CHECK(call)                                                                                                                   \
  do {                                                                                                                                     \
    if (const CUresult res = call; res != CUDA_SUCCESS) {                                                                                  \
      const char *err;                                                                                                                     \
      cuGetErrorName(res, &err);                                                                                                           \
      std::fprintf(stderr, "Error %s:%d: %s\n", __FILE__, __LINE__, err);                                                                  \
      std::exit(res);                                                                                                                      \
    }                                                                                                                                      \
  } while (0)
int main() {
  CUDA_CHECK(cuInit(0));
  CUdevice dev;
  CUDA_CHECK(cuDeviceGet(&dev, 0));
  CUcontext ctx;
  CUDA_CHECK(cuCtxCreate(&ctx, 0, dev));
  CUmodule m;
  CUfunction f;
  const char *ptx = R"PTX(
.version 7.8
.target sm_89
.address_size 64
.visible .entry kernel(
  .param .u64 .ptr         .align 1 kernel_param_0,
  .param .u64 .ptr .global .align 1 kernel_param_2,
  .param .u64 .ptr .shared .align 1 kernel_param_3
){
  ret;
}
)PTX";
  CUDA_CHECK(cuModuleLoadData(&m, ptx));
  CUDA_CHECK(cuModuleGetFunction(&f, m, "kernel"));
  return 0;
}

If we take out the .shared, then the module loads.
Reading the PTX spec, I think attributes like .shared should be allowed so not sure why it's failing.
The behaviour before this patch doesn't append address space attributes so it worked.

Artem-B · 2025-02-14T02:28:25Z

I can't reproduce it. AFAICT, .shared works fine with ptxas as far back as cuda-11: https://godbolt.org/z/W1hMxWTKd

tom91136 · 2025-02-14T02:56:19Z

ptxas also worked on my end. Can you try the reproducer and see if it loads the module at runtime?

LewisCrawford · 2025-02-14T16:02:44Z

This appears to be a limitation of the CUDA API. In CUDA, only .global and generic pointers are valid kernel args. The other address spaces like .local and .shared are valid PTX, but only in the context of the OpenCL API, which is why this compiles fine in ptxas, but fails to load within CUDA.

I should add some error-checking to the code here to make this clearer, and update the tests.

LewisCrawford · 2025-05-06T15:24:27Z

Here's a patch to add the extra error checking: #138706

Vandana2896 and others added 14 commits October 30, 2024 17:26

Enable .ptr .global .align attributes for kernel attributes for CUDA

a0d23db

Rearrange code, add comment

7dbc3b7

Fixed clang formatting

1814093

Update .global and .align

608202f

Update NVPTXAsmPrinter.cpp

58dc372

Clang formatting reverted.

Update NVPTXAsmPrinter.cpp

3f9b0fb

Fix comment

7eb7dbb

add addrspace

955d72a

Fix typo

4973c30

CHECK-LABEL

22914f7

Update testcase

21da083

upadte test

118c10c

Fix tests

c8c51a0

LewisCrawford mentioned this pull request Nov 4, 2024

[NVPTX] Annotate CUDA kernel pointer arguments with .ptr .space .align attributes. #79646

Closed

LewisCrawford requested review from Artem-B, jholewinski and AlexMaclean November 4, 2024 21:40

Fix clang-format issue in header comment

aaade68

LewisCrawford force-pushed the kernel_ptr_args branch from 07eb38e to aaade68 Compare November 4, 2024 22:02

AlexMaclean reviewed Nov 4, 2024

View reviewed changes

LewisCrawford added 3 commits November 6, 2024 17:23

Call emitPTXAddressSpace instead of switch

0ef9acb

Refactor to use emitPTXAddressSpace instead of duplicating the code in a separate switch statement for kernel .ptr args.

Emit .align 1 by default if unspecified

f6f64a1

Change to explicitly emit .align 1 for both CL and CUDA if alignment is unspecified. In most cases, the frontend should already specify alignment, so this should ideally not change much in practice.

AlexMaclean approved these changes Nov 14, 2024

View reviewed changes

Artem-B approved these changes Nov 14, 2024

View reviewed changes

LewisCrawford merged commit 6d05831 into llvm:main Nov 15, 2024
8 checks passed

LewisCrawford mentioned this pull request May 12, 2025

[NVPTX] Add errors for incorrect CUDA addrpaces #138706

Open

Enable .ptr .global .align attributes for kernel attributes for CUDA #114874

Enable .ptr .global .align attributes for kernel attributes for CUDA #114874

Uh oh!

Conversation

LewisCrawford commented Nov 4, 2024

Uh oh!

github-actions bot commented Nov 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LewisCrawford commented Nov 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LewisCrawford Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LewisCrawford Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AlexMaclean left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tom91136 commented Feb 14, 2025

Uh oh!

Artem-B commented Feb 14, 2025

Uh oh!

tom91136 commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LewisCrawford commented Feb 14, 2025

Uh oh!

LewisCrawford commented May 6, 2025

Uh oh!

Uh oh!

github-actions bot commented Nov 4, 2024 •

edited

Loading

LewisCrawford commented Nov 4, 2024 •

edited

Loading

LewisCrawford Nov 7, 2024 •

edited

Loading

LewisCrawford Nov 8, 2024 •

edited

Loading

tom91136 commented Feb 14, 2025 •

edited

Loading