Skip to content

Enable .ptr .global .align attributes for kernel attributes for CUDA #114874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Nov 15, 2024
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 28 additions & 20 deletions llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1600,29 +1600,37 @@ void NVPTXAsmPrinter::emitFunctionParamList(const Function *F, raw_ostream &O) {

if (isKernelFunc) {
if (PTy) {
// Special handling for pointer arguments to kernel
O << "\t.param .u" << PTySizeInBits << " ";

if (static_cast<NVPTXTargetMachine &>(TM).getDrvInterface() !=
NVPTX::CUDA) {
int addrSpace = PTy->getAddressSpace();
switch (addrSpace) {
default:
O << ".ptr ";
break;
case ADDRESS_SPACE_CONST:
O << ".ptr .const ";
break;
case ADDRESS_SPACE_SHARED:
O << ".ptr .shared ";
break;
case ADDRESS_SPACE_GLOBAL:
O << ".ptr .global ";
break;
}
Align ParamAlign = I->getParamAlign().valueOrOne();
O << ".align " << ParamAlign.value() << " ";
int addrSpace = PTy->getAddressSpace();
const bool IsCUDA =
static_cast<NVPTXTargetMachine &>(TM).getDrvInterface() ==
NVPTX::CUDA;

O << ".ptr ";
switch (addrSpace) {
default:
// Special handling for pointer arguments to kernel
// CUDA kernels assume that pointers are in global address space
// See:
// https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parameter-state-space
if (IsCUDA)
O << " .global ";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely just reflective of my own ignorance, but I'm still a bit confused as to why we should add ".global" to generic pointers in cuda?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to say we should always honor the LLVM address space. If the front-end wants to assume a pointer is in the global space, it can use addrspace(1) on the function argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original patch which gave us perf improvements did so by assuming all pointers used as kernel args are .global.

However, I have been struggling to reproduce that case locally myself, and as far as I can tell from the original discussions, it might only be the missing .align info that was needed anyway (as the address space can be inferred from things like cvta.to.global.u64 and ld.global.nc already).

I agree that it is much cleaner to just honour the LLVM address space, rather than making undocumented and potentially invalid assumptions. If there turn out to be strong perf reasons for assuming .global for generic pointers later on, we should probably document this implicit assumption better somewhere.

I've now also added support for emitting .local here, as the PTX spec says this is technically allowed here too: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#kernel-parameter-attribute-ptr

break;
case ADDRESS_SPACE_CONST:
O << " .const ";
break;
case ADDRESS_SPACE_SHARED:
O << " .shared ";
break;
case ADDRESS_SPACE_GLOBAL:
O << " .global ";
break;
}

Align ParamAlign = I->getParamAlign().valueOrOne();
if (ParamAlign != 1 || !IsCUDA)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the alignment is explicitly set to 1 wouldn't we still want to emit the directive?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an explicit alignment is specified in the IR, it should be propagated to PTX. The default alignment in PTX is 4 bytes; we need to preserve 1- and 2-byte alignment to maintain correctness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that makes more sense. I wasn't sure whether there was a good reason for avoiding emitting .align 1 in the original patch, so I kept that behaviour intact.

I've changed the code to emit all explicitly specified alignments on CUDA now.

Do we still need to preserve the CUDA vs OpenCL behavioural difference of emitting .align 1 for CL by default, and having no .align specifier for CUDA by default if no align is specified too, or can they both be changed to just not emit the alignment if it is not specified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion of emitting 1 by default for CL is on the review here: https://reviews.llvm.org/D118894
Clang should emit explicit alignment anyway, so it mostly only impacts other frontends like XLA or Julia (potentially). It maybe makes sense to keep that behaviour for CL there for maximum compatibility, but choose the potentially faster default alignment on CUDA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for emitting alignment when it's known. If it turns out to be not optimal, that would be a separate issue to fix.

Copy link
Contributor Author

@LewisCrawford LewisCrawford Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest version is:

      const bool IsCUDA =
          static_cast<NVPTXTargetMachine &>(TM).getDrvInterface() ==
          NVPTX::CUDA;

      MaybeAlign ParamAlign = I->getParamAlign();
      if (ParamAlign.has_value() || !IsCUDA)
        O << ".align " << ParamAlign.valueOrOne().value() << " ";

So, it always emits the alignment when known.
When unknown, OpenCL defaults to 1, and CUDA defaults to "do not emit align info" (implicitly assume 4 according to the PTX spec: https://docs.nvidia.com/cuda/parallel-thread-execution/#kernel-function-parameter-attributes ).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little unsure about why it is safe to implicitly assume 4 for CUDA. What are the semantics for LLVM IR when no alignment is specified? Shouldn't we faithfully translate these into PTX?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we should play it safe and always emit the alignment. Even when it's not specified in LLVM IR, it defaults to the ABI alignment. I don't see a reason why we should deviate from that. If CL needs something different, it should be handled in the front-end with explicit alignment annotations.

Copy link
Contributor Author

@LewisCrawford LewisCrawford Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've changed to always emit .align 1 if no explicit alignment is specified (for both CUDA and OpenCL).

O << " .align " << I->getParamAlign().valueOrOne().value();

The closest thing I've found in the langref to explaining the default behaviour is here: https://llvm.org/docs/LangRef.html#paramattrs

Note that align 1 has no effect on non-byval, non-preallocated arguments.

As far as I can tell, if "align 1 has no effect", that sounds like no-alignment specified == align 1 == no assumptions made about alignment.

O << ".align " << ParamAlign.value() << " ";
O << TLI->getParamName(F, paramIndex);
continue;
}
Expand Down
2 changes: 1 addition & 1 deletion llvm/test/CodeGen/NVPTX/i1-param.ll
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ target triple = "nvptx-nvidia-cuda"

; CHECK: .entry foo
; CHECK: .param .u8 foo_param_0
; CHECK: .param .u64 foo_param_1
; CHECK: .param .u64 .ptr .global foo_param_1
define void @foo(i1 %p, ptr %out) {
%val = zext i1 %p to i32
store i32 %val, ptr %out
Expand Down
48 changes: 48 additions & 0 deletions llvm/test/CodeGen/NVPTX/kernel-param-align.ll
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
; RUN: llc < %s -march=nvptx64 -mcpu=sm_60 | FileCheck %s
; RUN: %if ptxas %{ llc < %s -march=nvptx64 -mcpu=sm_60 | %ptxas-verify %}

%struct.Large = type { [16 x double] }

; CHECK-LABEL: .entry func_align(
; CHECK: .param .u64 .ptr .global .align 16 func_align_param_0
; CHECK: .param .u64 .ptr .global .align 16 func_align_param_1
; CHECK: .param .u64 .ptr .global .align 16 func_align_param_2
; CHECK: .param .u64 .ptr .shared .align 16 func_align_param_3
; CHECK: .param .u64 .ptr .const .align 16 func_align_param_4
define void @func_align(ptr nocapture readonly align 16 %input,
ptr nocapture align 16 %out,
ptr addrspace(1) align 16 %global,
ptr addrspace(3) align 16 %shared,
ptr addrspace(4) align 16 %const) {
entry:
%0 = addrspacecast ptr %out to ptr addrspace(1)
%1 = addrspacecast ptr %input to ptr addrspace(1)
%getElem = getelementptr inbounds %struct.Large, ptr addrspace(1) %1, i64 0, i32 0, i64 5
%tmp2 = load i32, ptr addrspace(1) %getElem, align 8
store i32 %tmp2, ptr addrspace(1) %0, align 4
ret void
}

; CHECK-LABEL: .entry func_noalign(
; CHECK: .param .u64 .ptr .global func_noalign_param_0
; CHECK: .param .u64 .ptr .global func_noalign_param_1
; CHECK: .param .u64 .ptr .global func_noalign_param_2
; CHECK: .param .u64 .ptr .shared func_noalign_param_3
; CHECK: .param .u64 .ptr .const func_noalign_param_4
define void @func_noalign(ptr nocapture readonly %input,
ptr nocapture %out,
ptr addrspace(1) %global,
ptr addrspace(3) %shared,
ptr addrspace(4) %const) {
entry:
%0 = addrspacecast ptr %out to ptr addrspace(1)
%1 = addrspacecast ptr %input to ptr addrspace(1)
%getElem = getelementptr inbounds %struct.Large, ptr addrspace(1) %1, i64 0, i32 0, i64 5
%tmp2 = load i32, ptr addrspace(1) %getElem, align 8
store i32 %tmp2, ptr addrspace(1) %0, align 4
ret void
}

!nvvm.annotations = !{!0, !1}
!0 = !{ptr @func_align, !"kernel", i32 1}
!1 = !{ptr @func_noalign, !"kernel", i32 1}