Skip to content

[NVPTX] Add Intrinsics for applypriority.* #127989

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Feb 27, 2025
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions llvm/docs/NVPTXUsage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -630,6 +630,30 @@ uses and eviction priority which can be accessed by the '``.level::eviction_prio
For more information, refer to the PTX ISA
`<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-prefetch-prefetchu>`_.

'``llvm.nvvm.applypriority.*``'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Syntax:
"""""""

.. code-block:: llvm

declare void @llvm.nvvm.applypriority.global.L2.evict.normal(ptr addrspace(1) %global_ptr, i64 %size)
declare void @llvm.nvvm.applypriority.L2.evict.normal(ptr %ptr, i64 %size)

Overview:
"""""""""

The '``@llvm.nvvm.applypriority.*``' applies the cache eviction priority specified by the
.level::eviction_priority qualifier to the address range [a..a+size) in the specified cache
level. If no state space is specified then Generic Addressing is used. If the specified address
does not fall within the address window of .global state space then the behavior is undefined.
The operand size is an integer constant that specifies the amount of data, in bytes, in the specified cache
level on which the priority is to be applied. The only supported value for the size operand is 128.

For more information, refer to the PTX ISA
`<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-applypriority>`_.

'``llvm.nvvm.cp.async.bulk.tensor.g2s.tile.[1-5]d``'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand Down
10 changes: 10 additions & 0 deletions llvm/include/llvm/IR/IntrinsicsNVVM.td
Original file line number Diff line number Diff line change
Expand Up @@ -5043,6 +5043,16 @@ def int_nvvm_prefetch_global_L2_evict_last: Intrinsic<[], [llvm_global_ptr_ty],
def int_nvvm_prefetchu_L1 : Intrinsic<[], [llvm_ptr_ty],
[IntrArgMemOnly, ReadOnly<ArgIndex<0>>, NoCapture<ArgIndex<0>>]>;

def int_nvvm_applypriority_global_L2_evict_normal
: Intrinsic<[], [llvm_global_ptr_ty, llvm_i64_ty],
[IntrArgMemOnly, ReadOnly<ArgIndex<0>>, NoCapture<ArgIndex<0>>,
ImmArg<ArgIndex<1>>]>;

def int_nvvm_applypriority_L2_evict_normal
: Intrinsic<[], [llvm_ptr_ty, llvm_i64_ty],
[IntrArgMemOnly, ReadOnly<ArgIndex<0>>, NoCapture<ArgIndex<0>>,
ImmArg<ArgIndex<1>>]>;


// Intrinsics for Bulk Copy using TMA (non-tensor)
// From Global to Shared Cluster
Expand Down
11 changes: 11 additions & 0 deletions llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
Original file line number Diff line number Diff line change
Expand Up @@ -789,6 +789,17 @@ def PREFETCH_GLOBAL_L2_EVICT_LAST : NVPTXInst<(outs), (ins Int64Regs:$addr),

def PREFETCHU_L1 : PREFETCH_INTRS<"prefetchu.L1">;

//Applypriority intrinsics
class APPLYPRIORITY_L2_INTRS<string InstName> :
NVPTXInst<(outs), (ins Int64Regs:$addr, Int64Regs:$size),
InstName # " [$addr], $size;",
[(!cast<Intrinsic>("int_nvvm_" # !subst("::", "_", !subst(".", "_", InstName)))
i64:$addr, i64:$size)]>,
Requires<[hasPTX<74>, hasSM<80>]>;

def APPLYPRIORITY_L2_EVICT_NORMAL : APPLYPRIORITY_L2_INTRS<"applypriority.L2::evict_normal">;
def APPLYPRIORITY_GLOBAL_L2_EVICT_NORMAL : APPLYPRIORITY_L2_INTRS<"applypriority.global.L2::evict_normal">;

//-----------------------------------
// MBarrier Functions
//-----------------------------------
Expand Down
34 changes: 34 additions & 0 deletions llvm/test/CodeGen/NVPTX/applypriority.ll
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx74| FileCheck --check-prefixes=CHECK-PTX64 %s
; RUN: %if ptxas-11.4 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx74| %ptxas-verify -arch=sm_80 %}

target triple = "nvptx64-nvidia-cuda"

declare void @llvm.nvvm.applypriority.global.L2.evict.normal(ptr addrspace(1) %global_ptr, i64 %size)
declare void @llvm.nvvm.applypriority.L2.evict.normal(ptr %ptr, i64 %size)

define void @applypriority_global_L2(ptr addrspace(1) %global_ptr, i64 %size) {
; CHECK-PTX64-LABEL: applypriority_global_L2(
; CHECK-PTX64: {
; CHECK-PTX64-NEXT: .reg .b64 %rd<2>;
; CHECK-PTX64-EMPTY:
; CHECK-PTX64-NEXT: // %bb.0:
; CHECK-PTX64-NEXT: ld.param.u64 %rd1, [applypriority_global_L2_param_0];
; CHECK-PTX64-NEXT: applypriority.global.L2::evict_normal [%rd1], 128;
; CHECK-PTX64-NEXT: ret;
tail call void @llvm.nvvm.applypriority.global.L2.evict.normal(ptr addrspace(1) %global_ptr, i64 128)
ret void
}

define void @applypriority_L2(ptr %ptr, i64 %size) {
; CHECK-PTX64-LABEL: applypriority_L2(
; CHECK-PTX64: {
; CHECK-PTX64-NEXT: .reg .b64 %rd<2>;
; CHECK-PTX64-EMPTY:
; CHECK-PTX64-NEXT: // %bb.0:
; CHECK-PTX64-NEXT: ld.param.u64 %rd1, [applypriority_L2_param_0];
; CHECK-PTX64-NEXT: applypriority.L2::evict_normal [%rd1], 128;
; CHECK-PTX64-NEXT: ret;
tail call void @llvm.nvvm.applypriority.L2.evict.normal(ptr %ptr, i64 128)
ret void
}