[NVPTX] Add intrinsics for prefetch.* #125887

abhilash1910 · 2025-02-05T17:16:38Z

[NVPTX] Add Prefetch intrinsics

This PR adds prefetch intrinsics with the relevant eviction priorities.

Lit tests are added as part of prefetch.ll
The generated PTX is verified with a 12.3 ptxas executable.
Added docs for these intrinsics in NVPTXUsage.rst.

For more information, refer PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-prefetch-prefetchu>_.

llvmbot · 2025-02-05T17:17:29Z

@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-backend-nvptx

Author: Abhilash Majumder (abhilash1910)

Changes

[NVPTX] Add Prefetch intrinsics

This PR adds prefetch intrinsics with the relevant eviction priorities.

Lit tests are added as part of prefetch.ll
The generated PTX is verified with a 12.3 ptxas executable.
Added docs for these intrinsics in NVPTXUsage.rst.

For more information, refer PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-prefetch-prefetchu>_.

Full diff: https://github.com/llvm/llvm-project/pull/125887.diff

4 Files Affected:

(modified) llvm/docs/NVPTXUsage.rst (+41)
(modified) llvm/include/llvm/IR/IntrinsicsNVVM.td (+25)
(modified) llvm/lib/Target/NVPTX/NVPTXIntrinsics.td (+25)
(added) llvm/test/CodeGen/NVPTX/prefetch.ll (+82)

diff --git a/llvm/docs/NVPTXUsage.rst b/llvm/docs/NVPTXUsage.rst
index dec6ad4e541152a..31602a8c4981157 100644
--- a/llvm/docs/NVPTXUsage.rst
+++ b/llvm/docs/NVPTXUsage.rst
@@ -581,6 +581,47 @@ prefetched in terms of bytes and it must be a multiple of 16.
 For more information, refer PTX ISA
 `<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-prefetch>`_.
 
+'``llvm.nvvm.prefetch.*``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare void  @llvm.nvvm.prefetch.local.L1.evictnormal(ptr addrspace(5) %localPtr)
+  declare void  @llvm.nvvm.prefetch.local.L2.evictnormal(ptr addrspace(5) %localPtr)
+  
+  declare void  @llvm.nvvm.prefetch.global.L1.evictnormal(ptr addrspace(1) %globalPtr)
+  declare void  @llvm.nvvm.prefetch.global.L2.evictnormal(ptr addrspace(1) %globalPtr)
+  declare void  @llvm.nvvm.prefetch.global.L1.evictlast(ptr addrspace(1) %globalPtr)
+  declare void  @llvm.nvvm.prefetch.global.L2.evictlast(ptr addrspace(1) %globalPtr)
+  
+  declare void  @llvm.nvvm.prefetch.L1.evictnormal(ptr %ptr)
+  declare void  @llvm.nvvm.prefetch.L2.evictnormal(ptr %ptr)
+  
+  declare void  @llvm.nvvm.prefetchu.L1.evictnormal(ptr %ptr)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.prefetch.*``' and '``@llvm.nvvm.prefetchu.*``' intrinsic
+correspond to the '``prefetch.*``;' and '``prefetchu.*``' family of PTX instructions. 
+The '``prefetch.*``' instructions bring the cache line containing the
+specified address in global or local memory address space into the 
+specified cache level (L1 or L2). The '`prefetchu.*``' instruction brings the cache line 
+containing the specified generic address into the specified uniform cache level.
+If no address space is specified, it is assumed to be generic address. The intrinsic 
+uses and eviction priority which can be accessed by the '``.level::eviction_priority``' modifier. 
+
+
+* A prefetch to a shared memory location performs no operation.
+* A prefetch into the uniform cache requires a generic address, 
+  and no operation occurs if the address maps to a const, local, or shared memory location.
+
+For more information, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-prefetch-prefetchu>`_.
+
 '``llvm.nvvm.cp.async.bulk.tensor.g2s.tile.[1-5]d``'
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
diff --git a/llvm/include/llvm/IR/IntrinsicsNVVM.td b/llvm/include/llvm/IR/IntrinsicsNVVM.td
index abbe25bf0040a6f..e9504b0ef5a2f20 100644
--- a/llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ b/llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -48,6 +48,7 @@
 
 def llvm_global_ptr_ty  : LLVMQualPointerType<1>;  // (global)ptr
 def llvm_shared_ptr_ty  : LLVMQualPointerType<3>;  // (shared)ptr
+def llvm_local_ptr_ty   : LLVMQualPointerType<5>;  // (local)ptr
 def llvm_tmem_ptr_ty    : LLVMQualPointerType<6>;  // (tensor memory)ptr
 
 //
@@ -4999,6 +5000,30 @@ foreach dim = [1, 2, 3, 4, 5] in {
   }
 }
 
+// Intrinsics for Prefetch and Prefetchu
+foreach addr = ["global", "local", ""] in {
+  foreach evict = !if(!eq(addr, "global"),
+                  ["evictlast", "evictnormal"],
+                  ["evictnormal"]) in {
+    foreach level = ["L1", "L2"] in {
+      def int_nvvm_prefetch_ # !if(!eq(addr, ""), "", addr # "_") 
+          # level # "_" # evict : Intrinsic<[], 
+                                            !cond(
+                                            !eq(addr, "global") : [llvm_global_ptr_ty],
+                                            !eq(addr, "local"): [llvm_local_ptr_ty],
+                                            !eq(addr, ""): [llvm_ptr_ty]),
+                                            [IntrArgMemOnly,ReadOnly<ArgIndex<0>>,
+                                             NoCapture<ArgIndex<0>>],
+                                            "llvm.nvvm.prefetch." # !if(!eq(addr, ""), "", addr                                                                                                     # ".")# level # "." # evict>;   
+    }
+  }
+}
+
+def int_nvvm_prefetchu_L1_evictnormal : Intrinsic<[], [llvm_ptr_ty],
+  [IntrArgMemOnly, ReadOnly<ArgIndex<0>>, NoCapture<ArgIndex<0>>],
+  "llvm.nvvm.prefetchu.L1.evictnormal">;
+
+
 // Intrinsics for Bulk Copy using TMA (non-tensor)
 // From Global to Shared Cluster
 def int_nvvm_cp_async_bulk_global_to_shared_cluster
diff --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
index a0d00e4aac560a5..69f9cb3ff0c87b8 100644
--- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -740,6 +740,31 @@ foreach dim = [1, 2, 3, 4, 5] in {
   }
 }
 
+//Prefetch and Prefetchu 
+class PREFETCH_INTRS<string InstName, Intrinsic Intrin> :
+          NVPTXInst<(outs), (ins Int64Regs:$addr),
+            !strconcat(InstName, " [$addr];"),
+            [(Intrin i64:$addr)]>,
+          Requires<[hasPTX<80>, hasSM<90>]>;
+   
+
+// Only global supports evictlast and evictnormal.
+// Other variants (local and default) only support evictnormal
+foreach addr = ["global", "local", ""] in {
+  foreach evict = !if(!eq(addr, "global"),
+                          ["evictlast", "evictnormal"],
+                          ["evictnormal"]) in {
+    foreach level = ["L1", "L2"] in {
+      def PREFETCH_# addr # level # "_" # evict : PREFETCH_INTRS<
+                                                 "prefetch." # !if(!eq(addr, ""), "", addr # ".") # level # "." # evict,
+                                                 !cast<Intrinsic>
+                                                 ("int_nvvm_prefetch_"# !if(!eq(addr, ""), "", addr # "_") # level # "_" # evict)>;
+    }
+  }
+}
+
+def PREFETCHU_L1_EVICTNORMAL : PREFETCH_INTRS<"prefetchu.L1.evictnormal", !cast<Intrinsic>("int_nvvm_prefetchu_L1_evictnormal")>;
+
 //-----------------------------------
 // MBarrier Functions
 //-----------------------------------
diff --git a/llvm/test/CodeGen/NVPTX/prefetch.ll b/llvm/test/CodeGen/NVPTX/prefetch.ll
new file mode 100644
index 000000000000000..0576a737d69a242
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/prefetch.ll
@@ -0,0 +1,82 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80| FileCheck --check-prefixes=CHECK-PTX64 %s
+; RUN: %if ptxas-12.3 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80| %ptxas-verify -arch=sm_90 %}
+; RUN: %if ptxas-12.3 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80 --nvptx-short-ptr| %ptxas-verify -arch=sm_90 %}
+
+target triple = "nvptx64-nvidia-cuda"
+
+declare void  @llvm.nvvm.prefetch.local.L1.evictnormal(ptr addrspace(5) %localPtr)
+declare void  @llvm.nvvm.prefetch.local.L2.evictnormal(ptr addrspace(5) %localPtr)
+
+declare void  @llvm.nvvm.prefetch.global.L1.evictnormal(ptr addrspace(1) %globalPtr)
+declare void  @llvm.nvvm.prefetch.global.L2.evictnormal(ptr addrspace(1) %globalPtr)
+declare void  @llvm.nvvm.prefetch.global.L1.evictlast(ptr addrspace(1) %globalPtr)
+declare void  @llvm.nvvm.prefetch.global.L2.evictlast(ptr addrspace(1) %globalPtr)
+
+declare void  @llvm.nvvm.prefetch.L1.evictnormal(ptr %ptr)
+declare void  @llvm.nvvm.prefetch.L2.evictnormal(ptr %ptr)
+
+declare void  @llvm.nvvm.prefetchu.L1.evictnormal(ptr %ptr)
+
+define void @prefetch_local(ptr addrspace(5) %localPtr) {
+; CHECK-PTX64-LABEL: prefetch_local(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.u64 %rd1, [prefetch_local_param_0];
+; CHECK-PTX64-NEXT:    prefetch.local.L1.evictnormal [%rd1];
+; CHECK-PTX64-NEXT:    prefetch.local.L2.evictnormal [%rd1];
+; CHECK-PTX64-NEXT:    ret;
+  tail call void @llvm.nvvm.prefetch.local.L1.evictnormal(ptr addrspace(5) %localPtr)
+  tail call void @llvm.nvvm.prefetch.local.L2.evictnormal(ptr addrspace(5) %localPtr)
+  ret void
+}
+
+define void @prefetch_global(ptr addrspace(1) %globalPtr) {
+; CHECK-PTX64-LABEL: prefetch_global(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.u64 %rd1, [prefetch_global_param_0];
+; CHECK-PTX64-NEXT:    prefetch.global.L1.evictnormal [%rd1];
+; CHECK-PTX64-NEXT:    prefetch.global.L2.evictnormal [%rd1];
+; CHECK-PTX64-NEXT:    prefetch.global.L1.evictlast [%rd1];
+; CHECK-PTX64-NEXT:    prefetch.global.L2.evictlast [%rd1];
+; CHECK-PTX64-NEXT:    ret;
+  tail call void @llvm.nvvm.prefetch.global.L1.evictnormal(ptr addrspace(1) %globalPtr)
+  tail call void @llvm.nvvm.prefetch.global.L2.evictnormal(ptr addrspace(1) %globalPtr)
+  tail call void @llvm.nvvm.prefetch.global.L1.evictlast(ptr addrspace(1) %globalPtr)
+  tail call void @llvm.nvvm.prefetch.global.L2.evictlast(ptr addrspace(1) %globalPtr)
+  ret void
+}
+
+
+define void @prefetch_(ptr %ptr) {
+; CHECK-PTX64-LABEL: prefetch_(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.u64 %rd1, [prefetch__param_0];
+; CHECK-PTX64-NEXT:    prefetch.L1.evictnormal [%rd1];
+; CHECK-PTX64-NEXT:    prefetch.L2.evictnormal [%rd1];
+; CHECK-PTX64-NEXT:    ret;
+  tail call void @llvm.nvvm.prefetch.L1.evictnormal(ptr %ptr)
+  tail call void @llvm.nvvm.prefetch.L2.evictnormal(ptr %ptr)
+  ret void
+}
+
+define void @prefetchu_l1(ptr %ptr) {
+; CHECK-PTX64-LABEL: prefetchu_l1(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.u64 %rd1, [prefetchu_l1_param_0];
+; CHECK-PTX64-NEXT:    prefetchu.L1.evictnormal [%rd1];
+; CHECK-PTX64-NEXT:    ret;
+  tail call void @llvm.nvvm.prefetchu.L1.evictnormal(ptr %ptr)
+  ret void
+}
\ No newline at end of file

llvm/include/llvm/IR/IntrinsicsNVVM.td

llvm/lib/Target/NVPTX/NVPTXIntrinsics.td

llvm/docs/NVPTXUsage.rst

llvm/lib/Target/NVPTX/NVPTXIntrinsics.td

llvm/test/CodeGen/NVPTX/prefetch.ll

durga4github

The latest revision LGTM

llvm/include/llvm/IR/IntrinsicsNVVM.td

durga4github · 2025-02-11T08:54:32Z

Merging it as per offline request,

\[NVPTX\] Add Prefetch intrinsics This PR adds prefetch intrinsics with the relevant eviction priorities. * Lit tests are added as part of prefetch.ll * The generated PTX is verified with a 12.3 ptxas executable. * Added docs for these intrinsics in NVPTXUsage.rst. For more information, refer PTX ISA `<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-prefetch-prefetchu>`_. --------- Co-authored-by: abmajumder <[email protected]>

metaflow · 2025-02-12T07:05:39Z

Hi @abhilash1910, the new test fails for me as

ptxas warning : incompatible redefinition for option 'gpu-name', the last value of this option was used
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 19; error   : Unknown modifier '.evictnormal'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 20; error   : Unknown modifier '.evictnormal'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 33; error   : Unknown modifier '.evictnormal'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 34; error   : Unknown modifier '.evictnormal'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 35; error   : Unknown modifier '.evictlast'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 36; error   : Unknown modifier '.evictlast'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 49; error   : Unknown modifier '.evictnormal'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 50; error   : Unknown modifier '.evictnormal'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 63; error   : Unknown modifier '.evictnormal'
ptxas fatal   : Ptx assembly aborted due to errors

ptxas --version is

ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:14:54_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

I have attached prefetch.txt generated by llc

Running as cat prefetch.txt | ptxas -arch=sm_60 -c - -arch=sm_90

What might be the issue?

durga4github · 2025-02-12T09:13:58Z

Hi @abhilash1910, the new test fails for me as

ptxas warning : incompatible redefinition for option 'gpu-name', the last value of this option was used
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 19; error   : Unknown modifier '.evictnormal'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 20; error   : Unknown modifier '.evictnormal'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 33; error   : Unknown modifier '.evictnormal'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 34; error   : Unknown modifier '.evictnormal'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 35; error   : Unknown modifier '.evictlast'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 36; error   : Unknown modifier '.evictlast'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 49; error   : Unknown modifier '.evictnormal'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 50; error   : Unknown modifier '.evictnormal'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 63; error   : Unknown modifier '.evictnormal'
ptxas fatal   : Ptx assembly aborted due to errors

ptxas --version is

ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:14:54_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

I have attached prefetch.txt generated by llc

Running as cat prefetch.txt | ptxas -arch=sm_60 -c - -arch=sm_90

What might be the issue?

Yes, I could repro the issue locally. Thank you for reporting!

At first sight, all the evict qualifiers need an underscore '_' to match the Spec.
will investigate further and update..

abhilash1910 · 2025-02-12T10:04:54Z

Yes , we are working on currently fixing this ASAP. Thanks for raising the issue.

Fix for this issue is addressed in this PR : #126899

This is follow-up PR from #125887 which fixes the intrinsic failures . --------- Co-authored-by: abmajumder <[email protected]>

\[NVPTX\] Add Prefetch intrinsics This PR adds prefetch intrinsics with the relevant eviction priorities. * Lit tests are added as part of prefetch.ll * The generated PTX is verified with a 12.3 ptxas executable. * Added docs for these intrinsics in NVPTXUsage.rst. For more information, refer PTX ISA `<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-prefetch-prefetchu>`_. --------- Co-authored-by: abmajumder <[email protected]>

This is follow-up PR from llvm#125887 which fixes the intrinsic failures . --------- Co-authored-by: abmajumder <[email protected]>

\[NVPTX\] Add Prefetch intrinsics This PR adds prefetch intrinsics with the relevant eviction priorities. * Lit tests are added as part of prefetch.ll * The generated PTX is verified with a 12.3 ptxas executable. * Added docs for these intrinsics in NVPTXUsage.rst. For more information, refer PTX ISA `<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-prefetch-prefetchu>`_. --------- Co-authored-by: abmajumder <[email protected]>

This is follow-up PR from llvm#125887 which fixes the intrinsic failures . --------- Co-authored-by: abmajumder <[email protected]>

add prefetch intrinsics

77420f6

llvmbot added backend:NVPTX llvm:ir labels Feb 5, 2025