Skip to content

[NVPTX] Add intrinsics for prefetch.* #125887

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Feb 11, 2025
Merged

Conversation

abhilash1910
Copy link
Contributor

[NVPTX] Add Prefetch intrinsics

This PR adds prefetch intrinsics with the relevant eviction priorities.

  • Lit tests are added as part of prefetch.ll
  • The generated PTX is verified with a 12.3 ptxas executable.
  • Added docs for these intrinsics in NVPTXUsage.rst.

For more information, refer PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-prefetch-prefetchu>_.

@llvmbot
Copy link
Member

llvmbot commented Feb 5, 2025

@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-backend-nvptx

Author: Abhilash Majumder (abhilash1910)

Changes

[NVPTX] Add Prefetch intrinsics

This PR adds prefetch intrinsics with the relevant eviction priorities.

  • Lit tests are added as part of prefetch.ll
  • The generated PTX is verified with a 12.3 ptxas executable.
  • Added docs for these intrinsics in NVPTXUsage.rst.

For more information, refer PTX ISA &lt;https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-prefetch-prefetchu&gt;_.


Full diff: https://github.com/llvm/llvm-project/pull/125887.diff

4 Files Affected:

  • (modified) llvm/docs/NVPTXUsage.rst (+41)
  • (modified) llvm/include/llvm/IR/IntrinsicsNVVM.td (+25)
  • (modified) llvm/lib/Target/NVPTX/NVPTXIntrinsics.td (+25)
  • (added) llvm/test/CodeGen/NVPTX/prefetch.ll (+82)
diff --git a/llvm/docs/NVPTXUsage.rst b/llvm/docs/NVPTXUsage.rst
index dec6ad4e541152a..31602a8c4981157 100644
--- a/llvm/docs/NVPTXUsage.rst
+++ b/llvm/docs/NVPTXUsage.rst
@@ -581,6 +581,47 @@ prefetched in terms of bytes and it must be a multiple of 16.
 For more information, refer PTX ISA
 `<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-prefetch>`_.
 
+'``llvm.nvvm.prefetch.*``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare void  @llvm.nvvm.prefetch.local.L1.evictnormal(ptr addrspace(5) %localPtr)
+  declare void  @llvm.nvvm.prefetch.local.L2.evictnormal(ptr addrspace(5) %localPtr)
+  
+  declare void  @llvm.nvvm.prefetch.global.L1.evictnormal(ptr addrspace(1) %globalPtr)
+  declare void  @llvm.nvvm.prefetch.global.L2.evictnormal(ptr addrspace(1) %globalPtr)
+  declare void  @llvm.nvvm.prefetch.global.L1.evictlast(ptr addrspace(1) %globalPtr)
+  declare void  @llvm.nvvm.prefetch.global.L2.evictlast(ptr addrspace(1) %globalPtr)
+  
+  declare void  @llvm.nvvm.prefetch.L1.evictnormal(ptr %ptr)
+  declare void  @llvm.nvvm.prefetch.L2.evictnormal(ptr %ptr)
+  
+  declare void  @llvm.nvvm.prefetchu.L1.evictnormal(ptr %ptr)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.prefetch.*``' and '``@llvm.nvvm.prefetchu.*``' intrinsic
+correspond to the '``prefetch.*``;' and '``prefetchu.*``' family of PTX instructions. 
+The '``prefetch.*``' instructions bring the cache line containing the
+specified address in global or local memory address space into the 
+specified cache level (L1 or L2). The '`prefetchu.*``' instruction brings the cache line 
+containing the specified generic address into the specified uniform cache level.
+If no address space is specified, it is assumed to be generic address. The intrinsic 
+uses and eviction priority which can be accessed by the '``.level::eviction_priority``' modifier. 
+
+
+* A prefetch to a shared memory location performs no operation.
+* A prefetch into the uniform cache requires a generic address, 
+  and no operation occurs if the address maps to a const, local, or shared memory location.
+
+For more information, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-prefetch-prefetchu>`_.
+
 '``llvm.nvvm.cp.async.bulk.tensor.g2s.tile.[1-5]d``'
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
diff --git a/llvm/include/llvm/IR/IntrinsicsNVVM.td b/llvm/include/llvm/IR/IntrinsicsNVVM.td
index abbe25bf0040a6f..e9504b0ef5a2f20 100644
--- a/llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ b/llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -48,6 +48,7 @@
 
 def llvm_global_ptr_ty  : LLVMQualPointerType<1>;  // (global)ptr
 def llvm_shared_ptr_ty  : LLVMQualPointerType<3>;  // (shared)ptr
+def llvm_local_ptr_ty   : LLVMQualPointerType<5>;  // (local)ptr
 def llvm_tmem_ptr_ty    : LLVMQualPointerType<6>;  // (tensor memory)ptr
 
 //
@@ -4999,6 +5000,30 @@ foreach dim = [1, 2, 3, 4, 5] in {
   }
 }
 
+// Intrinsics for Prefetch and Prefetchu
+foreach addr = ["global", "local", ""] in {
+  foreach evict = !if(!eq(addr, "global"),
+                  ["evictlast", "evictnormal"],
+                  ["evictnormal"]) in {
+    foreach level = ["L1", "L2"] in {
+      def int_nvvm_prefetch_ # !if(!eq(addr, ""), "", addr # "_") 
+          # level # "_" # evict : Intrinsic<[], 
+                                            !cond(
+                                            !eq(addr, "global") : [llvm_global_ptr_ty],
+                                            !eq(addr, "local"): [llvm_local_ptr_ty],
+                                            !eq(addr, ""): [llvm_ptr_ty]),
+                                            [IntrArgMemOnly,ReadOnly<ArgIndex<0>>,
+                                             NoCapture<ArgIndex<0>>],
+                                            "llvm.nvvm.prefetch." # !if(!eq(addr, ""), "", addr                                                                                                     # ".")# level # "." # evict>;   
+    }
+  }
+}
+
+def int_nvvm_prefetchu_L1_evictnormal : Intrinsic<[], [llvm_ptr_ty],
+  [IntrArgMemOnly, ReadOnly<ArgIndex<0>>, NoCapture<ArgIndex<0>>],
+  "llvm.nvvm.prefetchu.L1.evictnormal">;
+
+
 // Intrinsics for Bulk Copy using TMA (non-tensor)
 // From Global to Shared Cluster
 def int_nvvm_cp_async_bulk_global_to_shared_cluster
diff --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
index a0d00e4aac560a5..69f9cb3ff0c87b8 100644
--- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -740,6 +740,31 @@ foreach dim = [1, 2, 3, 4, 5] in {
   }
 }
 
+//Prefetch and Prefetchu 
+class PREFETCH_INTRS<string InstName, Intrinsic Intrin> :
+          NVPTXInst<(outs), (ins Int64Regs:$addr),
+            !strconcat(InstName, " [$addr];"),
+            [(Intrin i64:$addr)]>,
+          Requires<[hasPTX<80>, hasSM<90>]>;
+   
+
+// Only global supports evictlast and evictnormal.
+// Other variants (local and default) only support evictnormal
+foreach addr = ["global", "local", ""] in {
+  foreach evict = !if(!eq(addr, "global"),
+                          ["evictlast", "evictnormal"],
+                          ["evictnormal"]) in {
+    foreach level = ["L1", "L2"] in {
+      def PREFETCH_# addr # level # "_" # evict : PREFETCH_INTRS<
+                                                 "prefetch." # !if(!eq(addr, ""), "", addr # ".") # level # "." # evict,
+                                                 !cast<Intrinsic>
+                                                 ("int_nvvm_prefetch_"# !if(!eq(addr, ""), "", addr # "_") # level # "_" # evict)>;
+    }
+  }
+}
+
+def PREFETCHU_L1_EVICTNORMAL : PREFETCH_INTRS<"prefetchu.L1.evictnormal", !cast<Intrinsic>("int_nvvm_prefetchu_L1_evictnormal")>;
+
 //-----------------------------------
 // MBarrier Functions
 //-----------------------------------
diff --git a/llvm/test/CodeGen/NVPTX/prefetch.ll b/llvm/test/CodeGen/NVPTX/prefetch.ll
new file mode 100644
index 000000000000000..0576a737d69a242
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/prefetch.ll
@@ -0,0 +1,82 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80| FileCheck --check-prefixes=CHECK-PTX64 %s
+; RUN: %if ptxas-12.3 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80| %ptxas-verify -arch=sm_90 %}
+; RUN: %if ptxas-12.3 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80 --nvptx-short-ptr| %ptxas-verify -arch=sm_90 %}
+
+target triple = "nvptx64-nvidia-cuda"
+
+declare void  @llvm.nvvm.prefetch.local.L1.evictnormal(ptr addrspace(5) %localPtr)
+declare void  @llvm.nvvm.prefetch.local.L2.evictnormal(ptr addrspace(5) %localPtr)
+
+declare void  @llvm.nvvm.prefetch.global.L1.evictnormal(ptr addrspace(1) %globalPtr)
+declare void  @llvm.nvvm.prefetch.global.L2.evictnormal(ptr addrspace(1) %globalPtr)
+declare void  @llvm.nvvm.prefetch.global.L1.evictlast(ptr addrspace(1) %globalPtr)
+declare void  @llvm.nvvm.prefetch.global.L2.evictlast(ptr addrspace(1) %globalPtr)
+
+declare void  @llvm.nvvm.prefetch.L1.evictnormal(ptr %ptr)
+declare void  @llvm.nvvm.prefetch.L2.evictnormal(ptr %ptr)
+
+declare void  @llvm.nvvm.prefetchu.L1.evictnormal(ptr %ptr)
+
+define void @prefetch_local(ptr addrspace(5) %localPtr) {
+; CHECK-PTX64-LABEL: prefetch_local(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.u64 %rd1, [prefetch_local_param_0];
+; CHECK-PTX64-NEXT:    prefetch.local.L1.evictnormal [%rd1];
+; CHECK-PTX64-NEXT:    prefetch.local.L2.evictnormal [%rd1];
+; CHECK-PTX64-NEXT:    ret;
+  tail call void @llvm.nvvm.prefetch.local.L1.evictnormal(ptr addrspace(5) %localPtr)
+  tail call void @llvm.nvvm.prefetch.local.L2.evictnormal(ptr addrspace(5) %localPtr)
+  ret void
+}
+
+define void @prefetch_global(ptr addrspace(1) %globalPtr) {
+; CHECK-PTX64-LABEL: prefetch_global(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.u64 %rd1, [prefetch_global_param_0];
+; CHECK-PTX64-NEXT:    prefetch.global.L1.evictnormal [%rd1];
+; CHECK-PTX64-NEXT:    prefetch.global.L2.evictnormal [%rd1];
+; CHECK-PTX64-NEXT:    prefetch.global.L1.evictlast [%rd1];
+; CHECK-PTX64-NEXT:    prefetch.global.L2.evictlast [%rd1];
+; CHECK-PTX64-NEXT:    ret;
+  tail call void @llvm.nvvm.prefetch.global.L1.evictnormal(ptr addrspace(1) %globalPtr)
+  tail call void @llvm.nvvm.prefetch.global.L2.evictnormal(ptr addrspace(1) %globalPtr)
+  tail call void @llvm.nvvm.prefetch.global.L1.evictlast(ptr addrspace(1) %globalPtr)
+  tail call void @llvm.nvvm.prefetch.global.L2.evictlast(ptr addrspace(1) %globalPtr)
+  ret void
+}
+
+
+define void @prefetch_(ptr %ptr) {
+; CHECK-PTX64-LABEL: prefetch_(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.u64 %rd1, [prefetch__param_0];
+; CHECK-PTX64-NEXT:    prefetch.L1.evictnormal [%rd1];
+; CHECK-PTX64-NEXT:    prefetch.L2.evictnormal [%rd1];
+; CHECK-PTX64-NEXT:    ret;
+  tail call void @llvm.nvvm.prefetch.L1.evictnormal(ptr %ptr)
+  tail call void @llvm.nvvm.prefetch.L2.evictnormal(ptr %ptr)
+  ret void
+}
+
+define void @prefetchu_l1(ptr %ptr) {
+; CHECK-PTX64-LABEL: prefetchu_l1(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.u64 %rd1, [prefetchu_l1_param_0];
+; CHECK-PTX64-NEXT:    prefetchu.L1.evictnormal [%rd1];
+; CHECK-PTX64-NEXT:    ret;
+  tail call void @llvm.nvvm.prefetchu.L1.evictnormal(ptr %ptr)
+  ret void
+}
\ No newline at end of file

Copy link
Contributor

@durga4github durga4github left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest revision LGTM

@durga4github durga4github requested a review from Artem-B February 7, 2025 18:03
@durga4github
Copy link
Contributor

Merging it as per offline request,

@durga4github durga4github merged commit 6a961dc into llvm:main Feb 11, 2025
9 checks passed
Icohedron pushed a commit to Icohedron/llvm-project that referenced this pull request Feb 11, 2025
\[NVPTX\] Add Prefetch intrinsics

This PR adds prefetch intrinsics with the relevant eviction priorities.
* Lit tests are added as part of prefetch.ll
* The generated PTX is verified with a 12.3 ptxas executable.
* Added docs for these intrinsics in NVPTXUsage.rst.

For more information, refer PTX ISA
`<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-prefetch-prefetchu>`_.

---------

Co-authored-by: abmajumder <[email protected]>
@metaflow
Copy link
Contributor

metaflow commented Feb 12, 2025

Hi @abhilash1910, the new test fails for me as

ptxas warning : incompatible redefinition for option 'gpu-name', the last value of this option was used
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 19; error   : Unknown modifier '.evictnormal'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 20; error   : Unknown modifier '.evictnormal'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 33; error   : Unknown modifier '.evictnormal'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 34; error   : Unknown modifier '.evictnormal'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 35; error   : Unknown modifier '.evictlast'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 36; error   : Unknown modifier '.evictlast'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 49; error   : Unknown modifier '.evictnormal'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 50; error   : Unknown modifier '.evictnormal'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 63; error   : Unknown modifier '.evictnormal'
ptxas fatal   : Ptx assembly aborted due to errors

ptxas --version is

ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:14:54_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

I have attached prefetch.txt generated by llc

Running as cat prefetch.txt | ptxas -arch=sm_60 -c - -arch=sm_90

What might be the issue?

@durga4github
Copy link
Contributor

Hi @abhilash1910, the new test fails for me as

ptxas warning : incompatible redefinition for option 'gpu-name', the last value of this option was used
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 19; error   : Unknown modifier '.evictnormal'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 20; error   : Unknown modifier '.evictnormal'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 33; error   : Unknown modifier '.evictnormal'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 34; error   : Unknown modifier '.evictnormal'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 35; error   : Unknown modifier '.evictlast'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 36; error   : Unknown modifier '.evictlast'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 49; error   : Unknown modifier '.evictnormal'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 50; error   : Unknown modifier '.evictnormal'
ptxas /tmp/tmpxft_000a5e18_00000000-0_stdin, line 63; error   : Unknown modifier '.evictnormal'
ptxas fatal   : Ptx assembly aborted due to errors

ptxas --version is

ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:14:54_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

I have attached prefetch.txt generated by llc

Running as cat prefetch.txt | ptxas -arch=sm_60 -c - -arch=sm_90

What might be the issue?

Yes, I could repro the issue locally. Thank you for reporting!

At first sight, all the evict qualifiers need an underscore '_' to match the Spec.
will investigate further and update..

@abhilash1910
Copy link
Contributor Author

abhilash1910 commented Feb 12, 2025

Yes , we are working on currently fixing this ASAP. Thanks for raising the issue.

Fix for this issue is addressed in this PR : #126899

metaflow pushed a commit that referenced this pull request Feb 13, 2025
This is follow-up PR from #125887  which fixes the intrinsic failures .

---------

Co-authored-by: abmajumder <[email protected]>
joaosaffran pushed a commit to joaosaffran/llvm-project that referenced this pull request Feb 14, 2025
\[NVPTX\] Add Prefetch intrinsics

This PR adds prefetch intrinsics with the relevant eviction priorities.
* Lit tests are added as part of prefetch.ll
* The generated PTX is verified with a 12.3 ptxas executable.
* Added docs for these intrinsics in NVPTXUsage.rst.

For more information, refer PTX ISA
`<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-prefetch-prefetchu>`_.

---------

Co-authored-by: abmajumder <[email protected]>
joaosaffran pushed a commit to joaosaffran/llvm-project that referenced this pull request Feb 14, 2025
This is follow-up PR from llvm#125887  which fixes the intrinsic failures .

---------

Co-authored-by: abmajumder <[email protected]>
sivan-shani pushed a commit to sivan-shani/llvm-project that referenced this pull request Feb 24, 2025
\[NVPTX\] Add Prefetch intrinsics

This PR adds prefetch intrinsics with the relevant eviction priorities.
* Lit tests are added as part of prefetch.ll
* The generated PTX is verified with a 12.3 ptxas executable.
* Added docs for these intrinsics in NVPTXUsage.rst.

For more information, refer PTX ISA
`<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-prefetch-prefetchu>`_.

---------

Co-authored-by: abmajumder <[email protected]>
sivan-shani pushed a commit to sivan-shani/llvm-project that referenced this pull request Feb 24, 2025
This is follow-up PR from llvm#125887  which fixes the intrinsic failures .

---------

Co-authored-by: abmajumder <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants