-
Notifications
You must be signed in to change notification settings - Fork 14.2k
[NVPTX] Add intrinsics for prefetch.* #125887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-llvm-ir @llvm/pr-subscribers-backend-nvptx Author: Abhilash Majumder (abhilash1910) Changes[NVPTX] Add Prefetch intrinsics This PR adds prefetch intrinsics with the relevant eviction priorities.
For more information, refer PTX ISA Full diff: https://github.com/llvm/llvm-project/pull/125887.diff 4 Files Affected:
diff --git a/llvm/docs/NVPTXUsage.rst b/llvm/docs/NVPTXUsage.rst
index dec6ad4e541152a..31602a8c4981157 100644
--- a/llvm/docs/NVPTXUsage.rst
+++ b/llvm/docs/NVPTXUsage.rst
@@ -581,6 +581,47 @@ prefetched in terms of bytes and it must be a multiple of 16.
For more information, refer PTX ISA
`<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-prefetch>`_.
+'``llvm.nvvm.prefetch.*``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+ declare void @llvm.nvvm.prefetch.local.L1.evictnormal(ptr addrspace(5) %localPtr)
+ declare void @llvm.nvvm.prefetch.local.L2.evictnormal(ptr addrspace(5) %localPtr)
+
+ declare void @llvm.nvvm.prefetch.global.L1.evictnormal(ptr addrspace(1) %globalPtr)
+ declare void @llvm.nvvm.prefetch.global.L2.evictnormal(ptr addrspace(1) %globalPtr)
+ declare void @llvm.nvvm.prefetch.global.L1.evictlast(ptr addrspace(1) %globalPtr)
+ declare void @llvm.nvvm.prefetch.global.L2.evictlast(ptr addrspace(1) %globalPtr)
+
+ declare void @llvm.nvvm.prefetch.L1.evictnormal(ptr %ptr)
+ declare void @llvm.nvvm.prefetch.L2.evictnormal(ptr %ptr)
+
+ declare void @llvm.nvvm.prefetchu.L1.evictnormal(ptr %ptr)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.prefetch.*``' and '``@llvm.nvvm.prefetchu.*``' intrinsic
+correspond to the '``prefetch.*``;' and '``prefetchu.*``' family of PTX instructions.
+The '``prefetch.*``' instructions bring the cache line containing the
+specified address in global or local memory address space into the
+specified cache level (L1 or L2). The '`prefetchu.*``' instruction brings the cache line
+containing the specified generic address into the specified uniform cache level.
+If no address space is specified, it is assumed to be generic address. The intrinsic
+uses and eviction priority which can be accessed by the '``.level::eviction_priority``' modifier.
+
+
+* A prefetch to a shared memory location performs no operation.
+* A prefetch into the uniform cache requires a generic address,
+ and no operation occurs if the address maps to a const, local, or shared memory location.
+
+For more information, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-prefetch-prefetchu>`_.
+
'``llvm.nvvm.cp.async.bulk.tensor.g2s.tile.[1-5]d``'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/llvm/include/llvm/IR/IntrinsicsNVVM.td b/llvm/include/llvm/IR/IntrinsicsNVVM.td
index abbe25bf0040a6f..e9504b0ef5a2f20 100644
--- a/llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ b/llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -48,6 +48,7 @@
def llvm_global_ptr_ty : LLVMQualPointerType<1>; // (global)ptr
def llvm_shared_ptr_ty : LLVMQualPointerType<3>; // (shared)ptr
+def llvm_local_ptr_ty : LLVMQualPointerType<5>; // (local)ptr
def llvm_tmem_ptr_ty : LLVMQualPointerType<6>; // (tensor memory)ptr
//
@@ -4999,6 +5000,30 @@ foreach dim = [1, 2, 3, 4, 5] in {
}
}
+// Intrinsics for Prefetch and Prefetchu
+foreach addr = ["global", "local", ""] in {
+ foreach evict = !if(!eq(addr, "global"),
+ ["evictlast", "evictnormal"],
+ ["evictnormal"]) in {
+ foreach level = ["L1", "L2"] in {
+ def int_nvvm_prefetch_ # !if(!eq(addr, ""), "", addr # "_")
+ # level # "_" # evict : Intrinsic<[],
+ !cond(
+ !eq(addr, "global") : [llvm_global_ptr_ty],
+ !eq(addr, "local"): [llvm_local_ptr_ty],
+ !eq(addr, ""): [llvm_ptr_ty]),
+ [IntrArgMemOnly,ReadOnly<ArgIndex<0>>,
+ NoCapture<ArgIndex<0>>],
+ "llvm.nvvm.prefetch." # !if(!eq(addr, ""), "", addr # ".")# level # "." # evict>;
+ }
+ }
+}
+
+def int_nvvm_prefetchu_L1_evictnormal : Intrinsic<[], [llvm_ptr_ty],
+ [IntrArgMemOnly, ReadOnly<ArgIndex<0>>, NoCapture<ArgIndex<0>>],
+ "llvm.nvvm.prefetchu.L1.evictnormal">;
+
+
// Intrinsics for Bulk Copy using TMA (non-tensor)
// From Global to Shared Cluster
def int_nvvm_cp_async_bulk_global_to_shared_cluster
diff --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
index a0d00e4aac560a5..69f9cb3ff0c87b8 100644
--- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -740,6 +740,31 @@ foreach dim = [1, 2, 3, 4, 5] in {
}
}
+//Prefetch and Prefetchu
+class PREFETCH_INTRS<string InstName, Intrinsic Intrin> :
+ NVPTXInst<(outs), (ins Int64Regs:$addr),
+ !strconcat(InstName, " [$addr];"),
+ [(Intrin i64:$addr)]>,
+ Requires<[hasPTX<80>, hasSM<90>]>;
+
+
+// Only global supports evictlast and evictnormal.
+// Other variants (local and default) only support evictnormal
+foreach addr = ["global", "local", ""] in {
+ foreach evict = !if(!eq(addr, "global"),
+ ["evictlast", "evictnormal"],
+ ["evictnormal"]) in {
+ foreach level = ["L1", "L2"] in {
+ def PREFETCH_# addr # level # "_" # evict : PREFETCH_INTRS<
+ "prefetch." # !if(!eq(addr, ""), "", addr # ".") # level # "." # evict,
+ !cast<Intrinsic>
+ ("int_nvvm_prefetch_"# !if(!eq(addr, ""), "", addr # "_") # level # "_" # evict)>;
+ }
+ }
+}
+
+def PREFETCHU_L1_EVICTNORMAL : PREFETCH_INTRS<"prefetchu.L1.evictnormal", !cast<Intrinsic>("int_nvvm_prefetchu_L1_evictnormal")>;
+
//-----------------------------------
// MBarrier Functions
//-----------------------------------
diff --git a/llvm/test/CodeGen/NVPTX/prefetch.ll b/llvm/test/CodeGen/NVPTX/prefetch.ll
new file mode 100644
index 000000000000000..0576a737d69a242
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/prefetch.ll
@@ -0,0 +1,82 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80| FileCheck --check-prefixes=CHECK-PTX64 %s
+; RUN: %if ptxas-12.3 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80| %ptxas-verify -arch=sm_90 %}
+; RUN: %if ptxas-12.3 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80 --nvptx-short-ptr| %ptxas-verify -arch=sm_90 %}
+
+target triple = "nvptx64-nvidia-cuda"
+
+declare void @llvm.nvvm.prefetch.local.L1.evictnormal(ptr addrspace(5) %localPtr)
+declare void @llvm.nvvm.prefetch.local.L2.evictnormal(ptr addrspace(5) %localPtr)
+
+declare void @llvm.nvvm.prefetch.global.L1.evictnormal(ptr addrspace(1) %globalPtr)
+declare void @llvm.nvvm.prefetch.global.L2.evictnormal(ptr addrspace(1) %globalPtr)
+declare void @llvm.nvvm.prefetch.global.L1.evictlast(ptr addrspace(1) %globalPtr)
+declare void @llvm.nvvm.prefetch.global.L2.evictlast(ptr addrspace(1) %globalPtr)
+
+declare void @llvm.nvvm.prefetch.L1.evictnormal(ptr %ptr)
+declare void @llvm.nvvm.prefetch.L2.evictnormal(ptr %ptr)
+
+declare void @llvm.nvvm.prefetchu.L1.evictnormal(ptr %ptr)
+
+define void @prefetch_local(ptr addrspace(5) %localPtr) {
+; CHECK-PTX64-LABEL: prefetch_local(
+; CHECK-PTX64: {
+; CHECK-PTX64-NEXT: .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT: // %bb.0:
+; CHECK-PTX64-NEXT: ld.param.u64 %rd1, [prefetch_local_param_0];
+; CHECK-PTX64-NEXT: prefetch.local.L1.evictnormal [%rd1];
+; CHECK-PTX64-NEXT: prefetch.local.L2.evictnormal [%rd1];
+; CHECK-PTX64-NEXT: ret;
+ tail call void @llvm.nvvm.prefetch.local.L1.evictnormal(ptr addrspace(5) %localPtr)
+ tail call void @llvm.nvvm.prefetch.local.L2.evictnormal(ptr addrspace(5) %localPtr)
+ ret void
+}
+
+define void @prefetch_global(ptr addrspace(1) %globalPtr) {
+; CHECK-PTX64-LABEL: prefetch_global(
+; CHECK-PTX64: {
+; CHECK-PTX64-NEXT: .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT: // %bb.0:
+; CHECK-PTX64-NEXT: ld.param.u64 %rd1, [prefetch_global_param_0];
+; CHECK-PTX64-NEXT: prefetch.global.L1.evictnormal [%rd1];
+; CHECK-PTX64-NEXT: prefetch.global.L2.evictnormal [%rd1];
+; CHECK-PTX64-NEXT: prefetch.global.L1.evictlast [%rd1];
+; CHECK-PTX64-NEXT: prefetch.global.L2.evictlast [%rd1];
+; CHECK-PTX64-NEXT: ret;
+ tail call void @llvm.nvvm.prefetch.global.L1.evictnormal(ptr addrspace(1) %globalPtr)
+ tail call void @llvm.nvvm.prefetch.global.L2.evictnormal(ptr addrspace(1) %globalPtr)
+ tail call void @llvm.nvvm.prefetch.global.L1.evictlast(ptr addrspace(1) %globalPtr)
+ tail call void @llvm.nvvm.prefetch.global.L2.evictlast(ptr addrspace(1) %globalPtr)
+ ret void
+}
+
+
+define void @prefetch_(ptr %ptr) {
+; CHECK-PTX64-LABEL: prefetch_(
+; CHECK-PTX64: {
+; CHECK-PTX64-NEXT: .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT: // %bb.0:
+; CHECK-PTX64-NEXT: ld.param.u64 %rd1, [prefetch__param_0];
+; CHECK-PTX64-NEXT: prefetch.L1.evictnormal [%rd1];
+; CHECK-PTX64-NEXT: prefetch.L2.evictnormal [%rd1];
+; CHECK-PTX64-NEXT: ret;
+ tail call void @llvm.nvvm.prefetch.L1.evictnormal(ptr %ptr)
+ tail call void @llvm.nvvm.prefetch.L2.evictnormal(ptr %ptr)
+ ret void
+}
+
+define void @prefetchu_l1(ptr %ptr) {
+; CHECK-PTX64-LABEL: prefetchu_l1(
+; CHECK-PTX64: {
+; CHECK-PTX64-NEXT: .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT: // %bb.0:
+; CHECK-PTX64-NEXT: ld.param.u64 %rd1, [prefetchu_l1_param_0];
+; CHECK-PTX64-NEXT: prefetchu.L1.evictnormal [%rd1];
+; CHECK-PTX64-NEXT: ret;
+ tail call void @llvm.nvvm.prefetchu.L1.evictnormal(ptr %ptr)
+ ret void
+}
\ No newline at end of file
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest revision LGTM
Merging it as per offline request, |
\[NVPTX\] Add Prefetch intrinsics This PR adds prefetch intrinsics with the relevant eviction priorities. * Lit tests are added as part of prefetch.ll * The generated PTX is verified with a 12.3 ptxas executable. * Added docs for these intrinsics in NVPTXUsage.rst. For more information, refer PTX ISA `<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-prefetch-prefetchu>`_. --------- Co-authored-by: abmajumder <[email protected]>
Hi @abhilash1910, the new test fails for me as
ptxas --version is
I have attached prefetch.txt generated by llc Running as What might be the issue? |
Yes, I could repro the issue locally. Thank you for reporting! At first sight, all the evict qualifiers need an underscore '_' to match the Spec. |
Yes , we are working on currently fixing this ASAP. Thanks for raising the issue. Fix for this issue is addressed in this PR : #126899 |
This is follow-up PR from #125887 which fixes the intrinsic failures . --------- Co-authored-by: abmajumder <[email protected]>
\[NVPTX\] Add Prefetch intrinsics This PR adds prefetch intrinsics with the relevant eviction priorities. * Lit tests are added as part of prefetch.ll * The generated PTX is verified with a 12.3 ptxas executable. * Added docs for these intrinsics in NVPTXUsage.rst. For more information, refer PTX ISA `<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-prefetch-prefetchu>`_. --------- Co-authored-by: abmajumder <[email protected]>
This is follow-up PR from llvm#125887 which fixes the intrinsic failures . --------- Co-authored-by: abmajumder <[email protected]>
\[NVPTX\] Add Prefetch intrinsics This PR adds prefetch intrinsics with the relevant eviction priorities. * Lit tests are added as part of prefetch.ll * The generated PTX is verified with a 12.3 ptxas executable. * Added docs for these intrinsics in NVPTXUsage.rst. For more information, refer PTX ISA `<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-prefetch-prefetchu>`_. --------- Co-authored-by: abmajumder <[email protected]>
This is follow-up PR from llvm#125887 which fixes the intrinsic failures . --------- Co-authored-by: abmajumder <[email protected]>
[NVPTX] Add Prefetch intrinsics
This PR adds prefetch intrinsics with the relevant eviction priorities.
For more information, refer PTX ISA
<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-prefetch-prefetchu>
_.