[LLVM][NVPTX] Add NVPTX codegen support for fence.proxy.tensormap #100748

schwarzschild-radius · 2024-07-26T14:01:39Z

This commit adds LLVM Intrinsics and NVPTX codegen support for fence.proxy.tensormap with lit tests under fence-proxy-tensormap.ll. Also, added Intrinsics documentation in NVPTXUsage.rst

llvmbot · 2024-07-26T14:02:11Z

@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-backend-nvptx

Author: Pradeep Kumar (schwarzschild-radius)

Changes

This commit adds LLVM Intrinsics and NVPTX codegen support for fence.proxy.tensormap with lit tests under fence-proxy-tensormap.ll. Also, added Intrinsics documentation in NVPTXUsage.rst

Full diff: https://github.com/llvm/llvm-project/pull/100748.diff

4 Files Affected:

(modified) llvm/docs/NVPTXUsage.rst (+28)
(modified) llvm/include/llvm/IR/IntrinsicsNVVM.td (+14)
(modified) llvm/lib/Target/NVPTX/NVPTXIntrinsics.td (+42)
(added) llvm/test/CodeGen/NVPTX/fence-proxy-tensormap.ll (+36)

diff --git a/llvm/docs/NVPTXUsage.rst b/llvm/docs/NVPTXUsage.rst
index 4385cba3ada0d..2b87985c78bbd 100644
--- a/llvm/docs/NVPTXUsage.rst
+++ b/llvm/docs/NVPTXUsage.rst
@@ -251,6 +251,34 @@ Overview:
 The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0``
 instruction, equivalent to the ``__syncthreads()`` call in CUDA.
 
+Membar/Fences
+-------------
+
+
+'``llvm.nvvm.fence.proxy.tensormap.*``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare void @llvm.nvvm.fence.proxy.tensormap.release.cta()
+  declare void @llvm.nvvm.fence.proxy.tensormap.release.cluster()
+  declare void @llvm.nvvm.fence.proxy.tensormap.release.gpu()
+  declare void @llvm.nvvm.fence.proxy.tensormap.release.sys()
+
+  declare void @llvm.nvvm.fence.proxy.tensormap.acquire.cta(ptr %addr, i32 %size)
+  declare void @llvm.nvvm.fence.proxy.tensormap.acquire.cluster(ptr %addr, i32 %size)
+  declare void @llvm.nvvm.fence.proxy.tensormap.acquire.gpu(ptr %addr, i32 %size)
+  declare void @llvm.nvvm.fence.proxy.tensormap.acquire.sys(ptr %addr, i32 %size)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.fence.proxy.tensormap.release.*``' intrinsic emits ``fence.proxy.tensormap::generic.release.*`` and '``@llvm.nvvm.fence.proxy.tensormap.acquire.*``' intrinsic emits ``fence.proxy.tensormap::generic.acquire.* [addr], size;``. ``nvvm.fence.proxy.tensormap*`` is a uni-directional fence used to establish ordering between memory accesses that may happen through different proxies. ``nvvm.fence.proxy.tensormap.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap.acquire`` proxy fence
+
+The address operand ``addr`` and the operand ``size`` together specifies the memory range ``[addr, addr+size-1]`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_.
 
 Other Intrinsics
 ----------------
diff --git a/llvm/include/llvm/IR/IntrinsicsNVVM.td b/llvm/include/llvm/IR/IntrinsicsNVVM.td
index 1e7fdb53059e2..cf4e7ee1d991e 100644
--- a/llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ b/llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -1418,6 +1418,20 @@ let TargetPrefix = "nvvm" in {
   def int_nvvm_fence_sc_cluster:
       Intrinsic<[], [], [IntrNoCallback]>;
 
+// Proxy fence (uni-directional)
+foreach scope = ["cta", "cluster", "gpu", "sys"] in {
+
+  def int_nvvm_fence_proxy_tensormap_release_ # scope:
+        Intrinsic<[], [], [IntrNoCallback],
+                  "llvm.nvvm.fence.proxy.tensormap.release." # scope>;
+
+  def int_nvvm_fence_proxy_tensormap_acquire_ # scope:
+        Intrinsic<[], [llvm_ptr_ty, llvm_i32_ty],
+                  [IntrNoCallback, ImmArg<ArgIndex<1>>],
+                  "llvm.nvvm.fence.proxy.tensormap.acquire." # scope>;
+
+}
+
 // Async Copy
 def int_nvvm_cp_async_mbarrier_arrive :
     ClangBuiltin<"__nvvm_cp_async_mbarrier_arrive">,
diff --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
index c81dfa68e4bd4..80ad91137689f 100644
--- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -335,6 +335,48 @@ def INT_FENCE_SC_CLUSTER:
        MEMBAR<"fence.sc.cluster;", int_nvvm_fence_sc_cluster>,
        Requires<[hasPTX<78>, hasSM<90>]>;
 
+// Proxy fence (uni-directional)
+// fence.proxy.tensormap.release variants
+
+class FENCE_PROXY_TENSORMAP_RELEASE<string Scope, Intrinsic Intr> :
+        NVPTXInst<(outs), (ins),
+                  "fence.proxy.tensormap::generic.release." # Scope # ";", [(Intr)]>,
+        Requires<[hasPTX<83>, hasSM<90>]>;
+
+def INT_FENCE_PROXY_TENSORMAP_RELEASE_CTA:
+      FENCE_PROXY_TENSORMAP_RELEASE<"cta",
+        int_nvvm_fence_proxy_tensormap_release_cta>;
+def INT_FENCE_PROXY_TENSORMAP_RELEASE_CLUSTER:
+      FENCE_PROXY_TENSORMAP_RELEASE<"cluster",
+        int_nvvm_fence_proxy_tensormap_release_cluster>;
+def INT_FENCE_PROXY_TENSORMAP_RELEASE_GPU:
+      FENCE_PROXY_TENSORMAP_RELEASE<"gpu",
+        int_nvvm_fence_proxy_tensormap_release_gpu>;
+def INT_FENCE_PROXY_TENSORMAP_RELEASE_SYS:
+      FENCE_PROXY_TENSORMAP_RELEASE<"sys",
+        int_nvvm_fence_proxy_tensormap_release_sys>;
+
+// fence.proxy.tensormap.acquire variants
+
+class FENCE_PROXY_TENSORMAP_ACQUIRE<string Scope, Intrinsic Intr> :
+        NVPTXInst<(outs), (ins Int64Regs:$addr, i32imm:$size),
+                  "fence.proxy.tensormap::generic.acquire." # Scope # " [$addr], $size;",
+                  [(Intr Int64Regs:$addr, timm:$size)]>,
+        Requires<[hasPTX<83>, hasSM<90>]>;
+
+def INT_FENCE_PROXY_TENSORMAP_ACQUIRE_CTA :
+      FENCE_PROXY_TENSORMAP_ACQUIRE<"cta",
+        int_nvvm_fence_proxy_tensormap_acquire_cta>;
+def INT_FENCE_PROXY_TENSORMAP_ACQUIRE_CLUSTER :
+      FENCE_PROXY_TENSORMAP_ACQUIRE<"cluster",
+        int_nvvm_fence_proxy_tensormap_acquire_cluster>;
+def INT_FENCE_PROXY_TENSORMAP_ACQUIRE_GPU :
+      FENCE_PROXY_TENSORMAP_ACQUIRE<"gpu",
+        int_nvvm_fence_proxy_tensormap_acquire_gpu>;
+def INT_FENCE_PROXY_TENSORMAP_ACQUIRE_SYS :
+      FENCE_PROXY_TENSORMAP_ACQUIRE<"sys",
+        int_nvvm_fence_proxy_tensormap_acquire_sys>;
+
 //-----------------------------------
 // Async Copy Functions
 //-----------------------------------
diff --git a/llvm/test/CodeGen/NVPTX/fence-proxy-tensormap.ll b/llvm/test/CodeGen/NVPTX/fence-proxy-tensormap.ll
new file mode 100644
index 0000000000000..72f82aad840a5
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/fence-proxy-tensormap.ll
@@ -0,0 +1,36 @@
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx83 | FileCheck --check-prefixes=CHECK %s
+; RUN: %if ptxas-12.5 %{ llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx83 | %ptxas-verify -arch=sm_90 %}
+
+; CHECK-LABEL: test_fence_proxy_tensormap_release
+define void @test_fence_proxy_tensormap_release() {
+  ; CHECK: fence.proxy.tensormap::generic.release.cta;
+  call void @llvm.nvvm.fence.proxy.tensormap.release.cta();
+
+  ; CHECK: fence.proxy.tensormap::generic.release.cluster;
+  call void @llvm.nvvm.fence.proxy.tensormap.release.cluster();
+
+  ; CHECK: fence.proxy.tensormap::generic.release.gpu;
+  call void @llvm.nvvm.fence.proxy.tensormap.release.gpu();
+
+  ; CHECK: fence.proxy.tensormap::generic.release.sys;
+  call void @llvm.nvvm.fence.proxy.tensormap.release.sys();
+
+  ret void
+}
+
+; CHECK-LABEL: test_fence_proxy_tensormap_acquire
+define void @test_fence_proxy_tensormap_acquire(ptr addrspace(0) %addr) {
+  ; CHECK: fence.proxy.tensormap::generic.acquire.cta [%rd{{[0-9]+}}], 128;
+  call void @llvm.nvvm.fence.proxy.tensormap.acquire.cta(ptr addrspace(0) %addr, i32 128);
+
+  ; CHECK: fence.proxy.tensormap::generic.acquire.cluster [%rd{{[0-9]+}}], 128;
+  call void @llvm.nvvm.fence.proxy.tensormap.acquire.cluster(ptr addrspace(0) %addr, i32 128);
+
+  ; CHECK: fence.proxy.tensormap::generic.acquire.gpu [%rd{{[0-9]+}}], 128;
+  call void @llvm.nvvm.fence.proxy.tensormap.acquire.gpu(ptr addrspace(0) %addr, i32 128);
+
+  ; CHECK: fence.proxy.tensormap::generic.acquire.sys [%rd{{[0-9]+}}], 128;
+  call void @llvm.nvvm.fence.proxy.tensormap.acquire.sys(ptr addrspace(0) %addr, i32 128);
+
+  ret void
+}

durga4github · 2024-07-26T14:19:38Z

Changes look good to me. Requested review from Artem,

durga4github · 2024-07-30T09:18:09Z

@Artem-B , Could you please review this change?

Artem-B · 2024-07-30T17:46:25Z

llvm/include/llvm/IR/IntrinsicsNVVM.td

+
+  def int_nvvm_fence_proxy_tensormap_acquire_ # scope:
+        Intrinsic<[], [llvm_ptr_ty, llvm_i32_ty],
+                  [IntrNoCallback, ImmArg<ArgIndex<1>>],


Intrinsic constraints may be underspecified here.

I'm not familiar enough with the new synchronization instructions, so I can't tell you what is the right way to specify their behavior in LLVM, but it's not uncommon for various barrier instructions to come with a IntrHasSideEffects to make sure that LLVM would not move them around.

These instruction variants seem to be narrower in scops and apply only to the area specified by the pointer and size, which suggests that we may give LLVM a bit more freedom and may get by with IntrArgMemOnly, IntrReadMem, IntrWriteMem.

Also, are there any concerns regarding replicating/merging these instructions across different control flow paths? Do they need IntrConvergent ?

I'm not familiar enough with the new synchronization instructions, so I can't tell you what is the right way to specify their behavior in LLVM, but it's not uncommon for various barrier instructions to come with a IntrHasSideEffects to make sure that LLVM would not move them around.

By default intrinsics are assumed to be side-effecting unless specified otherwise (

llvm-project/llvm/include/llvm/IR/Intrinsics.td

Line 168 in 67730ae

// intrinsic properties. By default, intrinsics are assumed to have side

). Added IntrArgMemOnly attribute to relax constraints for the acquire variants

Also, are there any concerns regarding replicating/merging these instructions across different control flow paths? Do they need IntrConvergent ?

fence.* variants don't require IntrConvergent as they can be executed divergently as well

Artem-B · 2024-07-30T17:54:10Z

llvm/test/CodeGen/NVPTX/fence-proxy-tensormap.ll

@@ -0,0 +1,36 @@
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx83 | FileCheck --check-prefixes=CHECK %s
+; RUN: %if ptxas-12.5 %{ llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx83 | %ptxas-verify -arch=sm_90 %}


The instructions require PTX8.3, which suggests that the required ptxas version should be 12.3

Fixed the ptxas version in the latest revision

Artem-B · 2024-07-30T17:58:48Z

llvm/test/CodeGen/NVPTX/fence-proxy-tensormap.ll

+; CHECK-LABEL: test_fence_proxy_tensormap_acquire
+define void @test_fence_proxy_tensormap_acquire(ptr addrspace(0) %addr) {
+  ; CHECK: fence.proxy.tensormap::generic.acquire.cta [%rd{{[0-9]+}}], 128;
+  call void @llvm.nvvm.fence.proxy.tensormap.acquire.cta(ptr addrspace(0) %addr, i32 128);


What is supposed to happen when a user passes a constant other than 128?

AFAICT we'll happily pass it along into generated PTX, and ptxas will fail to compile it. That's not ideal. I suppose, in this case it may be OK, as it's the PTX spec which specifies size as a parameter, but allows only one specific value at the moment. We could diagnose the wrong value a bit earlier, but I think it could be argued that it's not LLVM's job in this case.

One thing we could do is to remove the pattern for automatically matching the the intrinsic, and instead use a separate pattern that would match the intrinsic with a constant integer argument 128. Attempts to use the intrinsic with any other argument will fail in LLVM with a failure to lower the intrinsic, which would allow us to diagnose such errors earlier in the compilation pipeline.

Updated the ISel pattern to just match constant integer 128 argument and added a verifier check to emit diagnostics for other values

What is supposed to happen when a user passes a constant other than 128?

Since it is a precondition of the LLVM intrinsic, then I think it is LLVM's responsibility to emit an error (it is also a precondition of the PTX instruction, and ptxas does also emit an error for it).

gonzalobg · 2024-08-01T17:15:39Z

llvm/docs/NVPTXUsage.rst

+-------------
+
+
+'``llvm.nvvm.fence.proxy.tensormap.*``'


Why is generic dropped from PTX (tensormap::generic) ?

How does the user of the intrinsic know whether it is fencing "from generic to tensormap", or in the other direction "from tensormap to generic"?

How will this be disambiguated if the opposite direction needs to be added?

Added generic explicitly in the intrinsic to mark the direction of the fencing

Replace tensormap.generic with tensormap_generic to explicitly state to_proxy::from_proxy and also differentiate generic-addrspace and generic-proxy

gonzalobg · 2024-08-01T17:20:27Z

llvm/docs/NVPTXUsage.rst

+Overview:
+"""""""""
+
+The '``@llvm.nvvm.fence.proxy.tensormap.release.*``' intrinsic emits ``fence.proxy.tensormap::generic.release.*`` and '``@llvm.nvvm.fence.proxy.tensormap.acquire.*``' intrinsic emits ``fence.proxy.tensormap::generic.acquire.* [addr], size;``. ``nvvm.fence.proxy.tensormap*`` is a uni-directional fence used to establish ordering between memory accesses that may happen through different proxies. ``nvvm.fence.proxy.tensormap.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap.acquire`` proxy fence


I think this needs to mention the direction of the cross-proxy uni-directional fences: from which proxy to which proxy. In this case, these cross-proxy-fence from generic-proxy to tensormap-proxy.

For documenting which intrinsics map to which PTX instructions, a table would be clearer.

Added documentation to explicitly mention the proxy direction from generic to tensormap

gonzalobg · 2024-08-01T17:21:38Z

llvm/docs/NVPTXUsage.rst

+Overview:
+"""""""""
+
+The '``@llvm.nvvm.fence.proxy.tensormap.release.*``' intrinsic emits ``fence.proxy.tensormap::generic.release.*`` and '``@llvm.nvvm.fence.proxy.tensormap.acquire.*``' intrinsic emits ``fence.proxy.tensormap::generic.acquire.* [addr], size;``. ``nvvm.fence.proxy.tensormap*`` is a uni-directional fence used to establish ordering between memory accesses that may happen through different proxies. ``nvvm.fence.proxy.tensormap.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap.acquire`` proxy fence


nit: missing a colon

Suggested change

The '``@llvm.nvvm.fence.proxy.tensormap.release.*``' intrinsic emits ``fence.proxy.tensormap::generic.release.*`` and '``@llvm.nvvm.fence.proxy.tensormap.acquire.*``' intrinsic emits ``fence.proxy.tensormap::generic.acquire.* [addr], size;``. ``nvvm.fence.proxy.tensormap*`` is a uni-directional fence used to establish ordering between memory accesses that may happen through different proxies. ``nvvm.fence.proxy.tensormap.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap.acquire`` proxy fence

The '``@llvm.nvvm.fence.proxy.tensormap.release.*``' intrinsic emits ``fence.proxy.tensormap::generic.release.*`` and '``@llvm.nvvm.fence.proxy.tensormap.acquire.*``' intrinsic emits ``fence.proxy.tensormap::generic.acquire.* [addr], size;``. ``nvvm.fence.proxy.tensormap*`` is a uni-directional fence used to establish ordering between memory accesses that may happen through different proxies. ``nvvm.fence.proxy.tensormap.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap.acquire`` proxy fence.

Added a colon in the latest revision

gonzalobg · 2024-08-01T17:22:16Z

llvm/docs/NVPTXUsage.rst

+
+The '``@llvm.nvvm.fence.proxy.tensormap.release.*``' intrinsic emits ``fence.proxy.tensormap::generic.release.*`` and '``@llvm.nvvm.fence.proxy.tensormap.acquire.*``' intrinsic emits ``fence.proxy.tensormap::generic.acquire.* [addr], size;``. ``nvvm.fence.proxy.tensormap*`` is a uni-directional fence used to establish ordering between memory accesses that may happen through different proxies. ``nvvm.fence.proxy.tensormap.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap.acquire`` proxy fence
+
+The address operand ``addr`` and the operand ``size`` together specifies the memory range ``[addr, addr+size-1]`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_.


nit: half-open range?
nit: typo

Suggested change

The address operand ``addr`` and the operand ``size`` together specifies the memory range ``[addr, addr+size-1]`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_.

The address operand ``addr`` and the operand ``size`` together specify the memory range ``[addr, addr+size)`` on which ordering across the proxies is provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_.

Fixed the half open range and typo

gonzalobg · 2024-08-01T17:28:53Z

llvm/docs/NVPTXUsage.rst

+
+The '``@llvm.nvvm.fence.proxy.tensormap.release.*``' intrinsic emits ``fence.proxy.tensormap::generic.release.*`` and '``@llvm.nvvm.fence.proxy.tensormap.acquire.*``' intrinsic emits ``fence.proxy.tensormap::generic.acquire.* [addr], size;``. ``nvvm.fence.proxy.tensormap*`` is a uni-directional fence used to establish ordering between memory accesses that may happen through different proxies. ``nvvm.fence.proxy.tensormap.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap.acquire`` proxy fence
+
+The address operand ``addr`` and the operand ``size`` together specifies the memory range ``[addr, addr+size-1]`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_.


I think this should call out that addr must either be global or generic pointing to global.

Do we have a way to annotate an intrinsic with that, to help addrspace propagation? That is, such that if someone uses this intrinsic on a pointer, we propagate throughout that the pointer, and any pointer it was derived from / derived from it, must point to global?

Updated the documentation to specify that addr must either be global or generic point to global

Do we have a way to annotate an intrinsic with that, to help addrspace propagation? That is, such that if someone uses this intrinsic on a pointer, we propagate throughout that the pointer, and any pointer it was derived from / derived from it, must point to global?

I don't think we have an annotation for addrspace propagation. The inferAddrspace pass takes cares of addrspace inference but in this case, the instruction itself takes a generic address and does not have a specialized version for global addrspace, so we can use generic addrspace pointers to avoid redundancy

This commit adds LLVM Intrinsics and NVPTX codegen support for `fence.proxy.tensormap` with lit tests under fence-proxy-tensormap.ll. Also, added Intrinsics documentation in NVPTXUsage.rst

gonzalobg · 2024-08-06T20:03:54Z

Thanks Pradeep, this LGTM.

schwarzschild-radius · 2024-08-06T20:31:15Z

@Artem-B I have addressed all of your comments, please take a look when you get a chance

Artem-B

LGTM with a doc nit.

Artem-B · 2024-08-06T20:57:45Z

llvm/docs/NVPTXUsage.rst

+Overview:
+"""""""""
+
+The ``@llvm.nvvm.fence.proxy.tensormap_generic.*`` is a uni-directional fence used to establish ordering between a prior memory access performed via the generic proxy and a subsequent memory access performed via the tensormap proxy. ``nvvm.fence.proxy.tensormap_generic.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap_generic.acquire`` proxy fence. The following table describes the mapping between LLVM Intrinsic and the PTX instruction:


performed via the generic proxy and a subsequent memory access performed via the tensormap proxy

It would be great to define what a 'proxy' is in this context and/or add a pointer to the documentation.

https://dl.acm.org/doi/epdf/10.1145/3470496.3533045 -- "Mixed-Proxy Extensions for the NVIDIA PTX MemoryConsistency Model"

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#proxies

The change landed with no definition of what a proxy is, and @gonzalobg's suggestion was resolved w/o applying it.

I'm not sure if that was intentional or an oversight.

I did not add a definiton to proxy because Gonzalo explained that the PTX definition is evolving and he suggested us to just add a link to the PTX's proxy defintion and I have included Gonzalo's suggestion in the commit (

llvm-project/llvm/docs/NVPTXUsage.rst

Line 279 in cf416e0

The ``@llvm.nvvm.fence.proxy.tensormap_generic.*`` is a uni-directional fence used to establish ordering between a prior memory access performed via the generic `proxy<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#proxies>_` and a subsequent memory access performed via the tensormap proxy. ``nvvm.fence.proxy.tensormap_generic.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap_generic.acquire`` proxy fence. The following table describes the mapping between LLVM Intrinsic and the PTX instruction:

). Please let me know if I am missing something

I think this should have waited until Artem finished their review before merging it, ~~since they had started reviewing this change but hadn't approved it yet.~~ EDIT: while they had approved it conditionally with this change, they still hadn't reviewed the change. It seems that when my suggestion was commited, the thread was resolved by githug, and then another person saw the 2 reviews and merged it; it happens.

Now its too late too late for that, but things like this happen, and can be fixed: @Artem-B we can continue the review here, and once its finished, do the changes as a follow up. How does that sound?

@schwarzschild-radius My apologies, I do see the link to the proxy docs you've added. I must've been looking at the older revision of the patch. Github's review UI is still surprising me at the least convenient moments. :-/

llvm/docs/NVPTXUsage.rst

Co-authored-by: gonzalobg <[email protected]>

Artem-B · 2024-08-07T17:26:54Z

@durga4github -- please hold off on merging the changes while the review conversation is still in progress.

durga4github · 2024-08-08T02:38:33Z

Sure, I saw Pradeep had added the links and you had approved; so clicked the merge button.
I will wait for the author/reviewer to resolve all conversations before merging,

llvmbot added backend:NVPTX llvm:ir labels Jul 26, 2024

durga4github requested a review from Artem-B July 26, 2024 14:17

Artem-B reviewed Jul 30, 2024

View reviewed changes

schwarzschild-radius force-pushed the fence_proxy_tensormap_nvptx_support branch from 5fa07db to f9d9284 Compare August 1, 2024 10:13

gonzalobg reviewed Aug 1, 2024

View reviewed changes

schwarzschild-radius force-pushed the fence_proxy_tensormap_nvptx_support branch from f9d9284 to 72bfac3 Compare August 6, 2024 10:37

[LLVM][NVPTX] Add NVPTX codegen support for fence.proxy.tensormap

f1ee403

This commit adds LLVM Intrinsics and NVPTX codegen support for `fence.proxy.tensormap` with lit tests under fence-proxy-tensormap.ll. Also, added Intrinsics documentation in NVPTXUsage.rst

schwarzschild-radius force-pushed the fence_proxy_tensormap_nvptx_support branch from 72bfac3 to f1ee403 Compare August 6, 2024 19:26

gonzalobg approved these changes Aug 6, 2024

View reviewed changes

Artem-B approved these changes Aug 6, 2024

View reviewed changes

gonzalobg reviewed Aug 7, 2024

View reviewed changes

llvm/docs/NVPTXUsage.rst Outdated Show resolved Hide resolved

Update llvm/docs/NVPTXUsage.rst

ae535f3

Co-authored-by: gonzalobg <[email protected]>

durga4github merged commit bfef7cc into llvm:main Aug 7, 2024
8 checks passed

		@@ -0,0 +1,36 @@
		; RUN: llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx83 \| FileCheck --check-prefixes=CHECK %s
		; RUN: %if ptxas-12.5 %{ llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx83 \| %ptxas-verify -arch=sm_90 %}


		The '``@llvm.nvvm.fence.proxy.tensormap.release.``' intrinsic emits ``fence.proxy.tensormap::generic.release.`` and '``@llvm.nvvm.fence.proxy.tensormap.acquire.``' intrinsic emits ``fence.proxy.tensormap::generic.acquire. [addr], size;``. ``nvvm.fence.proxy.tensormap*`` is a uni-directional fence used to establish ordering between memory accesses that may happen through different proxies. ``nvvm.fence.proxy.tensormap.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap.acquire`` proxy fence

		The address operand ``addr`` and the operand ``size`` together specifies the memory range ``[addr, addr+size-1]`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_.

		-------------


		'``llvm.nvvm.fence.proxy.tensormap.*``'

[LLVM][NVPTX] Add NVPTX codegen support for fence.proxy.tensormap #100748

[LLVM][NVPTX] Add NVPTX codegen support for fence.proxy.tensormap #100748

Uh oh!

Conversation

schwarzschild-radius commented Jul 26, 2024

Uh oh!

llvmbot commented Jul 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

durga4github commented Jul 26, 2024

Uh oh!

durga4github commented Jul 30, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gonzalobg Aug 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gonzalobg Aug 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gonzalobg commented Aug 6, 2024

Uh oh!

schwarzschild-radius commented Aug 6, 2024

Uh oh!

Artem-B left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gonzalobg Aug 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Artem-B commented Aug 7, 2024

Uh oh!

durga4github commented Aug 8, 2024

Uh oh!

Uh oh!

llvmbot commented Jul 26, 2024 •

edited

Loading

gonzalobg Aug 1, 2024 •

edited

Loading

gonzalobg Aug 1, 2024 •

edited

Loading

gonzalobg Aug 7, 2024 •

edited

Loading