Skip to content

[LLVM][NVPTX] Add NVPTX codegen support for fence.proxy.tensormap #100748

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

schwarzschild-radius
Copy link
Contributor

This commit adds LLVM Intrinsics and NVPTX codegen support for fence.proxy.tensormap with lit tests under fence-proxy-tensormap.ll. Also, added Intrinsics documentation in NVPTXUsage.rst

@llvmbot
Copy link
Member

llvmbot commented Jul 26, 2024

@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-backend-nvptx

Author: Pradeep Kumar (schwarzschild-radius)

Changes

This commit adds LLVM Intrinsics and NVPTX codegen support for fence.proxy.tensormap with lit tests under fence-proxy-tensormap.ll. Also, added Intrinsics documentation in NVPTXUsage.rst


Full diff: https://github.com/llvm/llvm-project/pull/100748.diff

4 Files Affected:

  • (modified) llvm/docs/NVPTXUsage.rst (+28)
  • (modified) llvm/include/llvm/IR/IntrinsicsNVVM.td (+14)
  • (modified) llvm/lib/Target/NVPTX/NVPTXIntrinsics.td (+42)
  • (added) llvm/test/CodeGen/NVPTX/fence-proxy-tensormap.ll (+36)
diff --git a/llvm/docs/NVPTXUsage.rst b/llvm/docs/NVPTXUsage.rst
index 4385cba3ada0d..2b87985c78bbd 100644
--- a/llvm/docs/NVPTXUsage.rst
+++ b/llvm/docs/NVPTXUsage.rst
@@ -251,6 +251,34 @@ Overview:
 The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0``
 instruction, equivalent to the ``__syncthreads()`` call in CUDA.
 
+Membar/Fences
+-------------
+
+
+'``llvm.nvvm.fence.proxy.tensormap.*``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare void @llvm.nvvm.fence.proxy.tensormap.release.cta()
+  declare void @llvm.nvvm.fence.proxy.tensormap.release.cluster()
+  declare void @llvm.nvvm.fence.proxy.tensormap.release.gpu()
+  declare void @llvm.nvvm.fence.proxy.tensormap.release.sys()
+
+  declare void @llvm.nvvm.fence.proxy.tensormap.acquire.cta(ptr %addr, i32 %size)
+  declare void @llvm.nvvm.fence.proxy.tensormap.acquire.cluster(ptr %addr, i32 %size)
+  declare void @llvm.nvvm.fence.proxy.tensormap.acquire.gpu(ptr %addr, i32 %size)
+  declare void @llvm.nvvm.fence.proxy.tensormap.acquire.sys(ptr %addr, i32 %size)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.fence.proxy.tensormap.release.*``' intrinsic emits ``fence.proxy.tensormap::generic.release.*`` and '``@llvm.nvvm.fence.proxy.tensormap.acquire.*``' intrinsic emits ``fence.proxy.tensormap::generic.acquire.* [addr], size;``. ``nvvm.fence.proxy.tensormap*`` is a uni-directional fence used to establish ordering between memory accesses that may happen through different proxies. ``nvvm.fence.proxy.tensormap.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap.acquire`` proxy fence
+
+The address operand ``addr`` and the operand ``size`` together specifies the memory range ``[addr, addr+size-1]`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_.
 
 Other Intrinsics
 ----------------
diff --git a/llvm/include/llvm/IR/IntrinsicsNVVM.td b/llvm/include/llvm/IR/IntrinsicsNVVM.td
index 1e7fdb53059e2..cf4e7ee1d991e 100644
--- a/llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ b/llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -1418,6 +1418,20 @@ let TargetPrefix = "nvvm" in {
   def int_nvvm_fence_sc_cluster:
       Intrinsic<[], [], [IntrNoCallback]>;
 
+// Proxy fence (uni-directional)
+foreach scope = ["cta", "cluster", "gpu", "sys"] in {
+
+  def int_nvvm_fence_proxy_tensormap_release_ # scope:
+        Intrinsic<[], [], [IntrNoCallback],
+                  "llvm.nvvm.fence.proxy.tensormap.release." # scope>;
+
+  def int_nvvm_fence_proxy_tensormap_acquire_ # scope:
+        Intrinsic<[], [llvm_ptr_ty, llvm_i32_ty],
+                  [IntrNoCallback, ImmArg<ArgIndex<1>>],
+                  "llvm.nvvm.fence.proxy.tensormap.acquire." # scope>;
+
+}
+
 // Async Copy
 def int_nvvm_cp_async_mbarrier_arrive :
     ClangBuiltin<"__nvvm_cp_async_mbarrier_arrive">,
diff --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
index c81dfa68e4bd4..80ad91137689f 100644
--- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -335,6 +335,48 @@ def INT_FENCE_SC_CLUSTER:
        MEMBAR<"fence.sc.cluster;", int_nvvm_fence_sc_cluster>,
        Requires<[hasPTX<78>, hasSM<90>]>;
 
+// Proxy fence (uni-directional)
+// fence.proxy.tensormap.release variants
+
+class FENCE_PROXY_TENSORMAP_RELEASE<string Scope, Intrinsic Intr> :
+        NVPTXInst<(outs), (ins),
+                  "fence.proxy.tensormap::generic.release." # Scope # ";", [(Intr)]>,
+        Requires<[hasPTX<83>, hasSM<90>]>;
+
+def INT_FENCE_PROXY_TENSORMAP_RELEASE_CTA:
+      FENCE_PROXY_TENSORMAP_RELEASE<"cta",
+        int_nvvm_fence_proxy_tensormap_release_cta>;
+def INT_FENCE_PROXY_TENSORMAP_RELEASE_CLUSTER:
+      FENCE_PROXY_TENSORMAP_RELEASE<"cluster",
+        int_nvvm_fence_proxy_tensormap_release_cluster>;
+def INT_FENCE_PROXY_TENSORMAP_RELEASE_GPU:
+      FENCE_PROXY_TENSORMAP_RELEASE<"gpu",
+        int_nvvm_fence_proxy_tensormap_release_gpu>;
+def INT_FENCE_PROXY_TENSORMAP_RELEASE_SYS:
+      FENCE_PROXY_TENSORMAP_RELEASE<"sys",
+        int_nvvm_fence_proxy_tensormap_release_sys>;
+
+// fence.proxy.tensormap.acquire variants
+
+class FENCE_PROXY_TENSORMAP_ACQUIRE<string Scope, Intrinsic Intr> :
+        NVPTXInst<(outs), (ins Int64Regs:$addr, i32imm:$size),
+                  "fence.proxy.tensormap::generic.acquire." # Scope # " [$addr], $size;",
+                  [(Intr Int64Regs:$addr, timm:$size)]>,
+        Requires<[hasPTX<83>, hasSM<90>]>;
+
+def INT_FENCE_PROXY_TENSORMAP_ACQUIRE_CTA :
+      FENCE_PROXY_TENSORMAP_ACQUIRE<"cta",
+        int_nvvm_fence_proxy_tensormap_acquire_cta>;
+def INT_FENCE_PROXY_TENSORMAP_ACQUIRE_CLUSTER :
+      FENCE_PROXY_TENSORMAP_ACQUIRE<"cluster",
+        int_nvvm_fence_proxy_tensormap_acquire_cluster>;
+def INT_FENCE_PROXY_TENSORMAP_ACQUIRE_GPU :
+      FENCE_PROXY_TENSORMAP_ACQUIRE<"gpu",
+        int_nvvm_fence_proxy_tensormap_acquire_gpu>;
+def INT_FENCE_PROXY_TENSORMAP_ACQUIRE_SYS :
+      FENCE_PROXY_TENSORMAP_ACQUIRE<"sys",
+        int_nvvm_fence_proxy_tensormap_acquire_sys>;
+
 //-----------------------------------
 // Async Copy Functions
 //-----------------------------------
diff --git a/llvm/test/CodeGen/NVPTX/fence-proxy-tensormap.ll b/llvm/test/CodeGen/NVPTX/fence-proxy-tensormap.ll
new file mode 100644
index 0000000000000..72f82aad840a5
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/fence-proxy-tensormap.ll
@@ -0,0 +1,36 @@
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx83 | FileCheck --check-prefixes=CHECK %s
+; RUN: %if ptxas-12.5 %{ llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx83 | %ptxas-verify -arch=sm_90 %}
+
+; CHECK-LABEL: test_fence_proxy_tensormap_release
+define void @test_fence_proxy_tensormap_release() {
+  ; CHECK: fence.proxy.tensormap::generic.release.cta;
+  call void @llvm.nvvm.fence.proxy.tensormap.release.cta();
+
+  ; CHECK: fence.proxy.tensormap::generic.release.cluster;
+  call void @llvm.nvvm.fence.proxy.tensormap.release.cluster();
+
+  ; CHECK: fence.proxy.tensormap::generic.release.gpu;
+  call void @llvm.nvvm.fence.proxy.tensormap.release.gpu();
+
+  ; CHECK: fence.proxy.tensormap::generic.release.sys;
+  call void @llvm.nvvm.fence.proxy.tensormap.release.sys();
+
+  ret void
+}
+
+; CHECK-LABEL: test_fence_proxy_tensormap_acquire
+define void @test_fence_proxy_tensormap_acquire(ptr addrspace(0) %addr) {
+  ; CHECK: fence.proxy.tensormap::generic.acquire.cta [%rd{{[0-9]+}}], 128;
+  call void @llvm.nvvm.fence.proxy.tensormap.acquire.cta(ptr addrspace(0) %addr, i32 128);
+
+  ; CHECK: fence.proxy.tensormap::generic.acquire.cluster [%rd{{[0-9]+}}], 128;
+  call void @llvm.nvvm.fence.proxy.tensormap.acquire.cluster(ptr addrspace(0) %addr, i32 128);
+
+  ; CHECK: fence.proxy.tensormap::generic.acquire.gpu [%rd{{[0-9]+}}], 128;
+  call void @llvm.nvvm.fence.proxy.tensormap.acquire.gpu(ptr addrspace(0) %addr, i32 128);
+
+  ; CHECK: fence.proxy.tensormap::generic.acquire.sys [%rd{{[0-9]+}}], 128;
+  call void @llvm.nvvm.fence.proxy.tensormap.acquire.sys(ptr addrspace(0) %addr, i32 128);
+
+  ret void
+}

@durga4github durga4github requested a review from Artem-B July 26, 2024 14:17
@durga4github
Copy link
Contributor

Changes look good to me. Requested review from Artem,

@durga4github
Copy link
Contributor

@Artem-B , Could you please review this change?


def int_nvvm_fence_proxy_tensormap_acquire_ # scope:
Intrinsic<[], [llvm_ptr_ty, llvm_i32_ty],
[IntrNoCallback, ImmArg<ArgIndex<1>>],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intrinsic constraints may be underspecified here.

I'm not familiar enough with the new synchronization instructions, so I can't tell you what is the right way to specify their behavior in LLVM, but it's not uncommon for various barrier instructions to come with a IntrHasSideEffects to make sure that LLVM would not move them around.

These instruction variants seem to be narrower in scops and apply only to the area specified by the pointer and size, which suggests that we may give LLVM a bit more freedom and may get by with IntrArgMemOnly, IntrReadMem, IntrWriteMem.

Also, are there any concerns regarding replicating/merging these instructions across different control flow paths? Do they need IntrConvergent ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar enough with the new synchronization instructions, so I can't tell you what is the right way to specify their behavior in LLVM, but it's not uncommon for various barrier instructions to come with a IntrHasSideEffects to make sure that LLVM would not move them around.

By default intrinsics are assumed to be side-effecting unless specified otherwise (

// intrinsic properties. By default, intrinsics are assumed to have side
). Added IntrArgMemOnly attribute to relax constraints for the acquire variants

Also, are there any concerns regarding replicating/merging these instructions across different control flow paths? Do they need IntrConvergent ?

fence.* variants don't require IntrConvergent as they can be executed divergently as well

@@ -0,0 +1,36 @@
; RUN: llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx83 | FileCheck --check-prefixes=CHECK %s
; RUN: %if ptxas-12.5 %{ llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx83 | %ptxas-verify -arch=sm_90 %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instructions require PTX8.3, which suggests that the required ptxas version should be 12.3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the ptxas version in the latest revision

; CHECK-LABEL: test_fence_proxy_tensormap_acquire
define void @test_fence_proxy_tensormap_acquire(ptr addrspace(0) %addr) {
; CHECK: fence.proxy.tensormap::generic.acquire.cta [%rd{{[0-9]+}}], 128;
call void @llvm.nvvm.fence.proxy.tensormap.acquire.cta(ptr addrspace(0) %addr, i32 128);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is supposed to happen when a user passes a constant other than 128?

AFAICT we'll happily pass it along into generated PTX, and ptxas will fail to compile it. That's not ideal. I suppose, in this case it may be OK, as it's the PTX spec which specifies size as a parameter, but allows only one specific value at the moment. We could diagnose the wrong value a bit earlier, but I think it could be argued that it's not LLVM's job in this case.

One thing we could do is to remove the pattern for automatically matching the the intrinsic, and instead use a separate pattern that would match the intrinsic with a constant integer argument 128. Attempts to use the intrinsic with any other argument will fail in LLVM with a failure to lower the intrinsic, which would allow us to diagnose such errors earlier in the compilation pipeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the ISel pattern to just match constant integer 128 argument and added a verifier check to emit diagnostics for other values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is supposed to happen when a user passes a constant other than 128?

Since it is a precondition of the LLVM intrinsic, then I think it is LLVM's responsibility to emit an error (it is also a precondition of the PTX instruction, and ptxas does also emit an error for it).

@schwarzschild-radius schwarzschild-radius force-pushed the fence_proxy_tensormap_nvptx_support branch from 5fa07db to f9d9284 Compare August 1, 2024 10:13
-------------


'``llvm.nvvm.fence.proxy.tensormap.*``'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is generic dropped from PTX (tensormap::generic) ?

How does the user of the intrinsic know whether it is fencing "from generic to tensormap", or in the other direction "from tensormap to generic"?

How will this be disambiguated if the opposite direction needs to be added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added generic explicitly in the intrinsic to mark the direction of the fencing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace tensormap.generic with tensormap_generic to explicitly state to_proxy::from_proxy and also differentiate generic-addrspace and generic-proxy

Overview:
"""""""""

The '``@llvm.nvvm.fence.proxy.tensormap.release.*``' intrinsic emits ``fence.proxy.tensormap::generic.release.*`` and '``@llvm.nvvm.fence.proxy.tensormap.acquire.*``' intrinsic emits ``fence.proxy.tensormap::generic.acquire.* [addr], size;``. ``nvvm.fence.proxy.tensormap*`` is a uni-directional fence used to establish ordering between memory accesses that may happen through different proxies. ``nvvm.fence.proxy.tensormap.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap.acquire`` proxy fence
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to mention the direction of the cross-proxy uni-directional fences: from which proxy to which proxy. In this case, these cross-proxy-fence from generic-proxy to tensormap-proxy.

For documenting which intrinsics map to which PTX instructions, a table would be clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added documentation to explicitly mention the proxy direction from generic to tensormap

Overview:
"""""""""

The '``@llvm.nvvm.fence.proxy.tensormap.release.*``' intrinsic emits ``fence.proxy.tensormap::generic.release.*`` and '``@llvm.nvvm.fence.proxy.tensormap.acquire.*``' intrinsic emits ``fence.proxy.tensormap::generic.acquire.* [addr], size;``. ``nvvm.fence.proxy.tensormap*`` is a uni-directional fence used to establish ordering between memory accesses that may happen through different proxies. ``nvvm.fence.proxy.tensormap.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap.acquire`` proxy fence
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing a colon

Suggested change
The '``@llvm.nvvm.fence.proxy.tensormap.release.*``' intrinsic emits ``fence.proxy.tensormap::generic.release.*`` and '``@llvm.nvvm.fence.proxy.tensormap.acquire.*``' intrinsic emits ``fence.proxy.tensormap::generic.acquire.* [addr], size;``. ``nvvm.fence.proxy.tensormap*`` is a uni-directional fence used to establish ordering between memory accesses that may happen through different proxies. ``nvvm.fence.proxy.tensormap.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap.acquire`` proxy fence
The '``@llvm.nvvm.fence.proxy.tensormap.release.*``' intrinsic emits ``fence.proxy.tensormap::generic.release.*`` and '``@llvm.nvvm.fence.proxy.tensormap.acquire.*``' intrinsic emits ``fence.proxy.tensormap::generic.acquire.* [addr], size;``. ``nvvm.fence.proxy.tensormap*`` is a uni-directional fence used to establish ordering between memory accesses that may happen through different proxies. ``nvvm.fence.proxy.tensormap.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap.acquire`` proxy fence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a colon in the latest revision


The '``@llvm.nvvm.fence.proxy.tensormap.release.*``' intrinsic emits ``fence.proxy.tensormap::generic.release.*`` and '``@llvm.nvvm.fence.proxy.tensormap.acquire.*``' intrinsic emits ``fence.proxy.tensormap::generic.acquire.* [addr], size;``. ``nvvm.fence.proxy.tensormap*`` is a uni-directional fence used to establish ordering between memory accesses that may happen through different proxies. ``nvvm.fence.proxy.tensormap.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap.acquire`` proxy fence

The address operand ``addr`` and the operand ``size`` together specifies the memory range ``[addr, addr+size-1]`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_.
Copy link
Contributor

@gonzalobg gonzalobg Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: half-open range?
nit: typo

Suggested change
The address operand ``addr`` and the operand ``size`` together specifies the memory range ``[addr, addr+size-1]`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_.
The address operand ``addr`` and the operand ``size`` together specify the memory range ``[addr, addr+size)`` on which ordering across the proxies is provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the half open range and typo


The '``@llvm.nvvm.fence.proxy.tensormap.release.*``' intrinsic emits ``fence.proxy.tensormap::generic.release.*`` and '``@llvm.nvvm.fence.proxy.tensormap.acquire.*``' intrinsic emits ``fence.proxy.tensormap::generic.acquire.* [addr], size;``. ``nvvm.fence.proxy.tensormap*`` is a uni-directional fence used to establish ordering between memory accesses that may happen through different proxies. ``nvvm.fence.proxy.tensormap.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap.acquire`` proxy fence

The address operand ``addr`` and the operand ``size`` together specifies the memory range ``[addr, addr+size-1]`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_.
Copy link
Contributor

@gonzalobg gonzalobg Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should call out that addr must either be global or generic pointing to global.

Do we have a way to annotate an intrinsic with that, to help addrspace propagation? That is, such that if someone uses this intrinsic on a pointer, we propagate throughout that the pointer, and any pointer it was derived from / derived from it, must point to global?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the documentation to specify that addr must either be global or generic point to global

Do we have a way to annotate an intrinsic with that, to help addrspace propagation? That is, such that if someone uses this intrinsic on a pointer, we propagate throughout that the pointer, and any pointer it was derived from / derived from it, must point to global?

I don't think we have an annotation for addrspace propagation. The inferAddrspace pass takes cares of addrspace inference but in this case, the instruction itself takes a generic address and does not have a specialized version for global addrspace, so we can use generic addrspace pointers to avoid redundancy

@schwarzschild-radius schwarzschild-radius force-pushed the fence_proxy_tensormap_nvptx_support branch from f9d9284 to 72bfac3 Compare August 6, 2024 10:37
This commit adds LLVM Intrinsics and NVPTX codegen support for
`fence.proxy.tensormap` with lit tests under fence-proxy-tensormap.ll.
Also, added Intrinsics documentation in NVPTXUsage.rst
@schwarzschild-radius schwarzschild-radius force-pushed the fence_proxy_tensormap_nvptx_support branch from 72bfac3 to f1ee403 Compare August 6, 2024 19:26
@gonzalobg
Copy link
Contributor

Thanks Pradeep, this LGTM.

@schwarzschild-radius
Copy link
Contributor Author

@Artem-B I have addressed all of your comments, please take a look when you get a chance

Copy link
Member

@Artem-B Artem-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a doc nit.

Overview:
"""""""""

The ``@llvm.nvvm.fence.proxy.tensormap_generic.*`` is a uni-directional fence used to establish ordering between a prior memory access performed via the generic proxy and a subsequent memory access performed via the tensormap proxy. ``nvvm.fence.proxy.tensormap_generic.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap_generic.acquire`` proxy fence. The following table describes the mapping between LLVM Intrinsic and the PTX instruction:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

performed via the generic proxy and a subsequent memory access performed via the tensormap proxy

It would be great to define what a 'proxy' is in this context and/or add a pointer to the documentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change landed with no definition of what a proxy is, and @gonzalobg's suggestion was resolved w/o applying it.

I'm not sure if that was intentional or an oversight.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not add a definiton to proxy because Gonzalo explained that the PTX definition is evolving and he suggested us to just add a link to the PTX's proxy defintion and I have included Gonzalo's suggestion in the commit (

The ``@llvm.nvvm.fence.proxy.tensormap_generic.*`` is a uni-directional fence used to establish ordering between a prior memory access performed via the generic `proxy<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#proxies>_` and a subsequent memory access performed via the tensormap proxy. ``nvvm.fence.proxy.tensormap_generic.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap_generic.acquire`` proxy fence. The following table describes the mapping between LLVM Intrinsic and the PTX instruction:
). Please let me know if I am missing something

Copy link
Contributor

@gonzalobg gonzalobg Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should have waited until Artem finished their review before merging it, since they had started reviewing this change but hadn't approved it yet. EDIT: while they had approved it conditionally with this change, they still hadn't reviewed the change. It seems that when my suggestion was commited, the thread was resolved by githug, and then another person saw the 2 reviews and merged it; it happens.

Now its too late too late for that, but things like this happen, and can be fixed: @Artem-B we can continue the review here, and once its finished, do the changes as a follow up. How does that sound?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@schwarzschild-radius My apologies, I do see the link to the proxy docs you've added. I must've been looking at the older revision of the patch. Github's review UI is still surprising me at the least convenient moments. :-/

@durga4github durga4github merged commit bfef7cc into llvm:main Aug 7, 2024
8 checks passed
@Artem-B
Copy link
Member

Artem-B commented Aug 7, 2024

@durga4github -- please hold off on merging the changes while the review conversation is still in progress.

@durga4github
Copy link
Contributor

Sure, I saw Pradeep had added the links and you had approved; so clicked the merge button.
I will wait for the author/reviewer to resolve all conversations before merging,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants