-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[LLVM][NVPTX] Add NVPTX codegen support for fence.proxy.tensormap #100748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LLVM][NVPTX] Add NVPTX codegen support for fence.proxy.tensormap #100748
Conversation
@llvm/pr-subscribers-llvm-ir @llvm/pr-subscribers-backend-nvptx Author: Pradeep Kumar (schwarzschild-radius) ChangesThis commit adds LLVM Intrinsics and NVPTX codegen support for Full diff: https://github.com/llvm/llvm-project/pull/100748.diff 4 Files Affected:
diff --git a/llvm/docs/NVPTXUsage.rst b/llvm/docs/NVPTXUsage.rst
index 4385cba3ada0d..2b87985c78bbd 100644
--- a/llvm/docs/NVPTXUsage.rst
+++ b/llvm/docs/NVPTXUsage.rst
@@ -251,6 +251,34 @@ Overview:
The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0``
instruction, equivalent to the ``__syncthreads()`` call in CUDA.
+Membar/Fences
+-------------
+
+
+'``llvm.nvvm.fence.proxy.tensormap.*``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+ declare void @llvm.nvvm.fence.proxy.tensormap.release.cta()
+ declare void @llvm.nvvm.fence.proxy.tensormap.release.cluster()
+ declare void @llvm.nvvm.fence.proxy.tensormap.release.gpu()
+ declare void @llvm.nvvm.fence.proxy.tensormap.release.sys()
+
+ declare void @llvm.nvvm.fence.proxy.tensormap.acquire.cta(ptr %addr, i32 %size)
+ declare void @llvm.nvvm.fence.proxy.tensormap.acquire.cluster(ptr %addr, i32 %size)
+ declare void @llvm.nvvm.fence.proxy.tensormap.acquire.gpu(ptr %addr, i32 %size)
+ declare void @llvm.nvvm.fence.proxy.tensormap.acquire.sys(ptr %addr, i32 %size)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.fence.proxy.tensormap.release.*``' intrinsic emits ``fence.proxy.tensormap::generic.release.*`` and '``@llvm.nvvm.fence.proxy.tensormap.acquire.*``' intrinsic emits ``fence.proxy.tensormap::generic.acquire.* [addr], size;``. ``nvvm.fence.proxy.tensormap*`` is a uni-directional fence used to establish ordering between memory accesses that may happen through different proxies. ``nvvm.fence.proxy.tensormap.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap.acquire`` proxy fence
+
+The address operand ``addr`` and the operand ``size`` together specifies the memory range ``[addr, addr+size-1]`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_.
Other Intrinsics
----------------
diff --git a/llvm/include/llvm/IR/IntrinsicsNVVM.td b/llvm/include/llvm/IR/IntrinsicsNVVM.td
index 1e7fdb53059e2..cf4e7ee1d991e 100644
--- a/llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ b/llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -1418,6 +1418,20 @@ let TargetPrefix = "nvvm" in {
def int_nvvm_fence_sc_cluster:
Intrinsic<[], [], [IntrNoCallback]>;
+// Proxy fence (uni-directional)
+foreach scope = ["cta", "cluster", "gpu", "sys"] in {
+
+ def int_nvvm_fence_proxy_tensormap_release_ # scope:
+ Intrinsic<[], [], [IntrNoCallback],
+ "llvm.nvvm.fence.proxy.tensormap.release." # scope>;
+
+ def int_nvvm_fence_proxy_tensormap_acquire_ # scope:
+ Intrinsic<[], [llvm_ptr_ty, llvm_i32_ty],
+ [IntrNoCallback, ImmArg<ArgIndex<1>>],
+ "llvm.nvvm.fence.proxy.tensormap.acquire." # scope>;
+
+}
+
// Async Copy
def int_nvvm_cp_async_mbarrier_arrive :
ClangBuiltin<"__nvvm_cp_async_mbarrier_arrive">,
diff --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
index c81dfa68e4bd4..80ad91137689f 100644
--- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -335,6 +335,48 @@ def INT_FENCE_SC_CLUSTER:
MEMBAR<"fence.sc.cluster;", int_nvvm_fence_sc_cluster>,
Requires<[hasPTX<78>, hasSM<90>]>;
+// Proxy fence (uni-directional)
+// fence.proxy.tensormap.release variants
+
+class FENCE_PROXY_TENSORMAP_RELEASE<string Scope, Intrinsic Intr> :
+ NVPTXInst<(outs), (ins),
+ "fence.proxy.tensormap::generic.release." # Scope # ";", [(Intr)]>,
+ Requires<[hasPTX<83>, hasSM<90>]>;
+
+def INT_FENCE_PROXY_TENSORMAP_RELEASE_CTA:
+ FENCE_PROXY_TENSORMAP_RELEASE<"cta",
+ int_nvvm_fence_proxy_tensormap_release_cta>;
+def INT_FENCE_PROXY_TENSORMAP_RELEASE_CLUSTER:
+ FENCE_PROXY_TENSORMAP_RELEASE<"cluster",
+ int_nvvm_fence_proxy_tensormap_release_cluster>;
+def INT_FENCE_PROXY_TENSORMAP_RELEASE_GPU:
+ FENCE_PROXY_TENSORMAP_RELEASE<"gpu",
+ int_nvvm_fence_proxy_tensormap_release_gpu>;
+def INT_FENCE_PROXY_TENSORMAP_RELEASE_SYS:
+ FENCE_PROXY_TENSORMAP_RELEASE<"sys",
+ int_nvvm_fence_proxy_tensormap_release_sys>;
+
+// fence.proxy.tensormap.acquire variants
+
+class FENCE_PROXY_TENSORMAP_ACQUIRE<string Scope, Intrinsic Intr> :
+ NVPTXInst<(outs), (ins Int64Regs:$addr, i32imm:$size),
+ "fence.proxy.tensormap::generic.acquire." # Scope # " [$addr], $size;",
+ [(Intr Int64Regs:$addr, timm:$size)]>,
+ Requires<[hasPTX<83>, hasSM<90>]>;
+
+def INT_FENCE_PROXY_TENSORMAP_ACQUIRE_CTA :
+ FENCE_PROXY_TENSORMAP_ACQUIRE<"cta",
+ int_nvvm_fence_proxy_tensormap_acquire_cta>;
+def INT_FENCE_PROXY_TENSORMAP_ACQUIRE_CLUSTER :
+ FENCE_PROXY_TENSORMAP_ACQUIRE<"cluster",
+ int_nvvm_fence_proxy_tensormap_acquire_cluster>;
+def INT_FENCE_PROXY_TENSORMAP_ACQUIRE_GPU :
+ FENCE_PROXY_TENSORMAP_ACQUIRE<"gpu",
+ int_nvvm_fence_proxy_tensormap_acquire_gpu>;
+def INT_FENCE_PROXY_TENSORMAP_ACQUIRE_SYS :
+ FENCE_PROXY_TENSORMAP_ACQUIRE<"sys",
+ int_nvvm_fence_proxy_tensormap_acquire_sys>;
+
//-----------------------------------
// Async Copy Functions
//-----------------------------------
diff --git a/llvm/test/CodeGen/NVPTX/fence-proxy-tensormap.ll b/llvm/test/CodeGen/NVPTX/fence-proxy-tensormap.ll
new file mode 100644
index 0000000000000..72f82aad840a5
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/fence-proxy-tensormap.ll
@@ -0,0 +1,36 @@
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx83 | FileCheck --check-prefixes=CHECK %s
+; RUN: %if ptxas-12.5 %{ llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx83 | %ptxas-verify -arch=sm_90 %}
+
+; CHECK-LABEL: test_fence_proxy_tensormap_release
+define void @test_fence_proxy_tensormap_release() {
+ ; CHECK: fence.proxy.tensormap::generic.release.cta;
+ call void @llvm.nvvm.fence.proxy.tensormap.release.cta();
+
+ ; CHECK: fence.proxy.tensormap::generic.release.cluster;
+ call void @llvm.nvvm.fence.proxy.tensormap.release.cluster();
+
+ ; CHECK: fence.proxy.tensormap::generic.release.gpu;
+ call void @llvm.nvvm.fence.proxy.tensormap.release.gpu();
+
+ ; CHECK: fence.proxy.tensormap::generic.release.sys;
+ call void @llvm.nvvm.fence.proxy.tensormap.release.sys();
+
+ ret void
+}
+
+; CHECK-LABEL: test_fence_proxy_tensormap_acquire
+define void @test_fence_proxy_tensormap_acquire(ptr addrspace(0) %addr) {
+ ; CHECK: fence.proxy.tensormap::generic.acquire.cta [%rd{{[0-9]+}}], 128;
+ call void @llvm.nvvm.fence.proxy.tensormap.acquire.cta(ptr addrspace(0) %addr, i32 128);
+
+ ; CHECK: fence.proxy.tensormap::generic.acquire.cluster [%rd{{[0-9]+}}], 128;
+ call void @llvm.nvvm.fence.proxy.tensormap.acquire.cluster(ptr addrspace(0) %addr, i32 128);
+
+ ; CHECK: fence.proxy.tensormap::generic.acquire.gpu [%rd{{[0-9]+}}], 128;
+ call void @llvm.nvvm.fence.proxy.tensormap.acquire.gpu(ptr addrspace(0) %addr, i32 128);
+
+ ; CHECK: fence.proxy.tensormap::generic.acquire.sys [%rd{{[0-9]+}}], 128;
+ call void @llvm.nvvm.fence.proxy.tensormap.acquire.sys(ptr addrspace(0) %addr, i32 128);
+
+ ret void
+}
|
Changes look good to me. Requested review from Artem, |
@Artem-B , Could you please review this change? |
|
||
def int_nvvm_fence_proxy_tensormap_acquire_ # scope: | ||
Intrinsic<[], [llvm_ptr_ty, llvm_i32_ty], | ||
[IntrNoCallback, ImmArg<ArgIndex<1>>], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intrinsic constraints may be underspecified here.
I'm not familiar enough with the new synchronization instructions, so I can't tell you what is the right way to specify their behavior in LLVM, but it's not uncommon for various barrier instructions to come with a IntrHasSideEffects
to make sure that LLVM would not move them around.
These instruction variants seem to be narrower in scops and apply only to the area specified by the pointer and size, which suggests that we may give LLVM a bit more freedom and may get by with IntrArgMemOnly
, IntrReadMem
, IntrWriteMem
.
Also, are there any concerns regarding replicating/merging these instructions across different control flow paths? Do they need IntrConvergent
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar enough with the new synchronization instructions, so I can't tell you what is the right way to specify their behavior in LLVM, but it's not uncommon for various barrier instructions to come with a IntrHasSideEffects to make sure that LLVM would not move them around.
By default intrinsics are assumed to be side-effecting unless specified otherwise (
// intrinsic properties. By default, intrinsics are assumed to have side |
IntrArgMemOnly
attribute to relax constraints for the acquire variants
Also, are there any concerns regarding replicating/merging these instructions across different control flow paths? Do they need IntrConvergent ?
fence.* variants don't require IntrConvergent
as they can be executed divergently as well
@@ -0,0 +1,36 @@ | |||
; RUN: llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx83 | FileCheck --check-prefixes=CHECK %s | |||
; RUN: %if ptxas-12.5 %{ llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx83 | %ptxas-verify -arch=sm_90 %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The instructions require PTX8.3, which suggests that the required ptxas version should be 12.3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the ptxas version in the latest revision
; CHECK-LABEL: test_fence_proxy_tensormap_acquire | ||
define void @test_fence_proxy_tensormap_acquire(ptr addrspace(0) %addr) { | ||
; CHECK: fence.proxy.tensormap::generic.acquire.cta [%rd{{[0-9]+}}], 128; | ||
call void @llvm.nvvm.fence.proxy.tensormap.acquire.cta(ptr addrspace(0) %addr, i32 128); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is supposed to happen when a user passes a constant other than 128?
AFAICT we'll happily pass it along into generated PTX, and ptxas will fail to compile it. That's not ideal. I suppose, in this case it may be OK, as it's the PTX spec which specifies size as a parameter, but allows only one specific value at the moment. We could diagnose the wrong value a bit earlier, but I think it could be argued that it's not LLVM's job in this case.
One thing we could do is to remove the pattern for automatically matching the the intrinsic, and instead use a separate pattern that would match the intrinsic with a constant integer argument 128. Attempts to use the intrinsic with any other argument will fail in LLVM with a failure to lower the intrinsic, which would allow us to diagnose such errors earlier in the compilation pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the ISel pattern to just match constant integer 128 argument and added a verifier check to emit diagnostics for other values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is supposed to happen when a user passes a constant other than 128?
Since it is a precondition of the LLVM intrinsic, then I think it is LLVM's responsibility to emit an error (it is also a precondition of the PTX instruction, and ptxas does also emit an error for it).
5fa07db
to
f9d9284
Compare
llvm/docs/NVPTXUsage.rst
Outdated
------------- | ||
|
||
|
||
'``llvm.nvvm.fence.proxy.tensormap.*``' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is generic
dropped from PTX (tensormap::generic
) ?
How does the user of the intrinsic know whether it is fencing "from generic to tensormap", or in the other direction "from tensormap to generic"?
How will this be disambiguated if the opposite direction needs to be added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added generic
explicitly in the intrinsic to mark the direction of the fencing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace tensormap.generic
with tensormap_generic
to explicitly state to_proxy::from_proxy
and also differentiate generic-addrspace and generic-proxy
llvm/docs/NVPTXUsage.rst
Outdated
Overview: | ||
""""""""" | ||
|
||
The '``@llvm.nvvm.fence.proxy.tensormap.release.*``' intrinsic emits ``fence.proxy.tensormap::generic.release.*`` and '``@llvm.nvvm.fence.proxy.tensormap.acquire.*``' intrinsic emits ``fence.proxy.tensormap::generic.acquire.* [addr], size;``. ``nvvm.fence.proxy.tensormap*`` is a uni-directional fence used to establish ordering between memory accesses that may happen through different proxies. ``nvvm.fence.proxy.tensormap.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap.acquire`` proxy fence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to mention the direction of the cross-proxy uni-directional fences: from which proxy to which proxy. In this case, these cross-proxy-fence from generic-proxy to tensormap-proxy.
For documenting which intrinsics map to which PTX instructions, a table would be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added documentation to explicitly mention the proxy direction from generic to tensormap
llvm/docs/NVPTXUsage.rst
Outdated
Overview: | ||
""""""""" | ||
|
||
The '``@llvm.nvvm.fence.proxy.tensormap.release.*``' intrinsic emits ``fence.proxy.tensormap::generic.release.*`` and '``@llvm.nvvm.fence.proxy.tensormap.acquire.*``' intrinsic emits ``fence.proxy.tensormap::generic.acquire.* [addr], size;``. ``nvvm.fence.proxy.tensormap*`` is a uni-directional fence used to establish ordering between memory accesses that may happen through different proxies. ``nvvm.fence.proxy.tensormap.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap.acquire`` proxy fence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: missing a colon
The '``@llvm.nvvm.fence.proxy.tensormap.release.*``' intrinsic emits ``fence.proxy.tensormap::generic.release.*`` and '``@llvm.nvvm.fence.proxy.tensormap.acquire.*``' intrinsic emits ``fence.proxy.tensormap::generic.acquire.* [addr], size;``. ``nvvm.fence.proxy.tensormap*`` is a uni-directional fence used to establish ordering between memory accesses that may happen through different proxies. ``nvvm.fence.proxy.tensormap.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap.acquire`` proxy fence | |
The '``@llvm.nvvm.fence.proxy.tensormap.release.*``' intrinsic emits ``fence.proxy.tensormap::generic.release.*`` and '``@llvm.nvvm.fence.proxy.tensormap.acquire.*``' intrinsic emits ``fence.proxy.tensormap::generic.acquire.* [addr], size;``. ``nvvm.fence.proxy.tensormap*`` is a uni-directional fence used to establish ordering between memory accesses that may happen through different proxies. ``nvvm.fence.proxy.tensormap.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap.acquire`` proxy fence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a colon
in the latest revision
llvm/docs/NVPTXUsage.rst
Outdated
|
||
The '``@llvm.nvvm.fence.proxy.tensormap.release.*``' intrinsic emits ``fence.proxy.tensormap::generic.release.*`` and '``@llvm.nvvm.fence.proxy.tensormap.acquire.*``' intrinsic emits ``fence.proxy.tensormap::generic.acquire.* [addr], size;``. ``nvvm.fence.proxy.tensormap*`` is a uni-directional fence used to establish ordering between memory accesses that may happen through different proxies. ``nvvm.fence.proxy.tensormap.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap.acquire`` proxy fence | ||
|
||
The address operand ``addr`` and the operand ``size`` together specifies the memory range ``[addr, addr+size-1]`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: half-open range?
nit: typo
The address operand ``addr`` and the operand ``size`` together specifies the memory range ``[addr, addr+size-1]`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_. | |
The address operand ``addr`` and the operand ``size`` together specify the memory range ``[addr, addr+size)`` on which ordering across the proxies is provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the half open
range and typo
llvm/docs/NVPTXUsage.rst
Outdated
|
||
The '``@llvm.nvvm.fence.proxy.tensormap.release.*``' intrinsic emits ``fence.proxy.tensormap::generic.release.*`` and '``@llvm.nvvm.fence.proxy.tensormap.acquire.*``' intrinsic emits ``fence.proxy.tensormap::generic.acquire.* [addr], size;``. ``nvvm.fence.proxy.tensormap*`` is a uni-directional fence used to establish ordering between memory accesses that may happen through different proxies. ``nvvm.fence.proxy.tensormap.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap.acquire`` proxy fence | ||
|
||
The address operand ``addr`` and the operand ``size`` together specifies the memory range ``[addr, addr+size-1]`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should call out that addr
must either be global
or generic
pointing to global.
Do we have a way to annotate an intrinsic with that, to help addrspace propagation? That is, such that if someone uses this intrinsic on a pointer, we propagate throughout that the pointer, and any pointer it was derived from / derived from it, must point to global?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the documentation to specify that addr must either be global or generic point to global
Do we have a way to annotate an intrinsic with that, to help addrspace propagation? That is, such that if someone uses this intrinsic on a pointer, we propagate throughout that the pointer, and any pointer it was derived from / derived from it, must point to global?
I don't think we have an annotation for addrspace propagation. The inferAddrspace
pass takes cares of addrspace inference but in this case, the instruction itself takes a generic address and does not have a specialized version for global addrspace, so we can use generic addrspace pointers to avoid redundancy
f9d9284
to
72bfac3
Compare
This commit adds LLVM Intrinsics and NVPTX codegen support for `fence.proxy.tensormap` with lit tests under fence-proxy-tensormap.ll. Also, added Intrinsics documentation in NVPTXUsage.rst
72bfac3
to
f1ee403
Compare
Thanks Pradeep, this LGTM. |
@Artem-B I have addressed all of your comments, please take a look when you get a chance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a doc nit.
llvm/docs/NVPTXUsage.rst
Outdated
Overview: | ||
""""""""" | ||
|
||
The ``@llvm.nvvm.fence.proxy.tensormap_generic.*`` is a uni-directional fence used to establish ordering between a prior memory access performed via the generic proxy and a subsequent memory access performed via the tensormap proxy. ``nvvm.fence.proxy.tensormap_generic.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap_generic.acquire`` proxy fence. The following table describes the mapping between LLVM Intrinsic and the PTX instruction: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
performed via the generic proxy and a subsequent memory access performed via the tensormap proxy
It would be great to define what a 'proxy' is in this context and/or add a pointer to the documentation.
- https://dl.acm.org/doi/epdf/10.1145/3470496.3533045 -- "Mixed-Proxy Extensions for the NVIDIA PTX MemoryConsistency Model"
- https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#proxies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change landed with no definition of what a proxy is, and @gonzalobg's suggestion was resolved w/o applying it.
I'm not sure if that was intentional or an oversight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not add a definiton to proxy because Gonzalo explained that the PTX definition is evolving and he suggested us to just add a link to the PTX's proxy defintion and I have included Gonzalo's suggestion in the commit (
llvm-project/llvm/docs/NVPTXUsage.rst
Line 279 in cf416e0
The ``@llvm.nvvm.fence.proxy.tensormap_generic.*`` is a uni-directional fence used to establish ordering between a prior memory access performed via the generic `proxy<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#proxies>_` and a subsequent memory access performed via the tensormap proxy. ``nvvm.fence.proxy.tensormap_generic.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap_generic.acquire`` proxy fence. The following table describes the mapping between LLVM Intrinsic and the PTX instruction: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should have waited until Artem finished their review before merging it, since they had started reviewing this change but hadn't approved it yet. EDIT: while they had approved it conditionally with this change, they still hadn't reviewed the change. It seems that when my suggestion was commited, the thread was resolved by githug, and then another person saw the 2 reviews and merged it; it happens.
Now its too late too late for that, but things like this happen, and can be fixed: @Artem-B we can continue the review here, and once its finished, do the changes as a follow up. How does that sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@schwarzschild-radius My apologies, I do see the link to the proxy docs you've added. I must've been looking at the older revision of the patch. Github's review UI is still surprising me at the least convenient moments. :-/
Co-authored-by: gonzalobg <[email protected]>
@durga4github -- please hold off on merging the changes while the review conversation is still in progress. |
Sure, I saw Pradeep had added the links and you had approved; so clicked the merge button. |
This commit adds LLVM Intrinsics and NVPTX codegen support for
fence.proxy.tensormap
with lit tests under fence-proxy-tensormap.ll. Also, added Intrinsics documentation in NVPTXUsage.rst