Skip to content

Commit bf50e0a

Browse files
committed
[NVPTX] Add TMA Bulk Copy intrinsics
PR #96083 added intrinsics for async copy of 'tensor' data using TMA. This PR adds intrinsics for async copy of bulk data (non-tensor variants) through TMA, following a similar design. * These intrinsics optionally support multicast and cache_hints, as indicated by the boolean arguments at the end of the intrinsics. * The backend looks through these flag arguments and lowers to the appropriate PTX instruction. * Lit tests are added for all combinations of these intrinsics in cp-async-bulk.ll. * The generated PTX is verified with a 12.3 ptxas executable. * Added docs for these intrinsics in NVPTXUsage.rst file. PTX Spec reference: https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk Signed-off-by: Durgadoss R <[email protected]>
1 parent 81ae668 commit bf50e0a

File tree

6 files changed

+393
-3
lines changed

6 files changed

+393
-3
lines changed

llvm/docs/NVPTXUsage.rst

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -465,6 +465,94 @@ least-significant bit position. 0xffffffff is returned if no 1 bit is found.
465465
TMA family of Intrinsics
466466
------------------------
467467

468+
'``llvm.nvvm.cp.async.bulk.global.to.shared.cluster``'
469+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
470+
471+
Syntax:
472+
"""""""
473+
474+
.. code-block:: llvm
475+
476+
declare void @llvm.nvvm.cp.async.bulk.global.to.shared.cluster(ptr addrspace(3) %dst, ptr addrspace(3) %mbar, ptr addrspace(1) %src, i32 %size, i16 %mc, i64 %ch, i1 %flag_mc, i1 %flag_ch)
477+
478+
Overview:
479+
"""""""""
480+
481+
The '``@llvm.nvvm.cp.async.bulk.global.to.shared.cluster``' intrinsic
482+
corresponds to the ``cp.async.bulk.shared::cluster.global.*`` family
483+
of PTX instructions. These instructions initiate an asynchronous
484+
copy of bulk data from global memory to shared::cluster memory.
485+
The 32-bit operand ``%size`` specifies the amount of memory to be
486+
copied and it must be a multiple of 16.
487+
488+
* The last two arguments to these intrinsics are boolean flags
489+
indicating support for cache_hint and/or multicast modifiers.
490+
These flag arguments must be compile-time constants. The backend
491+
looks through these flags and lowers the intrinsics appropriately.
492+
493+
* The Nth argument (denoted by ``i1 %flag_ch``) when set, indicates
494+
a valid cache_hint (``i64 %ch``) and generates the ``.L2::cache_hint``
495+
variant of the PTX instruction.
496+
497+
* The [N-1]th argument (denoted by ``i1 %flag_mc``) when set, indicates
498+
the presence of a multicast mask (``i16 %mc``) and generates the PTX
499+
instruction with the ``.multicast::cluster`` modifier.
500+
501+
For more information, refer PTX ISA
502+
`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk>`_.
503+
504+
'``llvm.nvvm.cp.async.bulk.shared.cta.to.global``'
505+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
506+
507+
Syntax:
508+
"""""""
509+
510+
.. code-block:: llvm
511+
512+
declare void @llvm.nvvm.cp.async.bulk.shared.cta.to.global(ptr addrspace(1) %dst, ptr addrspace(3) %src, i32 %size, i64 %ch, i1 %flag_ch)
513+
514+
Overview:
515+
"""""""""
516+
517+
The '``@llvm.nvvm.cp.async.bulk.shared.cta.to.global``' intrinsic
518+
corresponds to the ``cp.async.bulk.global.shared::cta.*`` set of PTX
519+
instructions. These instructions initiate an asynchronous copy from
520+
shared::cta to global memory. The 32-bit operand ``%size`` specifies
521+
the amount of memory to be copied and it must be a multiple of 16.
522+
523+
* The last argument to these intrinsics is a boolean flag
524+
indicating support for cache_hint. This flag argument must
525+
be a compile-time constant. When set, it indicates a valid
526+
cache_hint (``i64 %ch``) and generates the ``.L2::cache_hint``
527+
variant of the PTX instruction.
528+
529+
For more information, refer PTX ISA
530+
`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk>`_.
531+
532+
'``llvm.nvvm.cp.async.bulk.shared.cta.to.cluster``'
533+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
534+
535+
Syntax:
536+
"""""""
537+
538+
.. code-block:: llvm
539+
540+
declare void @llvm.nvvm.cp.async.bulk.shared.cta.to.cluster(ptr addrspace(3) %dst, ptr addrspace(3) %mbar, ptr addrspace(3) %src, i32 %size)
541+
542+
Overview:
543+
"""""""""
544+
545+
The '``@llvm.nvvm.cp.async.bulk.shared.cta.to.cluster``' intrinsic
546+
corresponds to the ``cp.async.bulk.shared::cluster.shared::cta.*``
547+
PTX instruction. This instruction initiates an asynchronous copy from
548+
shared::cta to shared::cluster memory. The destination has to be in
549+
the shared memory of a different CTA within the cluster. The 32-bit
550+
operand ``%size`` specifies the amount of memory to be copied and
551+
it must be a multiple of 16.
552+
553+
For more information, refer PTX ISA
554+
`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk>`_.
555+
468556
'``llvm.nvvm.cp.async.bulk.tensor.g2s.tile.[1-5]d``'
469557
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
470558

llvm/include/llvm/IR/IntrinsicsNVVM.td

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4980,4 +4980,47 @@ foreach dim = [1, 2, 3, 4, 5] in {
49804980
}
49814981
}
49824982

4983+
// Intrinsics for Bulk Copy using TMA (non-tensor)
4984+
// From Global to Shared Cluster
4985+
def int_nvvm_cp_async_bulk_global_to_shared_cluster
4986+
: DefaultAttrsIntrinsic<[],
4987+
[llvm_shared_ptr_ty, // dst_smem_ptr
4988+
llvm_shared_ptr_ty, // mbarrier_ptr
4989+
llvm_global_ptr_ty, // src_gmem_ptr
4990+
llvm_i32_ty, // copy_size
4991+
llvm_i16_ty, // cta_mask
4992+
llvm_i64_ty, // cache_hint
4993+
llvm_i1_ty, // Flag for cta_mask
4994+
llvm_i1_ty], // Flag for cache_hint
4995+
[IntrConvergent, IntrArgMemOnly,
4996+
WriteOnly<ArgIndex<0>>, ReadOnly<ArgIndex<2>>,
4997+
NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>,
4998+
NoCapture<ArgIndex<2>>, ImmArg<ArgIndex<6>>,
4999+
ImmArg<ArgIndex<7>>]>;
5000+
5001+
// From Shared CTA to Shared Cluster
5002+
def int_nvvm_cp_async_bulk_shared_cta_to_cluster
5003+
: DefaultAttrsIntrinsic<[],
5004+
[llvm_shared_ptr_ty, // dst_smem_ptr
5005+
llvm_shared_ptr_ty, // mbarrier_ptr
5006+
llvm_shared_ptr_ty, // src_smem_ptr
5007+
llvm_i32_ty], // copy_size
5008+
[IntrConvergent, IntrArgMemOnly,
5009+
WriteOnly<ArgIndex<0>>, ReadOnly<ArgIndex<2>>,
5010+
NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>,
5011+
NoCapture<ArgIndex<2>>]>;
5012+
5013+
// From Shared CTA to Global memory
5014+
def int_nvvm_cp_async_bulk_shared_cta_to_global
5015+
: DefaultAttrsIntrinsic<[],
5016+
[llvm_global_ptr_ty, // dst_gmem_ptr
5017+
llvm_shared_ptr_ty, // src_smem_ptr
5018+
llvm_i32_ty, // copy_size
5019+
llvm_i64_ty, // cache_hint
5020+
llvm_i1_ty], // Flag for cache_hint
5021+
[IntrConvergent, IntrArgMemOnly,
5022+
WriteOnly<ArgIndex<0>>, ReadOnly<ArgIndex<1>>,
5023+
NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>,
5024+
ImmArg<ArgIndex<4>>]>;
5025+
49835026
} // let TargetPrefix = "nvvm"

llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3024,13 +3024,90 @@ void NVPTXDAGToDAGISel::SelectCpAsyncBulkTensorReduceCommon(SDNode *N,
30243024
ReplaceNode(N, CurDAG->getMachineNode(Opcode, DL, N->getVTList(), Ops));
30253025
}
30263026

3027+
void NVPTXDAGToDAGISel::SelectCpAsyncBulkS2G(SDNode *N) {
3028+
// We have {Chain, Intrinsic-ID} followed by the actual intrisic args:
3029+
// dst, src, size, cache_hint, cache_hint_flag
3030+
// NumOperands = {Chain, IID} + {Actual intrinsic args}
3031+
// = {2} + {5}
3032+
size_t NumOps = N->getNumOperands();
3033+
bool IsCacheHint = N->getConstantOperandVal(NumOps - 1) == 1;
3034+
size_t NumArgs = IsCacheHint ? 4 : 3; // src, dst, size, cache_hint
3035+
3036+
SDLoc DL(N);
3037+
SmallVector<SDValue, 8> Ops(N->ops().slice(2, NumArgs));
3038+
Ops.push_back(N->getOperand(0)); // Chain operand
3039+
3040+
unsigned Opcode;
3041+
bool IsShared32 =
3042+
CurDAG->getDataLayout().getPointerSizeInBits(ADDRESS_SPACE_SHARED) == 32;
3043+
if (IsCacheHint) {
3044+
Opcode = IsShared32 ? NVPTX::CP_ASYNC_BULK_S2G_SHARED32_CH
3045+
: NVPTX::CP_ASYNC_BULK_S2G_CH;
3046+
} else {
3047+
Opcode = IsShared32 ? NVPTX::CP_ASYNC_BULK_S2G_SHARED32
3048+
: NVPTX::CP_ASYNC_BULK_S2G;
3049+
}
3050+
ReplaceNode(N, CurDAG->getMachineNode(Opcode, DL, N->getVTList(), Ops));
3051+
}
3052+
3053+
void NVPTXDAGToDAGISel::SelectCpAsyncBulkG2S(SDNode *N) {
3054+
// We have {Chain, Intrinsic-ID} followed by the actual intrisic args:
3055+
// {dst, mbar, src, size, multicast, cache_hint,
3056+
// multicast_flag, cache_hint_flag}
3057+
// NumOperands = {Chain, IID} + {Actual intrinsic args}
3058+
// = {2} + {8}
3059+
size_t NumOps = N->getNumOperands();
3060+
bool IsCacheHint = N->getConstantOperandVal(NumOps - 1) == 1;
3061+
bool IsMultiCast = N->getConstantOperandVal(NumOps - 2) == 1;
3062+
size_t NumBaseArgs = 4; // dst, mbar, src, size
3063+
size_t MultiCastIdx = NumBaseArgs + 2; // for Chain and IID
3064+
3065+
SDLoc DL(N);
3066+
SmallVector<SDValue, 8> Ops(N->ops().slice(2, NumBaseArgs));
3067+
3068+
// Push MultiCast operand, if available
3069+
if (IsMultiCast)
3070+
Ops.push_back(N->getOperand(MultiCastIdx));
3071+
3072+
// Push CacheHint operand, if available
3073+
if (IsCacheHint)
3074+
Ops.push_back(N->getOperand(MultiCastIdx + 1));
3075+
3076+
// Finally, the chain operand
3077+
Ops.push_back(N->getOperand(0));
3078+
3079+
unsigned Opcode;
3080+
bool IsShared32 =
3081+
CurDAG->getDataLayout().getPointerSizeInBits(ADDRESS_SPACE_SHARED) == 32;
3082+
if (IsMultiCast && IsCacheHint) {
3083+
Opcode = IsShared32 ? NVPTX::CP_ASYNC_BULK_G2S_SHARED32_MC_CH
3084+
: NVPTX::CP_ASYNC_BULK_G2S_MC_CH;
3085+
} else if (IsMultiCast) {
3086+
Opcode = IsShared32 ? NVPTX::CP_ASYNC_BULK_G2S_SHARED32_MC
3087+
: NVPTX::CP_ASYNC_BULK_G2S_MC;
3088+
} else if (IsCacheHint) {
3089+
Opcode = IsShared32 ? NVPTX::CP_ASYNC_BULK_G2S_SHARED32_CH
3090+
: NVPTX::CP_ASYNC_BULK_G2S_CH;
3091+
} else {
3092+
Opcode = IsShared32 ? NVPTX::CP_ASYNC_BULK_G2S_SHARED32
3093+
: NVPTX::CP_ASYNC_BULK_G2S;
3094+
}
3095+
ReplaceNode(N, CurDAG->getMachineNode(Opcode, DL, N->getVTList(), Ops));
3096+
}
3097+
30273098
bool NVPTXDAGToDAGISel::tryIntrinsicVoid(SDNode *N) {
30283099
unsigned IID = N->getConstantOperandVal(1);
30293100
using TMARedTy = llvm::nvvm::TMAReductionOp;
30303101
auto CastTy = [](TMARedTy Op) { return static_cast<unsigned>(Op); };
30313102
switch (IID) {
30323103
default:
30333104
return false;
3105+
case Intrinsic::nvvm_cp_async_bulk_global_to_shared_cluster:
3106+
SelectCpAsyncBulkG2S(N);
3107+
return true;
3108+
case Intrinsic::nvvm_cp_async_bulk_shared_cta_to_global:
3109+
SelectCpAsyncBulkS2G(N);
3110+
return true;
30343111
case Intrinsic::nvvm_cp_async_bulk_tensor_s2g_tile_1d:
30353112
case Intrinsic::nvvm_cp_async_bulk_tensor_s2g_tile_2d:
30363113
case Intrinsic::nvvm_cp_async_bulk_tensor_s2g_tile_3d:

llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,8 @@ class LLVM_LIBRARY_VISIBILITY NVPTXDAGToDAGISel : public SelectionDAGISel {
9090
bool tryEXTRACT_VECTOR_ELEMENT(SDNode *N);
9191
void SelectV2I64toI128(SDNode *N);
9292
void SelectI128toV2I64(SDNode *N);
93+
void SelectCpAsyncBulkG2S(SDNode *N);
94+
void SelectCpAsyncBulkS2G(SDNode *N);
9395
void SelectCpAsyncBulkTensorG2SCommon(SDNode *N, bool IsIm2Col = false);
9496
void SelectCpAsyncBulkTensorS2GCommon(SDNode *N, bool IsIm2Col = false);
9597
void SelectCpAsyncBulkTensorPrefetchCommon(SDNode *N, bool IsIm2Col = false);

llvm/lib/Target/NVPTX/NVPTXIntrinsics.td

Lines changed: 65 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -498,9 +498,71 @@ def CP_ASYNC_BULK_WAIT_GROUP_READ :
498498
[(int_nvvm_cp_async_bulk_wait_group_read timm:$n)]>,
499499
Requires<[hasPTX<80>, hasSM<90>]>;
500500

501-
//-----------------------------------
502-
// TMA Async Tensor Copy Functions
503-
//-----------------------------------
501+
//------------------------------
502+
// TMA Async Bulk Copy Functions
503+
//------------------------------
504+
505+
class CpAsyncBulkStr<bit mc, bit ch> {
506+
// Shared to Global memory
507+
string S2G = "cp.async.bulk.global.shared::cta.bulk_group"
508+
# !if(ch, ".L2::cache_hint", "");
509+
510+
// Global to Shared cluster memory
511+
string G2S = "cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes"
512+
# !if(mc, ".multicast::cluster", "")
513+
# !if(ch, ".L2::cache_hint", "");
514+
515+
// Shared CTA to Cluster memory
516+
string C2C = "cp.async.bulk.shared::cluster.shared::cta.mbarrier::complete_tx::bytes";
517+
}
518+
519+
multiclass CP_ASYNC_BULK_S2G<NVPTXRegClass rc> {
520+
def NAME: NVPTXInst<(outs),
521+
(ins Int64Regs:$dst, rc:$src, Int32Regs:$size),
522+
!strconcat(CpAsyncBulkStr<0, 0>.S2G, " [$dst], [$src], $size;"), []>,
523+
Requires<[hasPTX<80>, hasSM<90>]>;
524+
def NAME # _CH: NVPTXInst<(outs),
525+
(ins Int64Regs:$dst, rc:$src, Int32Regs:$size, Int64Regs:$ch),
526+
!strconcat(CpAsyncBulkStr<0, 1>.S2G, " [$dst], [$src], $size, $ch;"), []>,
527+
Requires<[hasPTX<80>, hasSM<90>]>;
528+
}
529+
defm CP_ASYNC_BULK_S2G : CP_ASYNC_BULK_S2G<Int64Regs>;
530+
defm CP_ASYNC_BULK_S2G_SHARED32 : CP_ASYNC_BULK_S2G<Int32Regs>;
531+
532+
multiclass CP_ASYNC_BULK_G2S<NVPTXRegClass rc> {
533+
def NAME: NVPTXInst<(outs),
534+
(ins rc:$dst, rc:$mbar, Int64Regs:$src, Int32Regs:$size),
535+
!strconcat(CpAsyncBulkStr<0, 0>.G2S, " [$dst], [$src], $size, [$mbar];"), []>,
536+
Requires<[hasPTX<80>, hasSM<90>]>;
537+
def NAME # _MC: NVPTXInst<(outs),
538+
(ins rc:$dst, rc:$mbar, Int64Regs:$src, Int32Regs:$size, Int16Regs:$mc),
539+
!strconcat(CpAsyncBulkStr<1, 0>.G2S, " [$dst], [$src], $size, [$mbar], $mc;"), []>,
540+
Requires<[hasPTX<80>, hasSM<90>]>;
541+
def NAME # _CH: NVPTXInst<(outs),
542+
(ins rc:$dst, rc:$mbar, Int64Regs:$src, Int32Regs:$size, Int64Regs:$ch),
543+
!strconcat(CpAsyncBulkStr<0, 1>.G2S, " [$dst], [$src], $size, [$mbar], $ch;"), []>,
544+
Requires<[hasPTX<80>, hasSM<90>]>;
545+
def NAME # _MC_CH: NVPTXInst<(outs),
546+
(ins rc:$dst, rc:$mbar, Int64Regs:$src, Int32Regs:$size, Int16Regs:$mc, Int64Regs:$ch),
547+
!strconcat(CpAsyncBulkStr<1, 1>.G2S, " [$dst], [$src], $size, [$mbar], $mc, $ch;"), []>,
548+
Requires<[hasPTX<80>, hasSM<90>]>;
549+
}
550+
defm CP_ASYNC_BULK_G2S : CP_ASYNC_BULK_G2S<Int64Regs>;
551+
defm CP_ASYNC_BULK_G2S_SHARED32 : CP_ASYNC_BULK_G2S<Int32Regs>;
552+
553+
multiclass CP_ASYNC_BULK_CTA_TO_CLUSTER<NVPTXRegClass rc> {
554+
def NAME: NVPTXInst<(outs),
555+
(ins rc:$dst, rc:$mbar, rc:$src, Int32Regs:$size),
556+
!strconcat(CpAsyncBulkStr<0, 0>.C2C, " [$dst], [$src], $size, [$mbar];"),
557+
[(int_nvvm_cp_async_bulk_shared_cta_to_cluster rc:$dst, rc:$mbar, rc:$src, Int32Regs:$size)]>,
558+
Requires<[hasPTX<80>, hasSM<90>]>;
559+
}
560+
defm CP_ASYNC_BULK_CTA_TO_CLUSTER : CP_ASYNC_BULK_CTA_TO_CLUSTER<Int64Regs>;
561+
defm CP_ASYNC_BULK_CTA_TO_CLUSTER_SHARED32 : CP_ASYNC_BULK_CTA_TO_CLUSTER<Int32Regs>;
562+
563+
//-------------------------------------
564+
// TMA Async Bulk Tensor Copy Functions
565+
//-------------------------------------
504566

505567
// From Global to Shared memory (G2S)
506568
class G2S_STRINGS<int dim, string mode, bit mc, bit ch, bit is_shared32 = 0> {

0 commit comments

Comments
 (0)