Skip to content

[NVPTX] Add intrinsics for st.bulk instruction #128856

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions llvm/docs/NVPTXUsage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1407,6 +1407,38 @@ The last argument `i1 %unpack` is a compile-time constant which when set, indica
For more information, refer to the
`PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-instructions-tcgen05-st>`__.

Store Intrinsics
----------------

'``llvm.nvvm.st.bulk.*``'
^^^^^^^^^^^^^^^^^^^^^^^^^

Syntax:
"""""""

.. code-block:: llvm

declare void @llvm.nvvm.st.bulk(ptr addrspace(1) %dst, i64 %size, i64 immarg %initval)
declare void @llvm.nvvm.st.bulk.shared.cta(ptr addrspace(3) %dst, i64 %size, i64 immarg %initval)

Overview:
"""""""""

The '``@llvm.nvvm.st.bulk.*``' intrinsics initialize a region of shared memory
starting from the location specified by the destination address operand `%dst`.

The integer operand `%size` specifies the amount of memory to be initialized in
terms of number of bytes and must be a multiple of 8. Otherwise, the behavior
is undefined.

The integer immediate operand `%initval` specifies the initialization value for
the memory locations. The only numeric value allowed is 0.

The ``@llvm.nvvm.st.bulk.shared.cta`` and ``@llvm.nvvm.st.bulk`` intrinsics are
similar but the latter uses generic addressing (see `Generic Addressing <https://docs.nvidia.com/cuda/parallel-thread-execution/#generic-addressing>`__).

For more information, refer `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-st-bulk>`__.

Other Intrinsics
----------------

Expand Down
14 changes: 14 additions & 0 deletions llvm/include/llvm/IR/IntrinsicsNVVM.td
Original file line number Diff line number Diff line change
Expand Up @@ -5261,4 +5261,18 @@ foreach shape = ["16x64b", "16x128b", "16x256b", "32x32b", "16x32bx2"] in {
}
}

//
// Bulk store intrinsics
//

def int_nvvm_st_bulk: DefaultAttrsIntrinsic<[],
[llvm_global_ptr_ty, llvm_i64_ty, llvm_i64_ty],
[IntrArgMemOnly, IntrWriteMem,
WriteOnly<ArgIndex<0>>, NoCapture<ArgIndex<0>>, ImmArg<ArgIndex<2>>]>;

def int_nvvm_st_bulk_shared_cta : DefaultAttrsIntrinsic<[],
[llvm_shared_ptr_ty, llvm_i64_ty, llvm_i64_ty],
[IntrArgMemOnly, IntrWriteMem,
WriteOnly<ArgIndex<0>>, NoCapture<ArgIndex<0>>, ImmArg<ArgIndex<2>>]>;

} // let TargetPrefix = "nvvm"
14 changes: 14 additions & 0 deletions llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
Original file line number Diff line number Diff line change
Expand Up @@ -7766,3 +7766,17 @@ foreach shape = ["16x64b", "16x128b", "16x256b", "32x32b", "16x32bx2"] in {
}

} // isConvergent

// Bulk store instructions

def INT_NVVM_ST_BULK_GENERIC :
NVPTXInst<(outs), (ins ADDR:$dest_addr, Int64Regs:$size),
"st.bulk [$dest_addr], $size, 0;",
[(int_nvvm_st_bulk addr:$dest_addr, i64:$size, (i64 0))]>,
Requires<[hasSM<100>, hasPTX<86>]>;

def INT_NVVM_ST_BULK_SHARED_CTA:
NVPTXInst<(outs), (ins ADDR:$dest_addr, Int64Regs:$size),
"st.bulk.shared::cta [$dest_addr], $size, 0;",
[(int_nvvm_st_bulk_shared_cta addr:$dest_addr, i64:$size, (i64 0))]>,
Requires<[hasSM<100>, hasPTX<86>]>;
46 changes: 46 additions & 0 deletions llvm/test/CodeGen/NVPTX/st_bulk.ll
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_100 -mattr=+ptx86 | FileCheck --check-prefixes=CHECK,CHECK-PTX64 %s
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_100 -mattr=+ptx86 --nvptx-short-ptr | FileCheck --check-prefixes=CHECK,CHECK-PTX-SHARED32 %s
; RUN: %if ptxas-12.8 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_100 -mattr=+ptx86 | %ptxas-verify -arch=sm_100 %}
; RUN: %if ptxas-12.8 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_100 -mattr=+ptx86 --nvptx-short-ptr | %ptxas-verify -arch=sm_100 %}

declare void @llvm.nvvm.st.bulk(ptr addrspace(1), i64, i64)
define void @st_bulk(ptr addrspace(1) %dest_addr, i64 %size) {
; CHECK-LABEL: st_bulk(
; CHECK: {
; CHECK-NEXT: .reg .b64 %rd<3>;
; CHECK-EMPTY:
; CHECK-NEXT: // %bb.0:
; CHECK-NEXT: ld.param.u64 %rd1, [st_bulk_param_0];
; CHECK-NEXT: ld.param.u64 %rd2, [st_bulk_param_1];
; CHECK-NEXT: st.bulk [%rd1], %rd2, 0;
; CHECK-NEXT: ret;
call void @llvm.nvvm.st.bulk(ptr addrspace(1) %dest_addr, i64 %size, i64 0)
ret void
}

declare void @llvm.nvvm.st.bulk.shared.cta(ptr addrspace(3), i64, i64)
define void @st_bulk_shared_cta(ptr addrspace(3) %dest_addr, i64 %size) {
; CHECK-PTX64-LABEL: st_bulk_shared_cta(
; CHECK-PTX64: {
; CHECK-PTX64-NEXT: .reg .b64 %rd<3>;
; CHECK-PTX64-EMPTY:
; CHECK-PTX64-NEXT: // %bb.0:
; CHECK-PTX64-NEXT: ld.param.u64 %rd1, [st_bulk_shared_cta_param_0];
; CHECK-PTX64-NEXT: ld.param.u64 %rd2, [st_bulk_shared_cta_param_1];
; CHECK-PTX64-NEXT: st.bulk.shared::cta [%rd1], %rd2, 0;
; CHECK-PTX64-NEXT: ret;
;
; CHECK-PTX-SHARED32-LABEL: st_bulk_shared_cta(
; CHECK-PTX-SHARED32: {
; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<2>;
; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<2>;
; CHECK-PTX-SHARED32-EMPTY:
; CHECK-PTX-SHARED32-NEXT: // %bb.0:
; CHECK-PTX-SHARED32-NEXT: ld.param.u32 %r1, [st_bulk_shared_cta_param_0];
; CHECK-PTX-SHARED32-NEXT: ld.param.u64 %rd1, [st_bulk_shared_cta_param_1];
; CHECK-PTX-SHARED32-NEXT: st.bulk.shared::cta [%r1], %rd1, 0;
; CHECK-PTX-SHARED32-NEXT: ret;
call void @llvm.nvvm.st.bulk.shared.cta(ptr addrspace(3) %dest_addr, i64 %size, i64 0)
ret void
}