Skip to content

[mlir] im2col & l2cache on cp.async.bulk.tensor.shared.cluster.global` #72967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 22, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 45 additions & 21 deletions mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
Original file line number Diff line number Diff line change
Expand Up @@ -1404,20 +1404,34 @@ def NVVM_CpAsyncBulkTensorGlobalToSharedClusterOp :
AttrSizedOperandSegments]>,
Arguments<(ins LLVM_PointerShared:$dstMem,
LLVM_AnyPointer:$tmaDescriptor,
LLVM_PointerShared:$mbar,
Optional<I16>:$multicastMask,
Variadic<I32>:$coordinates,
LLVM_PointerShared:$mbar,
Variadic<I16>:$im2colOffsets,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this 'im2colOffsets' would be an Optional operand too? No?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right, im2colOffset is optional. We have Variadic<I16> that can be 0 sized.

For example the following Op is valid and executes in tiled mode:

nvvm.cp.async.bulk.tensor.shared.cluster.global %0, %1, %2, 
                box[%d0,%d1,%d2] : !llvm.ptr<3>, !llvm.ptr

im2col mode is activated when offsets are present like below:

nvvm.cp.async.bulk.tensor.shared.cluster.global %0, %1, %2, 
                box[%d0,%d1,%d2] im2col[%off0]: !llvm.ptr<3>, !llvm.ptr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking for your thoughts on this one, Otherwise, it looks good to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I was thinking we may need to provide an empty-vector like construct to take care of 0-sized im2col.
Not needing to provide that is even better !

+1. This change looks good to me.

Optional<I16>:$multicastMask,
Optional<I64>:$l2CacheHint,
PtxPredicate:$predicate)> {
let description = [{
Initiates an asynchronous copy operation on the tensor data from global
memory to shared memory.

The Op operates has two load modes:
1) Tiled Mode: It's the default mode. The source multi-dimensional tensor
layout is preserved at the destination.

2) Im2col Mode: This mode is used when `im2colOffsets` operands are present.
the elements in the Bounding Box of the source tensor are rearranged into
columns at the destination. In this mode, the tensor has to be at least
3-dimensional.

The `multicastMask` operand is optional. When it is present, the Op copies
data from global memory to shared memory of multiple CTAs in the cluster.
Operand `multicastMask` specifies the destination CTAs in the cluster such
that each bit position in the 16-bit `multicastMask` operand corresponds to
the `nvvm.read.ptx.sreg.ctaid` of the destination CTA.
the `nvvm.read.ptx.sreg.ctaid` of the destination CTA.

The `l2CacheHint` operand is optional, and it is used to specify cache
eviction policy that may be used during the memory access.

[For more information, see PTX ISA]
(https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor)
}];
Expand All @@ -1426,32 +1440,42 @@ def NVVM_CpAsyncBulkTensorGlobalToSharedClusterOp :
$dstMem `,`
$tmaDescriptor `,`
$mbar `,`
( `multicast_mask` `=` $multicastMask^ `,` )?
`box` `[`$coordinates `]`
(`,` `predicate` `=` $predicate^)?
attr-dict `:` type(operands)
`box` `[`$coordinates `]`
(`im2col` `[` $im2colOffsets^ `]` )?
(`multicast_mask` `=` $multicastMask^ )?
(`l2_cache_hint` `=` $l2CacheHint^ )?
(`predicate` `=` $predicate^)?
attr-dict `:` type($dstMem) `,` type($tmaDescriptor)
}];

let extraClassDefinition = [{
std::string $cppClass::getPtx() {
int im2colDim = getIm2colOffsets().size();
int dim = getCoordinates().size();
std::string ptx = "cp.async.bulk.tensor.";
ptx += std::to_string(dim) + "d.";
ptx += "shared::cluster.global.mbarrier::complete_tx::bytes";
if(getMulticastMask()) {
ptx += ".multicast::cluster";
if(dim == 1) ptx += " [%0], [%1, {%4} ], [%2], %3;";
if(dim == 2) ptx += " [%0], [%1, {%4, %5} ], [%2], %3;";
if(dim == 3) ptx += " [%0], [%1, {%4, %5, %6} ], [%2], %3;";
if(dim == 4) ptx += " [%0], [%1, {%4, %5, %6, %7} ], [%2], %3;";
if(dim == 5) ptx += " [%0], [%1, {%4, %5, %6, %7, %8} ], [%2], %3;";
} else {
if(dim == 1) ptx += " [%0], [%1, {%3} ], [%2];";
if(dim == 2) ptx += " [%0], [%1, {%3, %4} ], [%2];";
if(dim == 3) ptx += " [%0], [%1, {%3, %4, %5} ], [%2];";
if(dim == 4) ptx += " [%0], [%1, {%3, %4, %5, %6} ], [%2];";
if(dim == 5) ptx += " [%0], [%1, {%3, %4, %5, %6, %7} ], [%2];";
ptx += "shared::cluster.global.mbarrier::complete_tx::bytes";
if(im2colDim) ptx += ".im2col";
if(getMulticastMask()) ptx += ".multicast::cluster";
if(getL2CacheHint()) ptx += ".L2::cache_hint";

auto preg = [](int r) { return "%" + std::to_string(r); };

// Build Registers
ptx += " [%0], [%1, {";
int r = 2;
for(int i = 0; i < dim; i++) ptx += preg(r+i) + ",";
ptx.pop_back(); r += dim;
ptx += "} ], [%" + std::to_string(r++) + "]";
if(im2colDim) {
ptx += ",{";
for(int i = 0; i < im2colDim; i++) ptx += preg(r+i) + ",";
ptx.pop_back(); r += im2colDim;
ptx += "}";
}
if(getMulticastMask()) ptx += ", " + preg(r++);
if(getL2CacheHint()) ptx += ", " + preg(r++);
ptx += ";";
return ptx;
}
}];
Expand Down
5 changes: 2 additions & 3 deletions mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -973,10 +973,9 @@ struct NVGPUTmaAsyncLoadOpLowering
for (auto [index, value] : llvm::enumerate(coords)) {
coords[index] = truncToI32(b, value);
}

rewriter.replaceOpWithNewOp<NVVM::CpAsyncBulkTensorGlobalToSharedClusterOp>(
op, dest, adaptor.getTensorMapDescriptor(), barrier, Value(), coords,
adaptor.getPredicate());
op, dest, adaptor.getTensorMapDescriptor(), coords, barrier,
ValueRange{}, Value{}, Value{}, adaptor.getPredicate());
return success();
}
};
Expand Down
14 changes: 12 additions & 2 deletions mlir/lib/Dialect/LLVMIR/IR/NVVMDialect.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -76,8 +76,18 @@ ParseResult VoteBallotOp::parse(OpAsmParser &parser, OperationState &result) {
void VoteBallotOp::print(OpAsmPrinter &p) { printNVVMIntrinsicOp(p, *this); }

LogicalResult CpAsyncBulkTensorGlobalToSharedClusterOp::verify() {
if (getCoordinates().size() > 5)
return emitError("Maximum 5 coordinates and dimension is supported.");
if (getCoordinates().empty() || getCoordinates().size() > 5)
return emitError("expects coordinates between 1 to 5 dimension");

// Check for im2col mode
if (!getIm2colOffsets().empty()) {
if (getCoordinates().size() < 3)
return emitError(
"to use im2col mode, the tensor has to be at least 3-dimensional");
if (getCoordinates().size() != (getIm2colOffsets().size() + 2))
return emitError(
"im2col offsets must be 2 less than number of coordinates");
}
return success();
}

Expand Down
24 changes: 12 additions & 12 deletions mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
Original file line number Diff line number Diff line change
Expand Up @@ -653,15 +653,15 @@ func.func @async_tma_load(%tensorMap1d: !tensorMap1d, %tensorMap2d: !tensorMap2d
%c0 = arith.constant 0 : index
%crd0 = arith.constant 0 : index
%crd1 = arith.constant 0 : index
// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.*}}, %{{.*}}, %{{.*}}, box[%{{.*}}]
// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.*}}, %{{.*}}, %{{.*}} box[%{{.*}}]
nvgpu.tma.async.load %tensorMap1d[%crd0], %mbarrier[%c0] to %buffer1d : !tensorMap1d, !mbarrier -> memref<128xf32,3>
// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.*}}, %{{.*}}, %{{.*}}, box[%{{.*}}, %{{.*}}]
// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.*}}, %{{.*}}, %{{.*}} box[%{{.*}}, %{{.*}}]
nvgpu.tma.async.load %tensorMap2d[%crd0, %crd1], %mbarrier[%c0] to %buffer2d : !tensorMap2d, !mbarrier -> memref<32x32xf32,3>
// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.*}}, %{{.*}}, %{{.*}}, box[%{{.*}}, %{{.*}}, %{{.*}}]
// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.*}}, %{{.*}}, %{{.*}} box[%{{.*}}, %{{.*}}, %{{.*}}]
nvgpu.tma.async.load %tensorMap3d[%crd0, %crd1, %crd0], %mbarrier[%c0] to %buffer3d : !tensorMap3d, !mbarrier -> memref<2x32x32xf32,3>
// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.*}}, %{{.*}}, %{{.*}}, box[%{{.*}}, %{{.*}}, %{{.*}}, %{{.*}}]
// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.*}}, %{{.*}}, %{{.*}} box[%{{.*}}, %{{.*}}, %{{.*}}, %{{.*}}]
nvgpu.tma.async.load %tensorMap4d[%crd0, %crd1, %crd1, %crd0], %mbarrier[%c0] to %buffer4d : !tensorMap4d, !mbarrier -> memref<2x2x32x32xf32,3>
// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.*}}, %{{.*}}, %{{.*}}, box[%{{.*}}, %{{.*}}, %{{.*}}, %{{.*}}, %{{.*}}]
// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.*}}, %{{.*}}, %{{.*}} box[%{{.*}}, %{{.*}}, %{{.*}}, %{{.*}}, %{{.*}}]
nvgpu.tma.async.load %tensorMap5d[%crd0, %crd1, %crd1, %crd0, %crd0], %mbarrier[%c0] to %buffer5d : !tensorMap5d, !mbarrier -> memref<2x2x2x32x32xf32,3>
func.return
}
Expand All @@ -678,15 +678,15 @@ func.func @async_tma_load_pred(%tensorMap1d: !tensorMap1d, %tensorMap2d: !tensor
%c0 = arith.constant 0 : index
%crd0 = arith.constant 0 : index
%crd1 = arith.constant 0 : index
// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.*}}, %{{.*}}, %{{.*}}, box[%{{.*}}], predicate = %{{.*}}
// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.*}}, %{{.*}}, %{{.*}} box[%{{.*}}] predicate = %{{.*}}
nvgpu.tma.async.load %tensorMap1d[%crd0], %mbarrier[%c0] to %buffer1d, predicate = %p : !tensorMap1d, !mbarrier -> memref<128xf32,3>
// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.*}}, %{{.*}}, %{{.*}}, box[%{{.*}}, %{{.*}}], predicate = %{{.*}}
// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.*}}, %{{.*}}, %{{.*}} box[%{{.*}}, %{{.*}}] predicate = %{{.*}}
nvgpu.tma.async.load %tensorMap2d[%crd0, %crd1], %mbarrier[%c0] to %buffer2d, predicate = %p : !tensorMap2d, !mbarrier -> memref<32x32xf32,3>
// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.*}}, %{{.*}}, %{{.*}}, box[%{{.*}}, %{{.*}}, %{{.*}}], predicate = %{{.*}}
// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.*}}, %{{.*}}, %{{.*}} box[%{{.*}}, %{{.*}}, %{{.*}}] predicate = %{{.*}}
nvgpu.tma.async.load %tensorMap3d[%crd0, %crd1, %crd0], %mbarrier[%c0] to %buffer3d, predicate = %p : !tensorMap3d, !mbarrier -> memref<2x32x32xf32,3>
// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.*}}, %{{.*}}, %{{.*}}, box[%{{.*}}, %{{.*}}, %{{.*}}, %{{.*}}], predicate = %{{.*}}
// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.*}}, %{{.*}}, %{{.*}} box[%{{.*}}, %{{.*}}, %{{.*}}, %{{.*}}] predicate = %{{.*}}
nvgpu.tma.async.load %tensorMap4d[%crd0, %crd1, %crd1, %crd0], %mbarrier[%c0] to %buffer4d, predicate = %p : !tensorMap4d, !mbarrier -> memref<2x2x32x32xf32,3>
// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.*}}, %{{.*}}, %{{.*}}, box[%{{.*}}, %{{.*}}, %{{.*}}, %{{.*}}, %{{.*}}], predicate = %{{.*}}
// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.*}}, %{{.*}}, %{{.*}} box[%{{.*}}, %{{.*}}, %{{.*}}, %{{.*}}, %{{.*}}] predicate = %{{.*}}
nvgpu.tma.async.load %tensorMap5d[%crd0, %crd1, %crd1, %crd0, %crd0], %mbarrier[%c0] to %buffer5d, predicate = %p : !tensorMap5d, !mbarrier -> memref<2x2x2x32x32xf32,3>
func.return
}
Expand Down Expand Up @@ -737,8 +737,8 @@ module @mymodule {
nvgpu.tma.async.load %lhsTensorMap[%c0, %c0], %mbarrier[%c0] to %lhsShmem : !lhsTensorMap, !barrierType -> !shmemlhs
// CHECK: %[[desc:.+]] = llvm.extractvalue %{{.*}}[1] : !llvm.struct<(ptr<3>, ptr<3>, i64, array<2 x i64>, array<2 x i64>)>
// CHECK: %[[c8192:.+]] = llvm.mlir.constant(8192 : index) : i64
// CHECK: %[[shmemOfset:.+]] = llvm.getelementptr %[[desc]][%[[c8192]]] : (!llvm.ptr<3>, i64) -> !llvm.ptr<3>, f16
// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %[[shmemOfset]], %{{.*}}, %{{.*}}, box[%{{.*}}, %{{.*}}] : !llvm.ptr<3>, !llvm.ptr, !llvm.ptr<3>, i32, i32
// CHECK: %[[shmemOfset:.+]] = llvm.getelementptr %[[desc]][%[[c8192]]] : (!llvm.ptr<3>, i64)
// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %[[shmemOfset]], %{{.*}}, %{{.*}}, box[%{{.*}}, %{{.*}}]
nvgpu.tma.async.load %rhsTensorMap[%c0, %c0], %mbarrier[%c0] to %rhsShmem : !rhsTensorMap, !barrierType -> !shmemrhs
return
}
Expand Down
Loading