-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[mlir][nvgpu] Mark TMA descriptor as MemWriteAt in tma.async.store
#79427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The canonicalizer finds `nvgpu.tma.async.store` Op trivially dead, because it lacks any memory side effects. This PR aims to address this issue by adding the `MemWriteAt` to the TMA descriptor. This Op copies data `shared memory -> global memory`, but it is done asynchronously, so the fix might not be optimal. Because it does not mutate the memory right away. The asynchronous behavior is controlled by two NVVM OPs below: `nvvm.cp.async.bulk.commit.group`: Groups all the `nvgpu.tma.async.store` together and commits the group. `nvvm.cp.async.bulk.wait_group 1`: Waits for the completion of the 1st group Here's a simplified representation of the code: ``` gpu.func ... { // Write something to shared memory %shmem = ... // Perform asynchronous store from shared memory to global memory nvgpu.tma.async.store %shmem to %arg0[%c0, %c0], predicate = %1 : memref<128x32xf32, #gpu.address_space<workgroup>> -> <tensor = memref<128x32xf32, 3>, swizzle = none, l2promo = none, oob = zero, interleave = none> // Control asynchronous execution nvvm.cp.async.bulk.commit.group nvvm.cp.async.bulk.wait_group 1 } ```
@llvm/pr-subscribers-mlir-gpu @llvm/pr-subscribers-mlir-nvgpu Author: Guray Ozen (grypp) ChangesThe canonicalizer finds This Op copies data The asynchronous behavior is controlled by two NVVM OPs below: Here's a simplified representation of the code:
Full diff: https://github.com/llvm/llvm-project/pull/79427.diff 2 Files Affected:
diff --git a/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td b/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
index 239a5f1e2bc2985..a0c0d4cfd8714ba 100644
--- a/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
+++ b/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
@@ -671,7 +671,7 @@ def NVGPU_TmaAsyncStoreOp : NVGPU_Op<"tma.async.store", [AttrSizedOperandSegment
tile shape. The descriptor is created by `nvgpu.tma.create.descriptor`
}];
let arguments = (ins Arg<AnyMemRef, "", [MemReadAt<0, FullEffect>]>:$src,
- NVGPU_TensorMapDescriptor:$tensorMapDescriptor,
+ Arg<NVGPU_TensorMapDescriptor, "", [MemWriteAt<0, FullEffect>]>:$tensorMapDescriptor,
Variadic<Index>:$coordinates,
Optional<I1>:$predicate);
let assemblyFormat = [{
diff --git a/mlir/test/Dialect/NVGPU/canonicalization.mlir b/mlir/test/Dialect/NVGPU/canonicalization.mlir
new file mode 100644
index 000000000000000..a7fbfd80673957c
--- /dev/null
+++ b/mlir/test/Dialect/NVGPU/canonicalization.mlir
@@ -0,0 +1,30 @@
+// RUN: mlir-opt %s | mlir-opt -canonicalize -cse | FileCheck %s
+
+gpu.module @main_kernel {
+
+// CHECK-LABEL: @main_kernel(
+// CHECK-SAME: %[[arg0:.*]]: !nvgpu.tensormap.descriptor
+ gpu.func @main_kernel(%arg0: !nvgpu.tensormap.descriptor<
+ tensor = memref<128x32xf32, 3>, swizzle = none, l2promo = none,
+ oob = zero, interleave = none>) kernel attributes
+ { gpu.known_block_size = array<i32: 128, 1, 1>,
+ gpu.known_grid_size = array<i32: 1, 1, 1>
+ }
+ {
+ // CHECK: %[[c0:.+]] = arith.constant 0 : index
+ // CHECK: %[[S0:.+]] = gpu.thread_id x
+ // CHECK: %[[S1:.+]] = arith.cmpi eq, %[[S0]], %[[c0]] : index
+ // CHECK: %[[S2:.+]] = gpu.dynamic_shared_memory : memref<?xi8, #gpu.address_space<workgroup>>
+ // CHECK: %[[S3:.+]] = memref.view %[[S2]][%[[c0]]][] : memref<?xi8, #gpu.address_space<workgroup>> to memref<128x32xf32, #gpu.address_space<workgroup>>
+ // CHECK: nvgpu.tma.async.store %[[S3]] to %[[arg0]][%[[c0]], %[[c0]]], predicate = %[[S1]] : memref<128x32xf32, #gpu.address_space<workgroup>> -> <tensor = memref<128x32xf32, 3>, swizzle = none, l2promo = none, oob = zero, interleave = none>
+ %c0 = arith.constant 0 : index
+ %0 = gpu.thread_id x
+ %1 = arith.cmpi eq, %0, %c0 : index
+ %2 = gpu.dynamic_shared_memory : memref<?xi8, #gpu.address_space<workgroup>>
+ %view = memref.view %2[%c0][] : memref<?xi8, #gpu.address_space<workgroup>> to memref<128x32xf32, #gpu.address_space<workgroup>>
+ nvgpu.tma.async.store %view to %arg0[%c0, %c0], predicate = %1 : memref<128x32xf32, #gpu.address_space<workgroup>> -> <tensor = memref<128x32xf32, 3>, swizzle = none, l2promo = none, oob = zero, interleave = none>
+ nvvm.cp.async.bulk.commit.group
+ nvvm.cp.async.bulk.wait_group 0
+ gpu.return
+ }
+}
\ No newline at end of file
|
I have some questions about the In Cuda, the I am afraid that MLIR is sometimes using |
It maps to the pointer to a CUtensorMap object. But this is a great question. I was initially unsure whether using the object or pointer is better. I've done experiments to understand the fastest way to pass
This is the 2nd way, I believe I have implemented the functionality as described. Let me explain
The
|
@joker-eph let me know if you have further questions about the tma descriptor. Also, what do you think about this PR? Tma store is not usable currently as the op is found trivially dead. |
Seems to me that this document is misleading: https://github.com/llvm/llvm-project/blob/b40d5b1b08564d23d5e0769892ebbc32447b2987/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td#L689C1-L691C1
Taken as-is, it would mean this initialize a host-side struct. Can you follow-up with another PR to document the type as a device pointer, and check all these ops as well, thanks :) |
Also: the model of having "nvgpu.tma.create.descriptor" doing both the creation of the descriptor and the memcpy to the device will prevent from adopting the grid-constant method and so we won't be able to take advantage of the perf gain. |
I totally agree with that. I've implemented this as we don't have grid_constant. Let me come up with a follow up PR. Also, implicit memcpy could be a leak anyway if you don't do cudaFree. Thanks for bringing up this to my attention. |
The canonicalizer finds
nvgpu.tma.async.store
Op trivially dead, because it lacks any memory side effects. This PR aims to address this issue by adding theMemWriteAt
to the TMA descriptor.This Op copies data
shared memory -> global memory
asynchronously, so the fix might not be optimal as memory mutation does not happen right away.The asynchronous behavior is controlled by two NVVM OPs below:
nvvm.cp.async.bulk.commit.group
: Groups all thenvgpu.tma.async.store
together and commits the group.nvvm.cp.async.bulk.wait_group 1
: Waits for the completion of the 1st groupHere's a simplified representation of the code: