Skip to content

[mlir][nvgpu] Mark TMA descriptor as MemWriteAt in tma.async.store #79427

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 30, 2024

Conversation

grypp
Copy link
Member

@grypp grypp commented Jan 25, 2024

The canonicalizer finds nvgpu.tma.async.store Op trivially dead, because it lacks any memory side effects. This PR aims to address this issue by adding the MemWriteAt to the TMA descriptor.

This Op copies data shared memory -> global memory asynchronously, so the fix might not be optimal as memory mutation does not happen right away.

The asynchronous behavior is controlled by two NVVM OPs below: nvvm.cp.async.bulk.commit.group: Groups all the nvgpu.tma.async.store together and commits the group. nvvm.cp.async.bulk.wait_group 1: Waits for the completion of the 1st group

Here's a simplified representation of the code:

gpu.func ...  {
  // Write something to shared memory
  %shmem = ...

  // Perform asynchronous store shared memory -> global memory
  nvgpu.tma.async.store %shmem to %arg0[%c0, %c0], predicate = %1
    : memref<128x32xf32, #gpu.address_space<workgroup>> ->
      <tensor = memref<128x32xf32, 3>, swizzle = none, l2promo = none, oob = zero, interleave = none>

  // Control asynchronous execution
  nvvm.cp.async.bulk.commit.group
  nvvm.cp.async.bulk.wait_group 1
}

The canonicalizer finds `nvgpu.tma.async.store` Op trivially dead, because it lacks any memory side effects. This PR aims to address this issue by adding the `MemWriteAt` to the TMA descriptor.

This Op copies data `shared memory -> global memory`, but it is done  asynchronously, so the fix might not be optimal. Because it does not mutate the memory right away.

The asynchronous behavior is controlled by two NVVM OPs below:
`nvvm.cp.async.bulk.commit.group`: Groups all the `nvgpu.tma.async.store` together and commits the group.
`nvvm.cp.async.bulk.wait_group 1`: Waits for the completion of the 1st group

Here's a simplified representation of the code:
```
gpu.func ...  {
  // Write something to shared memory
  %shmem = ...

  // Perform asynchronous store from shared memory to global memory
  nvgpu.tma.async.store %shmem to %arg0[%c0, %c0], predicate = %1
    : memref<128x32xf32, #gpu.address_space<workgroup>> ->
      <tensor = memref<128x32xf32, 3>, swizzle = none, l2promo = none, oob = zero, interleave = none>

  // Control asynchronous execution
  nvvm.cp.async.bulk.commit.group
  nvvm.cp.async.bulk.wait_group 1
}
```
@llvmbot
Copy link
Member

llvmbot commented Jan 25, 2024

@llvm/pr-subscribers-mlir-gpu
@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-mlir-nvgpu

Author: Guray Ozen (grypp)

Changes

The canonicalizer finds nvgpu.tma.async.store Op trivially dead, because it lacks any memory side effects. This PR aims to address this issue by adding the MemWriteAt to the TMA descriptor.

This Op copies data shared memory -&gt; global memory asynchronously, so the fix might not be optimal as memory mutation does not happen right away.

The asynchronous behavior is controlled by two NVVM OPs below: nvvm.cp.async.bulk.commit.group: Groups all the nvgpu.tma.async.store together and commits the group. nvvm.cp.async.bulk.wait_group 1: Waits for the completion of the 1st group

Here's a simplified representation of the code:

gpu.func ...  {
  // Write something to shared memory
  %shmem = ...

  // Perform asynchronous store shared memory -&gt; global memory
  nvgpu.tma.async.store %shmem to %arg0[%c0, %c0], predicate = %1
    : memref&lt;128x32xf32, #gpu.address_space&lt;workgroup&gt;&gt; -&gt;
      &lt;tensor = memref&lt;128x32xf32, 3&gt;, swizzle = none, l2promo = none, oob = zero, interleave = none&gt;

  // Control asynchronous execution
  nvvm.cp.async.bulk.commit.group
  nvvm.cp.async.bulk.wait_group 1
}

Full diff: https://github.com/llvm/llvm-project/pull/79427.diff

2 Files Affected:

  • (modified) mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td (+1-1)
  • (added) mlir/test/Dialect/NVGPU/canonicalization.mlir (+30)
diff --git a/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td b/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
index 239a5f1e2bc2985..a0c0d4cfd8714ba 100644
--- a/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
+++ b/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
@@ -671,7 +671,7 @@ def NVGPU_TmaAsyncStoreOp : NVGPU_Op<"tma.async.store", [AttrSizedOperandSegment
     tile shape. The descriptor is created by `nvgpu.tma.create.descriptor`
   }];  
   let arguments = (ins  Arg<AnyMemRef, "", [MemReadAt<0, FullEffect>]>:$src,
-                        NVGPU_TensorMapDescriptor:$tensorMapDescriptor,
+                        Arg<NVGPU_TensorMapDescriptor, "", [MemWriteAt<0, FullEffect>]>:$tensorMapDescriptor,
                         Variadic<Index>:$coordinates, 
                         Optional<I1>:$predicate);
   let assemblyFormat = [{
diff --git a/mlir/test/Dialect/NVGPU/canonicalization.mlir b/mlir/test/Dialect/NVGPU/canonicalization.mlir
new file mode 100644
index 000000000000000..a7fbfd80673957c
--- /dev/null
+++ b/mlir/test/Dialect/NVGPU/canonicalization.mlir
@@ -0,0 +1,30 @@
+// RUN: mlir-opt %s | mlir-opt -canonicalize -cse | FileCheck %s
+
+gpu.module @main_kernel {
+
+// CHECK-LABEL: @main_kernel(
+//  CHECK-SAME: %[[arg0:.*]]: !nvgpu.tensormap.descriptor
+  gpu.func @main_kernel(%arg0: !nvgpu.tensormap.descriptor<
+        tensor = memref<128x32xf32, 3>, swizzle = none, l2promo = none, 
+        oob = zero, interleave = none>) kernel attributes 
+        { gpu.known_block_size = array<i32: 128, 1, 1>, 
+          gpu.known_grid_size = array<i32: 1, 1, 1>
+        } 
+  {
+    // CHECK: %[[c0:.+]] = arith.constant 0 : index 
+    // CHECK: %[[S0:.+]] = gpu.thread_id  x
+    // CHECK: %[[S1:.+]] = arith.cmpi eq, %[[S0]], %[[c0]] : index
+    // CHECK: %[[S2:.+]] = gpu.dynamic_shared_memory : memref<?xi8, #gpu.address_space<workgroup>>
+    // CHECK: %[[S3:.+]] = memref.view %[[S2]][%[[c0]]][] : memref<?xi8, #gpu.address_space<workgroup>> to memref<128x32xf32, #gpu.address_space<workgroup>>
+    // CHECK: nvgpu.tma.async.store %[[S3]] to %[[arg0]][%[[c0]], %[[c0]]], predicate = %[[S1]] : memref<128x32xf32, #gpu.address_space<workgroup>> -> <tensor = memref<128x32xf32, 3>, swizzle = none, l2promo = none, oob = zero, interleave = none>
+    %c0 = arith.constant 0 : index
+    %0 = gpu.thread_id  x
+    %1 = arith.cmpi eq, %0, %c0 : index
+    %2 = gpu.dynamic_shared_memory : memref<?xi8, #gpu.address_space<workgroup>>
+    %view = memref.view %2[%c0][] : memref<?xi8, #gpu.address_space<workgroup>> to memref<128x32xf32, #gpu.address_space<workgroup>>
+    nvgpu.tma.async.store %view to %arg0[%c0, %c0], predicate = %1 : memref<128x32xf32, #gpu.address_space<workgroup>> -> <tensor = memref<128x32xf32, 3>, swizzle = none, l2promo = none, oob = zero, interleave = none>
+    nvvm.cp.async.bulk.commit.group
+    nvvm.cp.async.bulk.wait_group 0
+    gpu.return
+  }
+}
\ No newline at end of file

@joker-eph
Copy link
Collaborator

I have some questions about the NVGPU_TensorMapDescriptor type in MLIR.
Does it map to a CUtensorMap object? Or to a pointer to a CUtensorMap object?

In Cuda, the CUtensorMap object is setup on the host, and then copied to the GPU memory. Inside the kernel the PTX instruction are using a CUtensorMap *.

I am afraid that MLIR is sometimes using NVGPU_TensorMapDescriptor as a CUtensorMap and other times as a CUtensorMap *

@grypp
Copy link
Member Author

grypp commented Jan 26, 2024

I have some questions about the NVGPU_TensorMapDescriptor type in MLIR.
Does it map to a CUtensorMap object? Or to a pointer to a CUtensorMap object?

It maps to the pointer to a CUtensorMap object.

But this is a great question. I was initially unsure whether using the object or pointer is better. I've done experiments to understand the fastest way to pass CUtensorMap to the device. I found two approaches:

  1. Pass the CUtensorMap object as a kernel parameter (not as a pointer), but I was told that it requires mapping the kernel parameter as __grid_constant__, which currently lacks support in LLVM.
  2. Copying CUtensorMap * to the device and use the prefetch.tensor PTX instruction ensures the descriptor is in the cache.

In Cuda, the CUtensorMap object is setup on the host, and then copied to the GPU memory. Inside the kernel the PTX instruction are using a CUtensorMap *.

This is the 2nd way, I believe I have implemented the functionality as described. Let me explain CUtensorMap generation and copy steps. Let's take IR below, here nvgpu.tma.create.descriptor creates CUtensorMap object and memcpy it to the device. Let's run convert-nvgpu-to-nvvm pass:

!mbarDesc = !nvgpu.mbarrier.group<memorySpace = #gpu.address_space<workgroup>, num_barriers = 1>
!tmaDesc = !nvgpu.tensormap.descriptor<tensor = memref<64x64xf16, 3>, swizzle = swizzle_128b, l2promo = none, oob = zero, interleave = none>

func.func @main() {
  // ... (omitting some lines for brevity)

  // Call TMA descriptor and memcpy to the device 
  %2 = nvgpu.tma.create.descriptor %cast box[%c64, %c64] : memref<*xf16> -> !tmaDesc

  gpu.launch blocks(%arg0, %arg1, %arg2) in (%arg6 = %c1, %arg7 = %c1, %arg8 = %c1) threads(%arg3, %arg4, %arg5) in (%arg9 = %c64, %arg10 = %c1, %arg11 = %c1) dynamic_shared_memory_size %c8192 {
    // ... (omitting some lines for brevity)

    // Call to initialize mbarrier
    %5 = nvgpu.mbarrier.create -> !mbarDesc
    nvgpu.mbarrier.init %5[%c0], %c1 : !mbarDesc

    // Asynchronously load TMA descriptor to shared memory
    nvgpu.tma.async.load %2[%c0, %c0], %5[%c0] to %shmem, predicate = %tidx0 : !tmaDesc, !mbarDesc -> memref<64x64xf16, #gpu.address_space<workgroup>>

    gpu.terminator
  }
  return
}

The %tensorDesc is a device pointer. Because mgpuTensorMapEncodeTiledMemref function creates a CUtensorMap object and memcpy it to the device:

%tensorDesc = llvm.call @mgpuTensorMapEncodeTiledMemref(%9, %10, %3, %1, %0, %1, %1, !llvm.ptr) : (i64, !llvm.ptr, i64, i64, i64, i64, i64, !llvm.ptr) -> !llvm.ptr
gpu.launch () ...
{
  // ... (omitting some lines for brevity)

  // Asynchronously perform a bulk tensor operation with shared memory
  nvvm.cp.async.bulk.tensor.shared.cluster.global %23, %tensorDesc, %21, box[%24, %24] predicate = %15 : <3>, !llvm.ptr 
}

@grypp
Copy link
Member Author

grypp commented Jan 30, 2024

@joker-eph let me know if you have further questions about the tma descriptor.

Also, what do you think about this PR? Tma store is not usable currently as the op is found trivially dead.

@joker-eph
Copy link
Collaborator

Seems to me that this document is misleading: https://github.com/llvm/llvm-project/blob/b40d5b1b08564d23d5e0769892ebbc32447b2987/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td#L689C1-L691C1

The Op creates a tensor map descriptor object representing tiled memory region. To do that it calls CUDA Driver's cuTensorMapEncodeTiled.

Taken as-is, it would mean this initialize a host-side struct.

Can you follow-up with another PR to document the type as a device pointer, and check all these ops as well, thanks :)

@joker-eph
Copy link
Collaborator

Also: the model of having "nvgpu.tma.create.descriptor" doing both the creation of the descriptor and the memcpy to the device will prevent from adopting the grid-constant method and so we won't be able to take advantage of the perf gain.
We likely should revamp this to align more with how it works in Cuda?

@grypp
Copy link
Member Author

grypp commented Jan 30, 2024

Also: the model of having "nvgpu.tma.create.descriptor" doing both the creation of the descriptor and the memcpy to the device will prevent from adopting the grid-constant method and so we won't be able to take advantage of the perf gain.

We likely should revamp this to align more with how it works in Cuda?

I totally agree with that. I've implemented this as we don't have grid_constant. Let me come up with a follow up PR.

Also, implicit memcpy could be a leak anyway if you don't do cudaFree.

Thanks for bringing up this to my attention.

@grypp grypp merged commit 3477bcf into llvm:main Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants