[mlir][nvgpu] Mark TMA descriptor as MemWriteAt in `tma.async.store` #79427

grypp · 2024-01-25T09:27:03Z

The canonicalizer finds nvgpu.tma.async.store Op trivially dead, because it lacks any memory side effects. This PR aims to address this issue by adding the MemWriteAt to the TMA descriptor.

This Op copies data shared memory -> global memory asynchronously, so the fix might not be optimal as memory mutation does not happen right away.

The asynchronous behavior is controlled by two NVVM OPs below: nvvm.cp.async.bulk.commit.group: Groups all the nvgpu.tma.async.store together and commits the group. nvvm.cp.async.bulk.wait_group 1: Waits for the completion of the 1st group

Here's a simplified representation of the code:

gpu.func ...  {
  // Write something to shared memory
  %shmem = ...

  // Perform asynchronous store shared memory -> global memory
  nvgpu.tma.async.store %shmem to %arg0[%c0, %c0], predicate = %1
    : memref<128x32xf32, #gpu.address_space<workgroup>> ->
      <tensor = memref<128x32xf32, 3>, swizzle = none, l2promo = none, oob = zero, interleave = none>

  // Control asynchronous execution
  nvvm.cp.async.bulk.commit.group
  nvvm.cp.async.bulk.wait_group 1
}

The canonicalizer finds `nvgpu.tma.async.store` Op trivially dead, because it lacks any memory side effects. This PR aims to address this issue by adding the `MemWriteAt` to the TMA descriptor. This Op copies data `shared memory -> global memory`, but it is done asynchronously, so the fix might not be optimal. Because it does not mutate the memory right away. The asynchronous behavior is controlled by two NVVM OPs below: `nvvm.cp.async.bulk.commit.group`: Groups all the `nvgpu.tma.async.store` together and commits the group. `nvvm.cp.async.bulk.wait_group 1`: Waits for the completion of the 1st group Here's a simplified representation of the code: ``` gpu.func ... { // Write something to shared memory %shmem = ... // Perform asynchronous store from shared memory to global memory nvgpu.tma.async.store %shmem to %arg0[%c0, %c0], predicate = %1 : memref<128x32xf32, #gpu.address_space<workgroup>> -> <tensor = memref<128x32xf32, 3>, swizzle = none, l2promo = none, oob = zero, interleave = none> // Control asynchronous execution nvvm.cp.async.bulk.commit.group nvvm.cp.async.bulk.wait_group 1 } ```

llvmbot · 2024-01-25T09:27:34Z

@llvm/pr-subscribers-mlir-gpu
@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-mlir-nvgpu

Author: Guray Ozen (grypp)

Changes

The canonicalizer finds nvgpu.tma.async.store Op trivially dead, because it lacks any memory side effects. This PR aims to address this issue by adding the MemWriteAt to the TMA descriptor.

This Op copies data shared memory -> global memory asynchronously, so the fix might not be optimal as memory mutation does not happen right away.

The asynchronous behavior is controlled by two NVVM OPs below: nvvm.cp.async.bulk.commit.group: Groups all the nvgpu.tma.async.store together and commits the group. nvvm.cp.async.bulk.wait_group 1: Waits for the completion of the 1st group

Here's a simplified representation of the code:

gpu.func ...  {
  // Write something to shared memory
  %shmem = ...

  // Perform asynchronous store shared memory -&gt; global memory
  nvgpu.tma.async.store %shmem to %arg0[%c0, %c0], predicate = %1
    : memref&lt;128x32xf32, #gpu.address_space&lt;workgroup&gt;&gt; -&gt;
      &lt;tensor = memref&lt;128x32xf32, 3&gt;, swizzle = none, l2promo = none, oob = zero, interleave = none&gt;

  // Control asynchronous execution
  nvvm.cp.async.bulk.commit.group
  nvvm.cp.async.bulk.wait_group 1
}

Full diff: https://github.com/llvm/llvm-project/pull/79427.diff

2 Files Affected:

(modified) mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td (+1-1)
(added) mlir/test/Dialect/NVGPU/canonicalization.mlir (+30)

diff --git a/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td b/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
index 239a5f1e2bc2985..a0c0d4cfd8714ba 100644
--- a/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
+++ b/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
@@ -671,7 +671,7 @@ def NVGPU_TmaAsyncStoreOp : NVGPU_Op<"tma.async.store", [AttrSizedOperandSegment
     tile shape. The descriptor is created by `nvgpu.tma.create.descriptor`
   }];  
   let arguments = (ins  Arg<AnyMemRef, "", [MemReadAt<0, FullEffect>]>:$src,
-                        NVGPU_TensorMapDescriptor:$tensorMapDescriptor,
+                        Arg<NVGPU_TensorMapDescriptor, "", [MemWriteAt<0, FullEffect>]>:$tensorMapDescriptor,
                         Variadic<Index>:$coordinates, 
                         Optional<I1>:$predicate);
   let assemblyFormat = [{
diff --git a/mlir/test/Dialect/NVGPU/canonicalization.mlir b/mlir/test/Dialect/NVGPU/canonicalization.mlir
new file mode 100644
index 000000000000000..a7fbfd80673957c
--- /dev/null
+++ b/mlir/test/Dialect/NVGPU/canonicalization.mlir
@@ -0,0 +1,30 @@
+// RUN: mlir-opt %s | mlir-opt -canonicalize -cse | FileCheck %s
+
+gpu.module @main_kernel {
+
+// CHECK-LABEL: @main_kernel(
+//  CHECK-SAME: %[[arg0:.*]]: !nvgpu.tensormap.descriptor
+  gpu.func @main_kernel(%arg0: !nvgpu.tensormap.descriptor<
+        tensor = memref<128x32xf32, 3>, swizzle = none, l2promo = none, 
+        oob = zero, interleave = none>) kernel attributes 
+        { gpu.known_block_size = array<i32: 128, 1, 1>, 
+          gpu.known_grid_size = array<i32: 1, 1, 1>
+        } 
+  {
+    // CHECK: %[[c0:.+]] = arith.constant 0 : index 
+    // CHECK: %[[S0:.+]] = gpu.thread_id  x
+    // CHECK: %[[S1:.+]] = arith.cmpi eq, %[[S0]], %[[c0]] : index
+    // CHECK: %[[S2:.+]] = gpu.dynamic_shared_memory : memref<?xi8, #gpu.address_space<workgroup>>
+    // CHECK: %[[S3:.+]] = memref.view %[[S2]][%[[c0]]][] : memref<?xi8, #gpu.address_space<workgroup>> to memref<128x32xf32, #gpu.address_space<workgroup>>
+    // CHECK: nvgpu.tma.async.store %[[S3]] to %[[arg0]][%[[c0]], %[[c0]]], predicate = %[[S1]] : memref<128x32xf32, #gpu.address_space<workgroup>> -> <tensor = memref<128x32xf32, 3>, swizzle = none, l2promo = none, oob = zero, interleave = none>
+    %c0 = arith.constant 0 : index
+    %0 = gpu.thread_id  x
+    %1 = arith.cmpi eq, %0, %c0 : index
+    %2 = gpu.dynamic_shared_memory : memref<?xi8, #gpu.address_space<workgroup>>
+    %view = memref.view %2[%c0][] : memref<?xi8, #gpu.address_space<workgroup>> to memref<128x32xf32, #gpu.address_space<workgroup>>
+    nvgpu.tma.async.store %view to %arg0[%c0, %c0], predicate = %1 : memref<128x32xf32, #gpu.address_space<workgroup>> -> <tensor = memref<128x32xf32, 3>, swizzle = none, l2promo = none, oob = zero, interleave = none>
+    nvvm.cp.async.bulk.commit.group
+    nvvm.cp.async.bulk.wait_group 0
+    gpu.return
+  }
+}
\ No newline at end of file

joker-eph · 2024-01-26T03:49:03Z

I have some questions about the NVGPU_TensorMapDescriptor type in MLIR.
Does it map to a CUtensorMap object? Or to a pointer to a CUtensorMap object?

In Cuda, the CUtensorMap object is setup on the host, and then copied to the GPU memory. Inside the kernel the PTX instruction are using a CUtensorMap *.

I am afraid that MLIR is sometimes using NVGPU_TensorMapDescriptor as a CUtensorMap and other times as a CUtensorMap *

grypp · 2024-01-26T15:45:02Z

I have some questions about the NVGPU_TensorMapDescriptor type in MLIR.
Does it map to a CUtensorMap object? Or to a pointer to a CUtensorMap object?

It maps to the pointer to a CUtensorMap object.

But this is a great question. I was initially unsure whether using the object or pointer is better. I've done experiments to understand the fastest way to pass CUtensorMap to the device. I found two approaches:

Pass the CUtensorMap object as a kernel parameter (not as a pointer), but I was told that it requires mapping the kernel parameter as __grid_constant__, which currently lacks support in LLVM.
Copying CUtensorMap * to the device and use the prefetch.tensor PTX instruction ensures the descriptor is in the cache.

In Cuda, the CUtensorMap object is setup on the host, and then copied to the GPU memory. Inside the kernel the PTX instruction are using a CUtensorMap *.

This is the 2nd way, I believe I have implemented the functionality as described. Let me explain CUtensorMap generation and copy steps. Let's take IR below, here nvgpu.tma.create.descriptor creates CUtensorMap object and memcpy it to the device. Let's run convert-nvgpu-to-nvvm pass:

!mbarDesc = !nvgpu.mbarrier.group<memorySpace = #gpu.address_space<workgroup>, num_barriers = 1>
!tmaDesc = !nvgpu.tensormap.descriptor<tensor = memref<64x64xf16, 3>, swizzle = swizzle_128b, l2promo = none, oob = zero, interleave = none>

func.func @main() {
  // ... (omitting some lines for brevity)

  // Call TMA descriptor and memcpy to the device 
  %2 = nvgpu.tma.create.descriptor %cast box[%c64, %c64] : memref<*xf16> -> !tmaDesc

  gpu.launch blocks(%arg0, %arg1, %arg2) in (%arg6 = %c1, %arg7 = %c1, %arg8 = %c1) threads(%arg3, %arg4, %arg5) in (%arg9 = %c64, %arg10 = %c1, %arg11 = %c1) dynamic_shared_memory_size %c8192 {
    // ... (omitting some lines for brevity)

    // Call to initialize mbarrier
    %5 = nvgpu.mbarrier.create -> !mbarDesc
    nvgpu.mbarrier.init %5[%c0], %c1 : !mbarDesc

    // Asynchronously load TMA descriptor to shared memory
    nvgpu.tma.async.load %2[%c0, %c0], %5[%c0] to %shmem, predicate = %tidx0 : !tmaDesc, !mbarDesc -> memref<64x64xf16, #gpu.address_space<workgroup>>

    gpu.terminator
  }
  return
}

The %tensorDesc is a device pointer. Because mgpuTensorMapEncodeTiledMemref function creates a CUtensorMap object and memcpy it to the device:

%tensorDesc = llvm.call @mgpuTensorMapEncodeTiledMemref(%9, %10, %3, %1, %0, %1, %1, !llvm.ptr) : (i64, !llvm.ptr, i64, i64, i64, i64, i64, !llvm.ptr) -> !llvm.ptr
gpu.launch () ...
{
  // ... (omitting some lines for brevity)

  // Asynchronously perform a bulk tensor operation with shared memory
  nvvm.cp.async.bulk.tensor.shared.cluster.global %23, %tensorDesc, %21, box[%24, %24] predicate = %15 : <3>, !llvm.ptr 
}

grypp · 2024-01-30T13:34:52Z

@joker-eph let me know if you have further questions about the tma descriptor.

Also, what do you think about this PR? Tma store is not usable currently as the op is found trivially dead.

joker-eph · 2024-01-30T17:50:50Z

Seems to me that this document is misleading: https://github.com/llvm/llvm-project/blob/b40d5b1b08564d23d5e0769892ebbc32447b2987/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td#L689C1-L691C1

The Op creates a tensor map descriptor object representing tiled memory region. To do that it calls CUDA Driver's cuTensorMapEncodeTiled.

Taken as-is, it would mean this initialize a host-side struct.

Can you follow-up with another PR to document the type as a device pointer, and check all these ops as well, thanks :)

joker-eph · 2024-01-30T17:53:03Z

Also: the model of having "nvgpu.tma.create.descriptor" doing both the creation of the descriptor and the memcpy to the device will prevent from adopting the grid-constant method and so we won't be able to take advantage of the perf gain.
We likely should revamp this to align more with how it works in Cuda?

grypp · 2024-01-30T18:35:38Z

Also: the model of having "nvgpu.tma.create.descriptor" doing both the creation of the descriptor and the memcpy to the device will prevent from adopting the grid-constant method and so we won't be able to take advantage of the perf gain.

We likely should revamp this to align more with how it works in Cuda?

I totally agree with that. I've implemented this as we don't have grid_constant. Let me come up with a follow up PR.

Also, implicit memcpy could be a leak anyway if you don't do cudaFree.

Thanks for bringing up this to my attention.

grypp requested review from jpienaar and joker-eph January 25, 2024 09:27

llvmbot added mlir:gpu mlir mlir:nvgpu labels Jan 25, 2024

joker-eph approved these changes Jan 30, 2024

View reviewed changes

grypp merged commit 3477bcf into llvm:main Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[mlir][nvgpu] Mark TMA descriptor as MemWriteAt in `tma.async.store` #79427

[mlir][nvgpu] Mark TMA descriptor as MemWriteAt in `tma.async.store` #79427

Uh oh!

grypp commented Jan 25, 2024

Uh oh!

llvmbot commented Jan 25, 2024 •

edited

Loading

Uh oh!

joker-eph commented Jan 26, 2024

Uh oh!

grypp commented Jan 26, 2024 •

edited

Loading

Uh oh!

grypp commented Jan 30, 2024

Uh oh!

joker-eph commented Jan 30, 2024

Uh oh!

joker-eph commented Jan 30, 2024

Uh oh!

grypp commented Jan 30, 2024

Uh oh!

Uh oh!

[mlir][nvgpu] Mark TMA descriptor as MemWriteAt in tma.async.store #79427

[mlir][nvgpu] Mark TMA descriptor as MemWriteAt in tma.async.store #79427

Uh oh!

Conversation

grypp commented Jan 25, 2024

Uh oh!

llvmbot commented Jan 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joker-eph commented Jan 26, 2024

Uh oh!

grypp commented Jan 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

grypp commented Jan 30, 2024

Uh oh!

joker-eph commented Jan 30, 2024

Uh oh!

joker-eph commented Jan 30, 2024

Uh oh!

grypp commented Jan 30, 2024

Uh oh!

Uh oh!

[mlir][nvgpu] Mark TMA descriptor as MemWriteAt in `tma.async.store` #79427

[mlir][nvgpu] Mark TMA descriptor as MemWriteAt in `tma.async.store` #79427

llvmbot commented Jan 25, 2024 •

edited

Loading

grypp commented Jan 26, 2024 •

edited

Loading