[mlir][GPU] Improve handling of GPU bounds #95166

krzysz00 · 2024-06-11T19:43:07Z

This change reworks how range information for GPU dispatch IDs (block IDs, thread IDs, and so on) is handled.

known_block_size and known_grid_size become inherent attributes of GPU functions. This makes them less clunky to work with. As a consequence, the gpu.func lowering patterns now only look at the inherent attributes when setting target-specific attributes on the llvm.func that they lower to.
At the same time, gpu.known_block_size and gpu.known_grid_size are made official dialect-level discardable attributes which can be placed on arbitrary functions. This allows for progressive lowerings (without this, a lowering for gpu.thread_id couldn't know about the bounds if it had already been moved from a gpu.func to an llvm.func) and allows for range information to be provided even when gpu.*_{id,dim} are being used outside of a gpu.func context.
All of these index operations have gained an optional upper_bound attribute, allowing for an alternate mode of operation where the bounds are specified locally and not inherited from the operation's context. These also allow handling of cases where the precise launch sizes aren't known, but can be bounded more precisely than the maximum of what any platform's API allows. (I'd like to thank @benvanik for pointing out that this could be useful.)

When inferring bounds (either for range inference or for setting range during lowering) these sources of information are consulted in order of specificity (upper_bound > inherent attribute > discardable attribute, except that dimension sizes check for known_*_bounds to see if they can be constant-folded before checking their upper_bound).

This patch also updates the documentation about the bounds and inference behavior to clarify what these attributes do when set and the consequences of setting them up incorrectly.

This change removes the requirement that the known block or grid IDs be stored on a gpu.func, but instead allows them on any function implementing the FunctionOpInterface. This allows for, for instance, non-kernel functions that live ina func.func or for downstream usecases that don't use gpu.func.

llvmbot · 2024-06-11T19:43:38Z

@llvm/pr-subscribers-mlir-gpu

@llvm/pr-subscribers-mlir

Author: Krzysztof Drewniak (krzysz00)

Changes

This change removes the requirement that the known block or grid IDs be stored on a gpu.func, but instead allows them on any function implementing the FunctionOpInterface. This allows for, for instance, non-kernel functions that live ina func.func or for downstream usecases that don't use gpu.func.

Full diff: https://github.com/llvm/llvm-project/pull/95166.diff

3 Files Affected:

(modified) mlir/lib/Conversion/GPUCommon/IndexIntrinsicsOpLowering.h (+1-5)
(modified) mlir/lib/Dialect/GPU/IR/InferIntRangeInterfaceImpls.cpp (+19-3)
(modified) mlir/test/Dialect/GPU/int-range-interface.mlir (+33)

diff --git a/mlir/lib/Conversion/GPUCommon/IndexIntrinsicsOpLowering.h b/mlir/lib/Conversion/GPUCommon/IndexIntrinsicsOpLowering.h
index d067c70a90ea4..0f74768207205 100644
--- a/mlir/lib/Conversion/GPUCommon/IndexIntrinsicsOpLowering.h
+++ b/mlir/lib/Conversion/GPUCommon/IndexIntrinsicsOpLowering.h
@@ -57,11 +57,7 @@ struct GPUIndexIntrinsicOpLowering : public ConvertOpToLLVMPattern<Op> {
       break;
     }
 
-    Operation *function;
-    if (auto gpuFunc = op->template getParentOfType<gpu::GPUFuncOp>())
-      function = gpuFunc;
-    if (auto llvmFunc = op->template getParentOfType<LLVM::LLVMFuncOp>())
-      function = llvmFunc;
+    Operation *function = op->template getParentOfType<FunctionOpInterface>();
     if (!boundsAttrName.empty() && function) {
       if (auto attr = function->template getAttrOfType<DenseI32ArrayAttr>(
               boundsAttrName)) {
diff --git a/mlir/lib/Dialect/GPU/IR/InferIntRangeInterfaceImpls.cpp b/mlir/lib/Dialect/GPU/IR/InferIntRangeInterfaceImpls.cpp
index 69017efb9a0e6..152884e23b929 100644
--- a/mlir/lib/Dialect/GPU/IR/InferIntRangeInterfaceImpls.cpp
+++ b/mlir/lib/Dialect/GPU/IR/InferIntRangeInterfaceImpls.cpp
@@ -8,6 +8,7 @@
 
 #include "mlir/Dialect/GPU/IR/GPUDialect.h"
 #include "mlir/IR/Matchers.h"
+#include "mlir/Interfaces/FunctionInterfaces.h"
 #include "mlir/Interfaces/InferIntRangeInterface.h"
 #include "llvm/ADT/STLForwardCompat.h"
 #include "llvm/Support/ErrorHandling.h"
@@ -54,6 +55,17 @@ static Value valueByDim(KernelDim3 dims, Dimension dim) {
 
 static uint64_t zext(uint32_t arg) { return static_cast<uint64_t>(arg); }
 
+static std::optional<uint32_t> getKnownLaunchAttr(FunctionOpInterface func,
+                                                  StringRef attrName,
+                                                  Dimension dim) {
+  auto bounds = func.getOperation()->getAttrOfType<DenseI32ArrayAttr>(attrName);
+  if (!bounds)
+    return std::nullopt;
+  if (bounds.size() < static_cast<uint32_t>(dim))
+    return std::nullopt;
+  return bounds[static_cast<uint32_t>(dim)];
+}
+
 template <typename Op>
 static std::optional<uint64_t> getKnownLaunchDim(Op op, LaunchDims type) {
   Dimension dim = op.getDimension();
@@ -73,12 +85,16 @@ static std::optional<uint64_t> getKnownLaunchDim(Op op, LaunchDims type) {
       return value.getZExtValue();
   }
 
-  if (auto func = op->template getParentOfType<GPUFuncOp>()) {
+  if (auto func = op->template getParentOfType<FunctionOpInterface>()) {
     switch (type) {
     case LaunchDims::Block:
-      return llvm::transformOptional(func.getKnownBlockSize(dim), zext);
+      return llvm::transformOptional(
+          getKnownLaunchAttr(func, GPUFuncOp::getKnownBlockSizeAttrName(), dim),
+          zext);
     case LaunchDims::Grid:
-      return llvm::transformOptional(func.getKnownGridSize(dim), zext);
+      return llvm::transformOptional(
+          getKnownLaunchAttr(func, GPUFuncOp::getKnownGridSizeAttrName(), dim),
+          zext);
     }
   }
   return std::nullopt;
diff --git a/mlir/test/Dialect/GPU/int-range-interface.mlir b/mlir/test/Dialect/GPU/int-range-interface.mlir
index a0917a2fdf110..a6c74fec6e824 100644
--- a/mlir/test/Dialect/GPU/int-range-interface.mlir
+++ b/mlir/test/Dialect/GPU/int-range-interface.mlir
@@ -215,3 +215,36 @@ module attributes {gpu.container_module} {
   }
 }
 
+// -----
+
+// CHECK-LABEL: func @annotated_kernel
+module {
+  func.func @annotated_kernel()
+    attributes {gpu.known_block_size = array<i32: 8, 12, 16>,
+        gpu.known_grid_size = array<i32: 20, 24, 28>} {
+
+    %block_id_x = gpu.block_id x
+    %block_id_y = gpu.block_id y
+    %block_id_z = gpu.block_id z
+
+    // CHECK: test.reflect_bounds {smax = 19 : index, smin = 0 : index, umax = 19 : index, umin = 0 : index}
+    // CHECK: test.reflect_bounds {smax = 23 : index, smin = 0 : index, umax = 23 : index, umin = 0 : index}
+    // CHECK: test.reflect_bounds {smax = 27 : index, smin = 0 : index, umax = 27 : index, umin = 0 : index}
+    %block_id_x0 = test.reflect_bounds %block_id_x : index
+    %block_id_y0 = test.reflect_bounds %block_id_y : index
+    %block_id_z0 = test.reflect_bounds %block_id_z : index
+
+    %thread_id_x = gpu.thread_id x
+    %thread_id_y = gpu.thread_id y
+    %thread_id_z = gpu.thread_id z
+
+    // CHECK: test.reflect_bounds {smax = 7 : index, smin = 0 : index, umax = 7 : index, umin = 0 : index}
+    // CHECK: test.reflect_bounds {smax = 11 : index, smin = 0 : index, umax = 11 : index, umin = 0 : index}
+    // CHECK: test.reflect_bounds {smax = 15 : index, smin = 0 : index, umax = 15 : index, umin = 0 : index}
+    %thread_id_x0 = test.reflect_bounds %thread_id_x : index
+    %thread_id_y0 = test.reflect_bounds %thread_id_y : index
+    %thread_id_z0 = test.reflect_bounds %thread_id_z : index
+
+    return
+  }
+}

joker-eph · 2024-06-11T19:53:55Z

I'm missing on the motivation here? Shouldn't we just have all GPU code be inside gpu.func anyway?

krzysz00 · 2024-06-11T20:06:05Z

I remember there being a note on Discourse recently saying that non-kernel GPU functions could be func.funcs
IREE keeps its GPU code in a func.func and I want those InferIntRangeInterface implementations to work there

krzysz00 · 2024-06-11T20:06:24Z

Oh, and, 3, this'll make it work for spirv.func if that's a thing

joker-eph · 2024-06-11T20:11:10Z

I remember there being a note on Discourse recently saying that non-kernel GPU functions could be func.funcs

Yes, but as far as I remember, the only reason for this is because we wanted to allow function that works on CPU and GPU: that is if a function does not make use of GPU specific features then it can be a func.func.

Oh, and, 3, this'll make it work for spirv.func if that's a thing

But wouldn't a spirv.func be the result of this conversion? That is you perform dialect conversion from the GPU dialect to SPIRV and you get a spirv.func with the spirv operation matching this GPU operation?

joker-eph · 2024-06-11T20:12:36Z

mlir/lib/Dialect/GPU/IR/InferIntRangeInterfaceImpls.cpp

-      return llvm::transformOptional(func.getKnownGridSize(dim), zext);
+      return llvm::transformOptional(
+          getKnownLaunchAttr(func, GPUFuncOp::getKnownGridSizeAttrName(), dim),
+          zext);


This seems iffy: it dispatches to discardable attributes instead of inherent attrs/properties.

It's already discardable on GPU functions

I checked the implementation, and indeed the getKnownBlockSize() is just a wrapper around the discardable attributes: but that seems like a bug to me? It should be inherent (and optional) I think.

It's house style for gpu.func (see also the kernel attribute, last I checked)

And I found my cute for "GPU code doesn't have to be in a GPU func" https://discourse.llvm.org/t/mlir-gpu-calling-gpu-func-that-is-not-a-kernel-a-k-a-device-function/79285 .

And so there could be a pass that propagates these annotations from kernels to the func.funcs they call, which is an argument against making them inherent properties of GPU functions only.

So I'd argue that the known block size annotations are dialect-level metadata and that the fix here it to move the name to GPUDialect (and probably to go to a new-style attribute helper)

And so there could be a pass that propagates these annotations from kernels to the func.funcs they call, which is an argument against making them inherent properties of GPU functions only.

I would rather see a process where such func.func get turned into proper gpu.func instead.
Discardable attributes are supposed to be ... discardable :) So this seems like a design issue to me here!

Yeah, honestly, I don't think gpu.module and gpu.func should exist.
I think gpu.module should be builtin.module {target = #gpu.{nvvm,rocdl,...}_target<...> and that gpu.func should be func.func {gpu.kernel}.

That's a possibility, but I'm concerned about this being overly lose :(
(you know, it has a test of the "json of IRs")

My thoughts on that target thing would be to implement those same interfaces for rocdl.target and the like.

And ... having looked around some more, it seems like gpu.module doesn't have the requirement that it live inside a gpu.container_module anymore (unless you want to use gpu.launch_func).

Though I think we just drifted from functions to modules.

As far as I'm concerned, the only difference between

gpu.func @f(...) known_block_sizes = [64, 2, 1] { ... }

and

func.func @f(...) attributes {gpu.known_block_sizes = [64, 2, 1]} { ... }

(or even my_custom.func in the above) is the somewhat higher risk of discardability on the optimization hint.

This PR's trying to generalize some already general handling for this particular hint so that we're not locking downstream into needing a very particular infrastructure (that might not compose with other compilation infrastructures because there's an extra level of module involved) just to get ThreadIdOp::inferIntRanges() working.

My claim is that operations like gpu.thread_id and gpu.shuffle are in a third class: abstractions around platform-specific GPU intrinsics. That is, these are operations meant to abstract across what are almost inevitably platform-specific intrinsics, allowing people to write code generation schemes that can target "a GPU" (though somewhere in their context they're likely to know which).

+1, this has always been my understanding and I'm not aware of any intended limitations as to where these have to live gpu.func/func.func/spirv.func/etc. None seem to be documented at the moment either AFAICT.

None seem to be documented at the moment either AFAICT.

Right, hence why I commented above about the need to start with documenting our intent here. If we agree on the destination, it'll be easier to make fast progress on each step!

krzysz00 · 2024-06-11T20:27:51Z

During the conversion, the surrounding context could either be the source function (say, gpu.func) or the target function (say llvm.func) depending on if the target code gets the surrounding function or its body gets rewritten first, hence the original conversion code having cases for both GPU and LLVM's func ops, which this change generalizes

The known sizes also aren't inherent attributes of gpu.func either - see https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td#L434 , which goes to a discardable tag

kuhar · 2024-06-12T17:12:38Z

mlir/lib/Conversion/GPUCommon/IndexIntrinsicsOpLowering.h

-      function = gpuFunc;
-    if (auto llvmFunc = op->template getParentOfType<LLVM::LLVMFuncOp>())
-      function = llvmFunc;
+    Operation *function = op->template getParentOfType<FunctionOpInterface>();


Generalizing this seems fine to me until we have a non-discardable attribute on gpu.func

kuhar · 2024-06-12T17:19:40Z

mlir/lib/Dialect/GPU/IR/InferIntRangeInterfaceImpls.cpp

-      return llvm::transformOptional(func.getKnownGridSize(dim), zext);
+      return llvm::transformOptional(
+          getKnownLaunchAttr(func, GPUFuncOp::getKnownGridSizeAttrName(), dim),
+          zext);


My claim is that operations like gpu.thread_id and gpu.shuffle are in a third class: abstractions around platform-specific GPU intrinsics. That is, these are operations meant to abstract across what are almost inevitably platform-specific intrinsics, allowing people to write code generation schemes that can target "a GPU" (though somewhere in their context they're likely to know which).

+1, this has always been my understanding and I'm not aware of any intended limitations as to where these have to live gpu.func/func.func/spirv.func/etc. None seem to be documented at the moment either AFAICT.

kuhar · 2024-06-12T17:23:45Z

I'd be leaning towards making this an inherent attribute over gpu.func but also allow this as a discardable attribute over other function ops. This is supported by the existing support over llvm.func. I don't have any strong options wrt to the sequencing here.

joker-eph · 2024-06-12T23:41:43Z

The API to access discardable and non-discardable API are supposed to be completely segregated though: hence why I mentioned the need for an interface.

Make known_block_size and known_grid_size inherent attributes on gpu.func. Also make them visible discardable attributes on the GPU dialect. Remove those weird attribute name getters from GPUFuncOp Also add upperBound attributes to all the index operations so that people who want to make this information local and not context-dependent can do so. Haven't updated tests yet, but I hope this round of generalizations will address some of the concerns.

mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp

Co-authored-by: Mehdi Amini <[email protected]>

krzysz00 · 2024-06-14T23:00:35Z

PR description has been updated to reflect the radical change in approach that came out of discussion

joker-eph · 2024-06-17T13:21:17Z

mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp

+      ROCDL::ROCDLDialect::KernelAttrHelper(&converter.getContext()).getName(),
+      ROCDL::ROCDLDialect::ReqdWorkGroupSizeAttrHelper(&converter.getContext())
+          .getName());


Can you get a handle to the dialect first and then use it to get the various names?

auto *rocdlDialect = converter.getContext().getLoadedDialect<ROCDLDiale```

joker-eph

That seems OK, but can we have a PR to document the GPU dialect layering? It would be great to not lose all the discussion we had here, it can be handy to have this encoded for future reference.

krzysz00 · 2024-06-17T17:08:14Z

@joker-eph The documentation updates are at #95812

kuhar · 2024-06-17T17:14:42Z

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

@@ -309,8 +309,24 @@ void GPUDialect::printType(Type type, DialectAsmPrinter &os) const {
      .Default([](Type) { llvm_unreachable("unexpected 'gpu' type kind"); });
 }

+static LogicalResult verifyKnownLaunchSizeAttr(Operation *op,
+                                               NamedAttribute attr) {
+  auto array = llvm::dyn_cast<DenseI32ArrayAttr>(attr.getValue());


nit: do we need the llvm:: here?

grypp · 2024-06-18T06:35:08Z

I apologize for joining the discussion a bit late. I understand the rationale behind introducing the known_block_size attribute.

However, I have some concerns regarding the following code snippet (it's in the example). As per CUDA (or other gpu languages), these attributes are restricted to GPU kernels. It appears, based on my understanding, that the proposed changes in the PR are extending these attributes to func.func.

Could you please clarify the intended behavior of the following code? Specifically, should it be treated as a GPU kernel, a GPU function, or a host function?

  func.func @annotated_kernel()
    attributes {gpu.known_block_size = array<i32: 8, 12, 16>,
        gpu.known_grid_size = array<i32: 20, 24, 28>} {

If the goal is to permit GPU kernel attributes on func.func, in that case, this PR needs a verifier that allows known_block_size/known_grid_size only on gpu kernels.

krzysz00 · 2024-06-18T18:19:07Z

@grypp The intent of that code is that the func.func could be modelling a GPU kernel when you're using a non-upstream lowering flow

krzysz00 · 2024-06-18T18:19:54Z

That is, the func.func is either a host or device function depending on the pipeline's context / if it's in a GPU module / ...

grypp · 2024-06-21T09:40:58Z

That is, the func.func is either a host or device function depending on the pipeline's context / if it's in a GPU module / ...

I understand your intention. However, I'm asking if we can use a verifier to ensure that the blocksize attribute cannot be present unless func.func is within the gpu.module?

krzysz00 · 2024-06-21T16:51:17Z

I understand your intention. However, I'm asking if we can use a verifier to ensure that the blocksize attribute cannot be present unless func.func is within the gpu.module?

The specific intent of this change is that that is not a requirement.

You are allowed to do

some.other.thing target(SomeGPUTargetAttr) {
  func.func {gpu.known_block_size = array:i32: 64, 2, 1>} {
    ...
   %N = gpu.thread_id x
   ...
  }
}

krzysz00 · 2024-06-21T16:52:43Z

If you set this attribute on a host function, the lowering for host functions will ignore it. If you had a gpu.thread_id x in a function that ends up being a host function ... lowering failure, since thhere's no XToLLVM for x86_64

llvmbot added mlir:gpu mlir labels Jun 11, 2024

joker-eph reviewed Jun 11, 2024

View reviewed changes

kuhar reviewed Jun 12, 2024

View reviewed changes

krzysz00 added 2 commits June 14, 2024 02:15

Merge branch 'main' into function-op-interface-gpu-ranges

4a87d25

joker-eph reviewed Jun 14, 2024

View reviewed changes

mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp Outdated Show resolved Hide resolved

Update tests

eec884a

krzysz00 changed the title ~~[mlir] Let GPU ID bounds work on any FunctionOpInterfaces~~ [mlir][GPU] Improve handling of GPU bounds Jun 14, 2024

Use dialect helpers properly per review comment

dd44f39

Co-authored-by: Mehdi Amini <[email protected]>

joker-eph requested a review from grypp June 14, 2024 23:37

Whoops, that one doesn't compile

0209b83

krzysz00 requested review from kuhar and joker-eph June 16, 2024 05:57

joker-eph reviewed Jun 17, 2024

View reviewed changes

joker-eph approved these changes Jun 17, 2024

View reviewed changes

Address review feedback

2147e61

kuhar approved these changes Jun 17, 2024

View reviewed changes

Address stray llvm::

268cd0b

krzysz00 merged commit 43fd4c4 into llvm:main Jun 18, 2024
5 of 6 checks passed

krzysz00 mentioned this pull request Jun 19, 2024

[Codegen][GPU] Add range information to GPU dispatch IDs iree-org/iree#17707

Merged

[mlir][GPU] Improve handling of GPU bounds #95166

[mlir][GPU] Improve handling of GPU bounds #95166

Uh oh!

Conversation

krzysz00 commented Jun 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jun 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joker-eph commented Jun 11, 2024

Uh oh!

krzysz00 commented Jun 11, 2024

Uh oh!

krzysz00 commented Jun 11, 2024

Uh oh!

joker-eph commented Jun 11, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joker-eph Jun 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joker-eph Jun 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

krzysz00 commented Jun 11, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kuhar commented Jun 12, 2024

Uh oh!

joker-eph commented Jun 12, 2024

Uh oh!

Uh oh!

krzysz00 commented Jun 14, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joker-eph left a comment

Choose a reason for hiding this comment

Uh oh!

krzysz00 commented Jun 17, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

grypp commented Jun 18, 2024

Uh oh!

krzysz00 commented Jun 18, 2024

Uh oh!

krzysz00 commented Jun 18, 2024

Uh oh!

grypp commented Jun 21, 2024

Uh oh!

krzysz00 commented Jun 21, 2024

Uh oh!

krzysz00 commented Jun 21, 2024

Uh oh!

Uh oh!

krzysz00 commented Jun 11, 2024 •

edited

Loading

llvmbot commented Jun 11, 2024 •

edited

Loading

joker-eph Jun 11, 2024 •

edited

Loading

joker-eph Jun 11, 2024 •

edited

Loading