Skip to content

[mlir][GPU] Improve handling of GPU bounds #95166

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 18, 2024

Conversation

krzysz00
Copy link
Contributor

@krzysz00 krzysz00 commented Jun 11, 2024

This change reworks how range information for GPU dispatch IDs (block IDs, thread IDs, and so on) is handled.

  1. known_block_size and known_grid_size become inherent attributes of GPU functions. This makes them less clunky to work with. As a consequence, the gpu.func lowering patterns now only look at the inherent attributes when setting target-specific attributes on the llvm.func that they lower to.
  2. At the same time, gpu.known_block_size and gpu.known_grid_size are made official dialect-level discardable attributes which can be placed on arbitrary functions. This allows for progressive lowerings (without this, a lowering for gpu.thread_id couldn't know about the bounds if it had already been moved from a gpu.func to an llvm.func) and allows for range information to be provided even when gpu.*_{id,dim} are being used outside of a gpu.func context.
  3. All of these index operations have gained an optional upper_bound attribute, allowing for an alternate mode of operation where the bounds are specified locally and not inherited from the operation's context. These also allow handling of cases where the precise launch sizes aren't known, but can be bounded more precisely than the maximum of what any platform's API allows. (I'd like to thank @benvanik for pointing out that this could be useful.)

When inferring bounds (either for range inference or for setting range during lowering) these sources of information are consulted in order of specificity (upper_bound > inherent attribute > discardable attribute, except that dimension sizes check for known_*_bounds to see if they can be constant-folded before checking their upper_bound).

This patch also updates the documentation about the bounds and inference behavior to clarify what these attributes do when set and the consequences of setting them up incorrectly.

This change removes the requirement that the known block or grid IDs
be stored on a gpu.func, but instead allows them on any function
implementing the FunctionOpInterface. This allows for, for instance,
non-kernel functions that live ina func.func or for downstream usecases
that don't use gpu.func.
@llvmbot
Copy link
Member

llvmbot commented Jun 11, 2024

@llvm/pr-subscribers-mlir-gpu

@llvm/pr-subscribers-mlir

Author: Krzysztof Drewniak (krzysz00)

Changes

This change removes the requirement that the known block or grid IDs be stored on a gpu.func, but instead allows them on any function implementing the FunctionOpInterface. This allows for, for instance, non-kernel functions that live ina func.func or for downstream usecases that don't use gpu.func.


Full diff: https://github.com/llvm/llvm-project/pull/95166.diff

3 Files Affected:

  • (modified) mlir/lib/Conversion/GPUCommon/IndexIntrinsicsOpLowering.h (+1-5)
  • (modified) mlir/lib/Dialect/GPU/IR/InferIntRangeInterfaceImpls.cpp (+19-3)
  • (modified) mlir/test/Dialect/GPU/int-range-interface.mlir (+33)
diff --git a/mlir/lib/Conversion/GPUCommon/IndexIntrinsicsOpLowering.h b/mlir/lib/Conversion/GPUCommon/IndexIntrinsicsOpLowering.h
index d067c70a90ea4..0f74768207205 100644
--- a/mlir/lib/Conversion/GPUCommon/IndexIntrinsicsOpLowering.h
+++ b/mlir/lib/Conversion/GPUCommon/IndexIntrinsicsOpLowering.h
@@ -57,11 +57,7 @@ struct GPUIndexIntrinsicOpLowering : public ConvertOpToLLVMPattern<Op> {
       break;
     }
 
-    Operation *function;
-    if (auto gpuFunc = op->template getParentOfType<gpu::GPUFuncOp>())
-      function = gpuFunc;
-    if (auto llvmFunc = op->template getParentOfType<LLVM::LLVMFuncOp>())
-      function = llvmFunc;
+    Operation *function = op->template getParentOfType<FunctionOpInterface>();
     if (!boundsAttrName.empty() && function) {
       if (auto attr = function->template getAttrOfType<DenseI32ArrayAttr>(
               boundsAttrName)) {
diff --git a/mlir/lib/Dialect/GPU/IR/InferIntRangeInterfaceImpls.cpp b/mlir/lib/Dialect/GPU/IR/InferIntRangeInterfaceImpls.cpp
index 69017efb9a0e6..152884e23b929 100644
--- a/mlir/lib/Dialect/GPU/IR/InferIntRangeInterfaceImpls.cpp
+++ b/mlir/lib/Dialect/GPU/IR/InferIntRangeInterfaceImpls.cpp
@@ -8,6 +8,7 @@
 
 #include "mlir/Dialect/GPU/IR/GPUDialect.h"
 #include "mlir/IR/Matchers.h"
+#include "mlir/Interfaces/FunctionInterfaces.h"
 #include "mlir/Interfaces/InferIntRangeInterface.h"
 #include "llvm/ADT/STLForwardCompat.h"
 #include "llvm/Support/ErrorHandling.h"
@@ -54,6 +55,17 @@ static Value valueByDim(KernelDim3 dims, Dimension dim) {
 
 static uint64_t zext(uint32_t arg) { return static_cast<uint64_t>(arg); }
 
+static std::optional<uint32_t> getKnownLaunchAttr(FunctionOpInterface func,
+                                                  StringRef attrName,
+                                                  Dimension dim) {
+  auto bounds = func.getOperation()->getAttrOfType<DenseI32ArrayAttr>(attrName);
+  if (!bounds)
+    return std::nullopt;
+  if (bounds.size() < static_cast<uint32_t>(dim))
+    return std::nullopt;
+  return bounds[static_cast<uint32_t>(dim)];
+}
+
 template <typename Op>
 static std::optional<uint64_t> getKnownLaunchDim(Op op, LaunchDims type) {
   Dimension dim = op.getDimension();
@@ -73,12 +85,16 @@ static std::optional<uint64_t> getKnownLaunchDim(Op op, LaunchDims type) {
       return value.getZExtValue();
   }
 
-  if (auto func = op->template getParentOfType<GPUFuncOp>()) {
+  if (auto func = op->template getParentOfType<FunctionOpInterface>()) {
     switch (type) {
     case LaunchDims::Block:
-      return llvm::transformOptional(func.getKnownBlockSize(dim), zext);
+      return llvm::transformOptional(
+          getKnownLaunchAttr(func, GPUFuncOp::getKnownBlockSizeAttrName(), dim),
+          zext);
     case LaunchDims::Grid:
-      return llvm::transformOptional(func.getKnownGridSize(dim), zext);
+      return llvm::transformOptional(
+          getKnownLaunchAttr(func, GPUFuncOp::getKnownGridSizeAttrName(), dim),
+          zext);
     }
   }
   return std::nullopt;
diff --git a/mlir/test/Dialect/GPU/int-range-interface.mlir b/mlir/test/Dialect/GPU/int-range-interface.mlir
index a0917a2fdf110..a6c74fec6e824 100644
--- a/mlir/test/Dialect/GPU/int-range-interface.mlir
+++ b/mlir/test/Dialect/GPU/int-range-interface.mlir
@@ -215,3 +215,36 @@ module attributes {gpu.container_module} {
   }
 }
 
+// -----
+
+// CHECK-LABEL: func @annotated_kernel
+module {
+  func.func @annotated_kernel()
+    attributes {gpu.known_block_size = array<i32: 8, 12, 16>,
+        gpu.known_grid_size = array<i32: 20, 24, 28>} {
+
+    %block_id_x = gpu.block_id x
+    %block_id_y = gpu.block_id y
+    %block_id_z = gpu.block_id z
+
+    // CHECK: test.reflect_bounds {smax = 19 : index, smin = 0 : index, umax = 19 : index, umin = 0 : index}
+    // CHECK: test.reflect_bounds {smax = 23 : index, smin = 0 : index, umax = 23 : index, umin = 0 : index}
+    // CHECK: test.reflect_bounds {smax = 27 : index, smin = 0 : index, umax = 27 : index, umin = 0 : index}
+    %block_id_x0 = test.reflect_bounds %block_id_x : index
+    %block_id_y0 = test.reflect_bounds %block_id_y : index
+    %block_id_z0 = test.reflect_bounds %block_id_z : index
+
+    %thread_id_x = gpu.thread_id x
+    %thread_id_y = gpu.thread_id y
+    %thread_id_z = gpu.thread_id z
+
+    // CHECK: test.reflect_bounds {smax = 7 : index, smin = 0 : index, umax = 7 : index, umin = 0 : index}
+    // CHECK: test.reflect_bounds {smax = 11 : index, smin = 0 : index, umax = 11 : index, umin = 0 : index}
+    // CHECK: test.reflect_bounds {smax = 15 : index, smin = 0 : index, umax = 15 : index, umin = 0 : index}
+    %thread_id_x0 = test.reflect_bounds %thread_id_x : index
+    %thread_id_y0 = test.reflect_bounds %thread_id_y : index
+    %thread_id_z0 = test.reflect_bounds %thread_id_z : index
+
+    return
+  }
+}

@joker-eph
Copy link
Collaborator

I'm missing on the motivation here? Shouldn't we just have all GPU code be inside gpu.func anyway?

@krzysz00
Copy link
Contributor Author

  1. I remember there being a note on Discourse recently saying that non-kernel GPU functions could be func.funcs
  2. IREE keeps its GPU code in a func.func and I want those InferIntRangeInterface implementations to work there

@krzysz00
Copy link
Contributor Author

Oh, and, 3, this'll make it work for spirv.func if that's a thing

@joker-eph
Copy link
Collaborator

I remember there being a note on Discourse recently saying that non-kernel GPU functions could be func.funcs

Yes, but as far as I remember, the only reason for this is because we wanted to allow function that works on CPU and GPU: that is if a function does not make use of GPU specific features then it can be a func.func.

Oh, and, 3, this'll make it work for spirv.func if that's a thing

But wouldn't a spirv.func be the result of this conversion? That is you perform dialect conversion from the GPU dialect to SPIRV and you get a spirv.func with the spirv operation matching this GPU operation?

return llvm::transformOptional(func.getKnownGridSize(dim), zext);
return llvm::transformOptional(
getKnownLaunchAttr(func, GPUFuncOp::getKnownGridSizeAttrName(), dim),
zext);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems iffy: it dispatches to discardable attributes instead of inherent attrs/properties.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already discardable on GPU functions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the implementation, and indeed the getKnownBlockSize() is just a wrapper around the discardable attributes: but that seems like a bug to me? It should be inherent (and optional) I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's house style for gpu.func (see also the kernel attribute, last I checked)

And I found my cute for "GPU code doesn't have to be in a GPU func" https://discourse.llvm.org/t/mlir-gpu-calling-gpu-func-that-is-not-a-kernel-a-k-a-device-function/79285 .

And so there could be a pass that propagates these annotations from kernels to the func.funcs they call, which is an argument against making them inherent properties of GPU functions only.

So I'd argue that the known block size annotations are dialect-level metadata and that the fix here it to move the name to GPUDialect (and probably to go to a new-style attribute helper)

Copy link
Collaborator

@joker-eph joker-eph Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And so there could be a pass that propagates these annotations from kernels to the func.funcs they call, which is an argument against making them inherent properties of GPU functions only.

I would rather see a process where such func.func get turned into proper gpu.func instead.
Discardable attributes are supposed to be ... discardable :) So this seems like a design issue to me here!

Copy link
Collaborator

@joker-eph joker-eph Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, honestly, I don't think gpu.module and gpu.func should exist.
I think gpu.module should be builtin.module {target = #gpu.{nvvm,rocdl,...}_target<...> and that gpu.func should be func.func {gpu.kernel}.

That's a possibility, but I'm concerned about this being overly lose :(
(you know, it has a test of the "json of IRs")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts on that target thing would be to implement those same interfaces for rocdl.target and the like.

And ... having looked around some more, it seems like gpu.module doesn't have the requirement that it live inside a gpu.container_module anymore (unless you want to use gpu.launch_func).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I think we just drifted from functions to modules.

As far as I'm concerned, the only difference between

gpu.func @f(...) known_block_sizes = [64, 2, 1] {
  ...
}

and

func.func @f(...) attributes {gpu.known_block_sizes = [64, 2, 1]} {
  ...
}

(or even my_custom.func in the above) is the somewhat higher risk of discardability on the optimization hint.

This PR's trying to generalize some already general handling for this particular hint so that we're not locking downstream into needing a very particular infrastructure (that might not compose with other compilation infrastructures because there's an extra level of module involved) just to get ThreadIdOp::inferIntRanges() working.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My claim is that operations like gpu.thread_id and gpu.shuffle are in a third class: abstractions around platform-specific GPU intrinsics. That is, these are operations meant to abstract across what are almost inevitably platform-specific intrinsics, allowing people to write code generation schemes that can target "a GPU" (though somewhere in their context they're likely to know which).

+1, this has always been my understanding and I'm not aware of any intended limitations as to where these have to live gpu.func/func.func/spirv.func/etc. None seem to be documented at the moment either AFAICT.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None seem to be documented at the moment either AFAICT.

Right, hence why I commented above about the need to start with documenting our intent here. If we agree on the destination, it'll be easier to make fast progress on each step!

@krzysz00
Copy link
Contributor Author

During the conversion, the surrounding context could either be the source function (say, gpu.func) or the target function (say llvm.func) depending on if the target code gets the surrounding function or its body gets rewritten first, hence the original conversion code having cases for both GPU and LLVM's func ops, which this change generalizes

The known sizes also aren't inherent attributes of gpu.func either - see https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td#L434 , which goes to a discardable tag

function = gpuFunc;
if (auto llvmFunc = op->template getParentOfType<LLVM::LLVMFuncOp>())
function = llvmFunc;
Operation *function = op->template getParentOfType<FunctionOpInterface>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generalizing this seems fine to me until we have a non-discardable attribute on gpu.func

return llvm::transformOptional(func.getKnownGridSize(dim), zext);
return llvm::transformOptional(
getKnownLaunchAttr(func, GPUFuncOp::getKnownGridSizeAttrName(), dim),
zext);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My claim is that operations like gpu.thread_id and gpu.shuffle are in a third class: abstractions around platform-specific GPU intrinsics. That is, these are operations meant to abstract across what are almost inevitably platform-specific intrinsics, allowing people to write code generation schemes that can target "a GPU" (though somewhere in their context they're likely to know which).

+1, this has always been my understanding and I'm not aware of any intended limitations as to where these have to live gpu.func/func.func/spirv.func/etc. None seem to be documented at the moment either AFAICT.

@kuhar
Copy link
Member

kuhar commented Jun 12, 2024

I'd be leaning towards making this an inherent attribute over gpu.func but also allow this as a discardable attribute over other function ops. This is supported by the existing support over llvm.func. I don't have any strong options wrt to the sequencing here.

@joker-eph
Copy link
Collaborator

The API to access discardable and non-discardable API are supposed to be completely segregated though: hence why I mentioned the need for an interface.

krzysz00 added 2 commits June 14, 2024 02:15
Make known_block_size and known_grid_size inherent attributes on gpu.func.

Also make them visible discardable attributes on the GPU dialect.

Remove those weird attribute name getters from GPUFuncOp

Also add upperBound attributes to all the index operations so that people
who want to make this information local and not context-dependent
can do so.

Haven't updated tests yet, but I hope this round of generalizations will
address some of the concerns.
@krzysz00 krzysz00 changed the title [mlir] Let GPU ID bounds work on any FunctionOpInterfaces [mlir][GPU] Improve handling of GPU bounds Jun 14, 2024
@krzysz00
Copy link
Contributor Author

PR description has been updated to reflect the radical change in approach that came out of discussion

@joker-eph joker-eph requested a review from grypp June 14, 2024 23:37
@krzysz00 krzysz00 requested review from kuhar and joker-eph June 16, 2024 05:57
Comment on lines 374 to 376
ROCDL::ROCDLDialect::KernelAttrHelper(&converter.getContext()).getName(),
ROCDL::ROCDLDialect::ReqdWorkGroupSizeAttrHelper(&converter.getContext())
.getName());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you get a handle to the dialect first and then use it to get the various names?

      auto *rocdlDialect = converter.getContext().getLoadedDialect<ROCDLDiale```

Copy link
Collaborator

@joker-eph joker-eph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems OK, but can we have a PR to document the GPU dialect layering? It would be great to not lose all the discussion we had here, it can be handy to have this encoded for future reference.

@krzysz00
Copy link
Contributor Author

@joker-eph The documentation updates are at #95812

@@ -309,8 +309,24 @@ void GPUDialect::printType(Type type, DialectAsmPrinter &os) const {
.Default([](Type) { llvm_unreachable("unexpected 'gpu' type kind"); });
}

static LogicalResult verifyKnownLaunchSizeAttr(Operation *op,
NamedAttribute attr) {
auto array = llvm::dyn_cast<DenseI32ArrayAttr>(attr.getValue());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we need the llvm:: here?

@krzysz00 krzysz00 merged commit 43fd4c4 into llvm:main Jun 18, 2024
5 of 6 checks passed
@grypp
Copy link
Member

grypp commented Jun 18, 2024

I apologize for joining the discussion a bit late. I understand the rationale behind introducing the known_block_size attribute.

However, I have some concerns regarding the following code snippet (it's in the example). As per CUDA (or other gpu languages), these attributes are restricted to GPU kernels. It appears, based on my understanding, that the proposed changes in the PR are extending these attributes to func.func.

Could you please clarify the intended behavior of the following code? Specifically, should it be treated as a GPU kernel, a GPU function, or a host function?

  func.func @annotated_kernel()
    attributes {gpu.known_block_size = array<i32: 8, 12, 16>,
        gpu.known_grid_size = array<i32: 20, 24, 28>} {

If the goal is to permit GPU kernel attributes on func.func, in that case, this PR needs a verifier that allows known_block_size/known_grid_size only on gpu kernels.

@krzysz00
Copy link
Contributor Author

@grypp The intent of that code is that the func.func could be modelling a GPU kernel when you're using a non-upstream lowering flow

@krzysz00
Copy link
Contributor Author

That is, the func.func is either a host or device function depending on the pipeline's context / if it's in a GPU module / ...

@grypp
Copy link
Member

grypp commented Jun 21, 2024

That is, the func.func is either a host or device function depending on the pipeline's context / if it's in a GPU module / ...

I understand your intention. However, I'm asking if we can use a verifier to ensure that the blocksize attribute cannot be present unless func.func is within the gpu.module?

@krzysz00
Copy link
Contributor Author

I understand your intention. However, I'm asking if we can use a verifier to ensure that the blocksize attribute cannot be present unless func.func is within the gpu.module?

The specific intent of this change is that that is not a requirement.

You are allowed to do

some.other.thing target(SomeGPUTargetAttr) {
  func.func {gpu.known_block_size = array:i32: 64, 2, 1>} {
    ...
   %N = gpu.thread_id x
   ...
  }
}

@krzysz00
Copy link
Contributor Author

If you set this attribute on a host function, the lowering for host functions will ignore it. If you had a gpu.thread_id x in a function that ends up being a host function ... lowering failure, since thhere's no XToLLVM for x86_64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants