-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[mlir][GPU] Improve handling of GPU bounds #95166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mlir][GPU] Improve handling of GPU bounds #95166
Conversation
This change removes the requirement that the known block or grid IDs be stored on a gpu.func, but instead allows them on any function implementing the FunctionOpInterface. This allows for, for instance, non-kernel functions that live ina func.func or for downstream usecases that don't use gpu.func.
@llvm/pr-subscribers-mlir-gpu @llvm/pr-subscribers-mlir Author: Krzysztof Drewniak (krzysz00) ChangesThis change removes the requirement that the known block or grid IDs be stored on a gpu.func, but instead allows them on any function implementing the FunctionOpInterface. This allows for, for instance, non-kernel functions that live ina func.func or for downstream usecases that don't use gpu.func. Full diff: https://github.com/llvm/llvm-project/pull/95166.diff 3 Files Affected:
diff --git a/mlir/lib/Conversion/GPUCommon/IndexIntrinsicsOpLowering.h b/mlir/lib/Conversion/GPUCommon/IndexIntrinsicsOpLowering.h
index d067c70a90ea4..0f74768207205 100644
--- a/mlir/lib/Conversion/GPUCommon/IndexIntrinsicsOpLowering.h
+++ b/mlir/lib/Conversion/GPUCommon/IndexIntrinsicsOpLowering.h
@@ -57,11 +57,7 @@ struct GPUIndexIntrinsicOpLowering : public ConvertOpToLLVMPattern<Op> {
break;
}
- Operation *function;
- if (auto gpuFunc = op->template getParentOfType<gpu::GPUFuncOp>())
- function = gpuFunc;
- if (auto llvmFunc = op->template getParentOfType<LLVM::LLVMFuncOp>())
- function = llvmFunc;
+ Operation *function = op->template getParentOfType<FunctionOpInterface>();
if (!boundsAttrName.empty() && function) {
if (auto attr = function->template getAttrOfType<DenseI32ArrayAttr>(
boundsAttrName)) {
diff --git a/mlir/lib/Dialect/GPU/IR/InferIntRangeInterfaceImpls.cpp b/mlir/lib/Dialect/GPU/IR/InferIntRangeInterfaceImpls.cpp
index 69017efb9a0e6..152884e23b929 100644
--- a/mlir/lib/Dialect/GPU/IR/InferIntRangeInterfaceImpls.cpp
+++ b/mlir/lib/Dialect/GPU/IR/InferIntRangeInterfaceImpls.cpp
@@ -8,6 +8,7 @@
#include "mlir/Dialect/GPU/IR/GPUDialect.h"
#include "mlir/IR/Matchers.h"
+#include "mlir/Interfaces/FunctionInterfaces.h"
#include "mlir/Interfaces/InferIntRangeInterface.h"
#include "llvm/ADT/STLForwardCompat.h"
#include "llvm/Support/ErrorHandling.h"
@@ -54,6 +55,17 @@ static Value valueByDim(KernelDim3 dims, Dimension dim) {
static uint64_t zext(uint32_t arg) { return static_cast<uint64_t>(arg); }
+static std::optional<uint32_t> getKnownLaunchAttr(FunctionOpInterface func,
+ StringRef attrName,
+ Dimension dim) {
+ auto bounds = func.getOperation()->getAttrOfType<DenseI32ArrayAttr>(attrName);
+ if (!bounds)
+ return std::nullopt;
+ if (bounds.size() < static_cast<uint32_t>(dim))
+ return std::nullopt;
+ return bounds[static_cast<uint32_t>(dim)];
+}
+
template <typename Op>
static std::optional<uint64_t> getKnownLaunchDim(Op op, LaunchDims type) {
Dimension dim = op.getDimension();
@@ -73,12 +85,16 @@ static std::optional<uint64_t> getKnownLaunchDim(Op op, LaunchDims type) {
return value.getZExtValue();
}
- if (auto func = op->template getParentOfType<GPUFuncOp>()) {
+ if (auto func = op->template getParentOfType<FunctionOpInterface>()) {
switch (type) {
case LaunchDims::Block:
- return llvm::transformOptional(func.getKnownBlockSize(dim), zext);
+ return llvm::transformOptional(
+ getKnownLaunchAttr(func, GPUFuncOp::getKnownBlockSizeAttrName(), dim),
+ zext);
case LaunchDims::Grid:
- return llvm::transformOptional(func.getKnownGridSize(dim), zext);
+ return llvm::transformOptional(
+ getKnownLaunchAttr(func, GPUFuncOp::getKnownGridSizeAttrName(), dim),
+ zext);
}
}
return std::nullopt;
diff --git a/mlir/test/Dialect/GPU/int-range-interface.mlir b/mlir/test/Dialect/GPU/int-range-interface.mlir
index a0917a2fdf110..a6c74fec6e824 100644
--- a/mlir/test/Dialect/GPU/int-range-interface.mlir
+++ b/mlir/test/Dialect/GPU/int-range-interface.mlir
@@ -215,3 +215,36 @@ module attributes {gpu.container_module} {
}
}
+// -----
+
+// CHECK-LABEL: func @annotated_kernel
+module {
+ func.func @annotated_kernel()
+ attributes {gpu.known_block_size = array<i32: 8, 12, 16>,
+ gpu.known_grid_size = array<i32: 20, 24, 28>} {
+
+ %block_id_x = gpu.block_id x
+ %block_id_y = gpu.block_id y
+ %block_id_z = gpu.block_id z
+
+ // CHECK: test.reflect_bounds {smax = 19 : index, smin = 0 : index, umax = 19 : index, umin = 0 : index}
+ // CHECK: test.reflect_bounds {smax = 23 : index, smin = 0 : index, umax = 23 : index, umin = 0 : index}
+ // CHECK: test.reflect_bounds {smax = 27 : index, smin = 0 : index, umax = 27 : index, umin = 0 : index}
+ %block_id_x0 = test.reflect_bounds %block_id_x : index
+ %block_id_y0 = test.reflect_bounds %block_id_y : index
+ %block_id_z0 = test.reflect_bounds %block_id_z : index
+
+ %thread_id_x = gpu.thread_id x
+ %thread_id_y = gpu.thread_id y
+ %thread_id_z = gpu.thread_id z
+
+ // CHECK: test.reflect_bounds {smax = 7 : index, smin = 0 : index, umax = 7 : index, umin = 0 : index}
+ // CHECK: test.reflect_bounds {smax = 11 : index, smin = 0 : index, umax = 11 : index, umin = 0 : index}
+ // CHECK: test.reflect_bounds {smax = 15 : index, smin = 0 : index, umax = 15 : index, umin = 0 : index}
+ %thread_id_x0 = test.reflect_bounds %thread_id_x : index
+ %thread_id_y0 = test.reflect_bounds %thread_id_y : index
+ %thread_id_z0 = test.reflect_bounds %thread_id_z : index
+
+ return
+ }
+}
|
I'm missing on the motivation here? Shouldn't we just have all GPU code be inside |
|
Oh, and, 3, this'll make it work for |
Yes, but as far as I remember, the only reason for this is because we wanted to allow function that works on CPU and GPU: that is if a function does not make use of GPU specific features then it can be a func.func.
But wouldn't a spirv.func be the result of this conversion? That is you perform dialect conversion from the GPU dialect to SPIRV and you get a spirv.func with the spirv operation matching this GPU operation? |
return llvm::transformOptional(func.getKnownGridSize(dim), zext); | ||
return llvm::transformOptional( | ||
getKnownLaunchAttr(func, GPUFuncOp::getKnownGridSizeAttrName(), dim), | ||
zext); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems iffy: it dispatches to discardable attributes instead of inherent attrs/properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's already discardable on GPU functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the implementation, and indeed the getKnownBlockSize()
is just a wrapper around the discardable attributes: but that seems like a bug to me? It should be inherent (and optional) I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's house style for gpu.func
(see also the kernel
attribute, last I checked)
And I found my cute for "GPU code doesn't have to be in a GPU func" https://discourse.llvm.org/t/mlir-gpu-calling-gpu-func-that-is-not-a-kernel-a-k-a-device-function/79285 .
And so there could be a pass that propagates these annotations from kernels to the func.func
s they call, which is an argument against making them inherent properties of GPU functions only.
So I'd argue that the known block size annotations are dialect-level metadata and that the fix here it to move the name to GPUDialect (and probably to go to a new-style attribute helper)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And so there could be a pass that propagates these annotations from kernels to the func.funcs they call, which is an argument against making them inherent properties of GPU functions only.
I would rather see a process where such func.func get turned into proper gpu.func instead.
Discardable attributes are supposed to be ... discardable :) So this seems like a design issue to me here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, honestly, I don't think gpu.module and gpu.func should exist.
I think gpu.module should be builtin.module {target = #gpu.{nvvm,rocdl,...}_target<...> and that gpu.func should be func.func {gpu.kernel}.
That's a possibility, but I'm concerned about this being overly lose :(
(you know, it has a test of the "json of IRs")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thoughts on that target thing would be to implement those same interfaces for rocdl.target
and the like.
And ... having looked around some more, it seems like gpu.module
doesn't have the requirement that it live inside a gpu.container_module
anymore (unless you want to use gpu.launch_func
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though I think we just drifted from functions to modules.
As far as I'm concerned, the only difference between
gpu.func @f(...) known_block_sizes = [64, 2, 1] {
...
}
and
func.func @f(...) attributes {gpu.known_block_sizes = [64, 2, 1]} {
...
}
(or even my_custom.func
in the above) is the somewhat higher risk of discardability on the optimization hint.
This PR's trying to generalize some already general handling for this particular hint so that we're not locking downstream into needing a very particular infrastructure (that might not compose with other compilation infrastructures because there's an extra level of module involved) just to get ThreadIdOp::inferIntRanges()
working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My claim is that operations like gpu.thread_id and gpu.shuffle are in a third class: abstractions around platform-specific GPU intrinsics. That is, these are operations meant to abstract across what are almost inevitably platform-specific intrinsics, allowing people to write code generation schemes that can target "a GPU" (though somewhere in their context they're likely to know which).
+1, this has always been my understanding and I'm not aware of any intended limitations as to where these have to live gpu.func
/func.func
/spirv.func
/etc. None seem to be documented at the moment either AFAICT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None seem to be documented at the moment either AFAICT.
Right, hence why I commented above about the need to start with documenting our intent here. If we agree on the destination, it'll be easier to make fast progress on each step!
During the conversion, the surrounding context could either be the source function (say, The known sizes also aren't inherent attributes of |
function = gpuFunc; | ||
if (auto llvmFunc = op->template getParentOfType<LLVM::LLVMFuncOp>()) | ||
function = llvmFunc; | ||
Operation *function = op->template getParentOfType<FunctionOpInterface>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generalizing this seems fine to me until we have a non-discardable attribute on gpu.func
return llvm::transformOptional(func.getKnownGridSize(dim), zext); | ||
return llvm::transformOptional( | ||
getKnownLaunchAttr(func, GPUFuncOp::getKnownGridSizeAttrName(), dim), | ||
zext); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My claim is that operations like gpu.thread_id and gpu.shuffle are in a third class: abstractions around platform-specific GPU intrinsics. That is, these are operations meant to abstract across what are almost inevitably platform-specific intrinsics, allowing people to write code generation schemes that can target "a GPU" (though somewhere in their context they're likely to know which).
+1, this has always been my understanding and I'm not aware of any intended limitations as to where these have to live gpu.func
/func.func
/spirv.func
/etc. None seem to be documented at the moment either AFAICT.
I'd be leaning towards making this an inherent attribute over |
The API to access discardable and non-discardable API are supposed to be completely segregated though: hence why I mentioned the need for an interface. |
Make known_block_size and known_grid_size inherent attributes on gpu.func. Also make them visible discardable attributes on the GPU dialect. Remove those weird attribute name getters from GPUFuncOp Also add upperBound attributes to all the index operations so that people who want to make this information local and not context-dependent can do so. Haven't updated tests yet, but I hope this round of generalizations will address some of the concerns.
Co-authored-by: Mehdi Amini <[email protected]>
PR description has been updated to reflect the radical change in approach that came out of discussion |
ROCDL::ROCDLDialect::KernelAttrHelper(&converter.getContext()).getName(), | ||
ROCDL::ROCDLDialect::ReqdWorkGroupSizeAttrHelper(&converter.getContext()) | ||
.getName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you get a handle to the dialect first and then use it to get the various names?
auto *rocdlDialect = converter.getContext().getLoadedDialect<ROCDLDiale```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems OK, but can we have a PR to document the GPU dialect layering? It would be great to not lose all the discussion we had here, it can be handy to have this encoded for future reference.
@joker-eph The documentation updates are at #95812 |
@@ -309,8 +309,24 @@ void GPUDialect::printType(Type type, DialectAsmPrinter &os) const { | |||
.Default([](Type) { llvm_unreachable("unexpected 'gpu' type kind"); }); | |||
} | |||
|
|||
static LogicalResult verifyKnownLaunchSizeAttr(Operation *op, | |||
NamedAttribute attr) { | |||
auto array = llvm::dyn_cast<DenseI32ArrayAttr>(attr.getValue()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: do we need the llvm::
here?
I apologize for joining the discussion a bit late. I understand the rationale behind introducing the However, I have some concerns regarding the following code snippet (it's in the example). As per CUDA (or other gpu languages), these attributes are restricted to GPU kernels. It appears, based on my understanding, that the proposed changes in the PR are extending these attributes to Could you please clarify the intended behavior of the following code? Specifically, should it be treated as a GPU kernel, a GPU function, or a host function?
If the goal is to permit GPU kernel attributes on |
@grypp The intent of that code is that the |
That is, the |
I understand your intention. However, I'm asking if we can use a verifier to ensure that the blocksize attribute cannot be present unless |
The specific intent of this change is that that is not a requirement. You are allowed to do
|
If you set this attribute on a host function, the lowering for host functions will ignore it. If you had a |
This change reworks how range information for GPU dispatch IDs (block IDs, thread IDs, and so on) is handled.
known_block_size
andknown_grid_size
become inherent attributes of GPU functions. This makes them less clunky to work with. As a consequence, thegpu.func
lowering patterns now only look at the inherent attributes when setting target-specific attributes on thellvm.func
that they lower to.gpu.known_block_size
andgpu.known_grid_size
are made official dialect-level discardable attributes which can be placed on arbitrary functions. This allows for progressive lowerings (without this, a lowering forgpu.thread_id
couldn't know about the bounds if it had already been moved from agpu.func
to anllvm.func
) and allows for range information to be provided even whengpu.*_{id,dim}
are being used outside of agpu.func
context.upper_bound
attribute, allowing for an alternate mode of operation where the bounds are specified locally and not inherited from the operation's context. These also allow handling of cases where the precise launch sizes aren't known, but can be bounded more precisely than the maximum of what any platform's API allows. (I'd like to thank @benvanik for pointing out that this could be useful.)When inferring bounds (either for range inference or for setting
range
during lowering) these sources of information are consulted in order of specificity (upper_bound
> inherent attribute > discardable attribute, except that dimension sizes check forknown_*_bounds
to see if they can be constant-folded before checking theirupper_bound
).This patch also updates the documentation about the bounds and inference behavior to clarify what these attributes do when set and the consequences of setting them up incorrectly.