Skip to content

[mlir][gpu] Allow subgroup reductions over 1-d vector types #76015

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 21, 2023

Conversation

kuhar
Copy link
Member

@kuhar kuhar commented Dec 20, 2023

Each vector element is reduced independently, which is a form of multi-reduction.

The plan is to allow for gradual lowering of multi-reduction that results in fewer gpu.shuffle ops at the end:
1d vector.multi_reduction --> 1d gpu.subgroup_reduce --> smaller 1d gpu.subgroup_reduce --> packed gpu.shuffle over i32

For example we can perform 2 independent f16 reductions with a series of gpu.shuffles over i32, reducing the final number of gpu.shuffles by 2x.

@llvmbot
Copy link
Member

llvmbot commented Dec 20, 2023

@llvm/pr-subscribers-mlir-spirv
@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-mlir-gpu

Author: Jakub Kuderski (kuhar)

Changes

Each vector element is reduced independently, which is a form of multi-reduction.

The plan is to allow for gradual lowering of multi-redictions that results in fewer gpu.shuffle ops at the end:
1d vector.multi_reduction --> 1d gpu.subgroup_reduce --> smaller 1d gpu.subgroup_reduce --> packed gpu.shuffle over i32 or i64

For example we can perform 4 independent f16 reductions with a series of gpu.shuffles over i64, reducing the final number of gpu.shuffles by 4x.


Full diff: https://github.com/llvm/llvm-project/pull/76015.diff

4 Files Affected:

  • (modified) mlir/include/mlir/Dialect/GPU/IR/GPUOps.td (+12-4)
  • (modified) mlir/lib/Dialect/GPU/IR/GPUDialect.cpp (+10-1)
  • (modified) mlir/test/Dialect/GPU/invalid.mlir (+11-3)
  • (modified) mlir/test/Dialect/GPU/ops.mlir (+5)
diff --git a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
index 2e21cd77d2d83b..7777dd58eba1f7 100644
--- a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
+++ b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
@@ -19,10 +19,11 @@ include "mlir/Dialect/GPU/IR/CompilationAttrInterfaces.td"
 include "mlir/Dialect/GPU/IR/CompilationAttrs.td"
 include "mlir/Dialect/GPU/IR/ParallelLoopMapperAttr.td"
 include "mlir/Dialect/GPU/TransformOps/GPUDeviceMappingAttr.td"
+include "mlir/IR/CommonTypeConstraints.td"
 include "mlir/IR/EnumAttr.td"
-include "mlir/Interfaces/FunctionInterfaces.td"
 include "mlir/IR/SymbolInterfaces.td"
 include "mlir/Interfaces/DataLayoutInterfaces.td"
+include "mlir/Interfaces/FunctionInterfaces.td"
 include "mlir/Interfaces/InferIntRangeInterface.td"
 include "mlir/Interfaces/InferTypeOpInterface.td"
 include "mlir/Interfaces/SideEffectInterfaces.td"
@@ -1022,16 +1023,23 @@ def GPU_AllReduceOp : GPU_Op<"all_reduce",
   let hasRegionVerifier = 1;
 }
 
+def AnyIntegerOrFloatOr1DVector :
+  AnyTypeOf<[AnyIntegerOrFloat, VectorOfRankAndType<[1], [AnyIntegerOrFloat]>]>;
+
 def GPU_SubgroupReduceOp : GPU_Op<"subgroup_reduce", [SameOperandsAndResultType]> {
   let summary = "Reduce values among subgroup.";
   let description = [{
     The `subgroup_reduce` op reduces the value of every work item across a
     subgroup. The result is equal for all work items of a subgroup.
 
+    When the reduced value is of a vector type, each vector element is reduced
+    independently. Only 1-d vector types are allowed.
+
     Example:
 
     ```mlir
-    %1 = gpu.subgroup_reduce add %0 : (f32) -> (f32)
+    %1 = gpu.subgroup_reduce add %a : (f32) -> (f32)
+    %2 = gpu.subgroup_reduce add %b : (vector<4xf16>) -> (vector<4xf16>)
     ```
 
     If `uniform` flag is set either none or all work items of a subgroup
@@ -1044,11 +1052,11 @@ def GPU_SubgroupReduceOp : GPU_Op<"subgroup_reduce", [SameOperandsAndResultType]
   }];
 
   let arguments = (ins
-    AnyIntegerOrFloat:$value,
+    AnyIntegerOrFloatOr1DVector:$value,
     GPU_AllReduceOperationAttr:$op,
     UnitAttr:$uniform
   );
-  let results = (outs AnyIntegerOrFloat:$result);
+  let results = (outs AnyIntegerOrFloatOr1DVector:$result);
 
   let assemblyFormat = [{ custom<AllReduceOperation>($op) $value
                           (`uniform` $uniform^)? attr-dict
diff --git a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
index 7c3330f4c238f8..dd482f305fcbc8 100644
--- a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
+++ b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
@@ -19,6 +19,7 @@
 #include "mlir/IR/BuiltinAttributes.h"
 #include "mlir/IR/BuiltinOps.h"
 #include "mlir/IR/BuiltinTypes.h"
+#include "mlir/IR/Diagnostics.h"
 #include "mlir/IR/DialectImplementation.h"
 #include "mlir/IR/Matchers.h"
 #include "mlir/IR/OpImplementation.h"
@@ -588,8 +589,16 @@ static void printAllReduceOperation(AsmPrinter &printer, Operation *op,
 //===----------------------------------------------------------------------===//
 
 LogicalResult gpu::SubgroupReduceOp::verify() {
+  Type elemType = getType();
+  if (auto vecTy = dyn_cast<VectorType>(elemType)) {
+    if (vecTy.isScalable())
+      return emitOpError() << "is not compatible with scalable vector types";
+
+    elemType = vecTy.getElementType();
+  }
+
   gpu::AllReduceOperation opName = getOp();
-  if (failed(verifyReduceOpAndType(opName, getType()))) {
+  if (failed(verifyReduceOpAndType(opName, elemType))) {
     return emitError() << '`' << gpu::stringifyAllReduceOperation(opName)
                        << "` reduction operation is not compatible with type "
                        << getType();
diff --git a/mlir/test/Dialect/GPU/invalid.mlir b/mlir/test/Dialect/GPU/invalid.mlir
index d8a40f89f80ac2..8a34d64326072b 100644
--- a/mlir/test/Dialect/GPU/invalid.mlir
+++ b/mlir/test/Dialect/GPU/invalid.mlir
@@ -333,9 +333,17 @@ func.func @reduce_invalid_op_type_maximumf(%arg0 : i32) {
 
 // -----
 
-func.func @subgroup_reduce_bad_type(%arg0 : vector<2xf32>) {
-  // expected-error@+1 {{'gpu.subgroup_reduce' op operand #0 must be Integer or Float}}
-  %res = gpu.subgroup_reduce add %arg0 : (vector<2xf32>) -> vector<2xf32>
+func.func @subgroup_reduce_bad_type(%arg0 : vector<2x2xf32>) {
+  // expected-error@+1 {{'gpu.subgroup_reduce' op operand #0 must be Integer or Float or vector of}}
+  %res = gpu.subgroup_reduce add %arg0 : (vector<2x2xf32>) -> vector<2x2xf32>
+  return
+}
+
+// -----
+
+func.func @subgroup_reduce_bad_type_scalable(%arg0 : vector<[2]xf32>) {
+  // expected-error@+1 {{is not compatible with scalable vector types}}
+  %res = gpu.subgroup_reduce add %arg0 : (vector<[2]xf32>) -> vector<[2]xf32>
   return
 }
 
diff --git a/mlir/test/Dialect/GPU/ops.mlir b/mlir/test/Dialect/GPU/ops.mlir
index 48193436415637..c3b548062638f1 100644
--- a/mlir/test/Dialect/GPU/ops.mlir
+++ b/mlir/test/Dialect/GPU/ops.mlir
@@ -84,6 +84,8 @@ module attributes {gpu.container_module} {
 
       %one = arith.constant 1.0 : f32
 
+      %vec = vector.broadcast %arg0 : f32 to vector<4xf32>
+
       // CHECK: %{{.*}} = gpu.all_reduce add %{{.*}} {
       // CHECK-NEXT: } : (f32) -> f32
       %sum = gpu.all_reduce add %one {} : (f32) -> (f32)
@@ -98,6 +100,9 @@ module attributes {gpu.container_module} {
       // CHECK: %{{.*}} = gpu.subgroup_reduce add %{{.*}} uniform : (f32) -> f32
       %sum_subgroup1 = gpu.subgroup_reduce add %one uniform : (f32) -> f32
 
+      // CHECK: %{{.*}} = gpu.subgroup_reduce add %{{.*}} : (vector<4xf32>) -> vector<4xf32
+      %sum_subgroup2 = gpu.subgroup_reduce add %vec : (vector<4xf32>) -> vector<4xf32>
+
       %width = arith.constant 7 : i32
       %offset = arith.constant 3 : i32
       // CHECK: gpu.shuffle xor %{{.*}}, %{{.*}}, %{{.*}} : f32

Each vector element is reduced independently, which is a form of
multi-reduction.

The plan is to allow for gradual lowering of multi-redictions that
results in fewer `gpu.shuffle` ops at the end:
1d `vector.multi_reduction` --> 1d `gpu.subgroup_reduce` --> smaller
1d `gpu.subgroup_reduce` --> packed `gpu.shuffle` over i32 or i64

For example we can perform 4 independent f16 reductions with a series of
`gpu.shuffles` over i64, reducing the final number of `gpu.shuffles` by
4x.
@grypp
Copy link
Member

grypp commented Dec 20, 2023

Thanks for your implementation. The goal of PR sounds good to me.

For example we can perform 4 independent f16 reductions with a series of gpu.shuffles over i64, reducing the final number of gpu.shuffles by 4x.

PTX only allows shuffling 32bit registers, so 4xf16 needs 2xshuffle. Can the PR supports that?

Copy link
Contributor

@fabianmcg fabianmcg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reducing the number GPU shuffles in a multi-reduce op only makes sense if the type has size less than 32 bits as both NVIDIA & AMD only support 32 bit shuffles.

So my question is, how do you plan to lower large vectors (ie. vector<2xf32>)?

I'd prefer only accepting vector<1xi32>, and let the user use vector.bitcast.

@kuhar
Copy link
Member Author

kuhar commented Dec 20, 2023

PTX only allows shuffling 32bit registers, so 4xf16 needs 2xshuffle. Can the PR supports that?

@grypp Sure, the lowering chain that I described can be parametrized with the largest shuffle type available.

So my question is, how do you plan to lower large vectors (ie. vector<2xf32>)?

I'd prefer only accepting vector<1xi32>, and let the user use vector.bitcast.

@fabianmcg By having patterns to break them down into 'shuffable' chunks. To continue with your example, we would do something like:

  1. %a = gpu.subgroup_reduce add %x : vector<2xf32> -> vector<2xf32>
%a = vector.extract %x[0] : f32 from vector<2xf32> // or extract_strided_slice of vector<1xf32>
%b = vector.extract %x[1] : f32 from vector<2xf32>
%c = gpu.subgroup_reduce add %a : f32 -> f32
%d = gpu.subgroup_reduce add %b : f32 -> f32
%e = vector.insert %c ...
%f = vector.insert %d ...
  1. Lower gpu.subgroup_reduce add to shuffles if necessary.

If we only allowed vector<1xi32> we would lose the semantic information that the lowering should perform reduction over f32. This is fine for gpu.shuffle as the reduction part happens in between the data movement, but not for subgroup_reduce which does perform reductions.

@fabianmcg
Copy link
Contributor

If we only allowed vector<1xi32> we would lose the semantic information that the lowering should perform reduction over f32. This is fine for gpu.shuffle as the reduction part happens in between the data movement, but not for subgroup_reduce which does perform reductions.

You're right, I was only thinking in the shuffling part. To be more specific my concern is things like vector<4xf64> or larger. Don't get me wrong I'm all in for supporting things like vector<4xi8>, but I'm not convinced it should support all sizes.

@kuhar
Copy link
Member Author

kuhar commented Dec 20, 2023

If we only allowed vector<1xi32> we would lose the semantic information that the lowering should perform reduction over f32. This is fine for gpu.shuffle as the reduction part happens in between the data movement, but not for subgroup_reduce which does perform reductions.

You're right, I was only thinking in the shuffling part. To be more specific my concern is things like vector<4xf64> or larger. Don't get me wrong I'm all in for supporting things like vector<4xi8>, but I'm not convinced it should support all sizes.

The main point of supporting more sizes in gpu.subgroup_reduce (not gpu.shuffle) is to allow for a gradual lowering path. Also, SPIR-V is less restrictive here: https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#OpGroupNonUniformFAdd and https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#OpGroupNonUniformShuffleXor. This is another reason why I don't want to encode this nvidia/amd gpu specific detail into the type system -- we can always legalize these based on what we target.

@fabianmcg
Copy link
Contributor

Also, SPIR-V is less restrictive here: ... This is another reason why I don't want to encode this nvidia/amd gpu specific detail into the type system

That sounds good to me.

Copy link
Contributor

@Hardcode84 Hardcode84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

Copy link
Contributor

@qedawkins qedawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants