[mlir][gpu] Allow subgroup reductions over 1-d vector types #76015

kuhar · 2023-12-20T06:18:28Z

Each vector element is reduced independently, which is a form of multi-reduction.

The plan is to allow for gradual lowering of multi-reduction that results in fewer gpu.shuffle ops at the end:
1d vector.multi_reduction --> 1d gpu.subgroup_reduce --> smaller 1d gpu.subgroup_reduce --> packed gpu.shuffle over i32

For example we can perform 2 independent f16 reductions with a series of gpu.shuffles over i32, reducing the final number of gpu.shuffles by 2x.

llvmbot · 2023-12-20T06:18:57Z

@llvm/pr-subscribers-mlir-spirv
@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-mlir-gpu

Author: Jakub Kuderski (kuhar)

Changes

Each vector element is reduced independently, which is a form of multi-reduction.

The plan is to allow for gradual lowering of multi-redictions that results in fewer gpu.shuffle ops at the end:
1d vector.multi_reduction --> 1d gpu.subgroup_reduce --> smaller 1d gpu.subgroup_reduce --> packed gpu.shuffle over i32 or i64

For example we can perform 4 independent f16 reductions with a series of gpu.shuffles over i64, reducing the final number of gpu.shuffles by 4x.

Full diff: https://github.com/llvm/llvm-project/pull/76015.diff

4 Files Affected:

(modified) mlir/include/mlir/Dialect/GPU/IR/GPUOps.td (+12-4)
(modified) mlir/lib/Dialect/GPU/IR/GPUDialect.cpp (+10-1)
(modified) mlir/test/Dialect/GPU/invalid.mlir (+11-3)
(modified) mlir/test/Dialect/GPU/ops.mlir (+5)

diff --git a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
index 2e21cd77d2d83b..7777dd58eba1f7 100644
--- a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
+++ b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
@@ -19,10 +19,11 @@ include "mlir/Dialect/GPU/IR/CompilationAttrInterfaces.td"
 include "mlir/Dialect/GPU/IR/CompilationAttrs.td"
 include "mlir/Dialect/GPU/IR/ParallelLoopMapperAttr.td"
 include "mlir/Dialect/GPU/TransformOps/GPUDeviceMappingAttr.td"
+include "mlir/IR/CommonTypeConstraints.td"
 include "mlir/IR/EnumAttr.td"
-include "mlir/Interfaces/FunctionInterfaces.td"
 include "mlir/IR/SymbolInterfaces.td"
 include "mlir/Interfaces/DataLayoutInterfaces.td"
+include "mlir/Interfaces/FunctionInterfaces.td"
 include "mlir/Interfaces/InferIntRangeInterface.td"
 include "mlir/Interfaces/InferTypeOpInterface.td"
 include "mlir/Interfaces/SideEffectInterfaces.td"
@@ -1022,16 +1023,23 @@ def GPU_AllReduceOp : GPU_Op<"all_reduce",
   let hasRegionVerifier = 1;
 }
 
+def AnyIntegerOrFloatOr1DVector :
+  AnyTypeOf<[AnyIntegerOrFloat, VectorOfRankAndType<[1], [AnyIntegerOrFloat]>]>;
+
 def GPU_SubgroupReduceOp : GPU_Op<"subgroup_reduce", [SameOperandsAndResultType]> {
   let summary = "Reduce values among subgroup.";
   let description = [{
     The `subgroup_reduce` op reduces the value of every work item across a
     subgroup. The result is equal for all work items of a subgroup.
 
+    When the reduced value is of a vector type, each vector element is reduced
+    independently. Only 1-d vector types are allowed.
+
     Example:
 
     ```mlir
-    %1 = gpu.subgroup_reduce add %0 : (f32) -> (f32)
+    %1 = gpu.subgroup_reduce add %a : (f32) -> (f32)
+    %2 = gpu.subgroup_reduce add %b : (vector<4xf16>) -> (vector<4xf16>)
     ```
 
     If `uniform` flag is set either none or all work items of a subgroup
@@ -1044,11 +1052,11 @@ def GPU_SubgroupReduceOp : GPU_Op<"subgroup_reduce", [SameOperandsAndResultType]
   }];
 
   let arguments = (ins
-    AnyIntegerOrFloat:$value,
+    AnyIntegerOrFloatOr1DVector:$value,
     GPU_AllReduceOperationAttr:$op,
     UnitAttr:$uniform
   );
-  let results = (outs AnyIntegerOrFloat:$result);
+  let results = (outs AnyIntegerOrFloatOr1DVector:$result);
 
   let assemblyFormat = [{ custom<AllReduceOperation>($op) $value
                           (`uniform` $uniform^)? attr-dict
diff --git a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
index 7c3330f4c238f8..dd482f305fcbc8 100644
--- a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
+++ b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
@@ -19,6 +19,7 @@
 #include "mlir/IR/BuiltinAttributes.h"
 #include "mlir/IR/BuiltinOps.h"
 #include "mlir/IR/BuiltinTypes.h"
+#include "mlir/IR/Diagnostics.h"
 #include "mlir/IR/DialectImplementation.h"
 #include "mlir/IR/Matchers.h"
 #include "mlir/IR/OpImplementation.h"
@@ -588,8 +589,16 @@ static void printAllReduceOperation(AsmPrinter &printer, Operation *op,
 //===----------------------------------------------------------------------===//
 
 LogicalResult gpu::SubgroupReduceOp::verify() {
+  Type elemType = getType();
+  if (auto vecTy = dyn_cast<VectorType>(elemType)) {
+    if (vecTy.isScalable())
+      return emitOpError() << "is not compatible with scalable vector types";
+
+    elemType = vecTy.getElementType();
+  }
+
   gpu::AllReduceOperation opName = getOp();
-  if (failed(verifyReduceOpAndType(opName, getType()))) {
+  if (failed(verifyReduceOpAndType(opName, elemType))) {
     return emitError() << '`' << gpu::stringifyAllReduceOperation(opName)
                        << "` reduction operation is not compatible with type "
                        << getType();
diff --git a/mlir/test/Dialect/GPU/invalid.mlir b/mlir/test/Dialect/GPU/invalid.mlir
index d8a40f89f80ac2..8a34d64326072b 100644
--- a/mlir/test/Dialect/GPU/invalid.mlir
+++ b/mlir/test/Dialect/GPU/invalid.mlir
@@ -333,9 +333,17 @@ func.func @reduce_invalid_op_type_maximumf(%arg0 : i32) {
 
 // -----
 
-func.func @subgroup_reduce_bad_type(%arg0 : vector<2xf32>) {
-  // expected-error@+1 {{'gpu.subgroup_reduce' op operand #0 must be Integer or Float}}
-  %res = gpu.subgroup_reduce add %arg0 : (vector<2xf32>) -> vector<2xf32>
+func.func @subgroup_reduce_bad_type(%arg0 : vector<2x2xf32>) {
+  // expected-error@+1 {{'gpu.subgroup_reduce' op operand #0 must be Integer or Float or vector of}}
+  %res = gpu.subgroup_reduce add %arg0 : (vector<2x2xf32>) -> vector<2x2xf32>
+  return
+}
+
+// -----
+
+func.func @subgroup_reduce_bad_type_scalable(%arg0 : vector<[2]xf32>) {
+  // expected-error@+1 {{is not compatible with scalable vector types}}
+  %res = gpu.subgroup_reduce add %arg0 : (vector<[2]xf32>) -> vector<[2]xf32>
   return
 }
 
diff --git a/mlir/test/Dialect/GPU/ops.mlir b/mlir/test/Dialect/GPU/ops.mlir
index 48193436415637..c3b548062638f1 100644
--- a/mlir/test/Dialect/GPU/ops.mlir
+++ b/mlir/test/Dialect/GPU/ops.mlir
@@ -84,6 +84,8 @@ module attributes {gpu.container_module} {
 
       %one = arith.constant 1.0 : f32
 
+      %vec = vector.broadcast %arg0 : f32 to vector<4xf32>
+
       // CHECK: %{{.*}} = gpu.all_reduce add %{{.*}} {
       // CHECK-NEXT: } : (f32) -> f32
       %sum = gpu.all_reduce add %one {} : (f32) -> (f32)
@@ -98,6 +100,9 @@ module attributes {gpu.container_module} {
       // CHECK: %{{.*}} = gpu.subgroup_reduce add %{{.*}} uniform : (f32) -> f32
       %sum_subgroup1 = gpu.subgroup_reduce add %one uniform : (f32) -> f32
 
+      // CHECK: %{{.*}} = gpu.subgroup_reduce add %{{.*}} : (vector<4xf32>) -> vector<4xf32
+      %sum_subgroup2 = gpu.subgroup_reduce add %vec : (vector<4xf32>) -> vector<4xf32>
+
       %width = arith.constant 7 : i32
       %offset = arith.constant 3 : i32
       // CHECK: gpu.shuffle xor %{{.*}}, %{{.*}}, %{{.*}} : f32

Each vector element is reduced independently, which is a form of multi-reduction. The plan is to allow for gradual lowering of multi-redictions that results in fewer `gpu.shuffle` ops at the end: 1d `vector.multi_reduction` --> 1d `gpu.subgroup_reduce` --> smaller 1d `gpu.subgroup_reduce` --> packed `gpu.shuffle` over i32 or i64 For example we can perform 4 independent f16 reductions with a series of `gpu.shuffles` over i64, reducing the final number of `gpu.shuffles` by 4x.

grypp · 2023-12-20T13:10:28Z

Thanks for your implementation. The goal of PR sounds good to me.

For example we can perform 4 independent f16 reductions with a series of gpu.shuffles over i64, reducing the final number of gpu.shuffles by 4x.

PTX only allows shuffling 32bit registers, so 4xf16 needs 2xshuffle. Can the PR supports that?

mlir/lib/Conversion/GPUToSPIRV/GPUToSPIRV.cpp

fabianmcg

Reducing the number GPU shuffles in a multi-reduce op only makes sense if the type has size less than 32 bits as both NVIDIA & AMD only support 32 bit shuffles.

So my question is, how do you plan to lower large vectors (ie. vector<2xf32>)?

I'd prefer only accepting vector<1xi32>, and let the user use vector.bitcast.

kuhar · 2023-12-20T16:13:36Z

PTX only allows shuffling 32bit registers, so 4xf16 needs 2xshuffle. Can the PR supports that?

@grypp Sure, the lowering chain that I described can be parametrized with the largest shuffle type available.

So my question is, how do you plan to lower large vectors (ie. vector<2xf32>)?

I'd prefer only accepting vector<1xi32>, and let the user use vector.bitcast.

@fabianmcg By having patterns to break them down into 'shuffable' chunks. To continue with your example, we would do something like:

%a = gpu.subgroup_reduce add %x : vector<2xf32> -> vector<2xf32>

%a = vector.extract %x[0] : f32 from vector<2xf32> // or extract_strided_slice of vector<1xf32>
%b = vector.extract %x[1] : f32 from vector<2xf32>
%c = gpu.subgroup_reduce add %a : f32 -> f32
%d = gpu.subgroup_reduce add %b : f32 -> f32
%e = vector.insert %c ...
%f = vector.insert %d ...

Lower gpu.subgroup_reduce add to shuffles if necessary.

If we only allowed vector<1xi32> we would lose the semantic information that the lowering should perform reduction over f32. This is fine for gpu.shuffle as the reduction part happens in between the data movement, but not for subgroup_reduce which does perform reductions.

fabianmcg · 2023-12-20T16:25:56Z

If we only allowed vector<1xi32> we would lose the semantic information that the lowering should perform reduction over f32. This is fine for gpu.shuffle as the reduction part happens in between the data movement, but not for subgroup_reduce which does perform reductions.

You're right, I was only thinking in the shuffling part. To be more specific my concern is things like vector<4xf64> or larger. Don't get me wrong I'm all in for supporting things like vector<4xi8>, but I'm not convinced it should support all sizes.

kuhar · 2023-12-20T16:29:57Z

If we only allowed vector<1xi32> we would lose the semantic information that the lowering should perform reduction over f32. This is fine for gpu.shuffle as the reduction part happens in between the data movement, but not for subgroup_reduce which does perform reductions.

You're right, I was only thinking in the shuffling part. To be more specific my concern is things like vector<4xf64> or larger. Don't get me wrong I'm all in for supporting things like vector<4xi8>, but I'm not convinced it should support all sizes.

The main point of supporting more sizes in gpu.subgroup_reduce (not gpu.shuffle) is to allow for a gradual lowering path. Also, SPIR-V is less restrictive here: https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#OpGroupNonUniformFAdd and https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#OpGroupNonUniformShuffleXor. This is another reason why I don't want to encode this nvidia/amd gpu specific detail into the type system -- we can always legalize these based on what we target.

fabianmcg · 2023-12-20T17:25:12Z

Also, SPIR-V is less restrictive here: ... This is another reason why I don't want to encode this nvidia/amd gpu specific detail into the type system

That sounds good to me.

Hardcode84

LGTM, thanks

qedawkins

LGTM

mlir/lib/Conversion/GPUToSPIRV/GPUToSPIRV.cpp

kuhar requested review from antiagainst, MaheshRavishankar, Hardcode84, grypp, fabianmcg, qedawkins and maerhart December 20, 2023 06:18

llvmbot added mlir:gpu mlir labels Dec 20, 2023

kuhar force-pushed the gpu-vector-reduction branch from 421d3a9 to 0648580 Compare December 20, 2023 06:39

llvmbot added the mlir:spirv label Dec 20, 2023

Hardcode84 reviewed Dec 20, 2023

View reviewed changes

mlir/lib/Conversion/GPUToSPIRV/GPUToSPIRV.cpp Outdated Show resolved Hide resolved

fabianmcg reviewed Dec 20, 2023

View reviewed changes

Improve spirv pattern

05cfdb4

Hardcode84 approved these changes Dec 20, 2023

View reviewed changes

Fix typo

b49bd81

qedawkins approved these changes Dec 20, 2023

View reviewed changes

antiagainst approved these changes Dec 20, 2023

View reviewed changes

mlir/lib/Conversion/GPUToSPIRV/GPUToSPIRV.cpp Outdated Show resolved Hide resolved

Fix typo

afcef02

kuhar merged commit 72003ad into llvm:main Dec 21, 2023

kuhar mentioned this pull request Dec 21, 2023

[CodeGen] GPU Subgroup Reduction Pipeline for MatVec iree-org/iree#15951

Closed

24 tasks

kuhar mentioned this pull request Dec 28, 2023

[mlir][gpu] Add subgroup_reduce to shuffle lowering #76530

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[mlir][gpu] Allow subgroup reductions over 1-d vector types #76015

[mlir][gpu] Allow subgroup reductions over 1-d vector types #76015

Uh oh!

kuhar commented Dec 20, 2023 •

edited

Loading

Uh oh!

llvmbot commented Dec 20, 2023 •

edited

Loading

Uh oh!

grypp commented Dec 20, 2023 •

edited

Loading

Uh oh!

Uh oh!

fabianmcg left a comment •

edited

Loading

Uh oh!

kuhar commented Dec 20, 2023 •

edited

Loading

Uh oh!

fabianmcg commented Dec 20, 2023

Uh oh!

kuhar commented Dec 20, 2023

Uh oh!

fabianmcg commented Dec 20, 2023

Uh oh!

Hardcode84 left a comment

Uh oh!

qedawkins left a comment

Uh oh!

Uh oh!

Uh oh!

[mlir][gpu] Allow subgroup reductions over 1-d vector types #76015

[mlir][gpu] Allow subgroup reductions over 1-d vector types #76015

Uh oh!

Conversation

kuhar commented Dec 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Dec 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

grypp commented Dec 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

fabianmcg left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kuhar commented Dec 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fabianmcg commented Dec 20, 2023

Uh oh!

kuhar commented Dec 20, 2023

Uh oh!

fabianmcg commented Dec 20, 2023

Uh oh!

Hardcode84 left a comment

Choose a reason for hiding this comment

Uh oh!

qedawkins left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kuhar commented Dec 20, 2023 •

edited

Loading

llvmbot commented Dec 20, 2023 •

edited

Loading

grypp commented Dec 20, 2023 •

edited

Loading

fabianmcg left a comment •

edited

Loading

kuhar commented Dec 20, 2023 •

edited

Loading