Skip to content

Commit ecaf2c3

Browse files
[MLIR] Move warp_execute_on_lane_0 from vector to gpu (#116994)
Please see the related RFC here: https://discourse.llvm.org/t/rfc-move-execute-on-lane-0-from-vector-to-gpu-dialect/82989. This patch does exactly one thing - moves the op to gpu.
1 parent 556ea52 commit ecaf2c3

File tree

15 files changed

+738
-728
lines changed

15 files changed

+738
-728
lines changed

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1097,6 +1097,10 @@ def GPU_YieldOp : GPU_Op<"yield", [Pure, ReturnLike, Terminator]>,
10971097
```
10981098
}];
10991099

1100+
let builders = [
1101+
OpBuilder<(ins), [{ /* nothing to do */ }]>
1102+
];
1103+
11001104
let assemblyFormat = "attr-dict ($values^ `:` type($values))?";
11011105
}
11021106

@@ -2921,4 +2925,138 @@ def GPU_SetCsrPointersOp : GPU_Op<"set_csr_pointers", [GPU_AsyncOpInterface]> {
29212925
}];
29222926
}
29232927

2928+
def GPU_WarpExecuteOnLane0Op : GPU_Op<"warp_execute_on_lane_0",
2929+
[DeclareOpInterfaceMethods<RegionBranchOpInterface, ["areTypesCompatible"]>,
2930+
SingleBlockImplicitTerminator<"gpu::YieldOp">,
2931+
RecursiveMemoryEffects]> {
2932+
let summary = "Executes operations in the associated region on thread #0 of a"
2933+
"SPMD program";
2934+
let description = [{
2935+
`warp_execute_on_lane_0` is an operation used to bridge the gap between
2936+
vector programming and SPMD programming model like GPU SIMT. It allows to
2937+
trivially convert a region of vector code meant to run on a multiple threads
2938+
into a valid SPMD region and then allows incremental transformation to
2939+
distribute vector operations on the threads.
2940+
2941+
Any code present in the region would only be executed on first thread/lane
2942+
based on the `laneid` operand. The `laneid` operand is an integer ID between
2943+
[0, `warp_size`). The `warp_size` attribute indicates the number of lanes in
2944+
a warp.
2945+
2946+
Operands are vector values distributed on all lanes that may be used by
2947+
the single lane execution. The matching region argument is a vector of all
2948+
the values of those lanes available to the single active lane. The
2949+
distributed dimension is implicit based on the shape of the operand and
2950+
argument. the properties of the distribution may be described by extra
2951+
attributes (e.g. affine map).
2952+
2953+
Return values are distributed on all lanes using laneId as index. The
2954+
vector is distributed based on the shape ratio between the vector type of
2955+
the yield and the result type.
2956+
If the shapes are the same this means the value is broadcasted to all lanes.
2957+
In the future the distribution can be made more explicit using affine_maps
2958+
and will support having multiple Ids.
2959+
2960+
Therefore the `warp_execute_on_lane_0` operations allow to implicitly copy
2961+
between lane0 and the lanes of the warp. When distributing a vector
2962+
from lane0 to all the lanes, the data are distributed in a block cyclic way.
2963+
For example `vector<64xf32>` gets distributed on 32 threads and map to
2964+
`vector<2xf32>` where thread 0 contains vector[0] and vector[1].
2965+
2966+
During lowering values passed as operands and return value need to be
2967+
visible to different lanes within the warp. This would usually be done by
2968+
going through memory.
2969+
2970+
The region is *not* isolated from above. For values coming from the parent
2971+
region not going through operands only the lane 0 value will be accesible so
2972+
it generally only make sense for uniform values.
2973+
2974+
Example:
2975+
```
2976+
// Execute in parallel on all threads/lanes.
2977+
gpu.warp_execute_on_lane_0 (%laneid)[32] {
2978+
// Serial code running only on thread/lane 0.
2979+
...
2980+
}
2981+
// Execute in parallel on all threads/lanes.
2982+
```
2983+
2984+
This may be lowered to an scf.if region as below:
2985+
```
2986+
// Execute in parallel on all threads/lanes.
2987+
%cnd = arith.cmpi eq, %laneid, %c0 : index
2988+
scf.if %cnd {
2989+
// Serial code running only on thread/lane 0.
2990+
...
2991+
}
2992+
// Execute in parallel on all threads/lanes.
2993+
```
2994+
2995+
When the region has operands and/or return values:
2996+
```
2997+
// Execute in parallel on all threads/lanes.
2998+
%0 = gpu.warp_execute_on_lane_0(%laneid)[32]
2999+
args(%v0 : vector<4xi32>) -> (vector<1xf32>) {
3000+
^bb0(%arg0 : vector<128xi32>) :
3001+
// Serial code running only on thread/lane 0.
3002+
...
3003+
gpu.yield %1 : vector<32xf32>
3004+
}
3005+
// Execute in parallel on all threads/lanes.
3006+
```
3007+
3008+
values at the region boundary would go through memory:
3009+
```
3010+
// Execute in parallel on all threads/lanes.
3011+
...
3012+
// Store the data from each thread into memory and Synchronization.
3013+
%tmp0 = memreg.alloc() : memref<128xf32>
3014+
%tmp1 = memreg.alloc() : memref<32xf32>
3015+
%cnd = arith.cmpi eq, %laneid, %c0 : index
3016+
vector.store %v0, %tmp0[%laneid] : memref<128xf32>, vector<4xf32>
3017+
some_synchronization_primitive
3018+
scf.if %cnd {
3019+
// Serialized code running only on thread 0.
3020+
// Load the data from all the threads into a register from thread 0. This
3021+
// allow threads 0 to access data from all the threads.
3022+
%arg0 = vector.load %tmp0[%c0] : memref<128xf32>, vector<128xf32>
3023+
...
3024+
// Store the data from thread 0 into memory.
3025+
vector.store %1, %tmp1[%c0] : memref<32xf32>, vector<32xf32>
3026+
}
3027+
// Synchronization and load the data in a block cyclic way so that the
3028+
// vector is distributed on all threads.
3029+
some_synchronization_primitive
3030+
%0 = vector.load %tmp1[%laneid] : memref<32xf32>, vector<32xf32>
3031+
// Execute in parallel on all threads/lanes.
3032+
```
3033+
3034+
}];
3035+
3036+
let hasVerifier = 1;
3037+
let hasCustomAssemblyFormat = 1;
3038+
let arguments = (ins Index:$laneid, I64Attr:$warp_size,
3039+
Variadic<AnyType>:$args);
3040+
let results = (outs Variadic<AnyType>:$results);
3041+
let regions = (region SizedRegion<1>:$warpRegion);
3042+
3043+
let skipDefaultBuilders = 1;
3044+
let builders = [
3045+
OpBuilder<(ins "TypeRange":$resultTypes, "Value":$laneid,
3046+
"int64_t":$warpSize)>,
3047+
// `blockArgTypes` are different than `args` types as they are they
3048+
// represent all the `args` instances visibile to lane 0. Therefore we need
3049+
// to explicit pass the type.
3050+
OpBuilder<(ins "TypeRange":$resultTypes, "Value":$laneid,
3051+
"int64_t":$warpSize, "ValueRange":$args,
3052+
"TypeRange":$blockArgTypes)>
3053+
];
3054+
3055+
let extraClassDeclaration = [{
3056+
bool isDefinedOutsideOfRegion(Value value) {
3057+
return !getRegion().isAncestor(value.getParentRegion());
3058+
}
3059+
}];
3060+
}
3061+
29243062
#endif // GPU_OPS

mlir/include/mlir/Dialect/Vector/IR/VectorOps.td

Lines changed: 0 additions & 133 deletions
Original file line numberDiff line numberDiff line change
@@ -2985,138 +2985,5 @@ def Vector_YieldOp : Vector_Op<"yield", [
29852985
let assemblyFormat = "attr-dict ($operands^ `:` type($operands))?";
29862986
}
29872987

2988-
def Vector_WarpExecuteOnLane0Op : Vector_Op<"warp_execute_on_lane_0",
2989-
[DeclareOpInterfaceMethods<RegionBranchOpInterface, ["areTypesCompatible"]>,
2990-
SingleBlockImplicitTerminator<"vector::YieldOp">,
2991-
RecursiveMemoryEffects]> {
2992-
let summary = "Executes operations in the associated region on thread #0 of a"
2993-
"SPMD program";
2994-
let description = [{
2995-
`warp_execute_on_lane_0` is an operation used to bridge the gap between
2996-
vector programming and SPMD programming model like GPU SIMT. It allows to
2997-
trivially convert a region of vector code meant to run on a multiple threads
2998-
into a valid SPMD region and then allows incremental transformation to
2999-
distribute vector operations on the threads.
3000-
3001-
Any code present in the region would only be executed on first thread/lane
3002-
based on the `laneid` operand. The `laneid` operand is an integer ID between
3003-
[0, `warp_size`). The `warp_size` attribute indicates the number of lanes in
3004-
a warp.
3005-
3006-
Operands are vector values distributed on all lanes that may be used by
3007-
the single lane execution. The matching region argument is a vector of all
3008-
the values of those lanes available to the single active lane. The
3009-
distributed dimension is implicit based on the shape of the operand and
3010-
argument. the properties of the distribution may be described by extra
3011-
attributes (e.g. affine map).
3012-
3013-
Return values are distributed on all lanes using laneId as index. The
3014-
vector is distributed based on the shape ratio between the vector type of
3015-
the yield and the result type.
3016-
If the shapes are the same this means the value is broadcasted to all lanes.
3017-
In the future the distribution can be made more explicit using affine_maps
3018-
and will support having multiple Ids.
3019-
3020-
Therefore the `warp_execute_on_lane_0` operations allow to implicitly copy
3021-
between lane0 and the lanes of the warp. When distributing a vector
3022-
from lane0 to all the lanes, the data are distributed in a block cyclic way.
3023-
For exemple `vector<64xf32>` gets distributed on 32 threads and map to
3024-
`vector<2xf32>` where thread 0 contains vector[0] and vector[1].
3025-
3026-
During lowering values passed as operands and return value need to be
3027-
visible to different lanes within the warp. This would usually be done by
3028-
going through memory.
3029-
3030-
The region is *not* isolated from above. For values coming from the parent
3031-
region not going through operands only the lane 0 value will be accesible so
3032-
it generally only make sense for uniform values.
3033-
3034-
Example:
3035-
```
3036-
// Execute in parallel on all threads/lanes.
3037-
vector.warp_execute_on_lane_0 (%laneid)[32] {
3038-
// Serial code running only on thread/lane 0.
3039-
...
3040-
}
3041-
// Execute in parallel on all threads/lanes.
3042-
```
3043-
3044-
This may be lowered to an scf.if region as below:
3045-
```
3046-
// Execute in parallel on all threads/lanes.
3047-
%cnd = arith.cmpi eq, %laneid, %c0 : index
3048-
scf.if %cnd {
3049-
// Serial code running only on thread/lane 0.
3050-
...
3051-
}
3052-
// Execute in parallel on all threads/lanes.
3053-
```
3054-
3055-
When the region has operands and/or return values:
3056-
```
3057-
// Execute in parallel on all threads/lanes.
3058-
%0 = vector.warp_execute_on_lane_0(%laneid)[32]
3059-
args(%v0 : vector<4xi32>) -> (vector<1xf32>) {
3060-
^bb0(%arg0 : vector<128xi32>) :
3061-
// Serial code running only on thread/lane 0.
3062-
...
3063-
vector.yield %1 : vector<32xf32>
3064-
}
3065-
// Execute in parallel on all threads/lanes.
3066-
```
3067-
3068-
values at the region boundary would go through memory:
3069-
```
3070-
// Execute in parallel on all threads/lanes.
3071-
...
3072-
// Store the data from each thread into memory and Synchronization.
3073-
%tmp0 = memreg.alloc() : memref<128xf32>
3074-
%tmp1 = memreg.alloc() : memref<32xf32>
3075-
%cnd = arith.cmpi eq, %laneid, %c0 : index
3076-
vector.store %v0, %tmp0[%laneid] : memref<128xf32>, vector<4xf32>
3077-
some_synchronization_primitive
3078-
scf.if %cnd {
3079-
// Serialized code running only on thread 0.
3080-
// Load the data from all the threads into a register from thread 0. This
3081-
// allow threads 0 to access data from all the threads.
3082-
%arg0 = vector.load %tmp0[%c0] : memref<128xf32>, vector<128xf32>
3083-
...
3084-
// Store the data from thread 0 into memory.
3085-
vector.store %1, %tmp1[%c0] : memref<32xf32>, vector<32xf32>
3086-
}
3087-
// Synchronization and load the data in a block cyclic way so that the
3088-
// vector is distributed on all threads.
3089-
some_synchronization_primitive
3090-
%0 = vector.load %tmp1[%laneid] : memref<32xf32>, vector<32xf32>
3091-
// Execute in parallel on all threads/lanes.
3092-
```
3093-
3094-
}];
3095-
3096-
let hasVerifier = 1;
3097-
let hasCustomAssemblyFormat = 1;
3098-
let arguments = (ins Index:$laneid, I64Attr:$warp_size,
3099-
Variadic<AnyType>:$args);
3100-
let results = (outs Variadic<AnyType>:$results);
3101-
let regions = (region SizedRegion<1>:$warpRegion);
3102-
3103-
let skipDefaultBuilders = 1;
3104-
let builders = [
3105-
OpBuilder<(ins "TypeRange":$resultTypes, "Value":$laneid,
3106-
"int64_t":$warpSize)>,
3107-
// `blockArgTypes` are different than `args` types as they are they
3108-
// represent all the `args` instances visibile to lane 0. Therefore we need
3109-
// to explicit pass the type.
3110-
OpBuilder<(ins "TypeRange":$resultTypes, "Value":$laneid,
3111-
"int64_t":$warpSize, "ValueRange":$args,
3112-
"TypeRange":$blockArgTypes)>
3113-
];
3114-
3115-
let extraClassDeclaration = [{
3116-
bool isDefinedOutsideOfRegion(Value value) {
3117-
return !getRegion().isAncestor(value.getParentRegion());
3118-
}
3119-
}];
3120-
}
31212988

31222989
#endif // MLIR_DIALECT_VECTOR_IR_VECTOR_OPS

mlir/include/mlir/Dialect/Vector/Transforms/VectorDistribution.h

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
#ifndef MLIR_DIALECT_VECTOR_TRANSFORMS_VECTORDISTRIBUTION_H_
1010
#define MLIR_DIALECT_VECTOR_TRANSFORMS_VECTORDISTRIBUTION_H_
1111

12+
#include "mlir/Dialect/GPU/IR/GPUDialect.h"
1213
#include "mlir/Dialect/Vector/IR/VectorOps.h"
1314

1415
namespace mlir {
@@ -23,15 +24,15 @@ struct WarpExecuteOnLane0LoweringOptions {
2324
/// type may be VectorType or a scalar) and be availble for the current warp.
2425
/// If there are several warps running in parallel the allocation needs to be
2526
/// split so that each warp has its own allocation.
26-
using WarpAllocationFn =
27-
std::function<Value(Location, OpBuilder &, WarpExecuteOnLane0Op, Type)>;
27+
using WarpAllocationFn = std::function<Value(
28+
Location, OpBuilder &, gpu::WarpExecuteOnLane0Op, Type)>;
2829
WarpAllocationFn warpAllocationFn = nullptr;
2930

3031
/// Lamdba function to let user emit operation to syncronize all the thread
3132
/// within a warp. After this operation all the threads can see any memory
3233
/// written before the operation.
3334
using WarpSyncronizationFn =
34-
std::function<void(Location, OpBuilder &, WarpExecuteOnLane0Op)>;
35+
std::function<void(Location, OpBuilder &, gpu::WarpExecuteOnLane0Op)>;
3536
WarpSyncronizationFn warpSyncronizationFn = nullptr;
3637
};
3738

@@ -48,17 +49,17 @@ using DistributionMapFn = std::function<AffineMap(Value)>;
4849
///
4950
/// Example:
5051
/// ```
51-
/// %0 = vector.warp_execute_on_lane_0(%id){
52+
/// %0 = gpu.warp_execute_on_lane_0(%id){
5253
/// ...
5354
/// vector.transfer_write %v, %A[%c0] : vector<32xf32>, memref<128xf32>
54-
/// vector.yield
55+
/// gpu.yield
5556
/// }
5657
/// ```
5758
/// To
5859
/// ```
59-
/// %r:3 = vector.warp_execute_on_lane_0(%id) -> (vector<1xf32>) {
60+
/// %r:3 = gpu.warp_execute_on_lane_0(%id) -> (vector<1xf32>) {
6061
/// ...
61-
/// vector.yield %v : vector<32xf32>
62+
/// gpu.yield %v : vector<32xf32>
6263
/// }
6364
/// vector.transfer_write %v, %A[%id] : vector<1xf32>, memref<128xf32>
6465
///
@@ -73,7 +74,7 @@ void populateDistributeTransferWriteOpPatterns(
7374

7475
/// Move scalar operations with no dependency on the warp op outside of the
7576
/// region.
76-
void moveScalarUniformCode(WarpExecuteOnLane0Op op);
77+
void moveScalarUniformCode(gpu::WarpExecuteOnLane0Op op);
7778

7879
/// Lambda signature to compute a warp shuffle of a given value of a given lane
7980
/// within a given warp size.

0 commit comments

Comments
 (0)