[mlir][linalg] Restrict scalable vectorisation #98639

banach-space · 2024-07-12T14:30:47Z

Updates vectorizeScalableVectorPrecondition so that scalable
vectorisation is only applied in well understood and tested scenarios.

It's unlikely that we would ever want an arbitrary dimension to be
scalable. While the Linalg vectoriser should be flexible enough to
handle all possibilities:

in more "exotic" cases, we are likely to struggle with lowerings
further down the compilation stack,
it would be impractical given the limitations of LLVM (which usually
reflect the limitations of actual hardware) - e.g. no support for
"scalable" arrays of scalable or fixed width vectors (*).

Ultimately, the goal of this patch is to better document what's
currently supported. While this PR adds some new restrictions, no
existing tests are affected.

(*) At MLIR vector level that would correspond to e.g.
vector<[4]x8xf32>.

Updates `vectorizeScalableVectorPrecondition` so that scalable vectorisation is only applied in well understood and tested scenarios. It's unlikely that we would ever want an arbitrary dimension to be scalable. While the Linalg vectoriser should be flexible enough to handle all possibilities: * in more "exotic" cases we are likely to struggle with lowerings further down the compilation stack, * it would be impractical given the limitations of LLVM (which usually reflect the limitations of actual hardware) - e.g. no support for "scalable" arrays of scalable or fixed width vectors (*). Ultimately, the goal of this patch is to better document what's currently supported. While this PR adds some new restrictions, no existing tests are affected. (*) At MLIR vector level that would correspond to e.g. `vector<[[4]x8xf32>`.

llvmbot · 2024-07-12T14:31:17Z

@llvm/pr-subscribers-mlir-linalg

Author: Andrzej Warzyński (banach-space)

Changes

Updates vectorizeScalableVectorPrecondition so that scalable
vectorisation is only applied in well understood and tested scenarios.

It's unlikely that we would ever want an arbitrary dimension to be
scalable. While the Linalg vectoriser should be flexible enough to
handle all possibilities:

in more "exotic" cases we are likely to struggle with lowerings
further down the compilation stack,
it would be impractical given the limitations of LLVM (which usually
reflect the limitations of actual hardware) - e.g. no support for
"scalable" arrays of scalable or fixed width vectors (*).

Ultimately, the goal of this patch is to better document what's
currently supported. While this PR adds some new restrictions, no
existing tests are affected.

(*) At MLIR vector level that would correspond to e.g.
vector<[[4]x8xf32>.

Full diff: https://github.com/llvm/llvm-project/pull/98639.diff

2 Files Affected:

(modified) mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp (+64-9)
(modified) mlir/test/Dialect/Linalg/vectorization-unsupported.mlir (+66-1)

diff --git a/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp b/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
index a4c0508d0d8fa..9741120946362 100644
--- a/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
+++ b/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
@@ -1936,7 +1936,8 @@ vectorizePadOpPrecondition(tensor::PadOp padOp,
   return success();
 }
 
-/// Preconditions for scalable vectors.
+/// Preconditions for scalable vectors. This is quite restrictive - it models
+/// the fact that in practice we would only make selected dimensions scalable.
 static LogicalResult
 vectorizeScalableVectorPrecondition(Operation *op,
                                     ArrayRef<int64_t> inputVectorSizes,
@@ -1944,18 +1945,72 @@ vectorizeScalableVectorPrecondition(Operation *op,
   assert(inputVectorSizes.size() == inputScalableVecDims.size() &&
          "Number of input vector sizes and scalable dims doesn't match");
 
-  if (inputVectorSizes.empty())
-    return success();
+  size_t numOfScalableDims =
+      llvm::count_if(inputScalableVecDims, [](bool flag) { return flag; });
 
-  bool isScalable = inputScalableVecDims.back();
-  if (!isScalable)
+  if (numOfScalableDims == 0)
     return success();
 
-  // Only element-wise and 1d depthwise conv ops supported in the presence of
-  // scalable dims.
   auto linalgOp = dyn_cast<LinalgOp>(op);
-  return success(linalgOp && (isElementwise(linalgOp) ||
-                              isa<linalg::DepthwiseConv1DNwcWcOp>(op)));
+
+  // Cond 1: There's been no need for scalable vectorisation of
+  // non-linalg Ops so far
+  if (!linalgOp)
+    return failure();
+
+  // Cond 2: There's been no need for more than 2 scalable dims so far
+  if (numOfScalableDims > 2)
+    return failure();
+
+  // Cond 3: Look at the configuration in `inputScalableVecDims` and verify that
+  // it matches one of the supported cases:
+  //  1. exactly 1 dim is scalable and that's the _last_ parallel dim
+  //  2. exactly 2 dims are scalable and those are the _last two adjacent_
+  //     parallel dims
+  // The 2nd restriction above means that only Matmul-like Ops are supported
+  // when 2 dims are scalable, e.g. :
+  //    * iterators = [parallel, parallel, reduction]
+  //    * scalable flags = [true, true, false]
+
+  // Find the first scalable flag
+  bool seenParalell = false;
+  auto iterators = linalgOp.getIteratorTypesArray();
+  SmallVector<bool> scalableFlags(inputScalableVecDims);
+  if (!scalableFlags.back()) {
+    while (!scalableFlags.back()) {
+      seenParalell |= (iterators.back() == utils::IteratorType::parallel);
+
+      iterators.pop_back();
+      scalableFlags.pop_back();
+    }
+  }
+
+  // TODO: Support scalable vectorisation for reduction dims
+  if (iterators.back() == utils::IteratorType::reduction)
+    return failure();
+
+  // If this is not the _last_ parallel dim, 1. above is not met
+  if (seenParalell)
+    return failure();
+
+  // If present, check the 2nd scalable dim. ATM, only Matmul-like Ops are
+  // supported for which expect the folowing config:
+  //    * iterators = [parallel, parallel, reduction]
+  //    * scalable flags = [true, true, false]
+  if (numOfScalableDims == 2) {
+    scalableFlags.pop_back();
+    iterators.pop_back();
+
+    if (!scalableFlags.back() ||
+        (iterators.back() != utils::IteratorType::parallel))
+      return failure();
+  }
+
+  // Cond 4: Only element-wise and 1d depthwise conv ops supported in the
+  // presence of scalable vectors
+  return success(isElementwise(linalgOp) || isa<linalg::MatmulOp>(op) ||
+                 isa<linalg::MatmulTransposeAOp>(op) ||
+                 isa<linalg::DepthwiseConv1DNwcWcOp>(op));
 }
 
 LogicalResult mlir::linalg::vectorizeOpPrecondition(
diff --git a/mlir/test/Dialect/Linalg/vectorization-unsupported.mlir b/mlir/test/Dialect/Linalg/vectorization-unsupported.mlir
index 5d3c07c8e23c1..c7ec39b0dbfb3 100644
--- a/mlir/test/Dialect/Linalg/vectorization-unsupported.mlir
+++ b/mlir/test/Dialect/Linalg/vectorization-unsupported.mlir
@@ -110,7 +110,7 @@ module attributes {transform.with_named_sequence} {
   }
 }
 
-  // -----
+// -----
 
 func.func @test_pack_no_vectorize_dynamic_shape(%arg0: tensor<?xf32>, %arg1: tensor<4x16xf32>) -> tensor<4x16xf32> {
   %pad = arith.constant 0.000000e+00 : f32
@@ -126,3 +126,68 @@ module attributes {transform.with_named_sequence} {
     transform.yield
   }
 }
+
+// -----
+
+func.func @linalg_reduce_scalable(%input: tensor<?xf32>,
+                                  %acc: tensor<f32>) -> tensor<f32> {
+
+  // expected-error @+1 {{Attempted to vectorize, but failed}}
+  %0 = linalg.reduce ins(%input : tensor<?xf32>) outs(%acc : tensor<f32>) dimensions = [0]
+  (%in: f32, %init: f32) {
+    %0 = arith.addf %in, %init : f32
+    linalg.yield %0 : f32
+  }
+  return %0 : tensor<f32>
+}
+
+module attributes {transform.with_named_sequence} {
+  transform.named_sequence @__transform_main(%arg1: !transform.any_op {transform.readonly}) {
+    %0 = transform.structured.match ops{["linalg.reduce"]} in %arg1 : (!transform.any_op) -> !transform.any_op
+    transform.structured.vectorize %0 vector_sizes [[4]] : !transform.any_op
+    transform.yield
+  }
+}
+
+// -----
+
+func.func @linalg_generic_scalable_reduction_dim(%input: tensor<?x?xf32>,
+                                                 %acc: tensor<?xf32>) -> tensor<?xf32> {
+
+  // expected-error @+1 {{Attempted to vectorize, but failed}}
+  %0 = linalg.generic { indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>,
+                                         affine_map<(d0, d1) -> (d0)>],
+                        iterator_types = ["parallel", "reduction"] }
+    ins(%input : tensor<?x?xf32>)
+    outs(%acc : tensor<?xf32>) {
+    ^bb(%in: f32, %out: f32) :
+      %0 = arith.addf %in, %out : f32
+      linalg.yield %0 : f32
+    } -> tensor<?xf32>
+  return %0 : tensor<?xf32>
+}
+
+module attributes {transform.with_named_sequence} {
+  transform.named_sequence @__transform_main(%arg1: !transform.any_op {transform.readonly}) {
+    %0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!transform.any_op) -> !transform.any_op
+    transform.structured.vectorize %0 vector_sizes [1, [4]] : !transform.any_op
+    transform.yield
+  }
+}
+
+// -----
+
+func.func @linalg_matmul_scalable_leading_parallel_dim(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
+  // expected-error @+1 {{Attempted to vectorize, but failed}}
+  linalg.matmul ins(%A, %B: memref<?x?xf32>, memref<?x?xf32>)
+            outs(%C: memref<?x?xf32>)
+  return
+}
+
+module attributes {transform.with_named_sequence} {
+  transform.named_sequence @__transform_main(%arg1: !transform.any_op {transform.readonly}) {
+    %matmul = transform.structured.match ops{["linalg.matmul"]} in %arg1 : (!transform.any_op) -> !transform.any_op
+    transform.structured.vectorize %matmul vector_sizes [[8], 16, 4] : !transform.any_op
+    transform.yield
+  }
+}

llvmbot · 2024-07-12T14:31:18Z

@llvm/pr-subscribers-mlir

Author: Andrzej Warzyński (banach-space)

Changes

Updates vectorizeScalableVectorPrecondition so that scalable
vectorisation is only applied in well understood and tested scenarios.

It's unlikely that we would ever want an arbitrary dimension to be
scalable. While the Linalg vectoriser should be flexible enough to
handle all possibilities:

in more "exotic" cases we are likely to struggle with lowerings
further down the compilation stack,
it would be impractical given the limitations of LLVM (which usually
reflect the limitations of actual hardware) - e.g. no support for
"scalable" arrays of scalable or fixed width vectors (*).

Ultimately, the goal of this patch is to better document what's
currently supported. While this PR adds some new restrictions, no
existing tests are affected.

(*) At MLIR vector level that would correspond to e.g.
vector<[[4]x8xf32>.

Full diff: https://github.com/llvm/llvm-project/pull/98639.diff

2 Files Affected:

(modified) mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp (+64-9)
(modified) mlir/test/Dialect/Linalg/vectorization-unsupported.mlir (+66-1)

diff --git a/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp b/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
index a4c0508d0d8fa..9741120946362 100644
--- a/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
+++ b/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
@@ -1936,7 +1936,8 @@ vectorizePadOpPrecondition(tensor::PadOp padOp,
   return success();
 }
 
-/// Preconditions for scalable vectors.
+/// Preconditions for scalable vectors. This is quite restrictive - it models
+/// the fact that in practice we would only make selected dimensions scalable.
 static LogicalResult
 vectorizeScalableVectorPrecondition(Operation *op,
                                     ArrayRef<int64_t> inputVectorSizes,
@@ -1944,18 +1945,72 @@ vectorizeScalableVectorPrecondition(Operation *op,
   assert(inputVectorSizes.size() == inputScalableVecDims.size() &&
          "Number of input vector sizes and scalable dims doesn't match");
 
-  if (inputVectorSizes.empty())
-    return success();
+  size_t numOfScalableDims =
+      llvm::count_if(inputScalableVecDims, [](bool flag) { return flag; });
 
-  bool isScalable = inputScalableVecDims.back();
-  if (!isScalable)
+  if (numOfScalableDims == 0)
     return success();
 
-  // Only element-wise and 1d depthwise conv ops supported in the presence of
-  // scalable dims.
   auto linalgOp = dyn_cast<LinalgOp>(op);
-  return success(linalgOp && (isElementwise(linalgOp) ||
-                              isa<linalg::DepthwiseConv1DNwcWcOp>(op)));
+
+  // Cond 1: There's been no need for scalable vectorisation of
+  // non-linalg Ops so far
+  if (!linalgOp)
+    return failure();
+
+  // Cond 2: There's been no need for more than 2 scalable dims so far
+  if (numOfScalableDims > 2)
+    return failure();
+
+  // Cond 3: Look at the configuration in `inputScalableVecDims` and verify that
+  // it matches one of the supported cases:
+  //  1. exactly 1 dim is scalable and that's the _last_ parallel dim
+  //  2. exactly 2 dims are scalable and those are the _last two adjacent_
+  //     parallel dims
+  // The 2nd restriction above means that only Matmul-like Ops are supported
+  // when 2 dims are scalable, e.g. :
+  //    * iterators = [parallel, parallel, reduction]
+  //    * scalable flags = [true, true, false]
+
+  // Find the first scalable flag
+  bool seenParalell = false;
+  auto iterators = linalgOp.getIteratorTypesArray();
+  SmallVector<bool> scalableFlags(inputScalableVecDims);
+  if (!scalableFlags.back()) {
+    while (!scalableFlags.back()) {
+      seenParalell |= (iterators.back() == utils::IteratorType::parallel);
+
+      iterators.pop_back();
+      scalableFlags.pop_back();
+    }
+  }
+
+  // TODO: Support scalable vectorisation for reduction dims
+  if (iterators.back() == utils::IteratorType::reduction)
+    return failure();
+
+  // If this is not the _last_ parallel dim, 1. above is not met
+  if (seenParalell)
+    return failure();
+
+  // If present, check the 2nd scalable dim. ATM, only Matmul-like Ops are
+  // supported for which expect the folowing config:
+  //    * iterators = [parallel, parallel, reduction]
+  //    * scalable flags = [true, true, false]
+  if (numOfScalableDims == 2) {
+    scalableFlags.pop_back();
+    iterators.pop_back();
+
+    if (!scalableFlags.back() ||
+        (iterators.back() != utils::IteratorType::parallel))
+      return failure();
+  }
+
+  // Cond 4: Only element-wise and 1d depthwise conv ops supported in the
+  // presence of scalable vectors
+  return success(isElementwise(linalgOp) || isa<linalg::MatmulOp>(op) ||
+                 isa<linalg::MatmulTransposeAOp>(op) ||
+                 isa<linalg::DepthwiseConv1DNwcWcOp>(op));
 }
 
 LogicalResult mlir::linalg::vectorizeOpPrecondition(
diff --git a/mlir/test/Dialect/Linalg/vectorization-unsupported.mlir b/mlir/test/Dialect/Linalg/vectorization-unsupported.mlir
index 5d3c07c8e23c1..c7ec39b0dbfb3 100644
--- a/mlir/test/Dialect/Linalg/vectorization-unsupported.mlir
+++ b/mlir/test/Dialect/Linalg/vectorization-unsupported.mlir
@@ -110,7 +110,7 @@ module attributes {transform.with_named_sequence} {
   }
 }
 
-  // -----
+// -----
 
 func.func @test_pack_no_vectorize_dynamic_shape(%arg0: tensor<?xf32>, %arg1: tensor<4x16xf32>) -> tensor<4x16xf32> {
   %pad = arith.constant 0.000000e+00 : f32
@@ -126,3 +126,68 @@ module attributes {transform.with_named_sequence} {
     transform.yield
   }
 }
+
+// -----
+
+func.func @linalg_reduce_scalable(%input: tensor<?xf32>,
+                                  %acc: tensor<f32>) -> tensor<f32> {
+
+  // expected-error @+1 {{Attempted to vectorize, but failed}}
+  %0 = linalg.reduce ins(%input : tensor<?xf32>) outs(%acc : tensor<f32>) dimensions = [0]
+  (%in: f32, %init: f32) {
+    %0 = arith.addf %in, %init : f32
+    linalg.yield %0 : f32
+  }
+  return %0 : tensor<f32>
+}
+
+module attributes {transform.with_named_sequence} {
+  transform.named_sequence @__transform_main(%arg1: !transform.any_op {transform.readonly}) {
+    %0 = transform.structured.match ops{["linalg.reduce"]} in %arg1 : (!transform.any_op) -> !transform.any_op
+    transform.structured.vectorize %0 vector_sizes [[4]] : !transform.any_op
+    transform.yield
+  }
+}
+
+// -----
+
+func.func @linalg_generic_scalable_reduction_dim(%input: tensor<?x?xf32>,
+                                                 %acc: tensor<?xf32>) -> tensor<?xf32> {
+
+  // expected-error @+1 {{Attempted to vectorize, but failed}}
+  %0 = linalg.generic { indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>,
+                                         affine_map<(d0, d1) -> (d0)>],
+                        iterator_types = ["parallel", "reduction"] }
+    ins(%input : tensor<?x?xf32>)
+    outs(%acc : tensor<?xf32>) {
+    ^bb(%in: f32, %out: f32) :
+      %0 = arith.addf %in, %out : f32
+      linalg.yield %0 : f32
+    } -> tensor<?xf32>
+  return %0 : tensor<?xf32>
+}
+
+module attributes {transform.with_named_sequence} {
+  transform.named_sequence @__transform_main(%arg1: !transform.any_op {transform.readonly}) {
+    %0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!transform.any_op) -> !transform.any_op
+    transform.structured.vectorize %0 vector_sizes [1, [4]] : !transform.any_op
+    transform.yield
+  }
+}
+
+// -----
+
+func.func @linalg_matmul_scalable_leading_parallel_dim(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
+  // expected-error @+1 {{Attempted to vectorize, but failed}}
+  linalg.matmul ins(%A, %B: memref<?x?xf32>, memref<?x?xf32>)
+            outs(%C: memref<?x?xf32>)
+  return
+}
+
+module attributes {transform.with_named_sequence} {
+  transform.named_sequence @__transform_main(%arg1: !transform.any_op {transform.readonly}) {
+    %matmul = transform.structured.match ops{["linalg.matmul"]} in %arg1 : (!transform.any_op) -> !transform.any_op
+    transform.structured.vectorize %matmul vector_sizes [[8], 16, 4] : !transform.any_op
+    transform.yield
+  }
+}

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

zhaoshiz

Thanks for creating this PR! Let me try rebase my PR97788 on top of it.

Address PR comments

banach-space · 2024-07-12T18:46:46Z

Thanks for creating this PR! Let me try rebase my PR97788 on top of it.

Please let me know if you hit any issues. The intent is to make things easier for you. If I am failing, I'm happy to iterate :)

dcaballe · 2024-07-12T22:57:33Z

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

+
+  // Cond 3: Look at the configuration in `inputScalableVecDims` and verify that
+  // it matches one of the supported cases:
+  //  1. exactly 1 dim is scalable and that's the _last_ parallel dim


This feels like a strong limitations. Using split reduction, we should be able to vectorize the K dimension in a matmul, right? And any arbitrary generic op. What is the main concern here? It should be ok as long as we have a single scalable dimension, isn't it?

I'm reworking scalable vectorization of reduction (#97788) on top of this one. My goal is to allow linalg::ReduceOp and linalg::GenericOp with reduction iterators. I am testing with matvec and matmul. For now I'm restricting reduction to the last dim.

It should be ok as long as we have a single scalable dimension, isn't it?

At MLIR level it seems ok, both vectorizing linalg and lowering vector multi-dim reduction are producing reasonable results. But I have difficulties on lowering to LLVM dialect and IR. Perhaps due to

it would be impractical given the limitations of LLVM (which usually
reflect the limitations of actual hardware) - e.g. no support for
"scalable" arrays of scalable or fixed width vectors (*).
...
(*) At MLIR vector level that would correspond to e.g.
vector<[4]x8xf32>.

Here's an example:

func.func @linalg_reduce_scalable_leading_dim(%input: tensor<?x?xf32>, %acc: tensor<?xf32>) -> tensor<?xf32> { %0 = linalg.reduce ins(%input : tensor<?x?xf32>) outs(%acc : tensor<?xf32>) dimensions = [0] (%in: f32, %init: f32) { %0 = arith.addf %in, %init : f32 linalg.yield %0 : f32 } return %0 : tensor<?xf32> } module attributes {transform.with_named_sequence} { transform.named_sequence @__transform_main(%arg1: !transform.any_op {transform.readonly}) { %0 = transform.structured.match ops{["linalg.reduce"]} in %arg1 : (!transform.any_op) -> !transform.any_op transform.structured.vectorize %0 vector_sizes [[4], 1] : !transform.any_op %func = transform.structured.match ops{["func.func"]} in %arg1 : (!transform.any_op) -> !transform.any_op transform.apply_patterns to %func { transform.apply_patterns.vector.lower_masked_transfers transform.apply_patterns.vector.lower_multi_reduction lowering_strategy = "innerreduction" } : !transform.any_op transform.yield } }

After linalg-vectorization:

module { func.func @linalg_reduce_scalable_leading_dim(%arg0: tensor<?x?xf32>, %arg1: tensor<?xf32>) -> tensor<?xf32> { %c0 = arith.constant 0 : index %dim = tensor.dim %arg0, %c0 : tensor<?x?xf32> %c1 = arith.constant 1 : index %dim_0 = tensor.dim %arg0, %c1 : tensor<?x?xf32> %c0_1 = arith.constant 0 : index %cst = arith.constant 0.000000e+00 : f32 %0 = vector.create_mask %dim, %dim_0 : vector<[4]x1xi1> %1 = vector.mask %0 { vector.transfer_read %arg0[%c0_1, %c0_1], %cst {in_bounds = [true, true]} : tensor<?x?xf32>, vector<[4]x1xf32> } : vector<[4]x1xi1> -> vector<[4]x1xf32> %cst_2 = arith.constant 0.000000e+00 : f32 %2 = vector.create_mask %dim_0 : vector<1xi1> %3 = vector.mask %2 { vector.transfer_read %arg1[%c0_1], %cst_2 {in_bounds = [true]} : tensor<?xf32>, vector<1xf32> } : vector<1xi1> -> vector<1xf32> %4 = vector.mask %0 { vector.multi_reduction <add>, %1, %3 [0] : vector<[4]x1xf32> to vector<1xf32> } : vector<[4]x1xi1> -> vector<1xf32> %c0_3 = arith.constant 0 : index %5 = vector.mask %2 { vector.transfer_write %4, %arg1[%c0_3] {in_bounds = [true]} : vector<1xf32>, tensor<?xf32> } : vector<1xi1> -> tensor<?xf32> return %5 : tensor<?xf32> } module attributes {transform.with_named_sequence} { } }

After lowering vector masked xfer and multi reduction:

module { func.func @linalg_reduce_scalable_leading_dim(%arg0: tensor<?x?xf32>, %arg1: tensor<?xf32>) -> tensor<?xf32> { %cst = arith.constant dense<0.000000e+00> : vector<1xf32> %cst_0 = arith.constant 0.000000e+00 : f32 %c1 = arith.constant 1 : index %c0 = arith.constant 0 : index %dim = tensor.dim %arg0, %c0 : tensor<?x?xf32> %dim_1 = tensor.dim %arg0, %c1 : tensor<?x?xf32> %0 = vector.create_mask %dim, %dim_1 : vector<[4]x1xi1> %1 = vector.transfer_read %arg0[%c0, %c0], %cst_0, %0 {in_bounds = [true, true]} : tensor<?x?xf32>, vector<[4]x1xf32> %2 = vector.create_mask %dim_1 : vector<1xi1> %3 = vector.transfer_read %arg1[%c0], %cst_0, %2 {in_bounds = [true]} : tensor<?xf32>, vector<1xf32> %4 = vector.transpose %0, [1, 0] : vector<[4]x1xi1> to vector<1x[4]xi1> %5 = vector.transpose %1, [1, 0] : vector<[4]x1xf32> to vector<1x[4]xf32> %6 = vector.extract %5[0] : vector<[4]xf32> from vector<1x[4]xf32> %7 = vector.extract %3[0] : f32 from vector<1xf32> %8 = vector.extract %4[0] : vector<[4]xi1> from vector<1x[4]xi1> %9 = vector.mask %8 { vector.reduction <add>, %6, %7 : vector<[4]xf32> into f32 } : vector<[4]xi1> -> f32 %10 = vector.insertelement %9, %cst[%c0 : index] : vector<1xf32> %11 = vector.transfer_write %10, %arg1[%c0], %2 {in_bounds = [true]} : vector<1xf32>, tensor<?xf32> return %11 : tensor<?xf32> } module attributes {transform.with_named_sequence} { } }

Trying to lower above mlir to llvm with mlir-opt -test-lower-to-llvm:

module { func.func @linalg_reduce_scalable_leading_dim(%arg0: tensor<?x?xf32>, %arg1: tensor<?xf32>) -> tensor<?xf32> { %0 = llvm.mlir.constant(4 : i32) : i32 %1 = llvm.mlir.constant(0 : i64) : i64 %2 = llvm.mlir.undef : vector<[4]xi32> %3 = llvm.mlir.constant(0 : i32) : i32 %4 = llvm.mlir.undef : vector<1xi32> %5 = llvm.mlir.constant(dense<0> : vector<1xi32>) : vector<1xi32> %6 = llvm.mlir.constant(dense<false> : vector<[4]xi1>) : vector<[4]xi1> %7 = llvm.mlir.constant(dense<0.000000e+00> : vector<1xf32>) : vector<1xf32> %8 = llvm.mlir.constant(0.000000e+00 : f32) : f32 %9 = llvm.mlir.constant(1 : index) : i64 %10 = builtin.unrealized_conversion_cast %9 : i64 to index %11 = llvm.mlir.constant(0 : index) : i64 %12 = builtin.unrealized_conversion_cast %11 : i64 to index %dim = tensor.dim %arg0, %12 : tensor<?x?xf32> %13 = builtin.unrealized_conversion_cast %dim : index to i64 %dim_0 = tensor.dim %arg0, %10 : tensor<?x?xf32> %14 = builtin.unrealized_conversion_cast %dim_0 : index to i64 --> %15 = vector.create_mask %dim, %dim_0 : vector<[4]x1xi1> --> %16 = vector.transfer_read %arg0[%12, %12], %8, %15 {in_bounds = [true, true]} : tensor<?x?xf32>, vector<[4]x1xf32> %17 = llvm.trunc %14 : i64 to i32 %18 = llvm.insertelement %17, %4[%3 : i32] : vector<1xi32> %19 = llvm.shufflevector %18, %4 [0] : vector<1xi32> %20 = llvm.icmp "sgt" %19, %5 : vector<1xi32> --> %21 = vector.transfer_read %arg1[%12], %8, %20 {in_bounds = [true]} : tensor<?xf32>, vector<1xf32> %22 = llvm.intr.experimental.stepvector : vector<[4]xi32> %23 = llvm.trunc %13 : i64 to i32 %24 = llvm.insertelement %23, %2[%3 : i32] : vector<[4]xi32> %25 = llvm.shufflevector %24, %2 [0, 0, 0, 0] : vector<[4]xi32> %26 = llvm.icmp "slt" %22, %25 : vector<[4]xi32> %27 = llvm.icmp "sgt" %14, %11 : i64 %28 = llvm.select %27, %26, %6 : i1, vector<[4]xi1> --> %29 = vector.shape_cast %16 : vector<[4]x1xf32> to vector<1x[4]xf32> %30 = builtin.unrealized_conversion_cast %29 : vector<1x[4]xf32> to !llvm.array<1 x vector<[4]xf32>> %31 = llvm.extractvalue %30[0] : !llvm.array<1 x vector<[4]xf32>> %32 = llvm.extractelement %21[%1 : i64] : vector<1xf32> %33 = "llvm.intr.vscale"() : () -> i64 %34 = llvm.trunc %33 : i64 to i32 %35 = llvm.mul %34, %0 : i32 %36 = "llvm.intr.vp.reduce.fadd"(%32, %31, %28, %35) : (f32, vector<[4]xf32>, vector<[4]xi1>, i32) -> f32 %37 = llvm.insertelement %36, %7[%11 : i64] : vector<1xf32> --> %38 = vector.transfer_write %37, %arg1[%12], %20 {in_bounds = [true]} : vector<1xf32>, tensor<?xf32> return %38 : tensor<?xf32> } module attributes {transform.with_named_sequence} { } }

Note some vector ops are not converted and results of builtin.unrealized_conversion_cast are being used. mlir-translate --mlir-to-llvmir will fail due to these ops.

This feels like a strong limitations.

Agreed. But this is only meant to document what we've tried so far and hence "advertise" as supported. Just to make it clear to everyone who'd like to try this.

Also, the current pre-conditions require updating:

llvm-project/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

Lines 1950 to 1958 in 93d7d9b

bool isScalable = inputScalableVecDims.back();

if (!isScalable)

return success();

// Only element-wise and 1d depthwise conv ops supported in the presence of

// scalable dims.

auto linalgOp = dyn_cast<LinalgOp>(op);

return success(linalgOp && (isElementwise(linalgOp) ||

isa<linalg::DepthwiseConv1DNwcWcOp>(op)));

Let me decompose that. This snippet ignores the fact that also non-trailing dims can (and are) scalable:

bool isScalable = inputScalableVecDims.back(); if (!isScalable) return success();

And this is missing linalg.matmul_transpose_a (I think that it's misleading):

return success(linalgOp && (isElementwise(linalgOp) || isa<linalg::DepthwiseConv1DNwcWcOp>(op)));

Using split reduction, we should be able to vectorize the K dimension in a matmul, right?
And any arbitrary generic op.

Yes, that's the plan. And as @zhaoshiz mentioned, there's #97788 to enable reductions. One step at a time 😅

zhaoshiz · 2024-07-13T04:56:20Z

Please let me know if you hit any issues. The intent is to make things easier for you. If I am failing, I'm happy to iterate :)

Yes it indeed make it easier for me and all test in #97788 are passing.
I'm trying with reduction dim of matvec and matmul.

zhaoshiz

#97788 works fine on top of this one, thanks!

dcaballe

Ok, SG as it looks like there is ongoing work to improve some of the limitations.

Updates `vectorizeScalableVectorPrecondition` so that scalable vectorisation is only applied in well understood and tested scenarios. It's unlikely that we would ever want an arbitrary dimension to be scalable. While the Linalg vectoriser should be flexible enough to handle all possibilities: * in more "exotic" cases, we are likely to struggle with lowerings further down the compilation stack, * it would be impractical given the limitations of LLVM (which usually reflect the limitations of actual hardware) - e.g. no support for "scalable" arrays of scalable or fixed width vectors (*). Ultimately, the goal of this patch is to better document what's currently supported. While this PR adds some new restrictions, no existing tests are affected. (*) At MLIR vector level that would correspond to e.g. `vector<[4]x8xf32>`.

banach-space requested review from hanhanW and nicolasvasilache as code owners July 12, 2024 14:30

llvmbot added mlir:linalg mlir labels Jul 12, 2024

banach-space requested review from dcaballe, MacDue and zhaoshiz July 12, 2024 14:33

banach-space mentioned this pull request Jul 12, 2024

[MLIR][Linalg] Scalable Vectorization of Reduction on the Trailing Dimension #97788

Merged

zhaoshiz reviewed Jul 12, 2024

View reviewed changes

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp Outdated Show resolved Hide resolved

zhaoshiz reviewed Jul 12, 2024

View reviewed changes

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp Outdated Show resolved Hide resolved

zhaoshiz reviewed Jul 12, 2024

View reviewed changes

fixup! [mlir][linalg] Restrict scalable vectorisation

17bc397

Address PR comments

dcaballe reviewed Jul 12, 2024

View reviewed changes

zhaoshiz approved these changes Jul 17, 2024

View reviewed changes

dcaballe approved these changes Jul 18, 2024

View reviewed changes

banach-space merged commit cfe043c into llvm:main Jul 19, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[mlir][linalg] Restrict scalable vectorisation #98639

[mlir][linalg] Restrict scalable vectorisation #98639

Uh oh!

banach-space commented Jul 12, 2024 •

edited

Loading

Uh oh!

llvmbot commented Jul 12, 2024

Uh oh!

llvmbot commented Jul 12, 2024

Uh oh!

Uh oh!

Uh oh!

zhaoshiz left a comment

Uh oh!

banach-space commented Jul 12, 2024

Uh oh!

dcaballe Jul 12, 2024

Uh oh!

zhaoshiz Jul 13, 2024

Uh oh!

banach-space Jul 15, 2024

Uh oh!

zhaoshiz commented Jul 13, 2024

Uh oh!

zhaoshiz left a comment

Uh oh!

dcaballe left a comment

Uh oh!

Uh oh!

Uh oh!

	bool isScalable = inputScalableVecDims.back();
	if (!isScalable)
	return success();

	// Only element-wise and 1d depthwise conv ops supported in the presence of
	// scalable dims.
	auto linalgOp = dyn_cast<LinalgOp>(op);
	return success(linalgOp && (isElementwise(linalgOp) \|\|
	isa<linalg::DepthwiseConv1DNwcWcOp>(op)));

[mlir][linalg] Restrict scalable vectorisation #98639

[mlir][linalg] Restrict scalable vectorisation #98639

Uh oh!

Conversation

banach-space commented Jul 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jul 12, 2024

Uh oh!

llvmbot commented Jul 12, 2024

Uh oh!

Uh oh!

Uh oh!

zhaoshiz left a comment

Choose a reason for hiding this comment

Uh oh!

banach-space commented Jul 12, 2024

Uh oh!

dcaballe Jul 12, 2024

Choose a reason for hiding this comment

Uh oh!

zhaoshiz Jul 13, 2024

Choose a reason for hiding this comment

Uh oh!

banach-space Jul 15, 2024

Choose a reason for hiding this comment

Uh oh!

zhaoshiz commented Jul 13, 2024

Uh oh!

zhaoshiz left a comment

Choose a reason for hiding this comment

Uh oh!

dcaballe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

banach-space commented Jul 12, 2024 •

edited

Loading