[MLIR][TOSA] Add --tosa-reduce-transposes pass (#108260)

arteen1000 · web-flow · commit 00f239e48ab9 · 2024-09-13T19:16:55.000-07:00
----------
Motivation:
----------

Some legalization pathways introduce redundant tosa.TRANSPOSE
operations that result in avoidable data movement. For example,
PyTorch -&gt; TOSA contains a lot of unnecessary transposes due
to conversions between NCHW and NHWC.

We wish to remove all the ones that we can, since in general
it is possible to remove the overwhelming majority.

------------
Changes Made:
------------

- Add the --tosa-reduce-transposes pass
- Add TosaElementwiseOperator trait.

-------------------
High-Level Overview:
-------------------

The pass works through the transpose operators in the program. It begins
at some
transpose operator with an associated permutations tensor. It traverses
upwards
through the dependencies of this transpose and verifies that we
encounter only
operators with the TosaElementwiseOperator trait and terminate in either
constants, reshapes, or transposes.

We then evaluate whether there are any additional restrictions (the
transposes
it terminates in must invert the one we began at, and the reshapes must
be ones
in which we can fold the transpose into), and then we hoist the
transpose through
the intervening operators, folding it at the constants, reshapes, and
transposes.

Finally, we ensure that we do not need both the transposed form (the
form that
had the transpose hoisted through it) and the untransposed form (which
it was prior),
by analyzing the usages of those dependent operators of a given
transpose we are
attempting to hoist and replace.

If they are such that it would require both forms to be necessary, then
we do not
replace the hoisted transpose, causing the new chain to be dead.
Otherwise, we do
and the old chain (untransposed form) becomes dead. Only one chain will
ever then
be live, resulting in no duplication.

We then perform a simple one-pass DCE, so no canonicalization is
necessary.

--------------
Impact of Pass:
--------------

Patching the dense_resource artifacts (from PyTorch) with dense
attributes to
permit constant folding, we receive the following results.

Note that data movement represents total transpose data movement,
calculated
by noting which dimensions moved during the transpose.

///////////
MobilenetV3:
///////////

BEFORE total data movement: 11798776 B (11.25 MiB)
AFTER total data movement: 2998016 B (2.86 MiB)
74.6% of data movement removed.

BEFORE transposes: 82
AFTER transposes: 20
75.6% of transposes removed.

////////
ResNet18:
////////

BEFORE total data movement: 20596556 B (19.64 MiB)
AFTER total data movement: 1003520 B (0.96 MiB)
95.2% of data movement removed.

BEFORE transposes: 56
AFTER transposes: 5
91.1% of transposes removed.

////////
ResNet50:
////////

BEFORE total data movement: 83236172 B (79.3 MiB)
AFTER total data movement: 3010560 B (2.87 MiB)
96.4% of data movement removed

BEFORE transposes: 120
AFTER transposes: 7
94.2% of transposes removed.

/////////
ResNet101:
/////////

BEFORE total data movement: 124336460 B (118.58 MiB)
AFTER total data movement: 3010560 B (2.87 MiB)
97.6% of data movement removed

BEFORE transposes: 239
AFTER transposes: 7
97.1% of transposes removed.

/////////
ResNet152:
/////////

BEFORE total data movement: 175052108 B (166.94 MiB)
AFTER total data movement: 3010560 B (2.87 MiB)
98.3% of data movement removed

BEFORE transposes: 358
AFTER transposes: 7
98.0% of transposes removed.

////////
Overview:
////////

We see that we remove up to 98% of transposes and eliminate
up to 98.3% of redundant transpose data movement.

In the context of ResNet50, with 120 inferences per second,
we reduce dynamic transpose data bandwidth from 9.29 GiB/s
to 344.4 MiB/s.

-----------
Future Work:
-----------

(1) Evaluate tradeoffs with permitting ConstOp to be duplicated across
hoisted
    transposes with different permutation tensors.

(2) Expand the class of foldable upstream ReshapeOp we permit beyond
    N -&gt; 1x1x...x1xNx1x...x1x1.

(3) Enchance the pass to permit folding arbitrary transpose pairs,
beyond
    those that form the identity.

(4) Add support for more instructions besides TosaElementwiseOperator as
    the intervening ones (for example, the reduce_* operators).

(5) Support hoisting transposes up to an input parameter.

Signed-off-by: Arteen Abrishami &lt;arteen.abrishami@arm.com&gt;
diff --git a/mlir/include/mlir/Dialect/Tosa/IR/TosaOpBase.td b/mlir/include/mlir/Dialect/Tosa/IR/TosaOpBase.td
@@ -206,6 +206,15 @@ def Tosa_ExplicitValuePadOpQuantInfoBuilder : OpBuilder<
                                          input, paddings, pad_value);
   }]>;
 
+//===----------------------------------------------------------------------===//
+// TOSA Operator Trait.
+//===----------------------------------------------------------------------===//
+
+// Permits broadcasting. Elementwise trait is too strict.
+def TosaElementwiseOperator : NativeOpTrait<"TosaElementwiseOperator"> {
+  let cppNamespace = "mlir::OpTrait::tosa";
+}
+
 //===----------------------------------------------------------------------===//
 // TOSA Operator Class.
 //===----------------------------------------------------------------------===//
@@ -219,6 +228,7 @@ class Tosa_ElementwiseOp<string mnemonic, list<Trait> traits = []> :
               DeclareOpInterfaceMethods<InferShapedTypeOpInterface,
                                         ["inferReturnTypeComponents"]>,
               ResultsBroadcastableShape,
+              TosaElementwiseOperator,
               Pure])> {
   let assemblyFormat =
       "operands attr-dict `:` functional-type(operands, results)";
diff --git a/mlir/include/mlir/Dialect/Tosa/IR/TosaOps.h b/mlir/include/mlir/Dialect/Tosa/IR/TosaOps.h
@@ -84,6 +84,12 @@ class MulOperandsAndResultElementType
   }
 };
 
+/// This class indicates that an op is tosa-elementwise (permits broadcasting,
+/// unlike Elementwise trait).
+template <typename ConcreteType>
+class TosaElementwiseOperator
+    : public TraitBase<ConcreteType, TosaElementwiseOperator> {};
+
 } // namespace tosa
 } // namespace OpTrait
 
diff --git a/mlir/include/mlir/Dialect/Tosa/Transforms/Passes.td b/mlir/include/mlir/Dialect/Tosa/Transforms/Passes.td
@@ -126,4 +126,25 @@ def TosaValidation : Pass<"tosa-validate", "mlir::ModuleOp"> {
    ];
 }
 
+def TosaReduceTransposes : Pass<"tosa-reduce-transposes", "func::FuncOp"> {
+  let summary = "Reduce transposes through other operators";
+  let description = [{
+    Pass that identifies and reduces tosa.TRANSPOSE operations through chains
+    of operators.
+
+    The pass traverses dependencies of tosa.TRANSPOSE operations until they
+    terminate in either a tosa.RESHAPE that we can fold the hoisted
+    tosa.TRANSPOSE into, a tosa.TRANSPOSE that forms the identity with the
+    hoisted one, or a tosa.CONST with a dense elements attribute. It then
+    propagates the hoisted transform upward through the intervening operators
+    if the support is implemented. Finally, it observes that no duplication
+    will occur of both the chain that was hoisted through and the new chain
+    that results, and if so, it replaces the hoisted tosa.TRANSPOSE.
+
+    The pass has an important use-case in cleaning up the results of frameworks
+    that introduce a lot of data-layout transformations when legalizing to TOSA,
+    a common one being transformations between NHWC and NCHW layouts.
+  }];
+}
+
 #endif // MLIR_DIALECT_TOSA_TRANSFORMS_PASSES
diff --git a/mlir/lib/Dialect/Tosa/Transforms/CMakeLists.txt b/mlir/lib/Dialect/Tosa/Transforms/CMakeLists.txt
@@ -7,6 +7,7 @@ add_mlir_dialect_library(MLIRTosaTransforms
   TosaLayerwiseConstantFoldPass.cpp
   TosaMakeBroadcastable.cpp
   TosaOptionalDecompositions.cpp
+  TosaReduceTransposes.cpp
   TosaTypeConverters.cpp
   TosaValidation.cpp
 
diff --git a/mlir/lib/Dialect/Tosa/Transforms/TosaReduceTransposes.cpp b/mlir/lib/Dialect/Tosa/Transforms/TosaReduceTransposes.cpp
diff --git a/mlir/test/Dialect/Tosa/tosa-reduce-transposes.mlir b/mlir/test/Dialect/Tosa/tosa-reduce-transposes.mlir