[ARM]Adjust cost of muls in (U/S)MLAL and patterns #122713

nasherm · 2025-01-13T14:33:32Z

PR #117350 made changes to the SLP vectorizer which introduced a regression on some ARM benchmarks. Investigation narrowed it down to suboptimal codegen for benchmarks that previously only used scalar SMLAL instructions. The linked change meant the SLPVectorizer thought that these could be vectorized. This change makes the cost of muls in (U/S)MLAL patterns slightly cheaper to make sure scalar instructions are preferred in these cases over SLP vectorization

llvmbot · 2025-01-13T14:34:09Z

@llvm/pr-subscribers-vectorizers
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-llvm-analysis

Author: Nashe Mncube (nasherm)

Changes

PR #117350 made changes to the SLP vectorizer which introduced a regression on some ARM benchmarks. This was due to the changes assuming that SExt/ZExt vector instructions have constant cost. This behaviour is expected for RISCV but not on ARM where we take into account source and destination type of SExt/ZExt instructions when calculating vector cost.

Full diff: https://github.com/llvm/llvm-project/pull/122713.diff

6 Files Affected:

(modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+9)
(modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+2)
(modified) llvm/include/llvm/CodeGen/BasicTTIImpl.h (+3)
(modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+8)
(modified) llvm/lib/Target/ARM/ARMTargetTransformInfo.h (+5)
(modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+2-1)

diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 752313ab15858c..bac6500ed7184d 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1184,6 +1184,11 @@ class TargetTransformInfo {
   /// \return true if vscale is known to be a power of 2
   bool isVScaleKnownToBeAPowerOfTwo() const;
 
+  // \return true if vector implementations assume that SExt and ZExt
+  // instructions have a fixed cost.
+  bool isSExtCostConstant() const;
+  bool isZExtCostConstant() const;
+
   /// \return True if the vectorization factor should be chosen to
   /// make the vector of the smallest element type match the size of a
   /// vector register. For wider element types, this could result in
@@ -2065,6 +2070,8 @@ class TargetTransformInfo::Concept {
   virtual std::optional<unsigned> getMaxVScale() const = 0;
   virtual std::optional<unsigned> getVScaleForTuning() const = 0;
   virtual bool isVScaleKnownToBeAPowerOfTwo() const = 0;
+  virtual bool isSExtCostConstant() const = 0;
+  virtual bool isZExtCostConstant() const = 0;
   virtual bool
   shouldMaximizeVectorBandwidth(TargetTransformInfo::RegisterKind K) const = 0;
   virtual ElementCount getMinimumVF(unsigned ElemWidth,
@@ -2719,6 +2726,8 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
   bool isVScaleKnownToBeAPowerOfTwo() const override {
     return Impl.isVScaleKnownToBeAPowerOfTwo();
   }
+  bool isSExtCostConstant() const override { return Impl.isSExtCostConstant(); }
+  bool isZExtCostConstant() const override { return Impl.isZExtCostConstant(); }
   bool shouldMaximizeVectorBandwidth(
       TargetTransformInfo::RegisterKind K) const override {
     return Impl.shouldMaximizeVectorBandwidth(K);
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 9c74b2a0c31df1..3120e27139d358 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -530,6 +530,8 @@ class TargetTransformInfoImplBase {
   std::optional<unsigned> getMaxVScale() const { return std::nullopt; }
   std::optional<unsigned> getVScaleForTuning() const { return std::nullopt; }
   bool isVScaleKnownToBeAPowerOfTwo() const { return false; }
+  bool isSExtCostConstant() const { return true; }
+  bool isZExtCostConstant() const { return true; }
 
   bool
   shouldMaximizeVectorBandwidth(TargetTransformInfo::RegisterKind K) const {
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index c9f142d64ae9e4..5785aa8d74b61c 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -801,6 +801,9 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
   std::optional<unsigned> getVScaleForTuning() const { return std::nullopt; }
   bool isVScaleKnownToBeAPowerOfTwo() const { return false; }
 
+  bool isSExtCostConstant() const { return true; }
+  bool isZExtCostConstant() const { return true; }
+
   /// Estimate the overhead of scalarizing an instruction. Insert and Extract
   /// are set if the demanded result elements need to be inserted and/or
   /// extracted from vectors.
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index b32dffa9f0fe86..63baea591bac45 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1405,6 +1405,14 @@ bool TargetTransformInfo::isProfitableToSinkOperands(
   return TTIImpl->isProfitableToSinkOperands(I, OpsToSink);
 }
 
+bool TargetTransformInfo::isSExtCostConstant() const {
+  return TTIImpl->isSExtCostConstant();
+}
+
+bool TargetTransformInfo::isZExtCostConstant() const {
+  return TTIImpl->isZExtCostConstant();
+}
+
 bool TargetTransformInfo::isVectorShiftByScalarCheap(Type *Ty) const {
   return TTIImpl->isVectorShiftByScalarCheap(Ty);
 }
diff --git a/llvm/lib/Target/ARM/ARMTargetTransformInfo.h b/llvm/lib/Target/ARM/ARMTargetTransformInfo.h
index 3a4f940088b2e3..c2522b53987a3b 100644
--- a/llvm/lib/Target/ARM/ARMTargetTransformInfo.h
+++ b/llvm/lib/Target/ARM/ARMTargetTransformInfo.h
@@ -109,6 +109,11 @@ class ARMTTIImpl : public BasicTTIImplBase<ARMTTIImpl> {
 
   bool enableInterleavedAccessVectorization() { return true; }
 
+  // Cost model for vector SExt and ZExt takes into account
+  // source and destination vector type for MVE and NEON.
+  bool isSExtCostConstant() const { return false; }
+  bool isZExtCostConstant() const { return false; }
+
   TTI::AddressingModeKind
     getPreferredAddressingMode(const Loop *L, ScalarEvolution *SE) const;
 
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index c4582df89213d8..7813c1248f9fcf 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -11440,7 +11440,8 @@ BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
                                 I->getOpcode());
           });
       if (IsArithmeticExtendedReduction &&
-          (VecOpcode == Instruction::ZExt || VecOpcode == Instruction::SExt))
+          (VecOpcode == Instruction::ZExt || VecOpcode == Instruction::SExt) &&
+          (TTI->isZExtCostConstant() && TTI->isSExtCostConstant()))
         return CommonCost;
       return CommonCost +
              TTI->getCastInstrCost(VecOpcode, VecTy, SrcVecTy, CCH, CostKind,

llvmbot · 2025-01-13T14:34:10Z

@llvm/pr-subscribers-backend-arm

Author: Nashe Mncube (nasherm)

Changes

PR #117350 made changes to the SLP vectorizer which introduced a regression on some ARM benchmarks. This was due to the changes assuming that SExt/ZExt vector instructions have constant cost. This behaviour is expected for RISCV but not on ARM where we take into account source and destination type of SExt/ZExt instructions when calculating vector cost.

Full diff: https://github.com/llvm/llvm-project/pull/122713.diff

6 Files Affected:

(modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+9)
(modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+2)
(modified) llvm/include/llvm/CodeGen/BasicTTIImpl.h (+3)
(modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+8)
(modified) llvm/lib/Target/ARM/ARMTargetTransformInfo.h (+5)
(modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+2-1)

diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 752313ab15858c..bac6500ed7184d 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1184,6 +1184,11 @@ class TargetTransformInfo {
   /// \return true if vscale is known to be a power of 2
   bool isVScaleKnownToBeAPowerOfTwo() const;
 
+  // \return true if vector implementations assume that SExt and ZExt
+  // instructions have a fixed cost.
+  bool isSExtCostConstant() const;
+  bool isZExtCostConstant() const;
+
   /// \return True if the vectorization factor should be chosen to
   /// make the vector of the smallest element type match the size of a
   /// vector register. For wider element types, this could result in
@@ -2065,6 +2070,8 @@ class TargetTransformInfo::Concept {
   virtual std::optional<unsigned> getMaxVScale() const = 0;
   virtual std::optional<unsigned> getVScaleForTuning() const = 0;
   virtual bool isVScaleKnownToBeAPowerOfTwo() const = 0;
+  virtual bool isSExtCostConstant() const = 0;
+  virtual bool isZExtCostConstant() const = 0;
   virtual bool
   shouldMaximizeVectorBandwidth(TargetTransformInfo::RegisterKind K) const = 0;
   virtual ElementCount getMinimumVF(unsigned ElemWidth,
@@ -2719,6 +2726,8 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
   bool isVScaleKnownToBeAPowerOfTwo() const override {
     return Impl.isVScaleKnownToBeAPowerOfTwo();
   }
+  bool isSExtCostConstant() const override { return Impl.isSExtCostConstant(); }
+  bool isZExtCostConstant() const override { return Impl.isZExtCostConstant(); }
   bool shouldMaximizeVectorBandwidth(
       TargetTransformInfo::RegisterKind K) const override {
     return Impl.shouldMaximizeVectorBandwidth(K);
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 9c74b2a0c31df1..3120e27139d358 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -530,6 +530,8 @@ class TargetTransformInfoImplBase {
   std::optional<unsigned> getMaxVScale() const { return std::nullopt; }
   std::optional<unsigned> getVScaleForTuning() const { return std::nullopt; }
   bool isVScaleKnownToBeAPowerOfTwo() const { return false; }
+  bool isSExtCostConstant() const { return true; }
+  bool isZExtCostConstant() const { return true; }
 
   bool
   shouldMaximizeVectorBandwidth(TargetTransformInfo::RegisterKind K) const {
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index c9f142d64ae9e4..5785aa8d74b61c 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -801,6 +801,9 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
   std::optional<unsigned> getVScaleForTuning() const { return std::nullopt; }
   bool isVScaleKnownToBeAPowerOfTwo() const { return false; }
 
+  bool isSExtCostConstant() const { return true; }
+  bool isZExtCostConstant() const { return true; }
+
   /// Estimate the overhead of scalarizing an instruction. Insert and Extract
   /// are set if the demanded result elements need to be inserted and/or
   /// extracted from vectors.
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index b32dffa9f0fe86..63baea591bac45 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1405,6 +1405,14 @@ bool TargetTransformInfo::isProfitableToSinkOperands(
   return TTIImpl->isProfitableToSinkOperands(I, OpsToSink);
 }
 
+bool TargetTransformInfo::isSExtCostConstant() const {
+  return TTIImpl->isSExtCostConstant();
+}
+
+bool TargetTransformInfo::isZExtCostConstant() const {
+  return TTIImpl->isZExtCostConstant();
+}
+
 bool TargetTransformInfo::isVectorShiftByScalarCheap(Type *Ty) const {
   return TTIImpl->isVectorShiftByScalarCheap(Ty);
 }
diff --git a/llvm/lib/Target/ARM/ARMTargetTransformInfo.h b/llvm/lib/Target/ARM/ARMTargetTransformInfo.h
index 3a4f940088b2e3..c2522b53987a3b 100644
--- a/llvm/lib/Target/ARM/ARMTargetTransformInfo.h
+++ b/llvm/lib/Target/ARM/ARMTargetTransformInfo.h
@@ -109,6 +109,11 @@ class ARMTTIImpl : public BasicTTIImplBase<ARMTTIImpl> {
 
   bool enableInterleavedAccessVectorization() { return true; }
 
+  // Cost model for vector SExt and ZExt takes into account
+  // source and destination vector type for MVE and NEON.
+  bool isSExtCostConstant() const { return false; }
+  bool isZExtCostConstant() const { return false; }
+
   TTI::AddressingModeKind
     getPreferredAddressingMode(const Loop *L, ScalarEvolution *SE) const;
 
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index c4582df89213d8..7813c1248f9fcf 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -11440,7 +11440,8 @@ BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
                                 I->getOpcode());
           });
       if (IsArithmeticExtendedReduction &&
-          (VecOpcode == Instruction::ZExt || VecOpcode == Instruction::SExt))
+          (VecOpcode == Instruction::ZExt || VecOpcode == Instruction::SExt) &&
+          (TTI->isZExtCostConstant() && TTI->isSExtCostConstant()))
         return CommonCost;
       return CommonCost +
              TTI->getCastInstrCost(VecOpcode, VecTy, SrcVecTy, CCH, CostKind,

alexey-bataev · 2025-01-13T14:53:04Z

You just need to modify TTI::getExtendedReductionCost function, no changes in SLP are required. Also, provide a test

nasherm · 2025-01-20T10:06:24Z

You just need to modify TTI::getExtendedReductionCost function, no changes in SLP are required. Also, provide a test

I've updated the patch with this in mind

davemgreen

The MVE instruction that getExtendedReductionCost is trying to model is vaddlv, which is a i64 vecreduce(v4i64 sext(v4i32)). So it would seem odd to add the cost of the extend again to it. BaseT::getExtendedReductionCost includes the cast cost too, so it would look like this is double-counting.

This is an example of MVE using the vaddlv instruction, which looks decent in this small example: https://llvm.godbolt.org/z/oTrMTcM4E

llvm/test/Transforms/SLPVectorizer/ARM/expensive-arithmetic-extended-reduction-mve.ll

nasherm · 2025-01-22T13:04:04Z

My most recent patch fixes the regression with a one line change to ARMTTI. From what I understand, MVE vectors greater than 128 bits aren't handled very well. It appears that the changes to SLP revealed a bug where we were allowing extensions of the type vadd v4i32 -> vadd v4i64.

@davemgreen does my fix make sense with this reasoning? I'm skeptical as I suspect I'm missing something which justifies the original code which allowed this extension.

davemgreen

It is usually the input type that matters for MVE, and a vadd v4i32 -> vadd v4i64 reduction is still a single vaddlv. The comment IIRC was talking about things like v8i32->i64 reductions, that would need to be split into two v4i32 reductions but is difficult under MVE due to the way predicates and extensions work.

There are some Loop-vectorize test that would need to be updated, but I believe we would still want to generate an extending reduction pattern. There might be something more complex going on where a mixture of MVE and scalar code doesn't perform as well as sticking with scalar. Or maybe the SLP vectorizer isn't handling multiple uses properly, or the scalar cost should be a little lower for SMLAL DSP instructions.

llvm/test/Transforms/SLPVectorizer/ARM/vadd-mve.ll

llvm/test/Transforms/SLPVectorizer/ARM/vadd-mve.s

nasherm · 2025-01-30T13:35:38Z

I've opted for adjusting the cost returned by getExtendedReductionCost for vadd reductions to be slightly more pessimistic. This returns performance numbers on regressed benchmarks while not changing SLPVectorizer and making minimal changes to the rest of the compiler.

davemgreen · 2025-02-03T12:41:21Z

There is a test failing, which I think says "we shouldn't do this". I would expect the cost of a VADDV and VADDLV to be about the same as other MVE instructions, unless we have a strong reason otherwise.

…SExt PR llvm#117350 made changes to the SLP vectorizer which introduced a regression on ARM vectorization benchmarks. This was due to the changes assuming that SExt/ZExt vector instructions have constant cost. This behaviour is expected for RISCV but not on ARM where we take into account source and destination type of SExt/ZExt instructions when calculating vector cost. Change-Id: I6f995dcde26e5aaf62b779b63e52988fb333f941

Change-Id: Ie2795e0e5ebb0589146eaf07c752410e307a36e6

Changes to the SLPVectorizer caused a regression in some benchmarks which targetted cores that support both DSP and MVE instructions. The particular regression has been reduced to a case where MUL instructions that are part of a chain of instructions that can be replaced with DSP SMLAL may also be vectorized. The generated code ends up being an inefficient of both scalar and vector ops rather than leaning to one or the other. By reducing the cost of these MUL instructions in these patterns we recover lost performance. Change-Id: I302817cf4fcd18a11d40fba430c44e034a36448b

nasherm · 2025-03-12T16:38:16Z

Update:
After revisiting this I narrowed down the cause of our regressions. Changes to the SLPVectorizer caused a regression in some benchmarks which targeted cores that support both DSP and MVE instructions. The particular regression has been reduced to a case where MUL instructions that are part of a chain of instructions that can be replaced with DSP SMLAL may also be vectorized. The generated code ends up being an inefficient combination of both scalar and vector ops rather than leaning to one or the other. By reducing the cost of these MUL instructions in these patterns we recover lost performance.

This means that my changes don't touch the SLPVectorizer and are reduced to just some changes to the cost model

davemgreen

Sounds good - it looks like this will work quite well.