-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[ARM]Adjust cost of muls in (U/S)MLAL and patterns #122713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-vectorizers @llvm/pr-subscribers-llvm-analysis Author: Nashe Mncube (nasherm) ChangesPR #117350 made changes to the SLP vectorizer which introduced a regression on some ARM benchmarks. This was due to the changes assuming that SExt/ZExt vector instructions have constant cost. This behaviour is expected for RISCV but not on ARM where we take into account source and destination type of SExt/ZExt instructions when calculating vector cost. Full diff: https://github.com/llvm/llvm-project/pull/122713.diff 6 Files Affected:
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 752313ab15858c..bac6500ed7184d 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1184,6 +1184,11 @@ class TargetTransformInfo {
/// \return true if vscale is known to be a power of 2
bool isVScaleKnownToBeAPowerOfTwo() const;
+ // \return true if vector implementations assume that SExt and ZExt
+ // instructions have a fixed cost.
+ bool isSExtCostConstant() const;
+ bool isZExtCostConstant() const;
+
/// \return True if the vectorization factor should be chosen to
/// make the vector of the smallest element type match the size of a
/// vector register. For wider element types, this could result in
@@ -2065,6 +2070,8 @@ class TargetTransformInfo::Concept {
virtual std::optional<unsigned> getMaxVScale() const = 0;
virtual std::optional<unsigned> getVScaleForTuning() const = 0;
virtual bool isVScaleKnownToBeAPowerOfTwo() const = 0;
+ virtual bool isSExtCostConstant() const = 0;
+ virtual bool isZExtCostConstant() const = 0;
virtual bool
shouldMaximizeVectorBandwidth(TargetTransformInfo::RegisterKind K) const = 0;
virtual ElementCount getMinimumVF(unsigned ElemWidth,
@@ -2719,6 +2726,8 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
bool isVScaleKnownToBeAPowerOfTwo() const override {
return Impl.isVScaleKnownToBeAPowerOfTwo();
}
+ bool isSExtCostConstant() const override { return Impl.isSExtCostConstant(); }
+ bool isZExtCostConstant() const override { return Impl.isZExtCostConstant(); }
bool shouldMaximizeVectorBandwidth(
TargetTransformInfo::RegisterKind K) const override {
return Impl.shouldMaximizeVectorBandwidth(K);
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 9c74b2a0c31df1..3120e27139d358 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -530,6 +530,8 @@ class TargetTransformInfoImplBase {
std::optional<unsigned> getMaxVScale() const { return std::nullopt; }
std::optional<unsigned> getVScaleForTuning() const { return std::nullopt; }
bool isVScaleKnownToBeAPowerOfTwo() const { return false; }
+ bool isSExtCostConstant() const { return true; }
+ bool isZExtCostConstant() const { return true; }
bool
shouldMaximizeVectorBandwidth(TargetTransformInfo::RegisterKind K) const {
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index c9f142d64ae9e4..5785aa8d74b61c 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -801,6 +801,9 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
std::optional<unsigned> getVScaleForTuning() const { return std::nullopt; }
bool isVScaleKnownToBeAPowerOfTwo() const { return false; }
+ bool isSExtCostConstant() const { return true; }
+ bool isZExtCostConstant() const { return true; }
+
/// Estimate the overhead of scalarizing an instruction. Insert and Extract
/// are set if the demanded result elements need to be inserted and/or
/// extracted from vectors.
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index b32dffa9f0fe86..63baea591bac45 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1405,6 +1405,14 @@ bool TargetTransformInfo::isProfitableToSinkOperands(
return TTIImpl->isProfitableToSinkOperands(I, OpsToSink);
}
+bool TargetTransformInfo::isSExtCostConstant() const {
+ return TTIImpl->isSExtCostConstant();
+}
+
+bool TargetTransformInfo::isZExtCostConstant() const {
+ return TTIImpl->isZExtCostConstant();
+}
+
bool TargetTransformInfo::isVectorShiftByScalarCheap(Type *Ty) const {
return TTIImpl->isVectorShiftByScalarCheap(Ty);
}
diff --git a/llvm/lib/Target/ARM/ARMTargetTransformInfo.h b/llvm/lib/Target/ARM/ARMTargetTransformInfo.h
index 3a4f940088b2e3..c2522b53987a3b 100644
--- a/llvm/lib/Target/ARM/ARMTargetTransformInfo.h
+++ b/llvm/lib/Target/ARM/ARMTargetTransformInfo.h
@@ -109,6 +109,11 @@ class ARMTTIImpl : public BasicTTIImplBase<ARMTTIImpl> {
bool enableInterleavedAccessVectorization() { return true; }
+ // Cost model for vector SExt and ZExt takes into account
+ // source and destination vector type for MVE and NEON.
+ bool isSExtCostConstant() const { return false; }
+ bool isZExtCostConstant() const { return false; }
+
TTI::AddressingModeKind
getPreferredAddressingMode(const Loop *L, ScalarEvolution *SE) const;
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index c4582df89213d8..7813c1248f9fcf 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -11440,7 +11440,8 @@ BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
I->getOpcode());
});
if (IsArithmeticExtendedReduction &&
- (VecOpcode == Instruction::ZExt || VecOpcode == Instruction::SExt))
+ (VecOpcode == Instruction::ZExt || VecOpcode == Instruction::SExt) &&
+ (TTI->isZExtCostConstant() && TTI->isSExtCostConstant()))
return CommonCost;
return CommonCost +
TTI->getCastInstrCost(VecOpcode, VecTy, SrcVecTy, CCH, CostKind,
|
@llvm/pr-subscribers-backend-arm Author: Nashe Mncube (nasherm) ChangesPR #117350 made changes to the SLP vectorizer which introduced a regression on some ARM benchmarks. This was due to the changes assuming that SExt/ZExt vector instructions have constant cost. This behaviour is expected for RISCV but not on ARM where we take into account source and destination type of SExt/ZExt instructions when calculating vector cost. Full diff: https://github.com/llvm/llvm-project/pull/122713.diff 6 Files Affected:
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 752313ab15858c..bac6500ed7184d 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1184,6 +1184,11 @@ class TargetTransformInfo {
/// \return true if vscale is known to be a power of 2
bool isVScaleKnownToBeAPowerOfTwo() const;
+ // \return true if vector implementations assume that SExt and ZExt
+ // instructions have a fixed cost.
+ bool isSExtCostConstant() const;
+ bool isZExtCostConstant() const;
+
/// \return True if the vectorization factor should be chosen to
/// make the vector of the smallest element type match the size of a
/// vector register. For wider element types, this could result in
@@ -2065,6 +2070,8 @@ class TargetTransformInfo::Concept {
virtual std::optional<unsigned> getMaxVScale() const = 0;
virtual std::optional<unsigned> getVScaleForTuning() const = 0;
virtual bool isVScaleKnownToBeAPowerOfTwo() const = 0;
+ virtual bool isSExtCostConstant() const = 0;
+ virtual bool isZExtCostConstant() const = 0;
virtual bool
shouldMaximizeVectorBandwidth(TargetTransformInfo::RegisterKind K) const = 0;
virtual ElementCount getMinimumVF(unsigned ElemWidth,
@@ -2719,6 +2726,8 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
bool isVScaleKnownToBeAPowerOfTwo() const override {
return Impl.isVScaleKnownToBeAPowerOfTwo();
}
+ bool isSExtCostConstant() const override { return Impl.isSExtCostConstant(); }
+ bool isZExtCostConstant() const override { return Impl.isZExtCostConstant(); }
bool shouldMaximizeVectorBandwidth(
TargetTransformInfo::RegisterKind K) const override {
return Impl.shouldMaximizeVectorBandwidth(K);
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 9c74b2a0c31df1..3120e27139d358 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -530,6 +530,8 @@ class TargetTransformInfoImplBase {
std::optional<unsigned> getMaxVScale() const { return std::nullopt; }
std::optional<unsigned> getVScaleForTuning() const { return std::nullopt; }
bool isVScaleKnownToBeAPowerOfTwo() const { return false; }
+ bool isSExtCostConstant() const { return true; }
+ bool isZExtCostConstant() const { return true; }
bool
shouldMaximizeVectorBandwidth(TargetTransformInfo::RegisterKind K) const {
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index c9f142d64ae9e4..5785aa8d74b61c 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -801,6 +801,9 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
std::optional<unsigned> getVScaleForTuning() const { return std::nullopt; }
bool isVScaleKnownToBeAPowerOfTwo() const { return false; }
+ bool isSExtCostConstant() const { return true; }
+ bool isZExtCostConstant() const { return true; }
+
/// Estimate the overhead of scalarizing an instruction. Insert and Extract
/// are set if the demanded result elements need to be inserted and/or
/// extracted from vectors.
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index b32dffa9f0fe86..63baea591bac45 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1405,6 +1405,14 @@ bool TargetTransformInfo::isProfitableToSinkOperands(
return TTIImpl->isProfitableToSinkOperands(I, OpsToSink);
}
+bool TargetTransformInfo::isSExtCostConstant() const {
+ return TTIImpl->isSExtCostConstant();
+}
+
+bool TargetTransformInfo::isZExtCostConstant() const {
+ return TTIImpl->isZExtCostConstant();
+}
+
bool TargetTransformInfo::isVectorShiftByScalarCheap(Type *Ty) const {
return TTIImpl->isVectorShiftByScalarCheap(Ty);
}
diff --git a/llvm/lib/Target/ARM/ARMTargetTransformInfo.h b/llvm/lib/Target/ARM/ARMTargetTransformInfo.h
index 3a4f940088b2e3..c2522b53987a3b 100644
--- a/llvm/lib/Target/ARM/ARMTargetTransformInfo.h
+++ b/llvm/lib/Target/ARM/ARMTargetTransformInfo.h
@@ -109,6 +109,11 @@ class ARMTTIImpl : public BasicTTIImplBase<ARMTTIImpl> {
bool enableInterleavedAccessVectorization() { return true; }
+ // Cost model for vector SExt and ZExt takes into account
+ // source and destination vector type for MVE and NEON.
+ bool isSExtCostConstant() const { return false; }
+ bool isZExtCostConstant() const { return false; }
+
TTI::AddressingModeKind
getPreferredAddressingMode(const Loop *L, ScalarEvolution *SE) const;
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index c4582df89213d8..7813c1248f9fcf 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -11440,7 +11440,8 @@ BoUpSLP::getEntryCost(const TreeEntry *E, ArrayRef<Value *> VectorizedVals,
I->getOpcode());
});
if (IsArithmeticExtendedReduction &&
- (VecOpcode == Instruction::ZExt || VecOpcode == Instruction::SExt))
+ (VecOpcode == Instruction::ZExt || VecOpcode == Instruction::SExt) &&
+ (TTI->isZExtCostConstant() && TTI->isSExtCostConstant()))
return CommonCost;
return CommonCost +
TTI->getCastInstrCost(VecOpcode, VecTy, SrcVecTy, CCH, CostKind,
|
You just need to modify TTI::getExtendedReductionCost function, no changes in SLP are required. Also, provide a test |
I've updated the patch with this in mind |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MVE instruction that getExtendedReductionCost is trying to model is vaddlv, which is a i64 vecreduce(v4i64 sext(v4i32))
. So it would seem odd to add the cost of the extend again to it. BaseT::getExtendedReductionCost includes the cast cost too, so it would look like this is double-counting.
This is an example of MVE using the vaddlv instruction, which looks decent in this small example: https://llvm.godbolt.org/z/oTrMTcM4E
llvm/test/Transforms/SLPVectorizer/ARM/expensive-arithmetic-extended-reduction-mve.ll
Outdated
Show resolved
Hide resolved
bb1060c
to
28b9d6a
Compare
My most recent patch fixes the regression with a one line change to ARMTTI. From what I understand, MVE vectors greater than 128 bits aren't handled very well. It appears that the changes to SLP revealed a bug where we were allowing extensions of the type @davemgreen does my fix make sense with this reasoning? I'm skeptical as I suspect I'm missing something which justifies the original code which allowed this extension. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is usually the input type that matters for MVE, and a vadd v4i32 -> vadd v4i64 reduction is still a single vaddlv. The comment IIRC was talking about things like v8i32->i64 reductions, that would need to be split into two v4i32 reductions but is difficult under MVE due to the way predicates and extensions work.
There are some Loop-vectorize test that would need to be updated, but I believe we would still want to generate an extending reduction pattern. There might be something more complex going on where a mixture of MVE and scalar code doesn't perform as well as sticking with scalar. Or maybe the SLP vectorizer isn't handling multiple uses properly, or the scalar cost should be a little lower for SMLAL DSP instructions.
I've opted for adjusting the cost returned by getExtendedReductionCost for vadd reductions to be slightly more pessimistic. This returns performance numbers on regressed benchmarks while not changing SLPVectorizer and making minimal changes to the rest of the compiler. |
There is a test failing, which I think says "we shouldn't do this". I would expect the cost of a VADDV and VADDLV to be about the same as other MVE instructions, unless we have a strong reason otherwise. |
…SExt PR llvm#117350 made changes to the SLP vectorizer which introduced a regression on ARM vectorization benchmarks. This was due to the changes assuming that SExt/ZExt vector instructions have constant cost. This behaviour is expected for RISCV but not on ARM where we take into account source and destination type of SExt/ZExt instructions when calculating vector cost. Change-Id: I6f995dcde26e5aaf62b779b63e52988fb333f941
Change-Id: Ie2795e0e5ebb0589146eaf07c752410e307a36e6
Changes to the SLPVectorizer caused a regression in some benchmarks which targetted cores that support both DSP and MVE instructions. The particular regression has been reduced to a case where MUL instructions that are part of a chain of instructions that can be replaced with DSP SMLAL may also be vectorized. The generated code ends up being an inefficient of both scalar and vector ops rather than leaning to one or the other. By reducing the cost of these MUL instructions in these patterns we recover lost performance. Change-Id: I302817cf4fcd18a11d40fba430c44e034a36448b
Update: This means that my changes don't touch the SLPVectorizer and are reduced to just some changes to the cost model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good - it looks like this will work quite well.
return false; | ||
|
||
auto IsSExtInst = [](const Value *V) -> bool { | ||
return (dyn_cast<SExtInst>(V)) ? true : false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return isa(V);
Is it worth adding ZExt at the same time for UMULL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Surprisingly no. Vectorization for my reproducer using unsigned types is done quite well
Correction. I double checked and yes this may be beneficial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So by adding ZExt we do avoid mixing vector and scalar instructions. The generated code doesn't take advantage of UMULL instructions however and just sticks with simple scalar ops. This was the behavior before the changes to the SLPVectorizer were made.
I do think that once this is merged there is further improvement to be done on SMLAL code gen which I intend to investigate i.e further code folding using DSP instructions. I could add UMULL codegen into my investigation as well.
Regardless adding ZExt support will just return previous behavior so I don't believe it to be a concern.
Change-Id: Ie8c40972ae07e568d767ace37b9dda0f77272569
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - other than the inline comments this looks good to me.
Change-Id: I0c20291a8a1d2cbc895890d02aea963323ee48cb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks.
Change-Id: I9ce62da8405a08ece5e36c7e1ea0bb93e6f0b6c4
Buildbot failures were caused by PR llvm#122713. This was due to unused captures in a lambda function. Change-Id: Ie0907396c2551190dd982bfa180934b4075271d8
auto IsExtInst = [](const Value *V) -> bool { | ||
return isa<ZExtInst>(V) || isa<SExtInst>(V); | ||
}; | ||
auto IsExtensionFromHalf = [&, IsExtInst](const Value *V) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsExtInst
not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Florian, yes. I saw it triggered some buildbot failures. I opened #132018 to fix this. Is it preferable to use this merged branch?
for (auto *U : I->users()) | ||
if (!IsExtInst(U)) | ||
return false; | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this just be
return all_of(I->users(), IsExtInst)
?
Buildbot failures were caused by PR #122713. This was due to unused captures in a lambda function.
PR #117350 made changes to the SLP vectorizer which introduced a regression on some ARM benchmarks. Investigation narrowed it down to suboptimal codegen for benchmarks that previously only used scalar SMLAL instructions. The linked change meant the SLPVectorizer thought that these could be vectorized. This change makes the cost of muls in (U/S)MLAL patterns slightly cheaper to make sure scalar instructions are preferred in these cases over SLP vectorization