-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[VPlan] Add support for in-loop AnyOf reductions #131830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-vectorizers @llvm/pr-subscribers-llvm-transforms Author: Luke Lau (lukel97) ChangesToday, an AnyOf reduction will get neatly vectorized out-of-loop on RISC-V: int f(int *x, int y, int n) {
int z = 0;
for (int i = 0; i < n; i++)
if (x[i] == y)
z = 1;
return z;
} .LBB0_5: # %vector.body
# =>This Inner Loop Header: Depth=1
vl2re32.v v10, (a3)
add a3, a3, a4
vsetvli zero, zero, e32, m2, ta, ma
vmseq.vx v9, v10, a1
vmor.mm v8, v8, v9
bne a3, a5, .LBB0_5
# %bb.6: # %middle.block
vcpop.m a3, v8
# ... However, with EVL tail folding we get much worse codegen: .LBB0_2: # %vector.body
# =>This Inner Loop Header: Depth=1
sub t0, a2, a3
sh2add a6, a3, a0
vsetvli t1, t0, e8, mf2, ta, ma
vsetvli a4, zero, e64, m4, ta, ma
vmv.v.x v16, t1
vmsleu.vv v9, v16, v12
vsetvli zero, t0, e32, m2, ta, ma
vle32.v v10, (a6)
sub a5, a5, a7
vsetvli a4, zero, e64, m4, ta, ma
vmsltu.vx v16, v12, t1
vmand.mm v9, v8, v9
vsetvli zero, zero, e32, m2, ta, ma
vmseq.vx v17, v10, a1
vmor.mm v8, v8, v17
vmand.mm v8, v8, v16
vmor.mm v8, v8, v9
add a3, a3, t1
bnez a5, .LBB0_2
# %bb.3: # %middle.block
vcpop.m a0, v8
snez a0, a0
ret The issue is due to the fact that we need to use an i1 vp.merge to preserve the tail elements on the final iterations, because the final reduction will be across the entire vector: %9 = icmp eq <vscale x 4 x i32> %vp.op.load, %broadcast.splat
%10 = or <vscale x 4 x i1> %vec.phi, %9
%11 = call <vscale x 4 x i1> @<!-- -->llvm.vp.merge.nxv4i1(<vscale x 4 x i1> splat (i1 true), <vscale x 4 x i1> %10, <vscale x 4 x i1> %vec.phi, i32 %5) However on RISC-V there are no mask instructions that can preserve the tail as per the specification: > Mask destination tail elements are always treated as tail-agnostic, regardless of the setting of vta. So the current best lowering we have today is something like this: vsetvli a1, zero, e64, m1, ta, ma
vid.v v10
vmsltu.vx v10, v10, a0
vmand.mm v9, v9, v10
vmandn.mm v8, v8, v9
vmand.mm v9, v0, v9
vmor.mm v0, v9, v8 One way we can avoid the vp.merge is to do an in-loop reduction, which for an i1 vector is cheap via .LBB0_2: # %vector.body
# =>This Inner Loop Header: Depth=1
sub a7, a2, a4
sh2add t0, a4, a0
vsetvli a7, a7, e32, m2, ta, ma
vle32.v v8, (t0)
sub a5, a5, a6
vmseq.vx v10, v8, a1
vcpop.m a3, v10
snez a3, a3
or t1, a3, t1
add a4, a4, a7
bnez a5, .LBB0_2
# %bb.3: # %middle.block
andi a0, t1, 1 This PR adds support for in-loop AnyOf reductions, by emitting an or reduction. The resulting IR looks something like this: vector.body:
%vec.phi = phi i1 [ false, %for.body.preheader ], [ %9, %vector.body ]
...
%7 = icmp eq <vscale x 4 x i32> %vp.op.load, %broadcast.splat
%8 = tail call i1 @<!-- -->llvm.vp.reduce.or.nxv4i1(i1 false, <vscale x 4 x i1> %7, <vscale x 4 x i1> splat (i1 true), i32 %evl)
%.fr = freeze i1 %8
%9 = or i1 %.fr, %vec.phi
middle.block:
%rdx.select = select i1 %9, i32 0, i32 1 It still remains disabled by default, and a later patch can opt into it when EVL tail folding on RISC-V. Stacked on #131300 Patch is 180.12 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/131830.diff 12 Files Affected:
diff --git a/llvm/include/llvm/Transforms/Utils/LoopUtils.h b/llvm/include/llvm/Transforms/Utils/LoopUtils.h
index 1818ee03d2ec8..3ad7b8f17856c 100644
--- a/llvm/include/llvm/Transforms/Utils/LoopUtils.h
+++ b/llvm/include/llvm/Transforms/Utils/LoopUtils.h
@@ -411,8 +411,8 @@ Value *createSimpleReduction(IRBuilderBase &B, Value *Src,
RecurKind RdxKind);
/// Overloaded function to generate vector-predication intrinsics for
/// reduction.
-Value *createSimpleReduction(VectorBuilder &VB, Value *Src,
- const RecurrenceDescriptor &Desc);
+Value *createSimpleReduction(VectorBuilder &VB, Value *Src, RecurKind RdxKind,
+ FastMathFlags FMFs);
/// Create a reduction of the given vector \p Src for a reduction of the
/// kind RecurKind::IAnyOf or RecurKind::FAnyOf. The reduction operation is
@@ -428,14 +428,12 @@ Value *createFindLastIVReduction(IRBuilderBase &B, Value *Src,
const RecurrenceDescriptor &Desc);
/// Create an ordered reduction intrinsic using the given recurrence
-/// descriptor \p Desc.
-Value *createOrderedReduction(IRBuilderBase &B,
- const RecurrenceDescriptor &Desc, Value *Src,
+/// kind \p Kind.
+Value *createOrderedReduction(IRBuilderBase &B, RecurKind Kind, Value *Src,
Value *Start);
/// Overloaded function to generate vector-predication intrinsics for ordered
/// reduction.
-Value *createOrderedReduction(VectorBuilder &VB,
- const RecurrenceDescriptor &Desc, Value *Src,
+Value *createOrderedReduction(VectorBuilder &VB, RecurKind Kind, Value *Src,
Value *Start);
/// Get the intersection (logical and) of all of the potential IR flags
diff --git a/llvm/lib/Analysis/IVDescriptors.cpp b/llvm/lib/Analysis/IVDescriptors.cpp
index f74ede4450ce5..a1dc74c9d0779 100644
--- a/llvm/lib/Analysis/IVDescriptors.cpp
+++ b/llvm/lib/Analysis/IVDescriptors.cpp
@@ -1184,7 +1184,7 @@ RecurrenceDescriptor::getReductionOpChain(PHINode *Phi, Loop *L) const {
// more expensive than out-of-loop reductions, and need to be costed more
// carefully.
unsigned ExpectedUses = 1;
- if (RedOp == Instruction::ICmp || RedOp == Instruction::FCmp)
+ if (isMinMaxRecurrenceKind(getRecurrenceKind()))
ExpectedUses = 2;
auto getNextInstruction = [&](Instruction *Cur) -> Instruction * {
@@ -1192,7 +1192,7 @@ RecurrenceDescriptor::getReductionOpChain(PHINode *Phi, Loop *L) const {
Instruction *UI = cast<Instruction>(User);
if (isa<PHINode>(UI))
continue;
- if (RedOp == Instruction::ICmp || RedOp == Instruction::FCmp) {
+ if (isMinMaxRecurrenceKind(Kind)) {
// We are expecting a icmp/select pair, which we go to the next select
// instruction if we can. We already know that Cur has 2 uses.
if (isa<SelectInst>(UI))
@@ -1204,11 +1204,13 @@ RecurrenceDescriptor::getReductionOpChain(PHINode *Phi, Loop *L) const {
return nullptr;
};
auto isCorrectOpcode = [&](Instruction *Cur) {
- if (RedOp == Instruction::ICmp || RedOp == Instruction::FCmp) {
+ if (isMinMaxRecurrenceKind(getRecurrenceKind())) {
Value *LHS, *RHS;
return SelectPatternResult::isMinOrMax(
matchSelectPattern(Cur, LHS, RHS).Flavor);
}
+ if (isAnyOfRecurrenceKind(getRecurrenceKind()))
+ return isa<SelectInst>(Cur);
// Recognize a call to the llvm.fmuladd intrinsic.
if (isFMulAddIntrinsic(Cur))
return true;
diff --git a/llvm/lib/Transforms/Utils/LoopUtils.cpp b/llvm/lib/Transforms/Utils/LoopUtils.cpp
index 185af8631454a..41f43a24e19e6 100644
--- a/llvm/lib/Transforms/Utils/LoopUtils.cpp
+++ b/llvm/lib/Transforms/Utils/LoopUtils.cpp
@@ -1333,24 +1333,21 @@ Value *llvm::createSimpleReduction(IRBuilderBase &Builder, Value *Src,
}
Value *llvm::createSimpleReduction(VectorBuilder &VBuilder, Value *Src,
- const RecurrenceDescriptor &Desc) {
- RecurKind Kind = Desc.getRecurrenceKind();
+ RecurKind Kind, FastMathFlags FMFs) {
assert(!RecurrenceDescriptor::isAnyOfRecurrenceKind(Kind) &&
!RecurrenceDescriptor::isFindLastIVRecurrenceKind(Kind) &&
"AnyOf or FindLastIV reductions are not supported.");
Intrinsic::ID Id = getReductionIntrinsicID(Kind);
auto *SrcTy = cast<VectorType>(Src->getType());
Type *SrcEltTy = SrcTy->getElementType();
- Value *Iden = getRecurrenceIdentity(Kind, SrcEltTy, Desc.getFastMathFlags());
+ Value *Iden = getRecurrenceIdentity(Kind, SrcEltTy, FMFs);
Value *Ops[] = {Iden, Src};
return VBuilder.createSimpleReduction(Id, SrcTy, Ops);
}
-Value *llvm::createOrderedReduction(IRBuilderBase &B,
- const RecurrenceDescriptor &Desc,
+Value *llvm::createOrderedReduction(IRBuilderBase &B, RecurKind Kind,
Value *Src, Value *Start) {
- assert((Desc.getRecurrenceKind() == RecurKind::FAdd ||
- Desc.getRecurrenceKind() == RecurKind::FMulAdd) &&
+ assert((Kind == RecurKind::FAdd || Kind == RecurKind::FMulAdd) &&
"Unexpected reduction kind");
assert(Src->getType()->isVectorTy() && "Expected a vector type");
assert(!Start->getType()->isVectorTy() && "Expected a scalar type");
@@ -1358,11 +1355,9 @@ Value *llvm::createOrderedReduction(IRBuilderBase &B,
return B.CreateFAddReduce(Start, Src);
}
-Value *llvm::createOrderedReduction(VectorBuilder &VBuilder,
- const RecurrenceDescriptor &Desc,
+Value *llvm::createOrderedReduction(VectorBuilder &VBuilder, RecurKind Kind,
Value *Src, Value *Start) {
- assert((Desc.getRecurrenceKind() == RecurKind::FAdd ||
- Desc.getRecurrenceKind() == RecurKind::FMulAdd) &&
+ assert((Kind == RecurKind::FAdd || Kind == RecurKind::FMulAdd) &&
"Unexpected reduction kind");
assert(Src->getType()->isVectorTy() && "Expected a vector type");
assert(!Start->getType()->isVectorTy() && "Expected a scalar type");
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index cbfccaab01e27..bf4fd4c6af1c4 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -5868,6 +5868,14 @@ LoopVectorizationCostModel::getReductionPatternCost(Instruction *I,
Intrinsic::ID MinMaxID = getMinMaxReductionIntrinsicOp(RK);
BaseCost = TTI.getMinMaxReductionCost(MinMaxID, VectorTy,
RdxDesc.getFastMathFlags(), CostKind);
+ } else if (RecurrenceDescriptor::isAnyOfRecurrenceKind(RK)) {
+ VectorType *BoolTy = VectorType::get(
+ Type::getInt1Ty(VectorTy->getContext()), VectorTy->getElementCount());
+ BaseCost =
+ TTI.getArithmeticReductionCost(Instruction::Or, BoolTy,
+ RdxDesc.getFastMathFlags(), CostKind) +
+ TTI.getArithmeticInstrCost(Instruction::Or, BoolTy->getScalarType(),
+ CostKind);
} else {
BaseCost = TTI.getArithmeticReductionCost(
RdxDesc.getOpcode(), VectorTy, RdxDesc.getFastMathFlags(), CostKind);
@@ -9697,10 +9705,8 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
const RecurrenceDescriptor &RdxDesc = PhiR->getRecurrenceDescriptor();
RecurKind Kind = RdxDesc.getRecurrenceKind();
- assert(
- !RecurrenceDescriptor::isAnyOfRecurrenceKind(Kind) &&
- !RecurrenceDescriptor::isFindLastIVRecurrenceKind(Kind) &&
- "AnyOf and FindLast reductions are not allowed for in-loop reductions");
+ assert(!RecurrenceDescriptor::isFindLastIVRecurrenceKind(Kind) &&
+ "FindLast reductions are not allowed for in-loop reductions");
// Collect the chain of "link" recipes for the reduction starting at PhiR.
SetVector<VPSingleDefRecipe *> Worklist;
@@ -9769,6 +9775,11 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
CurrentLinkI->getFastMathFlags());
LinkVPBB->insert(FMulRecipe, CurrentLink->getIterator());
VecOp = FMulRecipe;
+ } else if (RecurrenceDescriptor::isAnyOfRecurrenceKind(Kind)) {
+ assert(isa<VPWidenSelectRecipe>(CurrentLink) &&
+ "must be a select recipe");
+ VecOp = CurrentLink->getOperand(0);
+ Kind = RecurKind::Or;
} else {
if (RecurrenceDescriptor::isMinMaxRecurrenceKind(Kind)) {
if (isa<VPWidenRecipe>(CurrentLink)) {
@@ -9804,8 +9815,9 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
CondOp = RecipeBuilder.getBlockInMask(BB);
auto *RedRecipe = new VPReductionRecipe(
- RdxDesc, CurrentLinkI, PreviousLink, VecOp, CondOp,
- CM.useOrderedReductions(RdxDesc), CurrentLinkI->getDebugLoc());
+ Kind, RdxDesc.getFastMathFlags(), CurrentLinkI, PreviousLink, VecOp,
+ CondOp, CM.useOrderedReductions(RdxDesc),
+ CurrentLinkI->getDebugLoc());
// Append the recipe to the end of the VPBasicBlock because we need to
// ensure that it comes after all of it's inputs, including CondOp.
// Delete CurrentLink as it will be invalid if its operand is replaced
@@ -9929,10 +9941,17 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
// selected if the negated condition is true in any iteration.
if (Select->getOperand(1) == PhiR)
Cmp = Builder.createNot(Cmp);
- VPValue *Or = Builder.createOr(PhiR, Cmp);
- Select->getVPSingleValue()->replaceAllUsesWith(Or);
- // Delete Select now that it has invalid types.
- ToDelete.push_back(Select);
+
+ if (PhiR->isInLoop() && MinVF.isVector()) {
+ auto *Reduction = cast<VPReductionRecipe>(
+ *find_if(PhiR->users(), IsaPred<VPReductionRecipe>));
+ Reduction->setOperand(1, Cmp);
+ } else {
+ VPValue *Or = Builder.createOr(PhiR, Cmp);
+ Select->getVPSingleValue()->replaceAllUsesWith(Or);
+ // Delete Select now that it has invalid types.
+ ToDelete.push_back(Select);
+ }
// Convert the reduction phi to operate on bools.
PhiR->setOperand(0, Plan->getOrAddLiveIn(ConstantInt::getFalse(
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index ba24143e0b5b6..c16b4bed356df 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -2239,22 +2239,21 @@ class VPInterleaveRecipe : public VPRecipeBase {
/// a vector operand into a scalar value, and adding the result to a chain.
/// The Operands are {ChainOp, VecOp, [Condition]}.
class VPReductionRecipe : public VPRecipeWithIRFlags {
- /// The recurrence decriptor for the reduction in question.
- const RecurrenceDescriptor &RdxDesc;
+ /// The recurrence kind for the reduction in question.
+ RecurKind RdxKind;
bool IsOrdered;
/// Whether the reduction is conditional.
bool IsConditional = false;
protected:
- VPReductionRecipe(const unsigned char SC, const RecurrenceDescriptor &R,
- Instruction *I, ArrayRef<VPValue *> Operands,
- VPValue *CondOp, bool IsOrdered, DebugLoc DL)
- : VPRecipeWithIRFlags(SC, Operands,
- isa_and_nonnull<FPMathOperator>(I)
- ? R.getFastMathFlags()
- : FastMathFlags(),
- DL),
- RdxDesc(R), IsOrdered(IsOrdered) {
+ VPReductionRecipe(const unsigned char SC, RecurKind RdxKind,
+ FastMathFlags FMFs, Instruction *I,
+ ArrayRef<VPValue *> Operands, VPValue *CondOp,
+ bool IsOrdered, DebugLoc DL)
+ : VPRecipeWithIRFlags(
+ SC, Operands,
+ isa_and_nonnull<FPMathOperator>(I) ? FMFs : FastMathFlags(), DL),
+ RdxKind(RdxKind), IsOrdered(IsOrdered) {
if (CondOp) {
IsConditional = true;
addOperand(CondOp);
@@ -2263,19 +2262,19 @@ class VPReductionRecipe : public VPRecipeWithIRFlags {
}
public:
- VPReductionRecipe(const RecurrenceDescriptor &R, Instruction *I,
+ VPReductionRecipe(RecurKind RdxKind, FastMathFlags FMFs, Instruction *I,
VPValue *ChainOp, VPValue *VecOp, VPValue *CondOp,
bool IsOrdered, DebugLoc DL = {})
- : VPReductionRecipe(VPDef::VPReductionSC, R, I,
+ : VPReductionRecipe(VPRecipeBase::VPReductionSC, RdxKind, FMFs, I,
ArrayRef<VPValue *>({ChainOp, VecOp}), CondOp,
IsOrdered, DL) {}
~VPReductionRecipe() override = default;
VPReductionRecipe *clone() override {
- return new VPReductionRecipe(RdxDesc, getUnderlyingInstr(), getChainOp(),
- getVecOp(), getCondOp(), IsOrdered,
- getDebugLoc());
+ return new VPReductionRecipe(RdxKind, getFastMathFlags(),
+ getUnderlyingInstr(), getChainOp(), getVecOp(),
+ getCondOp(), IsOrdered, getDebugLoc());
}
static inline bool classof(const VPRecipeBase *R) {
@@ -2301,9 +2300,11 @@ class VPReductionRecipe : public VPRecipeWithIRFlags {
VPSlotTracker &SlotTracker) const override;
#endif
- /// Return the recurrence decriptor for the in-loop reduction.
- const RecurrenceDescriptor &getRecurrenceDescriptor() const {
- return RdxDesc;
+ /// Return the recurrence kind for the in-loop reduction.
+ RecurKind getRecurrenceKind() const { return RdxKind; }
+ /// Return the opcode for the recurrence for the in-loop reduction.
+ unsigned getOpcode() const {
+ return RecurrenceDescriptor::getOpcode(RdxKind);
}
/// Return true if the in-loop reduction is ordered.
bool isOrdered() const { return IsOrdered; };
@@ -2328,7 +2329,8 @@ class VPReductionEVLRecipe : public VPReductionRecipe {
VPReductionEVLRecipe(VPReductionRecipe &R, VPValue &EVL, VPValue *CondOp,
DebugLoc DL = {})
: VPReductionRecipe(
- VPDef::VPReductionEVLSC, R.getRecurrenceDescriptor(),
+ VPDef::VPReductionEVLSC, R.getRecurrenceKind(),
+ R.getFastMathFlags(),
cast_or_null<Instruction>(R.getUnderlyingValue()),
ArrayRef<VPValue *>({R.getChainOp(), R.getVecOp(), &EVL}), CondOp,
R.isOrdered(), DL) {}
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index d315dbe9b4170..9a13619ec56f8 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -668,10 +668,10 @@ Value *VPInstruction::generate(VPTransformState &State) {
// Create the reduction after the loop. Note that inloop reductions create
// the target reduction in the loop using a Reduction recipe.
- if ((State.VF.isVector() ||
- RecurrenceDescriptor::isAnyOfRecurrenceKind(RK) ||
- RecurrenceDescriptor::isFindLastIVRecurrenceKind(RK)) &&
- !PhiR->isInLoop()) {
+ if (((State.VF.isVector() ||
+ RecurrenceDescriptor::isFindLastIVRecurrenceKind(RK)) &&
+ !PhiR->isInLoop()) ||
+ RecurrenceDescriptor::isAnyOfRecurrenceKind(RK)) {
// TODO: Support in-order reductions based on the recurrence descriptor.
// All ops in the reduction inherit fast-math-flags from the recurrence
// descriptor.
@@ -2285,9 +2285,9 @@ void VPBlendRecipe::print(raw_ostream &O, const Twine &Indent,
void VPReductionRecipe::execute(VPTransformState &State) {
assert(!State.Lane && "Reduction being replicated.");
Value *PrevInChain = State.get(getChainOp(), /*IsScalar*/ true);
- RecurKind Kind = RdxDesc.getRecurrenceKind();
+ RecurKind Kind = getRecurrenceKind();
assert(!RecurrenceDescriptor::isAnyOfRecurrenceKind(Kind) &&
- "In-loop AnyOf reductions aren't currently supported");
+ "In-loop AnyOf reduction should use Or reduction recipe");
// Propagate the fast-math flags carried by the underlying instruction.
IRBuilderBase::FastMathFlagGuard FMFGuard(State.Builder);
State.Builder.setFastMathFlags(getFastMathFlags());
@@ -2298,8 +2298,7 @@ void VPReductionRecipe::execute(VPTransformState &State) {
VectorType *VecTy = dyn_cast<VectorType>(NewVecOp->getType());
Type *ElementTy = VecTy ? VecTy->getElementType() : NewVecOp->getType();
- Value *Start =
- getRecurrenceIdentity(Kind, ElementTy, RdxDesc.getFastMathFlags());
+ Value *Start = getRecurrenceIdentity(Kind, ElementTy, getFastMathFlags());
if (State.VF.isVector())
Start = State.Builder.CreateVectorSplat(VecTy->getElementCount(), Start);
@@ -2311,21 +2310,20 @@ void VPReductionRecipe::execute(VPTransformState &State) {
if (IsOrdered) {
if (State.VF.isVector())
NewRed =
- createOrderedReduction(State.Builder, RdxDesc, NewVecOp, PrevInChain);
+ createOrderedReduction(State.Builder, Kind, NewVecOp, PrevInChain);
else
- NewRed = State.Builder.CreateBinOp(
- (Instruction::BinaryOps)RdxDesc.getOpcode(), PrevInChain, NewVecOp);
+ NewRed = State.Builder.CreateBinOp((Instruction::BinaryOps)getOpcode(),
+ PrevInChain, NewVecOp);
PrevInChain = NewRed;
NextInChain = NewRed;
} else {
PrevInChain = State.get(getChainOp(), /*IsScalar*/ true);
NewRed = createSimpleReduction(State.Builder, NewVecOp, Kind);
if (RecurrenceDescriptor::isMinMaxRecurrenceKind(Kind))
- NextInChain = createMinMaxOp(State.Builder, RdxDesc.getRecurrenceKind(),
- NewRed, PrevInChain);
+ NextInChain = createMinMaxOp(State.Builder, Kind, NewRed, PrevInChain);
else
NextInChain = State.Builder.CreateBinOp(
- (Instruction::BinaryOps)RdxDesc.getOpcode(), NewRed, PrevInChain);
+ (Instruction::BinaryOps)getOpcode(), NewRed, PrevInChain);
}
State.set(this, NextInChain, /*IsScalar*/ true);
}
@@ -2336,10 +2334,9 @@ void VPReductionEVLRecipe::execute(VPTransformState &State) {
auto &Builder = State.Builder;
// Propagate the fast-math flags carried by the underlying instruction.
IRBuilderBase::FastMathFlagGuard FMFGuard(Builder);
- const RecurrenceDescriptor &RdxDesc = getRecurrenceDescriptor();
Builder.setFastMathFlags(getFastMathFlags());
- RecurKind Kind = RdxDesc.getRecurrenceKind();
+ RecurKind Kind = getRecurrenceKind();
Value *Prev = State.get(getChainOp(), /*IsScalar*/ true);
Value *VecOp = State.get(getVecOp());
Value *EVL = State.get(getEVL(), VPLane(0));
@@ -2356,25 +2353,23 @@ void VPReductionEVLRecipe::execute(VPTransformState &State) {
Value *NewRed;
if (isOrdered()) {
- NewRed = createOrderedReduction(VBuilder, RdxDesc, VecOp, Prev);
+ NewRed = createOrderedReduction(VBuilder, Kind, VecOp, Prev);
} else {
- NewRed = createSimpleReduction(VBuilder, VecOp, RdxDesc);
+ NewRed = createSimpleReduction(VBuilder, VecOp, Kind, getFastMathFlags());
if (RecurrenceDescriptor::isMinMaxRecurrenceKind(Kind))
NewRed = createMinMaxOp(Builder, Kind, NewRed, Prev);
else
- NewRed = Builder.CreateBinOp((Instruction::BinaryOps)RdxDesc.getOpcode(),
- NewRed, Prev);
+ NewRed = Builder.CreateBinOp((Instruction::BinaryOps)getOpcode(), NewRed,
+ Prev);
}
State.set(this, NewRed, /*IsScalar*/ true);
}
InstructionCost VPReductionRecipe::computeCost(ElementCount VF,
VPCostContext &Ctx) const {
- RecurKind RdxKind = RdxDesc.getRecurrenceKind();
+ RecurKind RdxKind = getRecurrenceKind();
Type *ElementTy = Ctx.Types.inferScalarType(this);
auto *VectorTy = cast<VectorType>(toVectorTy(ElementTy, VF));
- unsigned Opcode = RdxDesc.getOpcode();
- FastMathFlags FMFs = getFastMathFlags();
// TODO: Support any-of and in-loop reductions.
assert(
@@ -2386,20 +2381,17 @@ InstructionCost VPReductionRecipe::computeCost(ElementCount VF,
ForceTargetInstructionCost.getNumOccurrences() > 0) &&
"In-loop reduction not implemented in VPlan-based cost model currently.");
- assert(...
[truncated]
|
The in-loop reductions may be completely not profitable |
Yes, this is only profitable for EVL tail folding due to the vp.merge. The plan is to eventually use the TargetTransformInfo::preferInLoopReduction hook to only enable this for AnyOf reductions with EVL tail folding. |
I mean, the in-loop reduction may be (significantly!!!)less profitable than vp.merge. |
The in-loop reduction should be a single vcpop.m which I think should be cheap on most microarchitectures? On the spacemit-x60 it's 2 cycles: https://camel-cdr.github.io/rvv-bench-results/bpi_f3/index.html This is versus at least a regular comparison linear in EMUL + 4 mask instructions |
Nope, it is not |
Is there a specific microarchitecture that you can point to? RISCVSchedSiFive7.td and RISCVSchedSiFiveP600.td seem to imply that they are cheap/don't scale with VL. In any case we can disable it for a specific core in the hook. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific microarchitecture that you can point to? RISCVSchedSiFive7.td and RISCVSchedSiFiveP600.td seem to imply that they are cheap/don't scale with VL. In any case we can disable it for a specific core in the hook.
It would be good if this wouldn't require a dedicated hook, but just checks the cost of the operations, which should be marked as expensive on the CPUs where it is not profitable. Not sure if this would be already happening in the current version, where we check the reduction cost independent of EVL AFAICT?
I was hoping we would just reuse the existing preferInLoopReductions hook But yeah it would be nice if we could have the cost model automatically select in-loop or out-of-loop reductions depending on the cost. We don't do it today for the tail folding style either, that's also a hook. We would need to create new VPlans for those? Is that something that could be explored in a separate PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lukel97 We previously considered doing this to avoid generating i1 vp.merge, but we ultimately decided against it. In our hardware, vpop takes more cycles in the pipeline, causing snez to idle for too long. Perhaps the results vary depending on the hardware.
As I mentioned in #120405 (comment), another possible approach is to widen the type of vp.merge, or to retain the original vectorization method—i.e., still using select in the vector loop instead of the or operation, and choose the way depend on TTI.
Thanks for the clarification, it looks like we have a difference in microarchitectures then. On the BPI-F3, the loop in the PR description is about 10% faster with an in-loop reduction vs out-of-loop reduction: Details
But since this PR only adds support, should we move the discussion on enabling it/adding a tuning flag to another PR? |
…ctionOpChain. NFC There are other types of recurrences with an icmp/fcmp opcode, AnyOf and FindLastIV, so don't rely on the opcode to detect them. This makes adding support for AnyOf in llvm#131830 easier. Note that these currently fail the ExpectedUses/isCorrectOpcode checks anyway, so there shouldn't be any functional change.
I believe what we talked about doing internally was a vp.zext to i8 then a i8 vp.merge in the loop with a vp.reduce.or after the loop. That avoids putting a vcpop.m in the loop. |
I want to second this request. Even if there's a better option available for some (or even all) micro-architectures, there's no response to block the functionality. |
+1, this |
I just ran some tests, I think widening to i8 is also more profitable on the BPI-F3 vs vcpop.m, e.g: vsetvli a5, zero, e8, m1, ta, ma
vmv.v.i v9, 0
loop:
vsetvli a5, a7, e32, m1, ta, ma
vle32.v v8, (a0)
add a0, a0, a5
vmseq.vx v0, v8, zero
vsetvli zero, zero, e8, mf4, ta, ma
vmerge.vim v10, v9, 1, v0
vor.vv v11, v11, v10
sub a7, a7, a5
bnez a7, loop
exit:
vmsne.vi v10, v11, 0
vcpop.m a1, v10 This sounds like an approach all microarchs can agree on. Is anyone at SiFive already working on this? Otherwise I can take a look at it. |
+1.
Go ahead |
…ctionOpChain. NFC (#132025) There are other types of recurrences with an icmp/fcmp opcode, AnyOf and FindLastIV, so don't rely on the opcode to detect them. This makes adding support for AnyOf in #131830 easier. Note that these currently fail the ExpectedUses/isCorrectOpcode checks anyway, so there shouldn't be any functional change.
aba5052
to
bb017c1
Compare
This patch changes the preferInLoopReduction function to take a RecurKind instead of an unsigned Opcode. This makes it possible to distinguish non-arithmetic reductions such as min/max, AnyOf, and FindLastIV, and also helps unify IAnyOf with FAnyOf and IFindLastIV with FFindLastIV. Related patch #118393 #131830
Today, an AnyOf reduction will get neatly vectorized out-of-loop on RISC-V:
However, with EVL tail folding we get much worse codegen:
The issue is due to the fact that we need to use an i1 vp.merge to preserve the tail elements on the final iterations, because the final reduction will be across the entire vector:
However on RISC-V there are no mask instructions that can preserve the tail as per the specification:
So the current best lowering we have today is something like this:
One way we can avoid the vp.merge is to do an in-loop reduction, which for an i1 vector is cheap via
vcpop.m
This PR adds support for in-loop AnyOf reductions, by emitting an or reduction. The resulting IR looks something like this:
It still remains disabled by default, and a later patch can opt into it when EVL tail folding on RISC-V.
Stacked on #131300