[VPlan] Add support for in-loop AnyOf reductions #131830

lukel97 · 2025-03-18T15:35:03Z

Today, an AnyOf reduction will get neatly vectorized out-of-loop on RISC-V:

int f(int *x, int y, int n) {
  int z = 0;
  for (int i = 0; i < n; i++)
    if (x[i] == y)
      z = 1;
  return z;
}

.LBB0_5:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	vl2re32.v	v10, (a3)
	add	a3, a3, a4
	vsetvli	zero, zero, e32, m2, ta, ma
	vmseq.vx	v9, v10, a1
	vmor.mm	v8, v8, v9
	bne	a3, a5, .LBB0_5
# %bb.6:                                # %middle.block
	vcpop.m	a3, v8
	# ...

However, with EVL tail folding we get much worse codegen:

.LBB0_2:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	sub	t0, a2, a3
	sh2add	a6, a3, a0
	vsetvli	t1, t0, e8, mf2, ta, ma
	vsetvli	a4, zero, e64, m4, ta, ma
	vmv.v.x	v16, t1
	vmsleu.vv	v9, v16, v12
	vsetvli	zero, t0, e32, m2, ta, ma
	vle32.v	v10, (a6)
	sub	a5, a5, a7
	vsetvli	a4, zero, e64, m4, ta, ma
	vmsltu.vx	v16, v12, t1
	vmand.mm	v9, v8, v9
	vsetvli	zero, zero, e32, m2, ta, ma
	vmseq.vx	v17, v10, a1
	vmor.mm	v8, v8, v17
	vmand.mm	v8, v8, v16
	vmor.mm	v8, v8, v9
	add	a3, a3, t1
	bnez	a5, .LBB0_2
# %bb.3:                                # %middle.block
	vcpop.m	a0, v8
	snez	a0, a0
	ret

The issue is due to the fact that we need to use an i1 vp.merge to preserve the tail elements on the final iterations, because the final reduction will be across the entire vector:

%9 = icmp eq <vscale x 4 x i32> %vp.op.load, %broadcast.splat
%10 = or <vscale x 4 x i1> %vec.phi, %9
%11 = call <vscale x 4 x i1> @llvm.vp.merge.nxv4i1(<vscale x 4 x i1> splat (i1 true), <vscale x 4 x i1> %10, <vscale x 4 x i1> %vec.phi, i32 %5)

However on RISC-V there are no mask instructions that can preserve the tail as per the specification:

Mask destination tail elements are always treated as tail-agnostic, regardless of the setting of vta.

So the current best lowering we have today is something like this:

      vsetvli a1, zero, e64, m1, ta, ma
      vid.v v10
      vmsltu.vx v10, v10, a0
      vmand.mm v9, v9, v10
      vmandn.mm v8, v8, v9
      vmand.mm v9, v0, v9
      vmor.mm v0, v9, v8

One way we can avoid the vp.merge is to do an in-loop reduction, which for an i1 vector is cheap via vcpop.m

.LBB0_2:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	sub	a7, a2, a4
	sh2add	t0, a4, a0
	vsetvli	a7, a7, e32, m2, ta, ma
	vle32.v	v8, (t0)
	sub	a5, a5, a6
	vmseq.vx	v10, v8, a1
	vcpop.m	a3, v10
	snez	a3, a3
	or	t1, a3, t1
	add	a4, a4, a7
	bnez	a5, .LBB0_2
# %bb.3:                                # %middle.block
	andi	a0, t1, 1

This PR adds support for in-loop AnyOf reductions, by emitting an or reduction. The resulting IR looks something like this:

vector.body:
  %vec.phi = phi i1 [ false, %for.body.preheader ], [ %9, %vector.body ]
  ...
  %7 = icmp eq <vscale x 4 x i32> %vp.op.load, %broadcast.splat
  %8 = tail call i1 @llvm.vp.reduce.or.nxv4i1(i1 false, <vscale x 4 x i1> %7, <vscale x 4 x i1> splat (i1 true), i32 %evl)
  %.fr = freeze i1 %8
  %9 = or i1 %.fr, %vec.phi

middle.block:
  %rdx.select = select i1 %9, i32 0, i32 1

It still remains disabled by default, and a later patch can opt into it when EVL tail folding on RISC-V.

Stacked on #131300

llvmbot · 2025-03-18T15:36:19Z

@llvm/pr-subscribers-vectorizers
@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-llvm-transforms

Author: Luke Lau (lukel97)

Changes

Today, an AnyOf reduction will get neatly vectorized out-of-loop on RISC-V:

int f(int *x, int y, int n) {
  int z = 0;
  for (int i = 0; i &lt; n; i++)
    if (x[i] == y)
      z = 1;
  return z;
}

.LBB0_5:                                # %vector.body
                                        # =&gt;This Inner Loop Header: Depth=1
	vl2re32.v	v10, (a3)
	add	a3, a3, a4
	vsetvli	zero, zero, e32, m2, ta, ma
	vmseq.vx	v9, v10, a1
	vmor.mm	v8, v8, v9
	bne	a3, a5, .LBB0_5
# %bb.6:                                # %middle.block
	vcpop.m	a3, v8
	# ...

However, with EVL tail folding we get much worse codegen:

.LBB0_2:                                # %vector.body
                                        # =&gt;This Inner Loop Header: Depth=1
	sub	t0, a2, a3
	sh2add	a6, a3, a0
	vsetvli	t1, t0, e8, mf2, ta, ma
	vsetvli	a4, zero, e64, m4, ta, ma
	vmv.v.x	v16, t1
	vmsleu.vv	v9, v16, v12
	vsetvli	zero, t0, e32, m2, ta, ma
	vle32.v	v10, (a6)
	sub	a5, a5, a7
	vsetvli	a4, zero, e64, m4, ta, ma
	vmsltu.vx	v16, v12, t1
	vmand.mm	v9, v8, v9
	vsetvli	zero, zero, e32, m2, ta, ma
	vmseq.vx	v17, v10, a1
	vmor.mm	v8, v8, v17
	vmand.mm	v8, v8, v16
	vmor.mm	v8, v8, v9
	add	a3, a3, t1
	bnez	a5, .LBB0_2
# %bb.3:                                # %middle.block
	vcpop.m	a0, v8
	snez	a0, a0
	ret

The issue is due to the fact that we need to use an i1 vp.merge to preserve the tail elements on the final iterations, because the final reduction will be across the entire vector:

%9 = icmp eq &lt;vscale x 4 x i32&gt; %vp.op.load, %broadcast.splat
%10 = or &lt;vscale x 4 x i1&gt; %vec.phi, %9
%11 = call &lt;vscale x 4 x i1&gt; @<!-- -->llvm.vp.merge.nxv4i1(&lt;vscale x 4 x i1&gt; splat (i1 true), &lt;vscale x 4 x i1&gt; %10, &lt;vscale x 4 x i1&gt; %vec.phi, i32 %5)

However on RISC-V there are no mask instructions that can preserve the tail as per the specification:

> Mask destination tail elements are always treated as tail-agnostic, regardless of the setting of vta.

So the current best lowering we have today is something like this:

      vsetvli a1, zero, e64, m1, ta, ma
      vid.v v10
      vmsltu.vx v10, v10, a0
      vmand.mm v9, v9, v10
      vmandn.mm v8, v8, v9
      vmand.mm v9, v0, v9
      vmor.mm v0, v9, v8

One way we can avoid the vp.merge is to do an in-loop reduction, which for an i1 vector is cheap via vcpop.m

.LBB0_2:                                # %vector.body
                                        # =&gt;This Inner Loop Header: Depth=1
	sub	a7, a2, a4
	sh2add	t0, a4, a0
	vsetvli	a7, a7, e32, m2, ta, ma
	vle32.v	v8, (t0)
	sub	a5, a5, a6
	vmseq.vx	v10, v8, a1
	vcpop.m	a3, v10
	snez	a3, a3
	or	t1, a3, t1
	add	a4, a4, a7
	bnez	a5, .LBB0_2
# %bb.3:                                # %middle.block
	andi	a0, t1, 1

This PR adds support for in-loop AnyOf reductions, by emitting an or reduction. The resulting IR looks something like this:

vector.body:
  %vec.phi = phi i1 [ false, %for.body.preheader ], [ %9, %vector.body ]
  ...
  %7 = icmp eq &lt;vscale x 4 x i32&gt; %vp.op.load, %broadcast.splat
  %8 = tail call i1 @<!-- -->llvm.vp.reduce.or.nxv4i1(i1 false, &lt;vscale x 4 x i1&gt; %7, &lt;vscale x 4 x i1&gt; splat (i1 true), i32 %evl)
  %.fr = freeze i1 %8
  %9 = or i1 %.fr, %vec.phi

middle.block:
  %rdx.select = select i1 %9, i32 0, i32 1

It still remains disabled by default, and a later patch can opt into it when EVL tail folding on RISC-V.

Stacked on #131300

Patch is 180.12 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/131830.diff

12 Files Affected:

(modified) llvm/include/llvm/Transforms/Utils/LoopUtils.h (+5-7)
(modified) llvm/lib/Analysis/IVDescriptors.cpp (+5-3)
(modified) llvm/lib/Transforms/Utils/LoopUtils.cpp (+6-11)
(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+29-10)
(modified) llvm/lib/Transforms/Vectorize/VPlan.h (+22-20)
(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+25-40)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-inloop-reduction.ll (+12-14)
(added) llvm/test/Transforms/LoopVectorize/select-cmp-blend.ll (+190)
(modified) llvm/test/Transforms/LoopVectorize/select-cmp-multiuse.ll (+431)
(modified) llvm/test/Transforms/LoopVectorize/select-cmp.ll (+1100)
(modified) llvm/test/Transforms/LoopVectorize/vplan-printing.ll (+1-1)
(modified) llvm/unittests/Transforms/Vectorize/VPlanTest.cpp (+8-8)

diff --git a/llvm/include/llvm/Transforms/Utils/LoopUtils.h b/llvm/include/llvm/Transforms/Utils/LoopUtils.h
index 1818ee03d2ec8..3ad7b8f17856c 100644
--- a/llvm/include/llvm/Transforms/Utils/LoopUtils.h
+++ b/llvm/include/llvm/Transforms/Utils/LoopUtils.h
@@ -411,8 +411,8 @@ Value *createSimpleReduction(IRBuilderBase &B, Value *Src,
                              RecurKind RdxKind);
 /// Overloaded function to generate vector-predication intrinsics for
 /// reduction.
-Value *createSimpleReduction(VectorBuilder &VB, Value *Src,
-                             const RecurrenceDescriptor &Desc);
+Value *createSimpleReduction(VectorBuilder &VB, Value *Src, RecurKind RdxKind,
+                             FastMathFlags FMFs);
 
 /// Create a reduction of the given vector \p Src for a reduction of the
 /// kind RecurKind::IAnyOf or RecurKind::FAnyOf. The reduction operation is
@@ -428,14 +428,12 @@ Value *createFindLastIVReduction(IRBuilderBase &B, Value *Src,
                                  const RecurrenceDescriptor &Desc);
 
 /// Create an ordered reduction intrinsic using the given recurrence
-/// descriptor \p Desc.
-Value *createOrderedReduction(IRBuilderBase &B,
-                              const RecurrenceDescriptor &Desc, Value *Src,
+/// kind \p Kind.
+Value *createOrderedReduction(IRBuilderBase &B, RecurKind Kind, Value *Src,
                               Value *Start);
 /// Overloaded function to generate vector-predication intrinsics for ordered
 /// reduction.
-Value *createOrderedReduction(VectorBuilder &VB,
-                              const RecurrenceDescriptor &Desc, Value *Src,
+Value *createOrderedReduction(VectorBuilder &VB, RecurKind Kind, Value *Src,
                               Value *Start);
 
 /// Get the intersection (logical and) of all of the potential IR flags
diff --git a/llvm/lib/Analysis/IVDescriptors.cpp b/llvm/lib/Analysis/IVDescriptors.cpp
index f74ede4450ce5..a1dc74c9d0779 100644
--- a/llvm/lib/Analysis/IVDescriptors.cpp
+++ b/llvm/lib/Analysis/IVDescriptors.cpp
@@ -1184,7 +1184,7 @@ RecurrenceDescriptor::getReductionOpChain(PHINode *Phi, Loop *L) const {
   // more expensive than out-of-loop reductions, and need to be costed more
   // carefully.
   unsigned ExpectedUses = 1;
-  if (RedOp == Instruction::ICmp || RedOp == Instruction::FCmp)
+  if (isMinMaxRecurrenceKind(getRecurrenceKind()))
     ExpectedUses = 2;
 
   auto getNextInstruction = [&](Instruction *Cur) -> Instruction * {
@@ -1192,7 +1192,7 @@ RecurrenceDescriptor::getReductionOpChain(PHINode *Phi, Loop *L) const {
       Instruction *UI = cast<Instruction>(User);
       if (isa<PHINode>(UI))
         continue;
-      if (RedOp == Instruction::ICmp || RedOp == Instruction::FCmp) {
+      if (isMinMaxRecurrenceKind(Kind)) {
         // We are expecting a icmp/select pair, which we go to the next select
         // instruction if we can. We already know that Cur has 2 uses.
         if (isa<SelectInst>(UI))
@@ -1204,11 +1204,13 @@ RecurrenceDescriptor::getReductionOpChain(PHINode *Phi, Loop *L) const {
     return nullptr;
   };
   auto isCorrectOpcode = [&](Instruction *Cur) {
-    if (RedOp == Instruction::ICmp || RedOp == Instruction::FCmp) {
+    if (isMinMaxRecurrenceKind(getRecurrenceKind())) {
       Value *LHS, *RHS;
       return SelectPatternResult::isMinOrMax(
           matchSelectPattern(Cur, LHS, RHS).Flavor);
     }
+    if (isAnyOfRecurrenceKind(getRecurrenceKind()))
+      return isa<SelectInst>(Cur);
     // Recognize a call to the llvm.fmuladd intrinsic.
     if (isFMulAddIntrinsic(Cur))
       return true;
diff --git a/llvm/lib/Transforms/Utils/LoopUtils.cpp b/llvm/lib/Transforms/Utils/LoopUtils.cpp
index 185af8631454a..41f43a24e19e6 100644
--- a/llvm/lib/Transforms/Utils/LoopUtils.cpp
+++ b/llvm/lib/Transforms/Utils/LoopUtils.cpp
@@ -1333,24 +1333,21 @@ Value *llvm::createSimpleReduction(IRBuilderBase &Builder, Value *Src,
 }
 
 Value *llvm::createSimpleReduction(VectorBuilder &VBuilder, Value *Src,
-                                   const RecurrenceDescriptor &Desc) {
-  RecurKind Kind = Desc.getRecurrenceKind();
+                                   RecurKind Kind, FastMathFlags FMFs) {
   assert(!RecurrenceDescriptor::isAnyOfRecurrenceKind(Kind) &&
          !RecurrenceDescriptor::isFindLastIVRecurrenceKind(Kind) &&
          "AnyOf or FindLastIV reductions are not supported.");
   Intrinsic::ID Id = getReductionIntrinsicID(Kind);
   auto *SrcTy = cast<VectorType>(Src->getType());
   Type *SrcEltTy = SrcTy->getElementType();
-  Value *Iden = getRecurrenceIdentity(Kind, SrcEltTy, Desc.getFastMathFlags());
+  Value *Iden = getRecurrenceIdentity(Kind, SrcEltTy, FMFs);
   Value *Ops[] = {Iden, Src};
   return VBuilder.createSimpleReduction(Id, SrcTy, Ops);
 }
 
-Value *llvm::createOrderedReduction(IRBuilderBase &B,
-                                    const RecurrenceDescriptor &Desc,
+Value *llvm::createOrderedReduction(IRBuilderBase &B, RecurKind Kind,
                                     Value *Src, Value *Start) {
-  assert((Desc.getRecurrenceKind() == RecurKind::FAdd ||
-          Desc.getRecurrenceKind() == RecurKind::FMulAdd) &&
+  assert((Kind == RecurKind::FAdd || Kind == RecurKind::FMulAdd) &&
          "Unexpected reduction kind");
   assert(Src->getType()->isVectorTy() && "Expected a vector type");
   assert(!Start->getType()->isVectorTy() && "Expected a scalar type");
@@ -1358,11 +1355,9 @@ Value *llvm::createOrderedReduction(IRBuilderBase &B,
   return B.CreateFAddReduce(Start, Src);
 }
 
-Value *llvm::createOrderedReduction(VectorBuilder &VBuilder,
-                                    const RecurrenceDescriptor &Desc,
+Value *llvm::createOrderedReduction(VectorBuilder &VBuilder, RecurKind Kind,
                                     Value *Src, Value *Start) {
-  assert((Desc.getRecurrenceKind() == RecurKind::FAdd ||
-          Desc.getRecurrenceKind() == RecurKind::FMulAdd) &&
+  assert((Kind == RecurKind::FAdd || Kind == RecurKind::FMulAdd) &&
          "Unexpected reduction kind");
   assert(Src->getType()->isVectorTy() && "Expected a vector type");
   assert(!Start->getType()->isVectorTy() && "Expected a scalar type");
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index cbfccaab01e27..bf4fd4c6af1c4 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -5868,6 +5868,14 @@ LoopVectorizationCostModel::getReductionPatternCost(Instruction *I,
     Intrinsic::ID MinMaxID = getMinMaxReductionIntrinsicOp(RK);
     BaseCost = TTI.getMinMaxReductionCost(MinMaxID, VectorTy,
                                           RdxDesc.getFastMathFlags(), CostKind);
+  } else if (RecurrenceDescriptor::isAnyOfRecurrenceKind(RK)) {
+    VectorType *BoolTy = VectorType::get(
+        Type::getInt1Ty(VectorTy->getContext()), VectorTy->getElementCount());
+    BaseCost =
+        TTI.getArithmeticReductionCost(Instruction::Or, BoolTy,
+                                       RdxDesc.getFastMathFlags(), CostKind) +
+        TTI.getArithmeticInstrCost(Instruction::Or, BoolTy->getScalarType(),
+                                   CostKind);
   } else {
     BaseCost = TTI.getArithmeticReductionCost(
         RdxDesc.getOpcode(), VectorTy, RdxDesc.getFastMathFlags(), CostKind);
@@ -9697,10 +9705,8 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
 
     const RecurrenceDescriptor &RdxDesc = PhiR->getRecurrenceDescriptor();
     RecurKind Kind = RdxDesc.getRecurrenceKind();
-    assert(
-        !RecurrenceDescriptor::isAnyOfRecurrenceKind(Kind) &&
-        !RecurrenceDescriptor::isFindLastIVRecurrenceKind(Kind) &&
-        "AnyOf and FindLast reductions are not allowed for in-loop reductions");
+    assert(!RecurrenceDescriptor::isFindLastIVRecurrenceKind(Kind) &&
+           "FindLast reductions are not allowed for in-loop reductions");
 
     // Collect the chain of "link" recipes for the reduction starting at PhiR.
     SetVector<VPSingleDefRecipe *> Worklist;
@@ -9769,6 +9775,11 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
             CurrentLinkI->getFastMathFlags());
         LinkVPBB->insert(FMulRecipe, CurrentLink->getIterator());
         VecOp = FMulRecipe;
+      } else if (RecurrenceDescriptor::isAnyOfRecurrenceKind(Kind)) {
+        assert(isa<VPWidenSelectRecipe>(CurrentLink) &&
+               "must be a select recipe");
+        VecOp = CurrentLink->getOperand(0);
+        Kind = RecurKind::Or;
       } else {
         if (RecurrenceDescriptor::isMinMaxRecurrenceKind(Kind)) {
           if (isa<VPWidenRecipe>(CurrentLink)) {
@@ -9804,8 +9815,9 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
         CondOp = RecipeBuilder.getBlockInMask(BB);
 
       auto *RedRecipe = new VPReductionRecipe(
-          RdxDesc, CurrentLinkI, PreviousLink, VecOp, CondOp,
-          CM.useOrderedReductions(RdxDesc), CurrentLinkI->getDebugLoc());
+          Kind, RdxDesc.getFastMathFlags(), CurrentLinkI, PreviousLink, VecOp,
+          CondOp, CM.useOrderedReductions(RdxDesc),
+          CurrentLinkI->getDebugLoc());
       // Append the recipe to the end of the VPBasicBlock because we need to
       // ensure that it comes after all of it's inputs, including CondOp.
       // Delete CurrentLink as it will be invalid if its operand is replaced
@@ -9929,10 +9941,17 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
       // selected if the negated condition is true in any iteration.
       if (Select->getOperand(1) == PhiR)
         Cmp = Builder.createNot(Cmp);
-      VPValue *Or = Builder.createOr(PhiR, Cmp);
-      Select->getVPSingleValue()->replaceAllUsesWith(Or);
-      // Delete Select now that it has invalid types.
-      ToDelete.push_back(Select);
+
+      if (PhiR->isInLoop() && MinVF.isVector()) {
+        auto *Reduction = cast<VPReductionRecipe>(
+            *find_if(PhiR->users(), IsaPred<VPReductionRecipe>));
+        Reduction->setOperand(1, Cmp);
+      } else {
+        VPValue *Or = Builder.createOr(PhiR, Cmp);
+        Select->getVPSingleValue()->replaceAllUsesWith(Or);
+        // Delete Select now that it has invalid types.
+        ToDelete.push_back(Select);
+      }
 
       // Convert the reduction phi to operate on bools.
       PhiR->setOperand(0, Plan->getOrAddLiveIn(ConstantInt::getFalse(
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index ba24143e0b5b6..c16b4bed356df 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -2239,22 +2239,21 @@ class VPInterleaveRecipe : public VPRecipeBase {
 /// a vector operand into a scalar value, and adding the result to a chain.
 /// The Operands are {ChainOp, VecOp, [Condition]}.
 class VPReductionRecipe : public VPRecipeWithIRFlags {
-  /// The recurrence decriptor for the reduction in question.
-  const RecurrenceDescriptor &RdxDesc;
+  /// The recurrence kind for the reduction in question.
+  RecurKind RdxKind;
   bool IsOrdered;
   /// Whether the reduction is conditional.
   bool IsConditional = false;
 
 protected:
-  VPReductionRecipe(const unsigned char SC, const RecurrenceDescriptor &R,
-                    Instruction *I, ArrayRef<VPValue *> Operands,
-                    VPValue *CondOp, bool IsOrdered, DebugLoc DL)
-      : VPRecipeWithIRFlags(SC, Operands,
-                            isa_and_nonnull<FPMathOperator>(I)
-                                ? R.getFastMathFlags()
-                                : FastMathFlags(),
-                            DL),
-        RdxDesc(R), IsOrdered(IsOrdered) {
+  VPReductionRecipe(const unsigned char SC, RecurKind RdxKind,
+                    FastMathFlags FMFs, Instruction *I,
+                    ArrayRef<VPValue *> Operands, VPValue *CondOp,
+                    bool IsOrdered, DebugLoc DL)
+      : VPRecipeWithIRFlags(
+            SC, Operands,
+            isa_and_nonnull<FPMathOperator>(I) ? FMFs : FastMathFlags(), DL),
+        RdxKind(RdxKind), IsOrdered(IsOrdered) {
     if (CondOp) {
       IsConditional = true;
       addOperand(CondOp);
@@ -2263,19 +2262,19 @@ class VPReductionRecipe : public VPRecipeWithIRFlags {
   }
 
 public:
-  VPReductionRecipe(const RecurrenceDescriptor &R, Instruction *I,
+  VPReductionRecipe(RecurKind RdxKind, FastMathFlags FMFs, Instruction *I,
                     VPValue *ChainOp, VPValue *VecOp, VPValue *CondOp,
                     bool IsOrdered, DebugLoc DL = {})
-      : VPReductionRecipe(VPDef::VPReductionSC, R, I,
+      : VPReductionRecipe(VPRecipeBase::VPReductionSC, RdxKind, FMFs, I,
                           ArrayRef<VPValue *>({ChainOp, VecOp}), CondOp,
                           IsOrdered, DL) {}
 
   ~VPReductionRecipe() override = default;
 
   VPReductionRecipe *clone() override {
-    return new VPReductionRecipe(RdxDesc, getUnderlyingInstr(), getChainOp(),
-                                 getVecOp(), getCondOp(), IsOrdered,
-                                 getDebugLoc());
+    return new VPReductionRecipe(RdxKind, getFastMathFlags(),
+                                 getUnderlyingInstr(), getChainOp(), getVecOp(),
+                                 getCondOp(), IsOrdered, getDebugLoc());
   }
 
   static inline bool classof(const VPRecipeBase *R) {
@@ -2301,9 +2300,11 @@ class VPReductionRecipe : public VPRecipeWithIRFlags {
              VPSlotTracker &SlotTracker) const override;
 #endif
 
-  /// Return the recurrence decriptor for the in-loop reduction.
-  const RecurrenceDescriptor &getRecurrenceDescriptor() const {
-    return RdxDesc;
+  /// Return the recurrence kind for the in-loop reduction.
+  RecurKind getRecurrenceKind() const { return RdxKind; }
+  /// Return the opcode for the recurrence for the in-loop reduction.
+  unsigned getOpcode() const {
+    return RecurrenceDescriptor::getOpcode(RdxKind);
   }
   /// Return true if the in-loop reduction is ordered.
   bool isOrdered() const { return IsOrdered; };
@@ -2328,7 +2329,8 @@ class VPReductionEVLRecipe : public VPReductionRecipe {
   VPReductionEVLRecipe(VPReductionRecipe &R, VPValue &EVL, VPValue *CondOp,
                        DebugLoc DL = {})
       : VPReductionRecipe(
-            VPDef::VPReductionEVLSC, R.getRecurrenceDescriptor(),
+            VPDef::VPReductionEVLSC, R.getRecurrenceKind(),
+            R.getFastMathFlags(),
             cast_or_null<Instruction>(R.getUnderlyingValue()),
             ArrayRef<VPValue *>({R.getChainOp(), R.getVecOp(), &EVL}), CondOp,
             R.isOrdered(), DL) {}
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index d315dbe9b4170..9a13619ec56f8 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -668,10 +668,10 @@ Value *VPInstruction::generate(VPTransformState &State) {
 
     // Create the reduction after the loop. Note that inloop reductions create
     // the target reduction in the loop using a Reduction recipe.
-    if ((State.VF.isVector() ||
-         RecurrenceDescriptor::isAnyOfRecurrenceKind(RK) ||
-         RecurrenceDescriptor::isFindLastIVRecurrenceKind(RK)) &&
-        !PhiR->isInLoop()) {
+    if (((State.VF.isVector() ||
+          RecurrenceDescriptor::isFindLastIVRecurrenceKind(RK)) &&
+         !PhiR->isInLoop()) ||
+        RecurrenceDescriptor::isAnyOfRecurrenceKind(RK)) {
       // TODO: Support in-order reductions based on the recurrence descriptor.
       // All ops in the reduction inherit fast-math-flags from the recurrence
       // descriptor.
@@ -2285,9 +2285,9 @@ void VPBlendRecipe::print(raw_ostream &O, const Twine &Indent,
 void VPReductionRecipe::execute(VPTransformState &State) {
   assert(!State.Lane && "Reduction being replicated.");
   Value *PrevInChain = State.get(getChainOp(), /*IsScalar*/ true);
-  RecurKind Kind = RdxDesc.getRecurrenceKind();
+  RecurKind Kind = getRecurrenceKind();
   assert(!RecurrenceDescriptor::isAnyOfRecurrenceKind(Kind) &&
-         "In-loop AnyOf reductions aren't currently supported");
+         "In-loop AnyOf reduction should use Or reduction recipe");
   // Propagate the fast-math flags carried by the underlying instruction.
   IRBuilderBase::FastMathFlagGuard FMFGuard(State.Builder);
   State.Builder.setFastMathFlags(getFastMathFlags());
@@ -2298,8 +2298,7 @@ void VPReductionRecipe::execute(VPTransformState &State) {
     VectorType *VecTy = dyn_cast<VectorType>(NewVecOp->getType());
     Type *ElementTy = VecTy ? VecTy->getElementType() : NewVecOp->getType();
 
-    Value *Start =
-        getRecurrenceIdentity(Kind, ElementTy, RdxDesc.getFastMathFlags());
+    Value *Start = getRecurrenceIdentity(Kind, ElementTy, getFastMathFlags());
     if (State.VF.isVector())
       Start = State.Builder.CreateVectorSplat(VecTy->getElementCount(), Start);
 
@@ -2311,21 +2310,20 @@ void VPReductionRecipe::execute(VPTransformState &State) {
   if (IsOrdered) {
     if (State.VF.isVector())
       NewRed =
-          createOrderedReduction(State.Builder, RdxDesc, NewVecOp, PrevInChain);
+          createOrderedReduction(State.Builder, Kind, NewVecOp, PrevInChain);
     else
-      NewRed = State.Builder.CreateBinOp(
-          (Instruction::BinaryOps)RdxDesc.getOpcode(), PrevInChain, NewVecOp);
+      NewRed = State.Builder.CreateBinOp((Instruction::BinaryOps)getOpcode(),
+                                         PrevInChain, NewVecOp);
     PrevInChain = NewRed;
     NextInChain = NewRed;
   } else {
     PrevInChain = State.get(getChainOp(), /*IsScalar*/ true);
     NewRed = createSimpleReduction(State.Builder, NewVecOp, Kind);
     if (RecurrenceDescriptor::isMinMaxRecurrenceKind(Kind))
-      NextInChain = createMinMaxOp(State.Builder, RdxDesc.getRecurrenceKind(),
-                                   NewRed, PrevInChain);
+      NextInChain = createMinMaxOp(State.Builder, Kind, NewRed, PrevInChain);
     else
       NextInChain = State.Builder.CreateBinOp(
-          (Instruction::BinaryOps)RdxDesc.getOpcode(), NewRed, PrevInChain);
+          (Instruction::BinaryOps)getOpcode(), NewRed, PrevInChain);
   }
   State.set(this, NextInChain, /*IsScalar*/ true);
 }
@@ -2336,10 +2334,9 @@ void VPReductionEVLRecipe::execute(VPTransformState &State) {
   auto &Builder = State.Builder;
   // Propagate the fast-math flags carried by the underlying instruction.
   IRBuilderBase::FastMathFlagGuard FMFGuard(Builder);
-  const RecurrenceDescriptor &RdxDesc = getRecurrenceDescriptor();
   Builder.setFastMathFlags(getFastMathFlags());
 
-  RecurKind Kind = RdxDesc.getRecurrenceKind();
+  RecurKind Kind = getRecurrenceKind();
   Value *Prev = State.get(getChainOp(), /*IsScalar*/ true);
   Value *VecOp = State.get(getVecOp());
   Value *EVL = State.get(getEVL(), VPLane(0));
@@ -2356,25 +2353,23 @@ void VPReductionEVLRecipe::execute(VPTransformState &State) {
 
   Value *NewRed;
   if (isOrdered()) {
-    NewRed = createOrderedReduction(VBuilder, RdxDesc, VecOp, Prev);
+    NewRed = createOrderedReduction(VBuilder, Kind, VecOp, Prev);
   } else {
-    NewRed = createSimpleReduction(VBuilder, VecOp, RdxDesc);
+    NewRed = createSimpleReduction(VBuilder, VecOp, Kind, getFastMathFlags());
     if (RecurrenceDescriptor::isMinMaxRecurrenceKind(Kind))
       NewRed = createMinMaxOp(Builder, Kind, NewRed, Prev);
     else
-      NewRed = Builder.CreateBinOp((Instruction::BinaryOps)RdxDesc.getOpcode(),
-                                   NewRed, Prev);
+      NewRed = Builder.CreateBinOp((Instruction::BinaryOps)getOpcode(), NewRed,
+                                   Prev);
   }
   State.set(this, NewRed, /*IsScalar*/ true);
 }
 
 InstructionCost VPReductionRecipe::computeCost(ElementCount VF,
                                                VPCostContext &Ctx) const {
-  RecurKind RdxKind = RdxDesc.getRecurrenceKind();
+  RecurKind RdxKind = getRecurrenceKind();
   Type *ElementTy = Ctx.Types.inferScalarType(this);
   auto *VectorTy = cast<VectorType>(toVectorTy(ElementTy, VF));
-  unsigned Opcode = RdxDesc.getOpcode();
-  FastMathFlags FMFs = getFastMathFlags();
 
   // TODO: Support any-of and in-loop reductions.
   assert(
@@ -2386,20 +2381,17 @@ InstructionCost VPReductionRecipe::computeCost(ElementCount VF,
        ForceTargetInstructionCost.getNumOccurrences() > 0) &&
       "In-loop reduction not implemented in VPlan-based cost model currently.");
 
-  assert(...
[truncated]

alexey-bataev · 2025-03-18T16:45:01Z

The in-loop reductions may be completely not profitable

lukel97 · 2025-03-18T16:53:58Z

The in-loop reductions may be completely not profitable

Yes, this is only profitable for EVL tail folding due to the vp.merge. The plan is to eventually use the TargetTransformInfo::preferInLoopReduction hook to only enable this for AnyOf reductions with EVL tail folding.

alexey-bataev · 2025-03-18T16:59:43Z

The in-loop reductions may be completely not profitable

Yes, this is only profitable for EVL tail folding due to the vp.merge. The plan is to eventually use the TargetTransformInfo::preferInLoopReduction hook to only enable this for AnyOf reductions with EVL tail folding.

I mean, the in-loop reduction may be (significantly!!!)less profitable than vp.merge.

lukel97 · 2025-03-18T17:15:36Z

The in-loop reductions may be completely not profitable

Yes, this is only profitable for EVL tail folding due to the vp.merge. The plan is to eventually use the TargetTransformInfo::preferInLoopReduction hook to only enable this for AnyOf reductions with EVL tail folding.

I mean, the in-loop reduction may be (significantly!!!)less profitable than vp.merge.

The in-loop reduction should be a single vcpop.m which I think should be cheap on most microarchitectures? On the spacemit-x60 it's 2 cycles: https://camel-cdr.github.io/rvv-bench-results/bpi_f3/index.html

This is versus at least a regular comparison linear in EMUL + 4 mask instructions

alexey-bataev · 2025-03-18T17:20:35Z

The in-loop reductions may be completely not profitable

Yes, this is only profitable for EVL tail folding due to the vp.merge. The plan is to eventually use the TargetTransformInfo::preferInLoopReduction hook to only enable this for AnyOf reductions with EVL tail folding.

I mean, the in-loop reduction may be (significantly!!!)less profitable than vp.merge.

The in-loop reduction should be a single vcpop.m which I think should be cheap on most microarchitecture?

Nope, it is not

lukel97 · 2025-03-18T18:05:29Z

The in-loop reductions may be completely not profitable

Yes, this is only profitable for EVL tail folding due to the vp.merge. The plan is to eventually use the TargetTransformInfo::preferInLoopReduction hook to only enable this for AnyOf reductions with EVL tail folding.

I mean, the in-loop reduction may be (significantly!!!)less profitable than vp.merge.

The in-loop reduction should be a single vcpop.m which I think should be cheap on most microarchitecture?

Nope, it is not

Is there a specific microarchitecture that you can point to? RISCVSchedSiFive7.td and RISCVSchedSiFiveP600.td seem to imply that they are cheap/don't scale with VL. In any case we can disable it for a specific core in the hook.

fhahn

Is there a specific microarchitecture that you can point to? RISCVSchedSiFive7.td and RISCVSchedSiFiveP600.td seem to imply that they are cheap/don't scale with VL. In any case we can disable it for a specific core in the hook.

It would be good if this wouldn't require a dedicated hook, but just checks the cost of the operations, which should be marked as expensive on the CPUs where it is not profitable. Not sure if this would be already happening in the current version, where we check the reduction cost independent of EVL AFAICT?

lukel97 · 2025-03-18T18:34:36Z

It would be good if this wouldn't require a dedicated hook

I was hoping we would just reuse the existing preferInLoopReductions hook

But yeah it would be nice if we could have the cost model automatically select in-loop or out-of-loop reductions depending on the cost. We don't do it today for the tail folding style either, that's also a hook. We would need to create new VPlans for those? Is that something that could be explored in a separate PR?

Mel-Chen

@lukel97 We previously considered doing this to avoid generating i1 vp.merge, but we ultimately decided against it. In our hardware, vpop takes more cycles in the pipeline, causing snez to idle for too long. Perhaps the results vary depending on the hardware.
As I mentioned in #120405 (comment), another possible approach is to widen the type of vp.merge, or to retain the original vectorization method—i.e., still using select in the vector loop instead of the or operation, and choose the way depend on TTI.

llvm/lib/Analysis/IVDescriptors.cpp

lukel97 · 2025-03-19T12:01:22Z

@lukel97 We previously considered doing this to avoid generating i1 vp.merge, but we ultimately decided against it. In our hardware, vpop takes more cycles in the pipeline, causing snez to idle for too long. Perhaps the results vary depending on the hardware. As I mentioned in #120405 (comment), another possible approach is to widen the type of vp.merge, or to retain the original vectorization method—i.e., still using select in the vector loop instead of the or operation, and choose the way depend on TTI.

Thanks for the clarification, it looks like we have a difference in microarchitectures then. On the BPI-F3, the loop in the PR description is about 10% faster with an in-loop reduction vs out-of-loop reduction:

Details

luke@bananapif3-16gb:~$ perf stat -r10 ./anyof_rdx_test.outofloop 102400000

 Performance counter stats for './anyof_rdx_test.outofloop 102400000' (10 runs):

            405.74 msec task-clock:u                     #    0.892 CPUs utilized            ( +-  7.88% )
                 0      context-switches:u               #    0.000 /sec                   
                 0      cpu-migrations:u                 #    0.000 /sec                   
           100,040      page-faults:u                    #  220.458 K/sec                  
       649,170,965      cycles:u                         #    1.431 GHz                      ( +-  7.88% )
       285,747,219      instructions:u                   #    0.39  insn per cycle           ( +-  0.02% )
         6,417,578      branches:u                       #   14.142 M/sec                    ( +-  0.00% )
             2,765      branch-misses:u                  #    0.04% of all branches          ( +-  2.09% )

            0.4550 +- 0.0320 seconds time elapsed  ( +-  7.03% )

luke@bananapif3-16gb:~$ perf stat -r10 ./anyof_rdx_test.inloop 102400000

 Performance counter stats for './anyof_rdx_test.inloop 102400000' (10 runs):

            361.55 msec task-clock:u                     #    0.995 CPUs utilized            ( +-  0.30% )
                 0      context-switches:u               #    0.000 /sec                   
                 0      cpu-migrations:u                 #    0.000 /sec                   
           100,040      page-faults:u                    #  276.211 K/sec                  
       578,469,553      cycles:u                         #    1.597 GHz                      ( +-  0.30% )
       234,494,827      instructions:u                   #    0.40  insn per cycle           ( +-  0.00% )
         6,417,578      branches:u                       #   17.719 M/sec                    ( +-  0.00% )
             2,529      branch-misses:u                  #    0.04% of all branches          ( +-  1.66% )

           0.36327 +- 0.00108 seconds time elapsed  ( +-  0.30% )

But since this PR only adds support, should we move the discussion on enabling it/adding a tuning flag to another PR?

…ctionOpChain. NFC There are other types of recurrences with an icmp/fcmp opcode, AnyOf and FindLastIV, so don't rely on the opcode to detect them. This makes adding support for AnyOf in llvm#131830 easier. Note that these currently fail the ExpectedUses/isCorrectOpcode checks anyway, so there shouldn't be any functional change.

topperc · 2025-03-19T20:09:43Z

I believe what we talked about doing internally was a vp.zext to i8 then a i8 vp.merge in the loop with a vp.reduce.or after the loop. That avoids putting a vcpop.m in the loop.

preames · 2025-03-19T20:26:07Z

But since this PR only adds support, should we move the discussion on enabling it/adding a tuning flag to another PR?

I want to second this request. Even if there's a better option available for some (or even all) micro-architectures, there's no response to block the functionality.

wangpc-pp · 2025-03-20T03:49:57Z

But since this PR only adds support, should we move the discussion on enabling it/adding a tuning flag to another PR?

I want to second this request. Even if there's a better option available for some (or even all) micro-architectures, there's no response to block the functionality.

+1, this vp.merge issue was also on my todo list but I haven't gotten some time to dive into it. I'd like to see this being supported as well. :-)

lukel97 · 2025-03-20T10:29:05Z

As I mentioned in #120405 (comment), another possible approach is to widen the type of vp.merge,

I believe what we talked about doing internally was a vp.zext to i8 then a i8 vp.merge in the loop with a vp.reduce.or after the loop. That avoids putting a vcpop.m in the loop.

I just ran some tests, I think widening to i8 is also more profitable on the BPI-F3 vs vcpop.m, e.g:

	vsetvli a5, zero, e8, m1, ta, ma
	vmv.v.i	v9, 0
loop:
	vsetvli	a5, a7, e32, m1, ta, ma
	vle32.v	v8, (a0)
	add	a0, a0, a5
	vmseq.vx	v0, v8, zero
	vsetvli	zero, zero, e8, mf4, ta, ma
	vmerge.vim	v10, v9, 1, v0
	vor.vv	v11, v11, v10
	sub	a7, a7, a5
	bnez	a7, loop
exit:
	vmsne.vi	v10, v11, 0
	vcpop.m	a1, v10

This sounds like an approach all microarchs can agree on. Is anyone at SiFive already working on this? Otherwise I can take a look at it.

alexey-bataev · 2025-03-20T11:30:41Z

As I mentioned in #120405 (comment), another possible approach is to widen the type of vp.merge,

I believe what we talked about doing internally was a vp.zext to i8 then a i8 vp.merge in the loop with a vp.reduce.or after the loop. That avoids putting a vcpop.m in the loop.

I just ran some tests, I think widening to i8 is also more profitable on the BPI-F3 vs vcpop.m, e.g:
	vsetvli a5, zero, e8, m1, ta, ma
	vmv.v.i	v9, 0
loop:
	vsetvli	a5, a7, e32, m1, ta, ma
	vle32.v	v8, (a0)
	add	a0, a0, a5
	vmseq.vx	v0, v8, zero
	vsetvli	zero, zero, e8, mf4, ta, ma
	vmerge.vim	v10, v9, 1, v0
	vor.vv	v11, v11, v10
	sub	a7, a7, a5
	bnez	a7, loop
exit:
	vmsne.vi	v10, v11, 0
	vcpop.m	a1, v10
This sounds like an approach all microarchs can agree on.

+1.
Generally speaking, all such transformations should be cost-based decisions. There should 3 vplans - the original, the one with vcpop and the one with extensions. And cost-based decision should choose the best plan.

Is anyone at SiFive already working on this? Otherwise I can take a look at it.

Go ahead

…ctionOpChain. NFC (#132025) There are other types of recurrences with an icmp/fcmp opcode, AnyOf and FindLastIV, so don't rely on the opcode to detect them. This makes adding support for AnyOf in #131830 easier. Note that these currently fail the ExpectedUses/isCorrectOpcode checks anyway, so there shouldn't be any functional change.

lukel97 · 2025-03-24T12:34:22Z

Rebased now that #131300 is landed

From the discussion in #132180 it looks like there is still some interest in having this available for certain targets and cores to enable, and potentially if we ever allow vectorizing with larger LMULs on RISC-V.

This patch changes the preferInLoopReduction function to take a RecurKind instead of an unsigned Opcode. This makes it possible to distinguish non-arithmetic reductions such as min/max, AnyOf, and FindLastIV, and also helps unify IAnyOf with FAnyOf and IFindLastIV with FFindLastIV. Related patch #118393 #131830

lukel97 requested review from fhahn, preames, alexey-bataev, Mel-Chen, david-arm and ElvisWang123 March 18, 2025 15:35

llvmbot added vectorizers llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms labels Mar 18, 2025

fhahn reviewed Mar 18, 2025

View reviewed changes

alexey-bataev requested a review from topperc March 18, 2025 18:10

Mel-Chen reviewed Mar 19, 2025

View reviewed changes

llvm/lib/Analysis/IVDescriptors.cpp Outdated Show resolved Hide resolved

llvm/lib/Analysis/IVDescriptors.cpp Outdated Show resolved Hide resolved

lukel97 mentioned this pull request Mar 19, 2025

[IVDescriptor] Explicitly check for isMinMaxRecurrenceKind in getReductionOpChain. NFC #132025

Merged

lukel97 mentioned this pull request Mar 20, 2025

[RISCV][EVL] Improve AnyOf reduction codegen #132180

Closed

Mel-Chen mentioned this pull request Mar 24, 2025

[TTI][LV] Change the prototype of preferInLoopReduction. nfc #132698

Merged

[VPlan] Add support for in-loop AnyOf reductions

bb017c1

lukel97 force-pushed the loop-vectorize/anyof-inloop branch from aba5052 to bb017c1 Compare March 24, 2025 12:32

[VPlan] Add support for in-loop AnyOf reductions #131830

Are you sure you want to change the base?

[VPlan] Add support for in-loop AnyOf reductions #131830

Uh oh!

Conversation

lukel97 commented Mar 18, 2025

Uh oh!

llvmbot commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexey-bataev commented Mar 18, 2025

Uh oh!

lukel97 commented Mar 18, 2025

Uh oh!

alexey-bataev commented Mar 18, 2025

Uh oh!

lukel97 commented Mar 18, 2025

Uh oh!

alexey-bataev commented Mar 18, 2025

Uh oh!

lukel97 commented Mar 18, 2025

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

lukel97 commented Mar 18, 2025

Uh oh!

Mel-Chen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lukel97 commented Mar 19, 2025

Uh oh!

topperc commented Mar 19, 2025

Uh oh!

preames commented Mar 19, 2025

Uh oh!

wangpc-pp commented Mar 20, 2025

Uh oh!

lukel97 commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexey-bataev commented Mar 20, 2025

Uh oh!

lukel97 commented Mar 24, 2025

Uh oh!

Uh oh!

llvmbot commented Mar 18, 2025 •

edited

Loading

lukel97 commented Mar 20, 2025 •

edited

Loading