Skip to content

[VPlan] Add support for in-loop AnyOf reductions #131830

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lukel97
Copy link
Contributor

@lukel97 lukel97 commented Mar 18, 2025

Today, an AnyOf reduction will get neatly vectorized out-of-loop on RISC-V:

int f(int *x, int y, int n) {
  int z = 0;
  for (int i = 0; i < n; i++)
    if (x[i] == y)
      z = 1;
  return z;
}
.LBB0_5:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	vl2re32.v	v10, (a3)
	add	a3, a3, a4
	vsetvli	zero, zero, e32, m2, ta, ma
	vmseq.vx	v9, v10, a1
	vmor.mm	v8, v8, v9
	bne	a3, a5, .LBB0_5
# %bb.6:                                # %middle.block
	vcpop.m	a3, v8
	# ...

However, with EVL tail folding we get much worse codegen:

.LBB0_2:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	sub	t0, a2, a3
	sh2add	a6, a3, a0
	vsetvli	t1, t0, e8, mf2, ta, ma
	vsetvli	a4, zero, e64, m4, ta, ma
	vmv.v.x	v16, t1
	vmsleu.vv	v9, v16, v12
	vsetvli	zero, t0, e32, m2, ta, ma
	vle32.v	v10, (a6)
	sub	a5, a5, a7
	vsetvli	a4, zero, e64, m4, ta, ma
	vmsltu.vx	v16, v12, t1
	vmand.mm	v9, v8, v9
	vsetvli	zero, zero, e32, m2, ta, ma
	vmseq.vx	v17, v10, a1
	vmor.mm	v8, v8, v17
	vmand.mm	v8, v8, v16
	vmor.mm	v8, v8, v9
	add	a3, a3, t1
	bnez	a5, .LBB0_2
# %bb.3:                                # %middle.block
	vcpop.m	a0, v8
	snez	a0, a0
	ret

The issue is due to the fact that we need to use an i1 vp.merge to preserve the tail elements on the final iterations, because the final reduction will be across the entire vector:

%9 = icmp eq <vscale x 4 x i32> %vp.op.load, %broadcast.splat
%10 = or <vscale x 4 x i1> %vec.phi, %9
%11 = call <vscale x 4 x i1> @llvm.vp.merge.nxv4i1(<vscale x 4 x i1> splat (i1 true), <vscale x 4 x i1> %10, <vscale x 4 x i1> %vec.phi, i32 %5)

However on RISC-V there are no mask instructions that can preserve the tail as per the specification:

Mask destination tail elements are always treated as tail-agnostic, regardless of the setting of vta.

So the current best lowering we have today is something like this:

      vsetvli a1, zero, e64, m1, ta, ma
      vid.v v10
      vmsltu.vx v10, v10, a0
      vmand.mm v9, v9, v10
      vmandn.mm v8, v8, v9
      vmand.mm v9, v0, v9
      vmor.mm v0, v9, v8

One way we can avoid the vp.merge is to do an in-loop reduction, which for an i1 vector is cheap via vcpop.m

.LBB0_2:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	sub	a7, a2, a4
	sh2add	t0, a4, a0
	vsetvli	a7, a7, e32, m2, ta, ma
	vle32.v	v8, (t0)
	sub	a5, a5, a6
	vmseq.vx	v10, v8, a1
	vcpop.m	a3, v10
	snez	a3, a3
	or	t1, a3, t1
	add	a4, a4, a7
	bnez	a5, .LBB0_2
# %bb.3:                                # %middle.block
	andi	a0, t1, 1

This PR adds support for in-loop AnyOf reductions, by emitting an or reduction. The resulting IR looks something like this:

vector.body:
  %vec.phi = phi i1 [ false, %for.body.preheader ], [ %9, %vector.body ]
  ...
  %7 = icmp eq <vscale x 4 x i32> %vp.op.load, %broadcast.splat
  %8 = tail call i1 @llvm.vp.reduce.or.nxv4i1(i1 false, <vscale x 4 x i1> %7, <vscale x 4 x i1> splat (i1 true), i32 %evl)
  %.fr = freeze i1 %8
  %9 = or i1 %.fr, %vec.phi

middle.block:
  %rdx.select = select i1 %9, i32 0, i32 1

It still remains disabled by default, and a later patch can opt into it when EVL tail folding on RISC-V.

Stacked on #131300

@llvmbot llvmbot added vectorizers llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms labels Mar 18, 2025
@llvmbot
Copy link
Member

llvmbot commented Mar 18, 2025

@llvm/pr-subscribers-vectorizers
@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-llvm-transforms

Author: Luke Lau (lukel97)

Changes

Today, an AnyOf reduction will get neatly vectorized out-of-loop on RISC-V:

int f(int *x, int y, int n) {
  int z = 0;
  for (int i = 0; i &lt; n; i++)
    if (x[i] == y)
      z = 1;
  return z;
}
.LBB0_5:                                # %vector.body
                                        # =&gt;This Inner Loop Header: Depth=1
	vl2re32.v	v10, (a3)
	add	a3, a3, a4
	vsetvli	zero, zero, e32, m2, ta, ma
	vmseq.vx	v9, v10, a1
	vmor.mm	v8, v8, v9
	bne	a3, a5, .LBB0_5
# %bb.6:                                # %middle.block
	vcpop.m	a3, v8
	# ...

However, with EVL tail folding we get much worse codegen:

.LBB0_2:                                # %vector.body
                                        # =&gt;This Inner Loop Header: Depth=1
	sub	t0, a2, a3
	sh2add	a6, a3, a0
	vsetvli	t1, t0, e8, mf2, ta, ma
	vsetvli	a4, zero, e64, m4, ta, ma
	vmv.v.x	v16, t1
	vmsleu.vv	v9, v16, v12
	vsetvli	zero, t0, e32, m2, ta, ma
	vle32.v	v10, (a6)
	sub	a5, a5, a7
	vsetvli	a4, zero, e64, m4, ta, ma
	vmsltu.vx	v16, v12, t1
	vmand.mm	v9, v8, v9
	vsetvli	zero, zero, e32, m2, ta, ma
	vmseq.vx	v17, v10, a1
	vmor.mm	v8, v8, v17
	vmand.mm	v8, v8, v16
	vmor.mm	v8, v8, v9
	add	a3, a3, t1
	bnez	a5, .LBB0_2
# %bb.3:                                # %middle.block
	vcpop.m	a0, v8
	snez	a0, a0
	ret

The issue is due to the fact that we need to use an i1 vp.merge to preserve the tail elements on the final iterations, because the final reduction will be across the entire vector:

%9 = icmp eq &lt;vscale x 4 x i32&gt; %vp.op.load, %broadcast.splat
%10 = or &lt;vscale x 4 x i1&gt; %vec.phi, %9
%11 = call &lt;vscale x 4 x i1&gt; @<!-- -->llvm.vp.merge.nxv4i1(&lt;vscale x 4 x i1&gt; splat (i1 true), &lt;vscale x 4 x i1&gt; %10, &lt;vscale x 4 x i1&gt; %vec.phi, i32 %5)

However on RISC-V there are no mask instructions that can preserve the tail as per the specification:

> Mask destination tail elements are always treated as tail-agnostic, regardless of the setting of vta.

So the current best lowering we have today is something like this:

      vsetvli a1, zero, e64, m1, ta, ma
      vid.v v10
      vmsltu.vx v10, v10, a0
      vmand.mm v9, v9, v10
      vmandn.mm v8, v8, v9
      vmand.mm v9, v0, v9
      vmor.mm v0, v9, v8

One way we can avoid the vp.merge is to do an in-loop reduction, which for an i1 vector is cheap via vcpop.m

.LBB0_2:                                # %vector.body
                                        # =&gt;This Inner Loop Header: Depth=1
	sub	a7, a2, a4
	sh2add	t0, a4, a0
	vsetvli	a7, a7, e32, m2, ta, ma
	vle32.v	v8, (t0)
	sub	a5, a5, a6
	vmseq.vx	v10, v8, a1
	vcpop.m	a3, v10
	snez	a3, a3
	or	t1, a3, t1
	add	a4, a4, a7
	bnez	a5, .LBB0_2
# %bb.3:                                # %middle.block
	andi	a0, t1, 1

This PR adds support for in-loop AnyOf reductions, by emitting an or reduction. The resulting IR looks something like this:

vector.body:
  %vec.phi = phi i1 [ false, %for.body.preheader ], [ %9, %vector.body ]
  ...
  %7 = icmp eq &lt;vscale x 4 x i32&gt; %vp.op.load, %broadcast.splat
  %8 = tail call i1 @<!-- -->llvm.vp.reduce.or.nxv4i1(i1 false, &lt;vscale x 4 x i1&gt; %7, &lt;vscale x 4 x i1&gt; splat (i1 true), i32 %evl)
  %.fr = freeze i1 %8
  %9 = or i1 %.fr, %vec.phi

middle.block:
  %rdx.select = select i1 %9, i32 0, i32 1

It still remains disabled by default, and a later patch can opt into it when EVL tail folding on RISC-V.

Stacked on #131300


Patch is 180.12 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/131830.diff

12 Files Affected:

  • (modified) llvm/include/llvm/Transforms/Utils/LoopUtils.h (+5-7)
  • (modified) llvm/lib/Analysis/IVDescriptors.cpp (+5-3)
  • (modified) llvm/lib/Transforms/Utils/LoopUtils.cpp (+6-11)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+29-10)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+22-20)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+25-40)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-inloop-reduction.ll (+12-14)
  • (added) llvm/test/Transforms/LoopVectorize/select-cmp-blend.ll (+190)
  • (modified) llvm/test/Transforms/LoopVectorize/select-cmp-multiuse.ll (+431)
  • (modified) llvm/test/Transforms/LoopVectorize/select-cmp.ll (+1100)
  • (modified) llvm/test/Transforms/LoopVectorize/vplan-printing.ll (+1-1)
  • (modified) llvm/unittests/Transforms/Vectorize/VPlanTest.cpp (+8-8)
diff --git a/llvm/include/llvm/Transforms/Utils/LoopUtils.h b/llvm/include/llvm/Transforms/Utils/LoopUtils.h
index 1818ee03d2ec8..3ad7b8f17856c 100644
--- a/llvm/include/llvm/Transforms/Utils/LoopUtils.h
+++ b/llvm/include/llvm/Transforms/Utils/LoopUtils.h
@@ -411,8 +411,8 @@ Value *createSimpleReduction(IRBuilderBase &B, Value *Src,
                              RecurKind RdxKind);
 /// Overloaded function to generate vector-predication intrinsics for
 /// reduction.
-Value *createSimpleReduction(VectorBuilder &VB, Value *Src,
-                             const RecurrenceDescriptor &Desc);
+Value *createSimpleReduction(VectorBuilder &VB, Value *Src, RecurKind RdxKind,
+                             FastMathFlags FMFs);
 
 /// Create a reduction of the given vector \p Src for a reduction of the
 /// kind RecurKind::IAnyOf or RecurKind::FAnyOf. The reduction operation is
@@ -428,14 +428,12 @@ Value *createFindLastIVReduction(IRBuilderBase &B, Value *Src,
                                  const RecurrenceDescriptor &Desc);
 
 /// Create an ordered reduction intrinsic using the given recurrence
-/// descriptor \p Desc.
-Value *createOrderedReduction(IRBuilderBase &B,
-                              const RecurrenceDescriptor &Desc, Value *Src,
+/// kind \p Kind.
+Value *createOrderedReduction(IRBuilderBase &B, RecurKind Kind, Value *Src,
                               Value *Start);
 /// Overloaded function to generate vector-predication intrinsics for ordered
 /// reduction.
-Value *createOrderedReduction(VectorBuilder &VB,
-                              const RecurrenceDescriptor &Desc, Value *Src,
+Value *createOrderedReduction(VectorBuilder &VB, RecurKind Kind, Value *Src,
                               Value *Start);
 
 /// Get the intersection (logical and) of all of the potential IR flags
diff --git a/llvm/lib/Analysis/IVDescriptors.cpp b/llvm/lib/Analysis/IVDescriptors.cpp
index f74ede4450ce5..a1dc74c9d0779 100644
--- a/llvm/lib/Analysis/IVDescriptors.cpp
+++ b/llvm/lib/Analysis/IVDescriptors.cpp
@@ -1184,7 +1184,7 @@ RecurrenceDescriptor::getReductionOpChain(PHINode *Phi, Loop *L) const {
   // more expensive than out-of-loop reductions, and need to be costed more
   // carefully.
   unsigned ExpectedUses = 1;
-  if (RedOp == Instruction::ICmp || RedOp == Instruction::FCmp)
+  if (isMinMaxRecurrenceKind(getRecurrenceKind()))
     ExpectedUses = 2;
 
   auto getNextInstruction = [&](Instruction *Cur) -> Instruction * {
@@ -1192,7 +1192,7 @@ RecurrenceDescriptor::getReductionOpChain(PHINode *Phi, Loop *L) const {
       Instruction *UI = cast<Instruction>(User);
       if (isa<PHINode>(UI))
         continue;
-      if (RedOp == Instruction::ICmp || RedOp == Instruction::FCmp) {
+      if (isMinMaxRecurrenceKind(Kind)) {
         // We are expecting a icmp/select pair, which we go to the next select
         // instruction if we can. We already know that Cur has 2 uses.
         if (isa<SelectInst>(UI))
@@ -1204,11 +1204,13 @@ RecurrenceDescriptor::getReductionOpChain(PHINode *Phi, Loop *L) const {
     return nullptr;
   };
   auto isCorrectOpcode = [&](Instruction *Cur) {
-    if (RedOp == Instruction::ICmp || RedOp == Instruction::FCmp) {
+    if (isMinMaxRecurrenceKind(getRecurrenceKind())) {
       Value *LHS, *RHS;
       return SelectPatternResult::isMinOrMax(
           matchSelectPattern(Cur, LHS, RHS).Flavor);
     }
+    if (isAnyOfRecurrenceKind(getRecurrenceKind()))
+      return isa<SelectInst>(Cur);
     // Recognize a call to the llvm.fmuladd intrinsic.
     if (isFMulAddIntrinsic(Cur))
       return true;
diff --git a/llvm/lib/Transforms/Utils/LoopUtils.cpp b/llvm/lib/Transforms/Utils/LoopUtils.cpp
index 185af8631454a..41f43a24e19e6 100644
--- a/llvm/lib/Transforms/Utils/LoopUtils.cpp
+++ b/llvm/lib/Transforms/Utils/LoopUtils.cpp
@@ -1333,24 +1333,21 @@ Value *llvm::createSimpleReduction(IRBuilderBase &Builder, Value *Src,
 }
 
 Value *llvm::createSimpleReduction(VectorBuilder &VBuilder, Value *Src,
-                                   const RecurrenceDescriptor &Desc) {
-  RecurKind Kind = Desc.getRecurrenceKind();
+                                   RecurKind Kind, FastMathFlags FMFs) {
   assert(!RecurrenceDescriptor::isAnyOfRecurrenceKind(Kind) &&
          !RecurrenceDescriptor::isFindLastIVRecurrenceKind(Kind) &&
          "AnyOf or FindLastIV reductions are not supported.");
   Intrinsic::ID Id = getReductionIntrinsicID(Kind);
   auto *SrcTy = cast<VectorType>(Src->getType());
   Type *SrcEltTy = SrcTy->getElementType();
-  Value *Iden = getRecurrenceIdentity(Kind, SrcEltTy, Desc.getFastMathFlags());
+  Value *Iden = getRecurrenceIdentity(Kind, SrcEltTy, FMFs);
   Value *Ops[] = {Iden, Src};
   return VBuilder.createSimpleReduction(Id, SrcTy, Ops);
 }
 
-Value *llvm::createOrderedReduction(IRBuilderBase &B,
-                                    const RecurrenceDescriptor &Desc,
+Value *llvm::createOrderedReduction(IRBuilderBase &B, RecurKind Kind,
                                     Value *Src, Value *Start) {
-  assert((Desc.getRecurrenceKind() == RecurKind::FAdd ||
-          Desc.getRecurrenceKind() == RecurKind::FMulAdd) &&
+  assert((Kind == RecurKind::FAdd || Kind == RecurKind::FMulAdd) &&
          "Unexpected reduction kind");
   assert(Src->getType()->isVectorTy() && "Expected a vector type");
   assert(!Start->getType()->isVectorTy() && "Expected a scalar type");
@@ -1358,11 +1355,9 @@ Value *llvm::createOrderedReduction(IRBuilderBase &B,
   return B.CreateFAddReduce(Start, Src);
 }
 
-Value *llvm::createOrderedReduction(VectorBuilder &VBuilder,
-                                    const RecurrenceDescriptor &Desc,
+Value *llvm::createOrderedReduction(VectorBuilder &VBuilder, RecurKind Kind,
                                     Value *Src, Value *Start) {
-  assert((Desc.getRecurrenceKind() == RecurKind::FAdd ||
-          Desc.getRecurrenceKind() == RecurKind::FMulAdd) &&
+  assert((Kind == RecurKind::FAdd || Kind == RecurKind::FMulAdd) &&
          "Unexpected reduction kind");
   assert(Src->getType()->isVectorTy() && "Expected a vector type");
   assert(!Start->getType()->isVectorTy() && "Expected a scalar type");
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index cbfccaab01e27..bf4fd4c6af1c4 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -5868,6 +5868,14 @@ LoopVectorizationCostModel::getReductionPatternCost(Instruction *I,
     Intrinsic::ID MinMaxID = getMinMaxReductionIntrinsicOp(RK);
     BaseCost = TTI.getMinMaxReductionCost(MinMaxID, VectorTy,
                                           RdxDesc.getFastMathFlags(), CostKind);
+  } else if (RecurrenceDescriptor::isAnyOfRecurrenceKind(RK)) {
+    VectorType *BoolTy = VectorType::get(
+        Type::getInt1Ty(VectorTy->getContext()), VectorTy->getElementCount());
+    BaseCost =
+        TTI.getArithmeticReductionCost(Instruction::Or, BoolTy,
+                                       RdxDesc.getFastMathFlags(), CostKind) +
+        TTI.getArithmeticInstrCost(Instruction::Or, BoolTy->getScalarType(),
+                                   CostKind);
   } else {
     BaseCost = TTI.getArithmeticReductionCost(
         RdxDesc.getOpcode(), VectorTy, RdxDesc.getFastMathFlags(), CostKind);
@@ -9697,10 +9705,8 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
 
     const RecurrenceDescriptor &RdxDesc = PhiR->getRecurrenceDescriptor();
     RecurKind Kind = RdxDesc.getRecurrenceKind();
-    assert(
-        !RecurrenceDescriptor::isAnyOfRecurrenceKind(Kind) &&
-        !RecurrenceDescriptor::isFindLastIVRecurrenceKind(Kind) &&
-        "AnyOf and FindLast reductions are not allowed for in-loop reductions");
+    assert(!RecurrenceDescriptor::isFindLastIVRecurrenceKind(Kind) &&
+           "FindLast reductions are not allowed for in-loop reductions");
 
     // Collect the chain of "link" recipes for the reduction starting at PhiR.
     SetVector<VPSingleDefRecipe *> Worklist;
@@ -9769,6 +9775,11 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
             CurrentLinkI->getFastMathFlags());
         LinkVPBB->insert(FMulRecipe, CurrentLink->getIterator());
         VecOp = FMulRecipe;
+      } else if (RecurrenceDescriptor::isAnyOfRecurrenceKind(Kind)) {
+        assert(isa<VPWidenSelectRecipe>(CurrentLink) &&
+               "must be a select recipe");
+        VecOp = CurrentLink->getOperand(0);
+        Kind = RecurKind::Or;
       } else {
         if (RecurrenceDescriptor::isMinMaxRecurrenceKind(Kind)) {
           if (isa<VPWidenRecipe>(CurrentLink)) {
@@ -9804,8 +9815,9 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
         CondOp = RecipeBuilder.getBlockInMask(BB);
 
       auto *RedRecipe = new VPReductionRecipe(
-          RdxDesc, CurrentLinkI, PreviousLink, VecOp, CondOp,
-          CM.useOrderedReductions(RdxDesc), CurrentLinkI->getDebugLoc());
+          Kind, RdxDesc.getFastMathFlags(), CurrentLinkI, PreviousLink, VecOp,
+          CondOp, CM.useOrderedReductions(RdxDesc),
+          CurrentLinkI->getDebugLoc());
       // Append the recipe to the end of the VPBasicBlock because we need to
       // ensure that it comes after all of it's inputs, including CondOp.
       // Delete CurrentLink as it will be invalid if its operand is replaced
@@ -9929,10 +9941,17 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
       // selected if the negated condition is true in any iteration.
       if (Select->getOperand(1) == PhiR)
         Cmp = Builder.createNot(Cmp);
-      VPValue *Or = Builder.createOr(PhiR, Cmp);
-      Select->getVPSingleValue()->replaceAllUsesWith(Or);
-      // Delete Select now that it has invalid types.
-      ToDelete.push_back(Select);
+
+      if (PhiR->isInLoop() && MinVF.isVector()) {
+        auto *Reduction = cast<VPReductionRecipe>(
+            *find_if(PhiR->users(), IsaPred<VPReductionRecipe>));
+        Reduction->setOperand(1, Cmp);
+      } else {
+        VPValue *Or = Builder.createOr(PhiR, Cmp);
+        Select->getVPSingleValue()->replaceAllUsesWith(Or);
+        // Delete Select now that it has invalid types.
+        ToDelete.push_back(Select);
+      }
 
       // Convert the reduction phi to operate on bools.
       PhiR->setOperand(0, Plan->getOrAddLiveIn(ConstantInt::getFalse(
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index ba24143e0b5b6..c16b4bed356df 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -2239,22 +2239,21 @@ class VPInterleaveRecipe : public VPRecipeBase {
 /// a vector operand into a scalar value, and adding the result to a chain.
 /// The Operands are {ChainOp, VecOp, [Condition]}.
 class VPReductionRecipe : public VPRecipeWithIRFlags {
-  /// The recurrence decriptor for the reduction in question.
-  const RecurrenceDescriptor &RdxDesc;
+  /// The recurrence kind for the reduction in question.
+  RecurKind RdxKind;
   bool IsOrdered;
   /// Whether the reduction is conditional.
   bool IsConditional = false;
 
 protected:
-  VPReductionRecipe(const unsigned char SC, const RecurrenceDescriptor &R,
-                    Instruction *I, ArrayRef<VPValue *> Operands,
-                    VPValue *CondOp, bool IsOrdered, DebugLoc DL)
-      : VPRecipeWithIRFlags(SC, Operands,
-                            isa_and_nonnull<FPMathOperator>(I)
-                                ? R.getFastMathFlags()
-                                : FastMathFlags(),
-                            DL),
-        RdxDesc(R), IsOrdered(IsOrdered) {
+  VPReductionRecipe(const unsigned char SC, RecurKind RdxKind,
+                    FastMathFlags FMFs, Instruction *I,
+                    ArrayRef<VPValue *> Operands, VPValue *CondOp,
+                    bool IsOrdered, DebugLoc DL)
+      : VPRecipeWithIRFlags(
+            SC, Operands,
+            isa_and_nonnull<FPMathOperator>(I) ? FMFs : FastMathFlags(), DL),
+        RdxKind(RdxKind), IsOrdered(IsOrdered) {
     if (CondOp) {
       IsConditional = true;
       addOperand(CondOp);
@@ -2263,19 +2262,19 @@ class VPReductionRecipe : public VPRecipeWithIRFlags {
   }
 
 public:
-  VPReductionRecipe(const RecurrenceDescriptor &R, Instruction *I,
+  VPReductionRecipe(RecurKind RdxKind, FastMathFlags FMFs, Instruction *I,
                     VPValue *ChainOp, VPValue *VecOp, VPValue *CondOp,
                     bool IsOrdered, DebugLoc DL = {})
-      : VPReductionRecipe(VPDef::VPReductionSC, R, I,
+      : VPReductionRecipe(VPRecipeBase::VPReductionSC, RdxKind, FMFs, I,
                           ArrayRef<VPValue *>({ChainOp, VecOp}), CondOp,
                           IsOrdered, DL) {}
 
   ~VPReductionRecipe() override = default;
 
   VPReductionRecipe *clone() override {
-    return new VPReductionRecipe(RdxDesc, getUnderlyingInstr(), getChainOp(),
-                                 getVecOp(), getCondOp(), IsOrdered,
-                                 getDebugLoc());
+    return new VPReductionRecipe(RdxKind, getFastMathFlags(),
+                                 getUnderlyingInstr(), getChainOp(), getVecOp(),
+                                 getCondOp(), IsOrdered, getDebugLoc());
   }
 
   static inline bool classof(const VPRecipeBase *R) {
@@ -2301,9 +2300,11 @@ class VPReductionRecipe : public VPRecipeWithIRFlags {
              VPSlotTracker &SlotTracker) const override;
 #endif
 
-  /// Return the recurrence decriptor for the in-loop reduction.
-  const RecurrenceDescriptor &getRecurrenceDescriptor() const {
-    return RdxDesc;
+  /// Return the recurrence kind for the in-loop reduction.
+  RecurKind getRecurrenceKind() const { return RdxKind; }
+  /// Return the opcode for the recurrence for the in-loop reduction.
+  unsigned getOpcode() const {
+    return RecurrenceDescriptor::getOpcode(RdxKind);
   }
   /// Return true if the in-loop reduction is ordered.
   bool isOrdered() const { return IsOrdered; };
@@ -2328,7 +2329,8 @@ class VPReductionEVLRecipe : public VPReductionRecipe {
   VPReductionEVLRecipe(VPReductionRecipe &R, VPValue &EVL, VPValue *CondOp,
                        DebugLoc DL = {})
       : VPReductionRecipe(
-            VPDef::VPReductionEVLSC, R.getRecurrenceDescriptor(),
+            VPDef::VPReductionEVLSC, R.getRecurrenceKind(),
+            R.getFastMathFlags(),
             cast_or_null<Instruction>(R.getUnderlyingValue()),
             ArrayRef<VPValue *>({R.getChainOp(), R.getVecOp(), &EVL}), CondOp,
             R.isOrdered(), DL) {}
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index d315dbe9b4170..9a13619ec56f8 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -668,10 +668,10 @@ Value *VPInstruction::generate(VPTransformState &State) {
 
     // Create the reduction after the loop. Note that inloop reductions create
     // the target reduction in the loop using a Reduction recipe.
-    if ((State.VF.isVector() ||
-         RecurrenceDescriptor::isAnyOfRecurrenceKind(RK) ||
-         RecurrenceDescriptor::isFindLastIVRecurrenceKind(RK)) &&
-        !PhiR->isInLoop()) {
+    if (((State.VF.isVector() ||
+          RecurrenceDescriptor::isFindLastIVRecurrenceKind(RK)) &&
+         !PhiR->isInLoop()) ||
+        RecurrenceDescriptor::isAnyOfRecurrenceKind(RK)) {
       // TODO: Support in-order reductions based on the recurrence descriptor.
       // All ops in the reduction inherit fast-math-flags from the recurrence
       // descriptor.
@@ -2285,9 +2285,9 @@ void VPBlendRecipe::print(raw_ostream &O, const Twine &Indent,
 void VPReductionRecipe::execute(VPTransformState &State) {
   assert(!State.Lane && "Reduction being replicated.");
   Value *PrevInChain = State.get(getChainOp(), /*IsScalar*/ true);
-  RecurKind Kind = RdxDesc.getRecurrenceKind();
+  RecurKind Kind = getRecurrenceKind();
   assert(!RecurrenceDescriptor::isAnyOfRecurrenceKind(Kind) &&
-         "In-loop AnyOf reductions aren't currently supported");
+         "In-loop AnyOf reduction should use Or reduction recipe");
   // Propagate the fast-math flags carried by the underlying instruction.
   IRBuilderBase::FastMathFlagGuard FMFGuard(State.Builder);
   State.Builder.setFastMathFlags(getFastMathFlags());
@@ -2298,8 +2298,7 @@ void VPReductionRecipe::execute(VPTransformState &State) {
     VectorType *VecTy = dyn_cast<VectorType>(NewVecOp->getType());
     Type *ElementTy = VecTy ? VecTy->getElementType() : NewVecOp->getType();
 
-    Value *Start =
-        getRecurrenceIdentity(Kind, ElementTy, RdxDesc.getFastMathFlags());
+    Value *Start = getRecurrenceIdentity(Kind, ElementTy, getFastMathFlags());
     if (State.VF.isVector())
       Start = State.Builder.CreateVectorSplat(VecTy->getElementCount(), Start);
 
@@ -2311,21 +2310,20 @@ void VPReductionRecipe::execute(VPTransformState &State) {
   if (IsOrdered) {
     if (State.VF.isVector())
       NewRed =
-          createOrderedReduction(State.Builder, RdxDesc, NewVecOp, PrevInChain);
+          createOrderedReduction(State.Builder, Kind, NewVecOp, PrevInChain);
     else
-      NewRed = State.Builder.CreateBinOp(
-          (Instruction::BinaryOps)RdxDesc.getOpcode(), PrevInChain, NewVecOp);
+      NewRed = State.Builder.CreateBinOp((Instruction::BinaryOps)getOpcode(),
+                                         PrevInChain, NewVecOp);
     PrevInChain = NewRed;
     NextInChain = NewRed;
   } else {
     PrevInChain = State.get(getChainOp(), /*IsScalar*/ true);
     NewRed = createSimpleReduction(State.Builder, NewVecOp, Kind);
     if (RecurrenceDescriptor::isMinMaxRecurrenceKind(Kind))
-      NextInChain = createMinMaxOp(State.Builder, RdxDesc.getRecurrenceKind(),
-                                   NewRed, PrevInChain);
+      NextInChain = createMinMaxOp(State.Builder, Kind, NewRed, PrevInChain);
     else
       NextInChain = State.Builder.CreateBinOp(
-          (Instruction::BinaryOps)RdxDesc.getOpcode(), NewRed, PrevInChain);
+          (Instruction::BinaryOps)getOpcode(), NewRed, PrevInChain);
   }
   State.set(this, NextInChain, /*IsScalar*/ true);
 }
@@ -2336,10 +2334,9 @@ void VPReductionEVLRecipe::execute(VPTransformState &State) {
   auto &Builder = State.Builder;
   // Propagate the fast-math flags carried by the underlying instruction.
   IRBuilderBase::FastMathFlagGuard FMFGuard(Builder);
-  const RecurrenceDescriptor &RdxDesc = getRecurrenceDescriptor();
   Builder.setFastMathFlags(getFastMathFlags());
 
-  RecurKind Kind = RdxDesc.getRecurrenceKind();
+  RecurKind Kind = getRecurrenceKind();
   Value *Prev = State.get(getChainOp(), /*IsScalar*/ true);
   Value *VecOp = State.get(getVecOp());
   Value *EVL = State.get(getEVL(), VPLane(0));
@@ -2356,25 +2353,23 @@ void VPReductionEVLRecipe::execute(VPTransformState &State) {
 
   Value *NewRed;
   if (isOrdered()) {
-    NewRed = createOrderedReduction(VBuilder, RdxDesc, VecOp, Prev);
+    NewRed = createOrderedReduction(VBuilder, Kind, VecOp, Prev);
   } else {
-    NewRed = createSimpleReduction(VBuilder, VecOp, RdxDesc);
+    NewRed = createSimpleReduction(VBuilder, VecOp, Kind, getFastMathFlags());
     if (RecurrenceDescriptor::isMinMaxRecurrenceKind(Kind))
       NewRed = createMinMaxOp(Builder, Kind, NewRed, Prev);
     else
-      NewRed = Builder.CreateBinOp((Instruction::BinaryOps)RdxDesc.getOpcode(),
-                                   NewRed, Prev);
+      NewRed = Builder.CreateBinOp((Instruction::BinaryOps)getOpcode(), NewRed,
+                                   Prev);
   }
   State.set(this, NewRed, /*IsScalar*/ true);
 }
 
 InstructionCost VPReductionRecipe::computeCost(ElementCount VF,
                                                VPCostContext &Ctx) const {
-  RecurKind RdxKind = RdxDesc.getRecurrenceKind();
+  RecurKind RdxKind = getRecurrenceKind();
   Type *ElementTy = Ctx.Types.inferScalarType(this);
   auto *VectorTy = cast<VectorType>(toVectorTy(ElementTy, VF));
-  unsigned Opcode = RdxDesc.getOpcode();
-  FastMathFlags FMFs = getFastMathFlags();
 
   // TODO: Support any-of and in-loop reductions.
   assert(
@@ -2386,20 +2381,17 @@ InstructionCost VPReductionRecipe::computeCost(ElementCount VF,
        ForceTargetInstructionCost.getNumOccurrences() > 0) &&
       "In-loop reduction not implemented in VPlan-based cost model currently.");
 
-  assert(...
[truncated]

@alexey-bataev
Copy link
Member

The in-loop reductions may be completely not profitable

@lukel97
Copy link
Contributor Author

lukel97 commented Mar 18, 2025

The in-loop reductions may be completely not profitable

Yes, this is only profitable for EVL tail folding due to the vp.merge. The plan is to eventually use the TargetTransformInfo::preferInLoopReduction hook to only enable this for AnyOf reductions with EVL tail folding.

@alexey-bataev
Copy link
Member

The in-loop reductions may be completely not profitable

Yes, this is only profitable for EVL tail folding due to the vp.merge. The plan is to eventually use the TargetTransformInfo::preferInLoopReduction hook to only enable this for AnyOf reductions with EVL tail folding.

I mean, the in-loop reduction may be (significantly!!!)less profitable than vp.merge.

@lukel97
Copy link
Contributor Author

lukel97 commented Mar 18, 2025

The in-loop reductions may be completely not profitable

Yes, this is only profitable for EVL tail folding due to the vp.merge. The plan is to eventually use the TargetTransformInfo::preferInLoopReduction hook to only enable this for AnyOf reductions with EVL tail folding.

I mean, the in-loop reduction may be (significantly!!!)less profitable than vp.merge.

The in-loop reduction should be a single vcpop.m which I think should be cheap on most microarchitectures? On the spacemit-x60 it's 2 cycles: https://camel-cdr.github.io/rvv-bench-results/bpi_f3/index.html

This is versus at least a regular comparison linear in EMUL + 4 mask instructions

@alexey-bataev
Copy link
Member

The in-loop reductions may be completely not profitable

Yes, this is only profitable for EVL tail folding due to the vp.merge. The plan is to eventually use the TargetTransformInfo::preferInLoopReduction hook to only enable this for AnyOf reductions with EVL tail folding.

I mean, the in-loop reduction may be (significantly!!!)less profitable than vp.merge.

The in-loop reduction should be a single vcpop.m which I think should be cheap on most microarchitecture?

Nope, it is not

@lukel97
Copy link
Contributor Author

lukel97 commented Mar 18, 2025

The in-loop reductions may be completely not profitable

Yes, this is only profitable for EVL tail folding due to the vp.merge. The plan is to eventually use the TargetTransformInfo::preferInLoopReduction hook to only enable this for AnyOf reductions with EVL tail folding.

I mean, the in-loop reduction may be (significantly!!!)less profitable than vp.merge.

The in-loop reduction should be a single vcpop.m which I think should be cheap on most microarchitecture?

Nope, it is not

Is there a specific microarchitecture that you can point to? RISCVSchedSiFive7.td and RISCVSchedSiFiveP600.td seem to imply that they are cheap/don't scale with VL. In any case we can disable it for a specific core in the hook.

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific microarchitecture that you can point to? RISCVSchedSiFive7.td and RISCVSchedSiFiveP600.td seem to imply that they are cheap/don't scale with VL. In any case we can disable it for a specific core in the hook.

It would be good if this wouldn't require a dedicated hook, but just checks the cost of the operations, which should be marked as expensive on the CPUs where it is not profitable. Not sure if this would be already happening in the current version, where we check the reduction cost independent of EVL AFAICT?

@alexey-bataev alexey-bataev requested a review from topperc March 18, 2025 18:10
@lukel97
Copy link
Contributor Author

lukel97 commented Mar 18, 2025

It would be good if this wouldn't require a dedicated hook

I was hoping we would just reuse the existing preferInLoopReductions hook

But yeah it would be nice if we could have the cost model automatically select in-loop or out-of-loop reductions depending on the cost. We don't do it today for the tail folding style either, that's also a hook. We would need to create new VPlans for those? Is that something that could be explored in a separate PR?

Copy link
Contributor

@Mel-Chen Mel-Chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukel97 We previously considered doing this to avoid generating i1 vp.merge, but we ultimately decided against it. In our hardware, vpop takes more cycles in the pipeline, causing snez to idle for too long. Perhaps the results vary depending on the hardware.
As I mentioned in #120405 (comment), another possible approach is to widen the type of vp.merge, or to retain the original vectorization method—i.e., still using select in the vector loop instead of the or operation, and choose the way depend on TTI.

@lukel97
Copy link
Contributor Author

lukel97 commented Mar 19, 2025

@lukel97 We previously considered doing this to avoid generating i1 vp.merge, but we ultimately decided against it. In our hardware, vpop takes more cycles in the pipeline, causing snez to idle for too long. Perhaps the results vary depending on the hardware. As I mentioned in #120405 (comment), another possible approach is to widen the type of vp.merge, or to retain the original vectorization method—i.e., still using select in the vector loop instead of the or operation, and choose the way depend on TTI.

Thanks for the clarification, it looks like we have a difference in microarchitectures then. On the BPI-F3, the loop in the PR description is about 10% faster with an in-loop reduction vs out-of-loop reduction:

Details

luke@bananapif3-16gb:~$ perf stat -r10 ./anyof_rdx_test.outofloop 102400000

 Performance counter stats for './anyof_rdx_test.outofloop 102400000' (10 runs):

            405.74 msec task-clock:u                     #    0.892 CPUs utilized            ( +-  7.88% )
                 0      context-switches:u               #    0.000 /sec                   
                 0      cpu-migrations:u                 #    0.000 /sec                   
           100,040      page-faults:u                    #  220.458 K/sec                  
       649,170,965      cycles:u                         #    1.431 GHz                      ( +-  7.88% )
       285,747,219      instructions:u                   #    0.39  insn per cycle           ( +-  0.02% )
         6,417,578      branches:u                       #   14.142 M/sec                    ( +-  0.00% )
             2,765      branch-misses:u                  #    0.04% of all branches          ( +-  2.09% )

            0.4550 +- 0.0320 seconds time elapsed  ( +-  7.03% )

luke@bananapif3-16gb:~$ perf stat -r10 ./anyof_rdx_test.inloop 102400000

 Performance counter stats for './anyof_rdx_test.inloop 102400000' (10 runs):

            361.55 msec task-clock:u                     #    0.995 CPUs utilized            ( +-  0.30% )
                 0      context-switches:u               #    0.000 /sec                   
                 0      cpu-migrations:u                 #    0.000 /sec                   
           100,040      page-faults:u                    #  276.211 K/sec                  
       578,469,553      cycles:u                         #    1.597 GHz                      ( +-  0.30% )
       234,494,827      instructions:u                   #    0.40  insn per cycle           ( +-  0.00% )
         6,417,578      branches:u                       #   17.719 M/sec                    ( +-  0.00% )
             2,529      branch-misses:u                  #    0.04% of all branches          ( +-  1.66% )

           0.36327 +- 0.00108 seconds time elapsed  ( +-  0.30% )

But since this PR only adds support, should we move the discussion on enabling it/adding a tuning flag to another PR?

lukel97 added a commit to lukel97/llvm-project that referenced this pull request Mar 19, 2025
…ctionOpChain. NFC

There are other types of recurrences with an icmp/fcmp opcode, AnyOf and FindLastIV, so don't rely on the opcode to detect them.
This makes adding support for AnyOf in llvm#131830 easier.

Note that these currently fail the ExpectedUses/isCorrectOpcode checks anyway, so there shouldn't be any functional change.
@topperc
Copy link
Collaborator

topperc commented Mar 19, 2025

I believe what we talked about doing internally was a vp.zext to i8 then a i8 vp.merge in the loop with a vp.reduce.or after the loop. That avoids putting a vcpop.m in the loop.

@preames
Copy link
Collaborator

preames commented Mar 19, 2025

But since this PR only adds support, should we move the discussion on enabling it/adding a tuning flag to another PR?

I want to second this request. Even if there's a better option available for some (or even all) micro-architectures, there's no response to block the functionality.

@wangpc-pp
Copy link
Contributor

But since this PR only adds support, should we move the discussion on enabling it/adding a tuning flag to another PR?

I want to second this request. Even if there's a better option available for some (or even all) micro-architectures, there's no response to block the functionality.

+1, this vp.merge issue was also on my todo list but I haven't gotten some time to dive into it. I'd like to see this being supported as well. :-)

@lukel97
Copy link
Contributor Author

lukel97 commented Mar 20, 2025

As I mentioned in #120405 (comment), another possible approach is to widen the type of vp.merge,

I believe what we talked about doing internally was a vp.zext to i8 then a i8 vp.merge in the loop with a vp.reduce.or after the loop. That avoids putting a vcpop.m in the loop.

I just ran some tests, I think widening to i8 is also more profitable on the BPI-F3 vs vcpop.m, e.g:

	vsetvli a5, zero, e8, m1, ta, ma
	vmv.v.i	v9, 0
loop:
	vsetvli	a5, a7, e32, m1, ta, ma
	vle32.v	v8, (a0)
	add	a0, a0, a5
	vmseq.vx	v0, v8, zero
	vsetvli	zero, zero, e8, mf4, ta, ma
	vmerge.vim	v10, v9, 1, v0
	vor.vv	v11, v11, v10
	sub	a7, a7, a5
	bnez	a7, loop
exit:
	vmsne.vi	v10, v11, 0
	vcpop.m	a1, v10

This sounds like an approach all microarchs can agree on. Is anyone at SiFive already working on this? Otherwise I can take a look at it.

@alexey-bataev
Copy link
Member

As I mentioned in #120405 (comment), another possible approach is to widen the type of vp.merge,

I believe what we talked about doing internally was a vp.zext to i8 then a i8 vp.merge in the loop with a vp.reduce.or after the loop. That avoids putting a vcpop.m in the loop.

I just ran some tests, I think widening to i8 is also more profitable on the BPI-F3 vs vcpop.m, e.g:

	vsetvli a5, zero, e8, m1, ta, ma
	vmv.v.i	v9, 0
loop:
	vsetvli	a5, a7, e32, m1, ta, ma
	vle32.v	v8, (a0)
	add	a0, a0, a5
	vmseq.vx	v0, v8, zero
	vsetvli	zero, zero, e8, mf4, ta, ma
	vmerge.vim	v10, v9, 1, v0
	vor.vv	v11, v11, v10
	sub	a7, a7, a5
	bnez	a7, loop
exit:
	vmsne.vi	v10, v11, 0
	vcpop.m	a1, v10

This sounds like an approach all microarchs can agree on.

+1.
Generally speaking, all such transformations should be cost-based decisions. There should 3 vplans - the original, the one with vcpop and the one with extensions. And cost-based decision should choose the best plan.

Is anyone at SiFive already working on this? Otherwise I can take a look at it.

Go ahead

lukel97 added a commit that referenced this pull request Mar 20, 2025
…ctionOpChain. NFC (#132025)

There are other types of recurrences with an icmp/fcmp opcode, AnyOf and
FindLastIV, so don't rely on the opcode to detect them.
This makes adding support for AnyOf in #131830 easier.

Note that these currently fail the ExpectedUses/isCorrectOpcode checks
anyway, so there shouldn't be any functional change.
@lukel97 lukel97 force-pushed the loop-vectorize/anyof-inloop branch from aba5052 to bb017c1 Compare March 24, 2025 12:32
@lukel97
Copy link
Contributor Author

lukel97 commented Mar 24, 2025

Rebased now that #131300 is landed

From the discussion in #132180 it looks like there is still some interest in having this available for certain targets and cores to enable, and potentially if we ever allow vectorizing with larger LMULs on RISC-V.

Mel-Chen added a commit that referenced this pull request Apr 7, 2025
This patch changes the preferInLoopReduction function to take a
RecurKind instead of an unsigned Opcode.
This makes it possible to distinguish non-arithmetic reductions such as
min/max, AnyOf, and FindLastIV, and also helps unify IAnyOf with FAnyOf
and IFindLastIV with FFindLastIV.

Related patch #118393 #131830
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms vectorizers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants