[VPlan] Remove ILV::sinkScalarOperands. #136023

fhahn · 2025-04-16T20:34:06Z

Remove legacy ILV sinkScalarOperands, which is superseded by the
sinkScalarOperands VPlan transforms.

There are a few cases that aren't handled by VPlan's sinkScalarOperands,
because the recipes doesn't support replicating. Those are pointer
inductions and blends.

We could probably improve this further, by allowing replication for more
recipes, but I don't think the extra complexity is warranted.

Depends on #136021.

llvmbot · 2025-04-16T20:34:41Z

@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-llvm-transforms

Author: Florian Hahn (fhahn)

Changes

Remove legacy ILV sinkScalarOperands, which is superseded by the
sinkScalarOperands VPlan transforms.

There are a few cases that aren't handled by VPlan's sinkScalarOperands,
because the recipes doesn't support replicating. Those are pointer
inductions and blends.

We could probably improve this further, by allowing replication for more
recipes, but I don't think the extra complexity is warranted.

Depends on #136021.

Patch is 130.20 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/136023.diff

23 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (-88)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+10-4)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/blend-costs.ll (+19-19)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/uniform-load-store.ll (+6-6)
(modified) llvm/test/Transforms/LoopVectorize/X86/cost-model.ll (+10-7)
(modified) llvm/test/Transforms/LoopVectorize/X86/pr51366-sunk-instruction-used-outside-of-loop.ll (+2-2)
(modified) llvm/test/Transforms/LoopVectorize/X86/small-size.ll (+28-28)
(modified) llvm/test/Transforms/LoopVectorize/debugloc.ll (+4-2)
(modified) llvm/test/Transforms/LoopVectorize/dereferenceable-info-from-assumption-constant-size.ll (+15-15)
(modified) llvm/test/Transforms/LoopVectorize/float-induction.ll (+9-9)
(modified) llvm/test/Transforms/LoopVectorize/if-pred-stores.ll (+50-50)
(modified) llvm/test/Transforms/LoopVectorize/induction.ll (+56-56)
(modified) llvm/test/Transforms/LoopVectorize/load-deref-pred-align.ll (+46-46)
(modified) llvm/test/Transforms/LoopVectorize/load-of-struct-deref-pred.ll (+44-44)
(modified) llvm/test/Transforms/LoopVectorize/loop-form.ll (+2-2)
(modified) llvm/test/Transforms/LoopVectorize/pointer-induction.ll (+6-6)
(modified) llvm/test/Transforms/LoopVectorize/select-cmp-multiuse.ll (+4-4)
(modified) llvm/test/Transforms/LoopVectorize/select-cmp-predicated.ll (+5-5)
(modified) llvm/test/Transforms/LoopVectorize/struct-return.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1.ll (+7-7)
(modified) llvm/test/Transforms/LoopVectorize/trip-count-expansion-may-introduce-ub.ll (+6-6)
(modified) llvm/test/Transforms/LoopVectorize/vplan-printing.ll (+2-1)
(modified) llvm/test/Transforms/LoopVectorize/vplan-sink-scalars-and-merge.ll (+2-1)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index dd7f05465a50b..d2f93bb7de2c8 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -541,10 +541,6 @@ class InnerLoopVectorizer {
 protected:
   friend class LoopVectorizationPlanner;
 
-  /// Iteratively sink the scalarized operands of a predicated instruction into
-  /// the block that was created for it.
-  void sinkScalarOperands(Instruction *PredInst);
-
   /// Returns (and creates if needed) the trip count of the widened loop.
   Value *getOrCreateVectorTripCount(BasicBlock *InsertBlock);
 
@@ -629,9 +625,6 @@ class InnerLoopVectorizer {
   /// A list of all bypass blocks. The first block is the entry of the loop.
   SmallVector<BasicBlock *, 4> LoopBypassBlocks;
 
-  /// Store instructions that were predicated.
-  SmallVector<Instruction *, 4> PredicatedInstructions;
-
   /// Trip count of the original loop.
   Value *TripCount = nullptr;
 
@@ -2385,15 +2378,12 @@ void InnerLoopVectorizer::scalarizeInstruction(const Instruction *Instr,
 
   // End if-block.
   VPRegionBlock *Parent = RepRecipe->getParent()->getParent();
-  bool IfPredicateInstr = Parent ? Parent->isReplicator() : false;
   assert(
       (Parent || !RepRecipe->getParent()->getPlan()->getVectorLoopRegion() ||
        all_of(RepRecipe->operands(),
               [](VPValue *Op) { return Op->isDefinedOutsideLoopRegions(); })) &&
       "Expected a recipe is either within a region or all of its operands "
       "are defined outside the vectorized region.");
-  if (IfPredicateInstr)
-    PredicatedInstructions.push_back(Cloned);
 }
 
 Value *
@@ -2867,8 +2857,6 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State) {
   if (!State.Plan->getVectorLoopRegion())
     return;
 
-  for (Instruction *PI : PredicatedInstructions)
-    sinkScalarOperands(&*PI);
 
   VPRegionBlock *VectorRegion = State.Plan->getVectorLoopRegion();
   VPBasicBlock *HeaderVPBB = VectorRegion->getEntryBasicBlock();
@@ -2895,82 +2883,6 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State) {
                                VF.getKnownMinValue() * UF);
 }
 
-void InnerLoopVectorizer::sinkScalarOperands(Instruction *PredInst) {
-  // The basic block and loop containing the predicated instruction.
-  auto *PredBB = PredInst->getParent();
-  auto *VectorLoop = LI->getLoopFor(PredBB);
-
-  // Initialize a worklist with the operands of the predicated instruction.
-  SetVector<Value *> Worklist(PredInst->op_begin(), PredInst->op_end());
-
-  // Holds instructions that we need to analyze again. An instruction may be
-  // reanalyzed if we don't yet know if we can sink it or not.
-  SmallVector<Instruction *, 8> InstsToReanalyze;
-
-  // Returns true if a given use occurs in the predicated block. Phi nodes use
-  // their operands in their corresponding predecessor blocks.
-  auto IsBlockOfUsePredicated = [&](Use &U) -> bool {
-    auto *I = cast<Instruction>(U.getUser());
-    BasicBlock *BB = I->getParent();
-    if (auto *Phi = dyn_cast<PHINode>(I))
-      BB = Phi->getIncomingBlock(
-          PHINode::getIncomingValueNumForOperand(U.getOperandNo()));
-    return BB == PredBB;
-  };
-
-  // Iteratively sink the scalarized operands of the predicated instruction
-  // into the block we created for it. When an instruction is sunk, it's
-  // operands are then added to the worklist. The algorithm ends after one pass
-  // through the worklist doesn't sink a single instruction.
-  bool Changed;
-  do {
-    // Add the instructions that need to be reanalyzed to the worklist, and
-    // reset the changed indicator.
-    Worklist.insert_range(InstsToReanalyze);
-    InstsToReanalyze.clear();
-    Changed = false;
-
-    while (!Worklist.empty()) {
-      auto *I = dyn_cast<Instruction>(Worklist.pop_back_val());
-
-      // We can't sink an instruction if it is a phi node, is not in the loop,
-      // may have side effects or may read from memory.
-      // TODO: Could do more granular checking to allow sinking
-      // a load past non-store instructions.
-      if (!I || isa<PHINode>(I) || !VectorLoop->contains(I) ||
-          I->mayHaveSideEffects() || I->mayReadFromMemory())
-          continue;
-
-      // If the instruction is already in PredBB, check if we can sink its
-      // operands. In that case, VPlan's sinkScalarOperands() succeeded in
-      // sinking the scalar instruction I, hence it appears in PredBB; but it
-      // may have failed to sink I's operands (recursively), which we try
-      // (again) here.
-      if (I->getParent() == PredBB) {
-        Worklist.insert_range(I->operands());
-        continue;
-      }
-
-      // It's legal to sink the instruction if all its uses occur in the
-      // predicated block. Otherwise, there's nothing to do yet, and we may
-      // need to reanalyze the instruction.
-      if (!llvm::all_of(I->uses(), IsBlockOfUsePredicated)) {
-        InstsToReanalyze.push_back(I);
-        continue;
-      }
-
-      // Move the instruction to the beginning of the predicated block, and add
-      // it's operands to the worklist.
-      I->moveBefore(PredBB->getFirstInsertionPt());
-      Worklist.insert_range(I->operands());
-
-      // The sinking may have enabled other instructions to be sunk, so we will
-      // need to iterate.
-      Changed = true;
-    }
-  } while (Changed);
-}
-
 void InnerLoopVectorizer::fixNonInductionPHIs(VPTransformState &State) {
   auto Iter = vp_depth_first_deep(Plan.getEntry());
   for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(Iter)) {
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index d0cb1c3ac590f..c89a07e11c5cc 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -164,7 +164,8 @@ static bool sinkScalarOperands(VPlan &Plan) {
         return true;
       NeedsDuplicating = UI->onlyFirstLaneUsed(SinkCandidate);
       // We only know how to duplicate VPRecipeRecipes for now.
-      return NeedsDuplicating && isa<VPReplicateRecipe>(SinkCandidate);
+      return NeedsDuplicating &&
+             isa<VPReplicateRecipe, VPScalarIVStepsRecipe>(SinkCandidate);
     };
     if (!all_of(SinkCandidate->users(), CanSinkWithUser))
       continue;
@@ -172,9 +173,14 @@ static bool sinkScalarOperands(VPlan &Plan) {
     if (NeedsDuplicating) {
       if (ScalarVFOnly)
         continue;
-      Instruction *I = SinkCandidate->getUnderlyingInstr();
-      auto *Clone = new VPReplicateRecipe(I, SinkCandidate->operands(), true);
-      // TODO: add ".cloned" suffix to name of Clone's VPValue.
+      VPSingleDefRecipe *Clone;
+      if (isa<VPReplicateRecipe>(SinkCandidate)) {
+        Instruction *I = SinkCandidate->getUnderlyingInstr();
+        Clone = new VPReplicateRecipe(I, SinkCandidate->operands(), true);
+        // TODO: add ".cloned" suffix to name of Clone's VPValue.
+      } else {
+        Clone = SinkCandidate->clone();
+      }
 
       Clone->insertBefore(SinkCandidate);
       SinkCandidate->replaceUsesWithIf(Clone, [SinkTo](VPUser &U, unsigned) {
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/blend-costs.ll b/llvm/test/Transforms/LoopVectorize/AArch64/blend-costs.ll
index e302bf195ef8e..3c8bbaa46f275 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/blend-costs.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/blend-costs.ll
@@ -211,8 +211,7 @@ define void @test_blend_feeding_replicated_store_2(ptr noalias %src, ptr %dst, i
 ; CHECK-NEXT:    [[TMP5:%.*]] = xor <16 x i1> [[BROADCAST_SPLAT]], splat (i1 true)
 ; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
 ; CHECK:       [[VECTOR_BODY]]:
-; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_STORE_CONTINUE30:.*]] ]
-; CHECK-NEXT:    [[IV:%.*]] = add i32 [[INDEX]], 0
+; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_STORE_CONTINUE30:.*]] ]
 ; CHECK-NEXT:    [[GEP_SRC:%.*]] = getelementptr inbounds i8, ptr [[SRC]], i32 [[IV]]
 ; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr inbounds i8, ptr [[GEP_SRC]], i32 0
 ; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <16 x i8>, ptr [[TMP2]], align 1
@@ -224,7 +223,8 @@ define void @test_blend_feeding_replicated_store_2(ptr noalias %src, ptr %dst, i
 ; CHECK-NEXT:    [[TMP8:%.*]] = extractelement <16 x i1> [[TMP7]], i32 0
 ; CHECK-NEXT:    br i1 [[TMP8]], label %[[PRED_STORE_IF:.*]], label %[[PRED_STORE_CONTINUE:.*]]
 ; CHECK:       [[PRED_STORE_IF]]:
-; CHECK-NEXT:    [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[DST]], i32 [[IV]]
+; CHECK-NEXT:    [[TMP72:%.*]] = add i32 [[IV]], 0
+; CHECK-NEXT:    [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[DST]], i32 [[TMP72]]
 ; CHECK-NEXT:    [[TMP10:%.*]] = extractelement <16 x i8> [[PREDPHI]], i32 0
 ; CHECK-NEXT:    store i8 [[TMP10]], ptr [[TMP9]], align 1
 ; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE]]
@@ -232,7 +232,7 @@ define void @test_blend_feeding_replicated_store_2(ptr noalias %src, ptr %dst, i
 ; CHECK-NEXT:    [[TMP11:%.*]] = extractelement <16 x i1> [[TMP7]], i32 1
 ; CHECK-NEXT:    br i1 [[TMP11]], label %[[PRED_STORE_IF1:.*]], label %[[PRED_STORE_CONTINUE2:.*]]
 ; CHECK:       [[PRED_STORE_IF1]]:
-; CHECK-NEXT:    [[TMP12:%.*]] = add i32 [[INDEX]], 1
+; CHECK-NEXT:    [[TMP12:%.*]] = add i32 [[IV]], 1
 ; CHECK-NEXT:    [[TMP13:%.*]] = getelementptr inbounds i8, ptr [[DST]], i32 [[TMP12]]
 ; CHECK-NEXT:    [[TMP14:%.*]] = extractelement <16 x i8> [[PREDPHI]], i32 1
 ; CHECK-NEXT:    store i8 [[TMP14]], ptr [[TMP13]], align 1
@@ -241,7 +241,7 @@ define void @test_blend_feeding_replicated_store_2(ptr noalias %src, ptr %dst, i
 ; CHECK-NEXT:    [[TMP15:%.*]] = extractelement <16 x i1> [[TMP7]], i32 2
 ; CHECK-NEXT:    br i1 [[TMP15]], label %[[PRED_STORE_IF3:.*]], label %[[PRED_STORE_CONTINUE4:.*]]
 ; CHECK:       [[PRED_STORE_IF3]]:
-; CHECK-NEXT:    [[TMP16:%.*]] = add i32 [[INDEX]], 2
+; CHECK-NEXT:    [[TMP16:%.*]] = add i32 [[IV]], 2
 ; CHECK-NEXT:    [[TMP17:%.*]] = getelementptr inbounds i8, ptr [[DST]], i32 [[TMP16]]
 ; CHECK-NEXT:    [[TMP18:%.*]] = extractelement <16 x i8> [[PREDPHI]], i32 2
 ; CHECK-NEXT:    store i8 [[TMP18]], ptr [[TMP17]], align 1
@@ -250,7 +250,7 @@ define void @test_blend_feeding_replicated_store_2(ptr noalias %src, ptr %dst, i
 ; CHECK-NEXT:    [[TMP19:%.*]] = extractelement <16 x i1> [[TMP7]], i32 3
 ; CHECK-NEXT:    br i1 [[TMP19]], label %[[PRED_STORE_IF5:.*]], label %[[PRED_STORE_CONTINUE6:.*]]
 ; CHECK:       [[PRED_STORE_IF5]]:
-; CHECK-NEXT:    [[TMP20:%.*]] = add i32 [[INDEX]], 3
+; CHECK-NEXT:    [[TMP20:%.*]] = add i32 [[IV]], 3
 ; CHECK-NEXT:    [[TMP21:%.*]] = getelementptr inbounds i8, ptr [[DST]], i32 [[TMP20]]
 ; CHECK-NEXT:    [[TMP22:%.*]] = extractelement <16 x i8> [[PREDPHI]], i32 3
 ; CHECK-NEXT:    store i8 [[TMP22]], ptr [[TMP21]], align 1
@@ -259,7 +259,7 @@ define void @test_blend_feeding_replicated_store_2(ptr noalias %src, ptr %dst, i
 ; CHECK-NEXT:    [[TMP23:%.*]] = extractelement <16 x i1> [[TMP7]], i32 4
 ; CHECK-NEXT:    br i1 [[TMP23]], label %[[PRED_STORE_IF7:.*]], label %[[PRED_STORE_CONTINUE8:.*]]
 ; CHECK:       [[PRED_STORE_IF7]]:
-; CHECK-NEXT:    [[TMP24:%.*]] = add i32 [[INDEX]], 4
+; CHECK-NEXT:    [[TMP24:%.*]] = add i32 [[IV]], 4
 ; CHECK-NEXT:    [[TMP25:%.*]] = getelementptr inbounds i8, ptr [[DST]], i32 [[TMP24]]
 ; CHECK-NEXT:    [[TMP26:%.*]] = extractelement <16 x i8> [[PREDPHI]], i32 4
 ; CHECK-NEXT:    store i8 [[TMP26]], ptr [[TMP25]], align 1
@@ -268,7 +268,7 @@ define void @test_blend_feeding_replicated_store_2(ptr noalias %src, ptr %dst, i
 ; CHECK-NEXT:    [[TMP27:%.*]] = extractelement <16 x i1> [[TMP7]], i32 5
 ; CHECK-NEXT:    br i1 [[TMP27]], label %[[PRED_STORE_IF9:.*]], label %[[PRED_STORE_CONTINUE10:.*]]
 ; CHECK:       [[PRED_STORE_IF9]]:
-; CHECK-NEXT:    [[TMP28:%.*]] = add i32 [[INDEX]], 5
+; CHECK-NEXT:    [[TMP28:%.*]] = add i32 [[IV]], 5
 ; CHECK-NEXT:    [[TMP29:%.*]] = getelementptr inbounds i8, ptr [[DST]], i32 [[TMP28]]
 ; CHECK-NEXT:    [[TMP30:%.*]] = extractelement <16 x i8> [[PREDPHI]], i32 5
 ; CHECK-NEXT:    store i8 [[TMP30]], ptr [[TMP29]], align 1
@@ -277,7 +277,7 @@ define void @test_blend_feeding_replicated_store_2(ptr noalias %src, ptr %dst, i
 ; CHECK-NEXT:    [[TMP31:%.*]] = extractelement <16 x i1> [[TMP7]], i32 6
 ; CHECK-NEXT:    br i1 [[TMP31]], label %[[PRED_STORE_IF11:.*]], label %[[PRED_STORE_CONTINUE12:.*]]
 ; CHECK:       [[PRED_STORE_IF11]]:
-; CHECK-NEXT:    [[TMP32:%.*]] = add i32 [[INDEX]], 6
+; CHECK-NEXT:    [[TMP32:%.*]] = add i32 [[IV]], 6
 ; CHECK-NEXT:    [[TMP33:%.*]] = getelementptr inbounds i8, ptr [[DST]], i32 [[TMP32]]
 ; CHECK-NEXT:    [[TMP34:%.*]] = extractelement <16 x i8> [[PREDPHI]], i32 6
 ; CHECK-NEXT:    store i8 [[TMP34]], ptr [[TMP33]], align 1
@@ -286,7 +286,7 @@ define void @test_blend_feeding_replicated_store_2(ptr noalias %src, ptr %dst, i
 ; CHECK-NEXT:    [[TMP35:%.*]] = extractelement <16 x i1> [[TMP7]], i32 7
 ; CHECK-NEXT:    br i1 [[TMP35]], label %[[PRED_STORE_IF13:.*]], label %[[PRED_STORE_CONTINUE14:.*]]
 ; CHECK:       [[PRED_STORE_IF13]]:
-; CHECK-NEXT:    [[TMP36:%.*]] = add i32 [[INDEX]], 7
+; CHECK-NEXT:    [[TMP36:%.*]] = add i32 [[IV]], 7
 ; CHECK-NEXT:    [[TMP37:%.*]] = getelementptr inbounds i8, ptr [[DST]], i32 [[TMP36]]
 ; CHECK-NEXT:    [[TMP38:%.*]] = extractelement <16 x i8> [[PREDPHI]], i32 7
 ; CHECK-NEXT:    store i8 [[TMP38]], ptr [[TMP37]], align 1
@@ -295,7 +295,7 @@ define void @test_blend_feeding_replicated_store_2(ptr noalias %src, ptr %dst, i
 ; CHECK-NEXT:    [[TMP39:%.*]] = extractelement <16 x i1> [[TMP7]], i32 8
 ; CHECK-NEXT:    br i1 [[TMP39]], label %[[PRED_STORE_IF15:.*]], label %[[PRED_STORE_CONTINUE16:.*]]
 ; CHECK:       [[PRED_STORE_IF15]]:
-; CHECK-NEXT:    [[TMP40:%.*]] = add i32 [[INDEX]], 8
+; CHECK-NEXT:    [[TMP40:%.*]] = add i32 [[IV]], 8
 ; CHECK-NEXT:    [[TMP41:%.*]] = getelementptr inbounds i8, ptr [[DST]], i32 [[TMP40]]
 ; CHECK-NEXT:    [[TMP42:%.*]] = extractelement <16 x i8> [[PREDPHI]], i32 8
 ; CHECK-NEXT:    store i8 [[TMP42]], ptr [[TMP41]], align 1
@@ -304,7 +304,7 @@ define void @test_blend_feeding_replicated_store_2(ptr noalias %src, ptr %dst, i
 ; CHECK-NEXT:    [[TMP43:%.*]] = extractelement <16 x i1> [[TMP7]], i32 9
 ; CHECK-NEXT:    br i1 [[TMP43]], label %[[PRED_STORE_IF17:.*]], label %[[PRED_STORE_CONTINUE18:.*]]
 ; CHECK:       [[PRED_STORE_IF17]]:
-; CHECK-NEXT:    [[TMP44:%.*]] = add i32 [[INDEX]], 9
+; CHECK-NEXT:    [[TMP44:%.*]] = add i32 [[IV]], 9
 ; CHECK-NEXT:    [[TMP45:%.*]] = getelementptr inbounds i8, ptr [[DST]], i32 [[TMP44]]
 ; CHECK-NEXT:    [[TMP46:%.*]] = extractelement <16 x i8> [[PREDPHI]], i32 9
 ; CHECK-NEXT:    store i8 [[TMP46]], ptr [[TMP45]], align 1
@@ -313,7 +313,7 @@ define void @test_blend_feeding_replicated_store_2(ptr noalias %src, ptr %dst, i
 ; CHECK-NEXT:    [[TMP47:%.*]] = extractelement <16 x i1> [[TMP7]], i32 10
 ; CHECK-NEXT:    br i1 [[TMP47]], label %[[PRED_STORE_IF19:.*]], label %[[PRED_STORE_CONTINUE20:.*]]
 ; CHECK:       [[PRED_STORE_IF19]]:
-; CHECK-NEXT:    [[TMP48:%.*]] = add i32 [[INDEX]], 10
+; CHECK-NEXT:    [[TMP48:%.*]] = add i32 [[IV]], 10
 ; CHECK-NEXT:    [[TMP49:%.*]] = getelementptr inbounds i8, ptr [[DST]], i32 [[TMP48]]
 ; CHECK-NEXT:    [[TMP50:%.*]] = extractelement <16 x i8> [[PREDPHI]], i32 10
 ; CHECK-NEXT:    store i8 [[TMP50]], ptr [[TMP49]], align 1
@@ -322,7 +322,7 @@ define void @test_blend_feeding_replicated_store_2(ptr noalias %src, ptr %dst, i
 ; CHECK-NEXT:    [[TMP51:%.*]] = extractelement <16 x i1> [[TMP7]], i32 11
 ; CHECK-NEXT:    br i1 [[TMP51]], label %[[PRED_STORE_IF21:.*]], label %[[PRED_STORE_CONTINUE22:.*]]
 ; CHECK:       [[PRED_STORE_IF21]]:
-; CHECK-NEXT:    [[TMP52:%.*]] = add i32 [[INDEX]], 11
+; CHECK-NEXT:    [[TMP52:%.*]] = add i32 [[IV]], 11
 ; CHECK-NEXT:    [[TMP53:%.*]] = getelementptr inbounds i8, ptr [[DST]], i32 [[TMP52]]
 ; CHECK-NEXT:    [[TMP54:%.*]] = extractelement <16 x i8> [[PREDPHI]], i32 11
 ; CHECK-NEXT:    store i8 [[TMP54]], ptr [[TMP53]], align 1
@@ -331,7 +331,7 @@ define void @test_blend_feeding_replicated_store_2(ptr noalias %src, ptr %dst, i
 ; CHECK-NEXT:    [[TMP55:%.*]] = extractelement <16 x i1> [[TMP7]], i32 12
 ; CHECK-NEXT:    br i1 [[TMP55]], label %[[PRED_STORE_IF23:.*]], label %[[PRED_STORE_CONTINUE24:.*]]
 ; CHECK:       [[PRED_STORE_IF23]]:
-; CHECK-NEXT:    [[TMP56:%.*]] = add i32 [[INDEX]], 12
+; CHECK-NEXT:    [[TMP56:%.*]] = add i32 [[IV]], 12
 ; CHECK-NEXT:    [[TMP57:%.*]] = getelementptr inbounds i8, ptr [[DST]], i32 [[TMP56]]
 ; CHECK-NEXT:    [[TMP58:%.*]] = extractelement <16 x i8> [[PREDPHI]], i32 12
 ; CHECK-NEXT:    store i8 [[TMP58]], ptr [[TMP57]], align 1
@@ -340,7 +340,7 @@ define void @test_blend_feeding_replicated_store_2(ptr noalias %src, ptr %dst, i
 ; CHECK-NEXT:    [[TMP59:%.*]] = extractelement <16 x i1> [[TMP7]], i32 13
 ; CHECK-NEXT:    br i1 [[TMP59]], label %[[PRED_STORE_IF25:.*]], label %[[PRED_STORE_CONTINUE26:.*]]
 ; CHECK:       [[PRED_STORE_IF25]]:
-; CHECK-NEXT:    [[TMP60:%.*]] = add i32 [[INDEX]], 13
+; CHECK-NEXT:    [[TMP60:%.*]] = add i32 [[IV]], 13
 ; CHECK-NEXT:    [[TMP61:%.*]] = getelementptr inbounds i8, ptr [[DST]], i32 [[TMP60]]
 ; CHECK-NEXT:    [[TMP62:%.*]] = extractelement <16 x i8> [[PREDPHI]], i32 13
 ; CHECK-NEXT:    store i8 [[TMP62]], ptr [[TMP61]], align 1
@@ -349,7 +349,7 @@ define void @test_blend_feeding_replicated_store_2(ptr noalias %src, ptr %dst, i
 ; CHECK-NEXT:    [[TMP63:%.*]] = extractelement <16 x i1> [[TMP7]], i32 14
 ; CHECK-NEXT:    br i1 [[TMP63]], label %[[PRED_STORE_IF27:.*]], label %[[PRED_STORE_CONTINUE28:.*]]
 ; CHECK:       [[PRED_STORE_IF27]]:
-; CHECK-NEXT:    [[TMP64:%.*]] = add i32 [[INDEX]], 14
+; CHECK-NEXT:    [[TMP64:%.*]] = add i32 [[IV]], 14
 ; CHECK-NEXT:    [[TMP65:%.*]] = getelementptr inbounds i8, ptr [[DST]], i32 [[TMP64]]
 ; CHECK-NEXT:    [[TMP66:%.*]] = extractelement <16 x i8> [[PREDPHI]], i32 14
 ; CHECK-NEXT:    store i8 [[TMP66]], ptr [[TMP65]], align 1
@@ -358,13 +358,13 @@ define void @test_blend_feeding_replicated_store_2(ptr noalias %src, ptr %dst, i
 ; CHECK-NEXT:    [[TMP67:%.*]] = extractelement <16 x i1> [[TMP7]], i32 15
 ; CHECK-NEXT:    br i1 [[TMP67]], label %[[PRED_STORE_IF29:.*]], label %[[PRED_STORE_CONTINUE30]]
 ; CHECK:       [[PRED_STORE_IF29]]:
-; CHECK-NEXT:    [[TMP68:%.*]] = add i32 [[INDEX]], 15
+; CHECK-NEXT:    [[TMP68:%.*]] = add i32 [[IV]], 15
 ; CHECK-NEXT:    [[TMP69:%.*]] = getelementptr inbounds i8, ptr [[DST]], i32 [[TMP68]]
 ; CHECK-NEXT:    [[TMP70:%.*]] = extractelement <16 x i8> [[PREDPHI]], i32 15
 ; CHECK-NEXT:    store i8 [[TMP70]], ptr [[TMP69]], align 1
 ; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE30]]
 ; CHECK:       [[PRED_STORE_CONTINUE30]]:
-; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 16
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[IV]], 16
 ; CHECK-NEXT:    [[TMP71:%.*]] = icmp eq i32 [[INDEX_NEXT]], 96
 ; CHECK-NEXT:    br i1 [[TMP71]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
 ; CHECK:       [[MIDDLE_BLOCK]]:
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/uniform-load-store.ll b/llvm/test/Transforms/LoopVectorize/RISCV/uniform-load-store.ll
index d8713bdda689a..827612cfe36d5 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/uniform-load-store.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/uniform-load-store.ll
@@ -1045,12 +1045,12 @@ define void @uniform_store_of_loop_varying(ptr noalias nocapture %a, ptr noalias
 ; TF-FIXEDLEN-NEXT:    br label %[[VECTOR_BODY:.*]]
 ; TF-FIXEDLEN:       [[VECTOR_BODY]]:
 ; TF-FIXEDLEN-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_STORE_CONTINUE6:.*]] ]
-; TF-FIXEDLEN-NEXT:    [[TMP0:%.*]] = add i64 [[INDEX]], 0
-; TF-FIXEDLEN-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 [[TMP0]], i64 1025)
-; TF-FIXEDLEN-NEXT:    [[TMP1:%.*]] = extractelement <4 x i1> [[ACTIVE_LANE_MASK]], i32 0
-; TF-FIXEDLEN-NEXT:    br i1 [[TMP1]], label %[[PRED_STORE_IF:.*]], label %[[PRED_STORE_CONTINUE:.*]]
+; TF-FIXEDLEN-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 [[INDEX]], i64 1025)
+; TF-FIXEDLEN-NEXT:    [[TMP0:%.*]] = extractelement <4 x i1> [[ACTIVE_...
[truncated]

fhahn · 2025-04-16T20:36:15Z

For reference, here's what the diff would look like without #136021: #136024

github-actions · 2025-04-16T20:36:18Z

✅ With the latest revision this PR passed the C/C++ code formatter.

Remove legacy ILV sinkScalarOperands, which is superseded by the sinkScalarOperands VPlan transforms. There are a few cases that aren't handled by VPlan's sinkScalarOperands, because the recipes doesn't support replicating. Those are pointer inductions and blends. We could probably improve this further, by allowing replication for more recipes, but I don't think the extra complexity is warranted. Depends on llvm#136021.

fhahn

This should be ready now that #136021 landed.

ayalz

Very nice milestone! Adding some minor comments.

There are a few cases that aren't handled by VPlan's sinkScalarOperands,
because the recipes doesn't support replicating. Those are pointer
inductions and blends.

Can some note be left behind documenting this opportunity?

ayalz · 2025-04-21T21:11:20Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -2384,15 +2377,12 @@ void InnerLoopVectorizer::scalarizeInstruction(const Instruction *Instr,



Potential follow-up: could scalarizeInstruction() now move from ILV to VPReplicateRecipe::execute(), with some handling of AC/AssumeInst's?

Yep, that was the main motivation for removing it :)

ayalz · 2025-04-21T21:12:18Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -2384,15 +2377,12 @@ void InnerLoopVectorizer::scalarizeInstruction(const Instruction *Instr,

  // End if-block.
  VPRegionBlock *Parent = RepRecipe->getParent()->getParent();


Parent is now used only by assert.

Move into assert

ayalz · 2025-04-22T07:01:07Z

llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1.ll

+; CHECK-NEXT:    [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[PTR1]], i64 [[OFFSET_IDX]]
+; CHECK-NEXT:    [[NEXT_GEP1:%.*]] = getelementptr i8, ptr [[PTR1]], i64 [[TMP4]]
+; CHECK-NEXT:    [[NEXT_GEP2:%.*]] = getelementptr i8, ptr [[PTR1]], i64 [[TMP5]]
+; CHECK-NEXT:    [[NEXT_GEP3:%.*]] = getelementptr i8, ptr [[PTR1]], i64 [[TMP6]]


Another pointer induction case. GEPs hopefully end up joining the stores that use them, for optimized accessing modes, and are presumably costed as such.

Cost of pointer inductions is assumed free at the moment IIRC

ayalz · 2025-04-22T07:02:55Z

llvm/test/Transforms/LoopVectorize/pointer-induction.ll

+; CHECK-NEXT:    [[TMP11:%.*]] = add i64 [[OFFSET_IDX]], -1
+; CHECK-NEXT:    [[TMP14:%.*]] = add i64 [[OFFSET_IDX]], -2
+; CHECK-NEXT:    [[TMP17:%.*]] = add i64 [[OFFSET_IDX]], -3
 ; CHECK-NEXT:    [[NEXT_GEP:%.*]] = getelementptr i8, ptr null, i64 [[TMP2]]
+; CHECK-NEXT:    [[NEXT_GEP2:%.*]] = getelementptr i8, ptr null, i64 [[TMP11]]
+; CHECK-NEXT:    [[NEXT_GEP3:%.*]] = getelementptr i8, ptr null, i64 [[TMP14]]
+; CHECK-NEXT:    [[NEXT_GEP4:%.*]] = getelementptr i8, ptr null, i64 [[TMP17]]


Pointer induction case.

ayalz · 2025-04-22T07:09:55Z

llvm/test/Transforms/LoopVectorize/X86/small-size.ll

+; CHECK-NEXT:    [[TMP7:%.*]] = getelementptr i8, ptr [[SRC:%.*]], i64 [[OFFSET_IDX]]
+; CHECK-NEXT:    [[NEXT_GEP2:%.*]] = getelementptr i8, ptr [[TMP7]], i64 2
+; CHECK-NEXT:    [[TMP8:%.*]] = getelementptr i8, ptr [[SRC]], i64 [[OFFSET_IDX]]
+; CHECK-NEXT:    [[NEXT_GEP3:%.*]] = getelementptr i8, ptr [[TMP8]], i64 4
+; CHECK-NEXT:    [[TMP13:%.*]] = getelementptr i8, ptr [[SRC]], i64 [[OFFSET_IDX]]
+; CHECK-NEXT:    [[NEXT_GEP4:%.*]] = getelementptr i8, ptr [[TMP13]], i64 6
 ; CHECK-NEXT:    [[OFFSET_IDX5:%.*]] = shl i64 [[INDEX]], 2
+; CHECK-NEXT:    [[TMP14:%.*]] = getelementptr i8, ptr [[DST:%.*]], i64 [[OFFSET_IDX5]]
+; CHECK-NEXT:    [[NEXT_GEP7:%.*]] = getelementptr i8, ptr [[TMP14]], i64 4
+; CHECK-NEXT:    [[TMP19:%.*]] = getelementptr i8, ptr [[DST]], i64 [[OFFSET_IDX5]]
+; CHECK-NEXT:    [[NEXT_GEP8:%.*]] = getelementptr i8, ptr [[TMP19]], i64 8
+; CHECK-NEXT:    [[TMP20:%.*]] = getelementptr i8, ptr [[DST]], i64 [[OFFSET_IDX5]]
+; CHECK-NEXT:    [[NEXT_GEP9:%.*]] = getelementptr i8, ptr [[TMP20]], i64 12


Sinking GEPS can help localize their access patterns for potential benefit of subsequent passes, but possibly sunk later?

If beneficial, e.g. because the GEP can be folded into address mode, this is done by CodeGenPrepare. Also, subsequent runs of instcombine should perform sinking as well.

ayalz · 2025-04-22T07:11:36Z

llvm/test/Transforms/LoopVectorize/X86/pr51366-sunk-instruction-used-outside-of-loop.ll

+; CHECK-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i32, ptr [[SRC:%.*]], i64 [[TMP0]]
+; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr inbounds i32, ptr [[SRC]], i64 [[TMP1]]


Here and below are cases of GEP sinkings.

ayalz · 2025-04-22T07:20:46Z

llvm/test/Transforms/LoopVectorize/AArch64/masked-call.ll

+; TFA_INTERLEAVE-NEXT:    [[TMP18:%.*]] = xor i1 [[TMP6]], true
+; TFA_INTERLEAVE-NEXT:    [[TMP20:%.*]] = xor i1 [[TMP9]], true


Independent: these NOTs can be eliminated by flipping the earlier fcmp's or later selects.

Yep, unfortunately the change to do so as surfaced a legacy/VPlan-cost model divergence I still need to investigate.

ayalz · 2025-04-22T07:46:25Z

llvm/test/Transforms/LoopVectorize/AArch64/masked-call.ll

+; TFA_INTERLEAVE-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[TMP27:%.*]], %[[TMP19:.*]] ]
+; TFA_INTERLEAVE-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi i1 [ [[ACTIVE_LANE_MASK_ENTRY]], %[[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], %[[TMP19]] ]
+; TFA_INTERLEAVE-NEXT:    [[ACTIVE_LANE_MASK2:%.*]] = phi i1 [ [[ACTIVE_LANE_MASK_ENTRY1]], %[[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT6:%.*]], %[[TMP19]] ]


Redundant change?

Kept original names here, thanks

ayalz · 2025-04-22T12:47:35Z

llvm/test/Transforms/LoopVectorize/AArch64/masked-call.ll

 ; TFA_INTERLEAVE-NEXT:    [[TMP5:%.*]] = tail call double @llvm.exp.f64(double [[TMP4]]) #[[ATTR7:[0-9]+]]
-; TFA_INTERLEAVE-NEXT:    [[TMP6:%.*]] = fcmp ogt double [[TMP5]], 0.000000e+00
-; TFA_INTERLEAVE-NEXT:    [[TMP7:%.*]] = xor i1 [[TMP6]], true
-; TFA_INTERLEAVE-NEXT:    [[TMP24:%.*]] = select i1 [[TMP7]], double 1.000000e+00, double 0.000000e+00
-; TFA_INTERLEAVE-NEXT:    store double [[TMP24]], ptr [[P]], align 8
-; TFA_INTERLEAVE-NEXT:    br label %[[PRED_STORE_CONTINUE]]
-; TFA_INTERLEAVE:       [[PRED_STORE_CONTINUE]]:
-; TFA_INTERLEAVE-NEXT:    br i1 [[ACTIVE_LANE_MASK2]], label %[[PRED_STORE_IF4:.*]], label %[[PRED_STORE_CONTINUE5]]
-; TFA_INTERLEAVE:       [[PRED_STORE_IF4]]:
 ; TFA_INTERLEAVE-NEXT:    [[TMP8:%.*]] = tail call double @llvm.exp.f64(double [[TMP4]]) #[[ATTR7]]
+; TFA_INTERLEAVE-NEXT:    [[TMP6:%.*]] = fcmp ogt double [[TMP5]], 0.000000e+00
 ; TFA_INTERLEAVE-NEXT:    [[TMP9:%.*]] = fcmp ogt double [[TMP8]], 0.000000e+00
-; TFA_INTERLEAVE-NEXT:    [[TMP10:%.*]] = xor i1 [[TMP9]], true
+; TFA_INTERLEAVE-NEXT:    [[TMP18:%.*]] = xor i1 [[TMP6]], true
+; TFA_INTERLEAVE-NEXT:    [[TMP20:%.*]] = xor i1 [[TMP9]], true
+; TFA_INTERLEAVE-NEXT:    [[TMP10:%.*]] = select i1 [[ACTIVE_LANE_MASK]], i1 [[TMP18]], i1 false
+; TFA_INTERLEAVE-NEXT:    [[TMP21:%.*]] = select i1 [[ACTIVE_LANE_MASK2]], i1 [[TMP20]], i1 false
 ; TFA_INTERLEAVE-NEXT:    [[TMP26:%.*]] = select i1 [[TMP10]], double 1.000000e+00, double 0.000000e+00
-; TFA_INTERLEAVE-NEXT:    store double [[TMP26]], ptr [[P]], align 8
-; TFA_INTERLEAVE-NEXT:    br label %[[PRED_STORE_CONTINUE5]]
-; TFA_INTERLEAVE:       [[PRED_STORE_CONTINUE5]]:
+; TFA_INTERLEAVE-NEXT:    [[PREDPHI3:%.*]] = select i1 [[TMP21]], double 1.000000e+00, double 0.000000e+00
+; TFA_INTERLEAVE-NEXT:    [[SPEC_SELECT:%.*]] = select i1 [[ACTIVE_LANE_MASK2]], double [[PREDPHI3]], double [[TMP26]]
+; TFA_INTERLEAVE-NEXT:    [[TMP13:%.*]] = xor i1 [[ACTIVE_LANE_MASK]], true
+; TFA_INTERLEAVE-NEXT:    [[TMP14:%.*]] = xor i1 [[ACTIVE_LANE_MASK2]], true
+; TFA_INTERLEAVE-NEXT:    [[TMP15:%.*]] = xor i1 [[TMP13]], true
+; TFA_INTERLEAVE-NEXT:    [[TMP16:%.*]] = xor i1 [[TMP14]], true
+; TFA_INTERLEAVE-NEXT:    [[TMP17:%.*]] = or i1 [[TMP15]], [[TMP16]]
+; TFA_INTERLEAVE-NEXT:    br i1 [[TMP17]], label %[[BB18:.*]], label %[[TMP19]]
+; TFA_INTERLEAVE:       [[BB18]]:
+; TFA_INTERLEAVE-NEXT:    store double [[SPEC_SELECT]], ptr [[P]], align 8
+; TFA_INTERLEAVE-NEXT:    br label %[[TMP19]]
+; TFA_INTERLEAVE:       [[TMP19]]:


Seems better to have a single store if (ACTIVE_LANE_MASK || ACTIVE_LANE_MASK2) than duplicate the store and have two identical cases one for each term separately. Better fold the double computation into one given that TMP5==TMP8? Or even better - this store and its computation appears to be fully invariant - best fold the loop into a single scalar iteration?

Agreed that there are a number of other simplifications that could be applied.

…-operands2

fhahn

Can some note be left behind documenting this opportunity?

Tried to add notes for the missed pointer induction cases

fhahn · 2025-04-22T20:42:34Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -2384,15 +2377,12 @@ void InnerLoopVectorizer::scalarizeInstruction(const Instruction *Instr,



Yep, that was the main motivation for removing it :)

fhahn · 2025-04-22T20:43:25Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -2384,15 +2377,12 @@ void InnerLoopVectorizer::scalarizeInstruction(const Instruction *Instr,

  // End if-block.
  VPRegionBlock *Parent = RepRecipe->getParent()->getParent();


Move into assert

fhahn · 2025-04-22T20:44:31Z

llvm/test/Transforms/LoopVectorize/AArch64/masked-call.ll

+; TFA_INTERLEAVE-NEXT:    [[TMP18:%.*]] = xor i1 [[TMP6]], true
+; TFA_INTERLEAVE-NEXT:    [[TMP20:%.*]] = xor i1 [[TMP9]], true


Yep, unfortunately the change to do so as surfaced a legacy/VPlan-cost model divergence I still need to investigate.

fhahn · 2025-04-22T20:46:06Z

llvm/test/Transforms/LoopVectorize/AArch64/masked-call.ll

+; TFA_INTERLEAVE-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[TMP27:%.*]], %[[TMP19:.*]] ]
+; TFA_INTERLEAVE-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi i1 [ [[ACTIVE_LANE_MASK_ENTRY]], %[[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], %[[TMP19]] ]
+; TFA_INTERLEAVE-NEXT:    [[ACTIVE_LANE_MASK2:%.*]] = phi i1 [ [[ACTIVE_LANE_MASK_ENTRY1]], %[[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT6:%.*]], %[[TMP19]] ]


Kept original names here, thanks

fhahn · 2025-04-22T20:51:54Z

llvm/test/Transforms/LoopVectorize/AArch64/masked-call.ll

 ; TFA_INTERLEAVE-NEXT:    [[TMP5:%.*]] = tail call double @llvm.exp.f64(double [[TMP4]]) #[[ATTR7:[0-9]+]]
-; TFA_INTERLEAVE-NEXT:    [[TMP6:%.*]] = fcmp ogt double [[TMP5]], 0.000000e+00
-; TFA_INTERLEAVE-NEXT:    [[TMP7:%.*]] = xor i1 [[TMP6]], true
-; TFA_INTERLEAVE-NEXT:    [[TMP24:%.*]] = select i1 [[TMP7]], double 1.000000e+00, double 0.000000e+00
-; TFA_INTERLEAVE-NEXT:    store double [[TMP24]], ptr [[P]], align 8
-; TFA_INTERLEAVE-NEXT:    br label %[[PRED_STORE_CONTINUE]]
-; TFA_INTERLEAVE:       [[PRED_STORE_CONTINUE]]:
-; TFA_INTERLEAVE-NEXT:    br i1 [[ACTIVE_LANE_MASK2]], label %[[PRED_STORE_IF4:.*]], label %[[PRED_STORE_CONTINUE5]]
-; TFA_INTERLEAVE:       [[PRED_STORE_IF4]]:
 ; TFA_INTERLEAVE-NEXT:    [[TMP8:%.*]] = tail call double @llvm.exp.f64(double [[TMP4]]) #[[ATTR7]]
+; TFA_INTERLEAVE-NEXT:    [[TMP6:%.*]] = fcmp ogt double [[TMP5]], 0.000000e+00
 ; TFA_INTERLEAVE-NEXT:    [[TMP9:%.*]] = fcmp ogt double [[TMP8]], 0.000000e+00
-; TFA_INTERLEAVE-NEXT:    [[TMP10:%.*]] = xor i1 [[TMP9]], true
+; TFA_INTERLEAVE-NEXT:    [[TMP18:%.*]] = xor i1 [[TMP6]], true
+; TFA_INTERLEAVE-NEXT:    [[TMP20:%.*]] = xor i1 [[TMP9]], true
+; TFA_INTERLEAVE-NEXT:    [[TMP10:%.*]] = select i1 [[ACTIVE_LANE_MASK]], i1 [[TMP18]], i1 false
+; TFA_INTERLEAVE-NEXT:    [[TMP21:%.*]] = select i1 [[ACTIVE_LANE_MASK2]], i1 [[TMP20]], i1 false
 ; TFA_INTERLEAVE-NEXT:    [[TMP26:%.*]] = select i1 [[TMP10]], double 1.000000e+00, double 0.000000e+00
-; TFA_INTERLEAVE-NEXT:    store double [[TMP26]], ptr [[P]], align 8
-; TFA_INTERLEAVE-NEXT:    br label %[[PRED_STORE_CONTINUE5]]
-; TFA_INTERLEAVE:       [[PRED_STORE_CONTINUE5]]:
+; TFA_INTERLEAVE-NEXT:    [[PREDPHI3:%.*]] = select i1 [[TMP21]], double 1.000000e+00, double 0.000000e+00
+; TFA_INTERLEAVE-NEXT:    [[SPEC_SELECT:%.*]] = select i1 [[ACTIVE_LANE_MASK2]], double [[PREDPHI3]], double [[TMP26]]
+; TFA_INTERLEAVE-NEXT:    [[TMP13:%.*]] = xor i1 [[ACTIVE_LANE_MASK]], true
+; TFA_INTERLEAVE-NEXT:    [[TMP14:%.*]] = xor i1 [[ACTIVE_LANE_MASK2]], true
+; TFA_INTERLEAVE-NEXT:    [[TMP15:%.*]] = xor i1 [[TMP13]], true
+; TFA_INTERLEAVE-NEXT:    [[TMP16:%.*]] = xor i1 [[TMP14]], true
+; TFA_INTERLEAVE-NEXT:    [[TMP17:%.*]] = or i1 [[TMP15]], [[TMP16]]
+; TFA_INTERLEAVE-NEXT:    br i1 [[TMP17]], label %[[BB18:.*]], label %[[TMP19]]
+; TFA_INTERLEAVE:       [[BB18]]:
+; TFA_INTERLEAVE-NEXT:    store double [[SPEC_SELECT]], ptr [[P]], align 8
+; TFA_INTERLEAVE-NEXT:    br label %[[TMP19]]
+; TFA_INTERLEAVE:       [[TMP19]]:


Agreed that there are a number of other simplifications that could be applied.

fhahn · 2025-04-22T20:53:08Z

llvm/test/Transforms/LoopVectorize/X86/small-size.ll

+; CHECK-NEXT:    [[TMP7:%.*]] = getelementptr i8, ptr [[SRC:%.*]], i64 [[OFFSET_IDX]]
+; CHECK-NEXT:    [[NEXT_GEP2:%.*]] = getelementptr i8, ptr [[TMP7]], i64 2
+; CHECK-NEXT:    [[TMP8:%.*]] = getelementptr i8, ptr [[SRC]], i64 [[OFFSET_IDX]]
+; CHECK-NEXT:    [[NEXT_GEP3:%.*]] = getelementptr i8, ptr [[TMP8]], i64 4
+; CHECK-NEXT:    [[TMP13:%.*]] = getelementptr i8, ptr [[SRC]], i64 [[OFFSET_IDX]]
+; CHECK-NEXT:    [[NEXT_GEP4:%.*]] = getelementptr i8, ptr [[TMP13]], i64 6
 ; CHECK-NEXT:    [[OFFSET_IDX5:%.*]] = shl i64 [[INDEX]], 2
+; CHECK-NEXT:    [[TMP14:%.*]] = getelementptr i8, ptr [[DST:%.*]], i64 [[OFFSET_IDX5]]
+; CHECK-NEXT:    [[NEXT_GEP7:%.*]] = getelementptr i8, ptr [[TMP14]], i64 4
+; CHECK-NEXT:    [[TMP19:%.*]] = getelementptr i8, ptr [[DST]], i64 [[OFFSET_IDX5]]
+; CHECK-NEXT:    [[NEXT_GEP8:%.*]] = getelementptr i8, ptr [[TMP19]], i64 8
+; CHECK-NEXT:    [[TMP20:%.*]] = getelementptr i8, ptr [[DST]], i64 [[OFFSET_IDX5]]
+; CHECK-NEXT:    [[NEXT_GEP9:%.*]] = getelementptr i8, ptr [[TMP20]], i64 12


If beneficial, e.g. because the GEP can be folded into address mode, this is done by CodeGenPrepare. Also, subsequent runs of instcombine should perform sinking as well.

fhahn · 2025-04-22T20:56:00Z

llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1.ll

+; CHECK-NEXT:    [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[PTR1]], i64 [[OFFSET_IDX]]
+; CHECK-NEXT:    [[NEXT_GEP1:%.*]] = getelementptr i8, ptr [[PTR1]], i64 [[TMP4]]
+; CHECK-NEXT:    [[NEXT_GEP2:%.*]] = getelementptr i8, ptr [[PTR1]], i64 [[TMP5]]
+; CHECK-NEXT:    [[NEXT_GEP3:%.*]] = getelementptr i8, ptr [[PTR1]], i64 [[TMP6]]


Cost of pointer inductions is assumed free at the moment IIRC

ayalz

Thanks, ship it!

Remove legacy ILV sinkScalarOperands, which is superseded by the sinkScalarOperands VPlan transforms. There are a few cases that aren't handled by VPlan's sinkScalarOperands, because the recipes doesn't support replicating. Those are pointer inductions and blends. We could probably improve this further, by allowing replication for more recipes, but I don't think the extra complexity is warranted. Depends on llvm/llvm-project#136021. PR: llvm/llvm-project#136023

Remove legacy ILV sinkScalarOperands, which is superseded by the sinkScalarOperands VPlan transforms. There are a few cases that aren't handled by VPlan's sinkScalarOperands, because the recipes doesn't support replicating. Those are pointer inductions and blends. We could probably improve this further, by allowing replication for more recipes, but I don't think the extra complexity is warranted. Depends on llvm#136021. PR: llvm#136023

fhahn requested review from rengolin, ayalz and aniragil April 16, 2025 20:34

llvmbot added vectorizers llvm:transforms labels Apr 16, 2025

fhahn force-pushed the remove-ilvsink-scalar-operands2 branch from 89d6915 to 42b9e34 Compare April 16, 2025 20:40

fhahn force-pushed the remove-ilvsink-scalar-operands2 branch from 42b9e34 to c4a9bac Compare April 21, 2025 18:06

fhahn commented Apr 21, 2025

View reviewed changes

ayalz reviewed Apr 22, 2025

View reviewed changes

fhahn added 2 commits April 22, 2025 21:40

Merge remote-tracking branch 'origin/main' into remove-ilvsink-scalar…

96a0568

…-operands2

!fixup address latest comments, thanks

bf0d6f0

fhahn commented Apr 22, 2025

View reviewed changes

ayalz approved these changes Apr 24, 2025

View reviewed changes

fhahn merged commit 15bb1db into llvm:main Apr 24, 2025
11 checks passed

fhahn deleted the remove-ilvsink-scalar-operands2 branch April 24, 2025 07:37

		@@ -2384,15 +2377,12 @@ void InnerLoopVectorizer::scalarizeInstruction(const Instruction *Instr,

		@@ -2384,15 +2377,12 @@ void InnerLoopVectorizer::scalarizeInstruction(const Instruction *Instr,

		// End if-block.
		VPRegionBlock *Parent = RepRecipe->getParent()->getParent();

		; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, ptr [[SRC:%.]], i64 [[TMP0]]
		; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i32, ptr [[SRC]], i64 [[TMP1]]

		; TFA_INTERLEAVE-NEXT: [[TMP18:%.*]] = xor i1 [[TMP6]], true
		; TFA_INTERLEAVE-NEXT: [[TMP20:%.*]] = xor i1 [[TMP9]], true

[VPlan] Remove ILV::sinkScalarOperands. #136023

[VPlan] Remove ILV::sinkScalarOperands. #136023

Uh oh!

Conversation

fhahn commented Apr 16, 2025

Uh oh!

llvmbot commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fhahn commented Apr 16, 2025

Uh oh!

github-actions bot commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

ayalz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ayalz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

llvmbot commented Apr 16, 2025 •

edited

Loading

github-actions bot commented Apr 16, 2025 •

edited

Loading