[SLP] Improve block traversal in getSpillCost() #128620

mikhailramalho · 2025-02-25T03:00:49Z

This is a WIP patch due to the compilation time regressions, up to 7% on gcc_r/gcc_s.

Previously, getSpillCost would skip in between blocks when traversing instructions backward. If one of the missing blocks has a function call, the existing logic would lead to incorrect spill cost calculations.

The new implementation:

Uses post_order traversal to visit blocks
Tracks live entries across basic blocks
Computes reachable blocks once upfront using depth_first_ext

Performance improvements:

Reduces execution time of SPEC CPU benchmark 544.nab_r by 9.92%
Reduces code size of 508.namd by 1.73%

This optimization improves vectorization decisions by making spill cost estimation more accurate, particularly for code with complex control flow.

This is a WIP patch due to the compilation time regressions, up to 7% on gcc_r/gcc_s. Previously, getSpillCost would skip in between blocks when traversing instructions backward. If one of the missing blocks has a function call, the existing logic would lead to incorrect spill cost calculations. The new implementation: - Uses post_order traversal to visit blocks - Tracks live entries across basic blocks - Computes reachable blocks once upfront using depth_first_ext - Maintains correct cost calculation for diamond-shaped control flow Performance improvements: - Reduces execution time of SPEC CPU benchmark 544.nab_r by 9.92% - Reduces code size of 508.namd by 1.73% This optimization improves vectorization decisions by making spill cost estimation more accurate, particularly for code with complex control flow.

llvmbot · 2025-02-25T03:01:29Z

@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-llvm-transforms

Author: Mikhail R. Gadelha (mikhailramalho)

Changes

This is a WIP patch due to the compilation time regressions, up to 7% on gcc_r/gcc_s.

Previously, getSpillCost would skip in between blocks when traversing instructions backward. If one of the missing blocks has a function call, the existing logic would lead to incorrect spill cost calculations.

The new implementation:

Uses post_order traversal to visit blocks
Tracks live entries across basic blocks
Computes reachable blocks once upfront using depth_first_ext

Performance improvements:

Reduces execution time of SPEC CPU benchmark 544.nab_r by 9.92%
Reduces code size of 508.namd by 1.73%

This optimization improves vectorization decisions by making spill cost estimation more accurate, particularly for code with complex control flow.

Full diff: https://github.com/llvm/llvm-project/pull/128620.diff

3 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+52-74)
(modified) llvm/test/Transforms/SLPVectorizer/RISCV/math-function.ll (+12-4)
(modified) llvm/test/Transforms/SLPVectorizer/RISCV/spillcost.ll (+16-6)

diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 3d660b63309d4..2d23307c18839 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -12415,68 +12415,48 @@ InstructionCost BoUpSLP::getSpillCost() {
   InstructionCost Cost = 0;
 
   SmallPtrSet<const TreeEntry *, 4> LiveEntries;
-  const TreeEntry *Prev = nullptr;
-
-  // The entries in VectorizableTree are not necessarily ordered by their
-  // position in basic blocks. Collect them and order them by dominance so later
-  // instructions are guaranteed to be visited first. For instructions in
-  // different basic blocks, we only scan to the beginning of the block, so
-  // their order does not matter, as long as all instructions in a basic block
-  // are grouped together. Using dominance ensures a deterministic order.
-  SmallVector<TreeEntry *, 16> OrderedEntries;
-  for (const auto &TEPtr : VectorizableTree) {
-    if (TEPtr->isGather())
-      continue;
-    OrderedEntries.push_back(TEPtr.get());
-  }
-  llvm::stable_sort(OrderedEntries, [&](const TreeEntry *TA,
-                                        const TreeEntry *TB) {
-    Instruction &A = getLastInstructionInBundle(TA);
-    Instruction &B = getLastInstructionInBundle(TB);
-    auto *NodeA = DT->getNode(A.getParent());
-    auto *NodeB = DT->getNode(B.getParent());
-    assert(NodeA && "Should only process reachable instructions");
-    assert(NodeB && "Should only process reachable instructions");
-    assert((NodeA == NodeB) == (NodeA->getDFSNumIn() == NodeB->getDFSNumIn()) &&
-           "Different nodes should have different DFS numbers");
-    if (NodeA != NodeB)
-      return NodeA->getDFSNumIn() > NodeB->getDFSNumIn();
-    return B.comesBefore(&A);
-  });
 
-  for (const TreeEntry *TE : OrderedEntries) {
-    if (!Prev) {
-      Prev = TE;
+  const TreeEntry *Root = VectorizableTree.front().get();
+  BasicBlock *RootBB = cast<Instruction>(Root->Scalars[0])->getParent();
+
+  // Compute what nodes are reachable from the leaves to the roots
+  df_iterator_default_set<const BasicBlock *> ReachableFromLeaves;
+  for (auto &TE : VectorizableTree) {
+    if (TE->isGather())
       continue;
-    }
+    auto *BB = getLastInstructionInBundle(TE.get()).getParent();
+    for (const BasicBlock *X : depth_first_ext(BB, ReachableFromLeaves))
+      ReachableFromLeaves.insert(X);
+  }
 
-    LiveEntries.erase(Prev);
-    for (unsigned I : seq<unsigned>(Prev->getNumOperands())) {
-      const TreeEntry *Op = getVectorizedOperand(Prev, I);
-      if (!Op)
-        continue;
-      assert(!Op->isGather() && "Expected vectorized operand.");
-      LiveEntries.insert(Op);
-    }
+  DenseSet<const BasicBlock *> Reachable;
+  for (const BasicBlock *X : inverse_depth_first(RootBB))
+    Reachable.insert(X);
+  set_intersect(Reachable, ReachableFromLeaves);
 
-    LLVM_DEBUG({
-      dbgs() << "SLP: #LV: " << LiveEntries.size();
-      for (auto *X : LiveEntries)
-        X->dump();
-      dbgs() << ", Looking at ";
-      TE->dump();
-    });
+  DenseSet<const TreeEntry *> Defined;
 
-    // Now find the sequence of instructions between PrevInst and Inst.
-    unsigned NumCalls = 0;
-    const Instruction *PrevInst = &getLastInstructionInBundle(Prev);
-    BasicBlock::const_reverse_iterator
-        InstIt = ++getLastInstructionInBundle(TE).getIterator().getReverse(),
-        PrevInstIt = PrevInst->getIterator().getReverse();
-    while (InstIt != PrevInstIt) {
-      if (PrevInstIt == PrevInst->getParent()->rend()) {
-        PrevInstIt = getLastInstructionInBundle(TE).getParent()->rbegin();
-        continue;
+  // Iterate the tree from the root, post order so that all uses appear before
+  // definitions.
+  // TODO: LiveEntries are shared across all paths, so this may overestimate.
+  for (BasicBlock *BB : post_order(RootBB->getParent())) {
+    if (!Reachable.contains(BB))
+      continue;
+
+    for (Instruction &I : reverse(*BB)) {
+      for (const auto *TE : getTreeEntries(&I)) {
+        if (TE->isGather())
+          continue;
+        LiveEntries.erase(TE);
+        Defined.insert(TE);
+        for (unsigned Idx : seq<unsigned>(TE->getNumOperands())) {
+          const TreeEntry *Op = getVectorizedOperand(TE, Idx);
+          if (!Op)
+            continue;
+          assert(!Op->isGather() && "Expected vectorized operand.");
+          if (!Defined.contains(Op))
+            LiveEntries.insert(Op);
+        }
       }
 
       auto NoCallIntrinsic = [this](const Instruction *I) {
@@ -12497,26 +12477,24 @@ InstructionCost BoUpSLP::getSpillCost() {
       // Debug information does not impact spill cost.
       // Vectorized calls, represented as vector intrinsics, do not impact spill
       // cost.
-      if (const auto *CB = dyn_cast<CallBase>(&*PrevInstIt);
-          CB && !NoCallIntrinsic(CB) && !isVectorized(CB))
-        NumCalls++;
-
-      ++PrevInstIt;
-    }
+      if (const auto *CB = dyn_cast<CallBase>(&I);
+          CB && !NoCallIntrinsic(CB) && !isVectorized(CB)) {
+        SmallVector<Type *, 4> EntriesTypes;
+        for (const TreeEntry *TE : LiveEntries) {
+          auto *ScalarTy = TE->getMainOp()->getType();
+          auto It = MinBWs.find(TE);
+          if (It != MinBWs.end())
+            ScalarTy =
+                IntegerType::get(ScalarTy->getContext(), It->second.first);
+          EntriesTypes.push_back(
+              getWidenedType(ScalarTy, TE->getVectorFactor()));
+        }
 
-    if (NumCalls) {
-      SmallVector<Type *, 4> EntriesTypes;
-      for (const TreeEntry *TE : LiveEntries) {
-        auto *ScalarTy = TE->getMainOp()->getType();
-        auto It = MinBWs.find(TE);
-        if (It != MinBWs.end())
-          ScalarTy = IntegerType::get(ScalarTy->getContext(), It->second.first);
-        EntriesTypes.push_back(getWidenedType(ScalarTy, TE->getVectorFactor()));
+        LLVM_DEBUG(dbgs() << "SLP: " << LiveEntries.size()
+                          << " entries alive over call:" << I << "\n");
+        Cost += TTI->getCostOfKeepingLiveOverCall(EntriesTypes);
       }
-      Cost += NumCalls * TTI->getCostOfKeepingLiveOverCall(EntriesTypes);
     }
-
-    Prev = TE;
   }
 
   return Cost;
diff --git a/llvm/test/Transforms/SLPVectorizer/RISCV/math-function.ll b/llvm/test/Transforms/SLPVectorizer/RISCV/math-function.ll
index 8cb620f870331..fc71643f6a51d 100644
--- a/llvm/test/Transforms/SLPVectorizer/RISCV/math-function.ll
+++ b/llvm/test/Transforms/SLPVectorizer/RISCV/math-function.ll
@@ -1740,7 +1740,9 @@ entry:
 define void @f(i1 %c, ptr %p, ptr %q, ptr %r) {
 ; CHECK-LABEL: define void @f
 ; CHECK-SAME: (i1 [[C:%.*]], ptr [[P:%.*]], ptr [[Q:%.*]], ptr [[R:%.*]]) #[[ATTR1]] {
-; CHECK-NEXT:    [[TMP1:%.*]] = load <2 x i64>, ptr [[P]], align 8
+; CHECK-NEXT:    [[X0:%.*]] = load i64, ptr [[P]], align 8
+; CHECK-NEXT:    [[P1:%.*]] = getelementptr i64, ptr [[P]], i64 1
+; CHECK-NEXT:    [[X1:%.*]] = load i64, ptr [[P1]], align 8
 ; CHECK-NEXT:    br i1 [[C]], label [[FOO:%.*]], label [[BAR:%.*]]
 ; CHECK:       foo:
 ; CHECK-NEXT:    [[Y0:%.*]] = load float, ptr [[R]], align 4
@@ -1751,12 +1753,16 @@ define void @f(i1 %c, ptr %p, ptr %q, ptr %r) {
 ; CHECK-NEXT:    [[Z1:%.*]] = call float @fabsf(float [[Z0]])
 ; CHECK-NEXT:    br label [[BAZ]]
 ; CHECK:       baz:
-; CHECK-NEXT:    store <2 x i64> [[TMP1]], ptr [[Q]], align 8
+; CHECK-NEXT:    store i64 [[X0]], ptr [[Q]], align 8
+; CHECK-NEXT:    [[Q1:%.*]] = getelementptr i64, ptr [[Q]], i64 1
+; CHECK-NEXT:    store i64 [[X1]], ptr [[Q1]], align 8
 ; CHECK-NEXT:    ret void
 ;
 ; DEFAULT-LABEL: define void @f
 ; DEFAULT-SAME: (i1 [[C:%.*]], ptr [[P:%.*]], ptr [[Q:%.*]], ptr [[R:%.*]]) #[[ATTR1]] {
-; DEFAULT-NEXT:    [[TMP1:%.*]] = load <2 x i64>, ptr [[P]], align 8
+; DEFAULT-NEXT:    [[X0:%.*]] = load i64, ptr [[P]], align 8
+; DEFAULT-NEXT:    [[P1:%.*]] = getelementptr i64, ptr [[P]], i64 1
+; DEFAULT-NEXT:    [[X1:%.*]] = load i64, ptr [[P1]], align 8
 ; DEFAULT-NEXT:    br i1 [[C]], label [[FOO:%.*]], label [[BAR:%.*]]
 ; DEFAULT:       foo:
 ; DEFAULT-NEXT:    [[Y0:%.*]] = load float, ptr [[R]], align 4
@@ -1767,7 +1773,9 @@ define void @f(i1 %c, ptr %p, ptr %q, ptr %r) {
 ; DEFAULT-NEXT:    [[Z1:%.*]] = call float @fabsf(float [[Z0]])
 ; DEFAULT-NEXT:    br label [[BAZ]]
 ; DEFAULT:       baz:
-; DEFAULT-NEXT:    store <2 x i64> [[TMP1]], ptr [[Q]], align 8
+; DEFAULT-NEXT:    store i64 [[X0]], ptr [[Q]], align 8
+; DEFAULT-NEXT:    [[Q1:%.*]] = getelementptr i64, ptr [[Q]], i64 1
+; DEFAULT-NEXT:    store i64 [[X1]], ptr [[Q1]], align 8
 ; DEFAULT-NEXT:    ret void
 ;
   %x0 = load i64, ptr %p
diff --git a/llvm/test/Transforms/SLPVectorizer/RISCV/spillcost.ll b/llvm/test/Transforms/SLPVectorizer/RISCV/spillcost.ll
index b0c25bc4cc1f2..5e91616c2beb9 100644
--- a/llvm/test/Transforms/SLPVectorizer/RISCV/spillcost.ll
+++ b/llvm/test/Transforms/SLPVectorizer/RISCV/spillcost.ll
@@ -7,7 +7,9 @@ declare void @g()
 define void @f0(i1 %c, ptr %p, ptr %q) {
 ; CHECK-LABEL: define void @f0(
 ; CHECK-SAME: i1 [[C:%.*]], ptr [[P:%.*]], ptr [[Q:%.*]]) #[[ATTR0:[0-9]+]] {
-; CHECK-NEXT:    [[TMP1:%.*]] = load <2 x i64>, ptr [[P]], align 8
+; CHECK-NEXT:    [[X0:%.*]] = load i64, ptr [[P]], align 8
+; CHECK-NEXT:    [[P1:%.*]] = getelementptr i64, ptr [[P]], i64 1
+; CHECK-NEXT:    [[X1:%.*]] = load i64, ptr [[P1]], align 8
 ; CHECK-NEXT:    br i1 [[C]], label %[[FOO:.*]], label %[[BAR:.*]]
 ; CHECK:       [[FOO]]:
 ; CHECK-NEXT:    call void @g()
@@ -20,7 +22,9 @@ define void @f0(i1 %c, ptr %p, ptr %q) {
 ; CHECK-NEXT:    call void @g()
 ; CHECK-NEXT:    br label %[[BAZ]]
 ; CHECK:       [[BAZ]]:
-; CHECK-NEXT:    store <2 x i64> [[TMP1]], ptr [[Q]], align 8
+; CHECK-NEXT:    store i64 [[X0]], ptr [[Q]], align 8
+; CHECK-NEXT:    [[Q1:%.*]] = getelementptr i64, ptr [[Q]], i64 1
+; CHECK-NEXT:    store i64 [[X1]], ptr [[Q1]], align 8
 ; CHECK-NEXT:    ret void
 ;
   %x0 = load i64, ptr %p
@@ -50,10 +54,13 @@ define void @f1(i1 %c, ptr %p, ptr %q, ptr %r) {
 ; CHECK-LABEL: define void @f1(
 ; CHECK-SAME: i1 [[C:%.*]], ptr [[P:%.*]], ptr [[Q:%.*]], ptr [[R:%.*]]) #[[ATTR0]] {
 ; CHECK-NEXT:  [[ENTRY:.*:]]
-; CHECK-NEXT:    [[TMP0:%.*]] = load <2 x i64>, ptr [[P]], align 8
+; CHECK-NEXT:    [[X0:%.*]] = load i64, ptr [[P]], align 8
+; CHECK-NEXT:    [[P1:%.*]] = getelementptr i64, ptr [[P]], i64 1
+; CHECK-NEXT:    [[X1:%.*]] = load i64, ptr [[P1]], align 8
 ; CHECK-NEXT:    br i1 [[C]], label %[[FOO:.*]], label %[[BAR:.*]]
 ; CHECK:       [[FOO]]:
-; CHECK-NEXT:    [[TMP1:%.*]] = add <2 x i64> [[TMP0]], splat (i64 1)
+; CHECK-NEXT:    [[Y0:%.*]] = add i64 [[X0]], 1
+; CHECK-NEXT:    [[Y1:%.*]] = add i64 [[X1]], 1
 ; CHECK-NEXT:    br label %[[BAZ:.*]]
 ; CHECK:       [[BAR]]:
 ; CHECK-NEXT:    call void @g()
@@ -61,8 +68,11 @@ define void @f1(i1 %c, ptr %p, ptr %q, ptr %r) {
 ; CHECK-NEXT:    call void @g()
 ; CHECK-NEXT:    br label %[[BAZ]]
 ; CHECK:       [[BAZ]]:
-; CHECK-NEXT:    [[TMP2:%.*]] = phi <2 x i64> [ [[TMP1]], %[[FOO]] ], [ [[TMP0]], %[[BAR]] ]
-; CHECK-NEXT:    store <2 x i64> [[TMP2]], ptr [[Q]], align 8
+; CHECK-NEXT:    [[PHI0:%.*]] = phi i64 [ [[Y0]], %[[FOO]] ], [ [[X0]], %[[BAR]] ]
+; CHECK-NEXT:    [[PHI1:%.*]] = phi i64 [ [[Y1]], %[[FOO]] ], [ [[X1]], %[[BAR]] ]
+; CHECK-NEXT:    store i64 [[PHI0]], ptr [[Q]], align 8
+; CHECK-NEXT:    [[Q1:%.*]] = getelementptr i64, ptr [[Q]], i64 1
+; CHECK-NEXT:    store i64 [[PHI1]], ptr [[Q1]], align 8
 ; CHECK-NEXT:    ret void
 ;
 entry:

mikhailramalho · 2025-02-25T03:02:05Z

Current compile time regression on SPEC:

Program                                       compile_time              
                                              lhs          rhs     diff 
INT2017rate/502.gcc_r/502.gcc_r               1213.14      1296.42  6.9%
INT2017speed/602.gcc_s/602.gcc_s              1226.93      1296.10  5.6%
FP2017spee...96.specrand_fs/996.specrand_fs      1.61         1.69  5.2%
FP2017rate/508.namd_r/508.namd_r                83.55        87.05  4.2%
FP2017rate/538.imagick_r/538.imagick_r         240.84       244.25  1.4%
INT2017rat...00.perlbench_r/500.perlbench_r    215.02       217.01  0.9%
FP2017speed/638.imagick_s/638.imagick_s        226.80       228.73  0.9%
FP2017rate/526.blender_r/526.blender_r        1910.20      1923.96  0.7%
INT2017spe...00.perlbench_s/600.perlbench_s    217.00       218.35  0.6%
INT2017spe...31.deepsjeng_s/631.deepsjeng_s     26.76        26.89  0.5%
INT2017speed/657.xz_s/657.xz_s                  87.78        88.22  0.5%
FP2017rate/510.parest_r/510.parest_r          1838.97      1847.39  0.5%
FP2017rate/511.povray_r/511.povray_r           200.60       201.43  0.4%
INT2017spe...23.xalancbmk_s/623.xalancbmk_s   1536.31      1542.30  0.4%
INT2017rate/525.x264_r/525.x264_r               81.10        81.36  0.3%
INT2017speed/625.x264_s/625.x264_s              81.27        81.51  0.3%
FP2017rate/544.nab_r/544.nab_r                  30.93        31.00  0.2%
INT2017rate/505.mcf_r/505.mcf_r                 12.43        12.46  0.2%
INT2017speed/641.leela_s/641.leela_s            68.88        69.01  0.2%
INT2017speed/605.mcf_s/605.mcf_s                12.48        12.49  0.1%
INT2017rat...99.specrand_ir/999.specrand_ir      1.78         1.78  0.0%
INT2017rat...23.xalancbmk_r/523.xalancbmk_r   1539.76      1539.29 -0.0%
INT2017rate/520.omnetpp_r/520.omnetpp_r        631.83       631.48 -0.1%
INT2017spe...ed/620.omnetpp_s/620.omnetpp_s    630.56       629.81 -0.1%
INT2017rate/557.xz_r/557.xz_r                   88.09        87.89 -0.2%
INT2017rat...31.deepsjeng_r/531.deepsjeng_r     26.95        26.80 -0.6%
FP2017rate...97.specrand_fr/997.specrand_fr      1.78         1.77 -0.6%
INT2017rate/541.leela_r/541.leela_r             69.31        68.84 -0.7%
FP2017speed/644.nab_s/644.nab_s                 29.05        28.74 -1.1%
FP2017speed/619.lbm_s/619.lbm_s                  2.78         2.75 -1.3%
INT2017spe...98.specrand_is/998.specrand_is      1.80         1.78 -1.4%
FP2017rate/519.lbm_r/519.lbm_r                   2.85         2.81 -1.4%
                           Geomean difference                       0.7%

alexey-bataev · 2025-02-25T11:21:30Z

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

+    for (Instruction &I : reverse(*BB)) {
+      for (const auto *TE : getTreeEntries(&I)) {


I'm afraid this may affect compile time, better to invent something else here

Do you have any specific suggestions? This does impact compile time (as noted in the review), but also is a pretty big win on at least one benchmark.

Maybe iterate over the tree and check if the main instruction has the required parent?

On this particular point, we can probably invert this loop. We can walk the vectorization tree, and then from the last instruction in the bundle, traverse the CFG back to the next defining instruction in the vectorization tree.

I think you are describing our first approach, but there was a limitation, on the new test we added (https://github.com/mikhailramalho/llvm-project/blob/a6300f77974b9a706eec966af1f1835173251623/llvm/test/Transforms/SLPVectorizer/RISCV/spillcost.ll#L53):

entry / \ foo bar* \ / baz * only bar has function calls

In this case, we get three entries that we can traverse backward:
(1) baz -> baz
(2) baz -> foo
(3) foo -> entry

None go through bar, so we don't add the calls in that block to the final cost...

That's when we decided to go with collecting all reachable blocks instead, so we wouldn't miss these calls

Is getTreeEntries not a map lookup? I thought it would have been relatively cheap

Yes, it's cheap, but the basic blocks might be very large and contain several thousands instructions, so iterating over instructions in the basic block may hurt the perf

preames · 2025-02-25T16:15:44Z

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

-      assert(!Op->isGather() && "Expected vectorized operand.");
-      LiveEntries.insert(Op);
-    }
+  DenseSet<const BasicBlock *> Reachable;


Algorithmically, this doesn't look quite right. You're computing the set of blocks reachable from all defs to the root. What you actually want to have explored is the set of blocks reachable between each def and it's respective set of uses. Your set the union of all of the smaller sets, but we want to respect liveness though the graph.

I think you're going to need to precompute the number of live values per block (and transition points within blocks) given the def/use information. Once you have that, I'd suggest walking the CFG in the existing loop below (i.e. change the early exit at end of block case), and cache don't precompute which blocks have been scanned so far. In particular, cache the number of calls per block which is the primary thing you need at that point.

@alexey-bataev is there some invariant on the graph which simplifies this in any way? Do we statically know any properties of the CFG or vector tree which allow us to avoid full general liveness here?

Only that the instructions in vector nodes do not have deps of any kind between them

Algorithmically, this doesn't look quite right. You're computing the set of blocks reachable from all defs to the root. What you actually want to have explored is the set of blocks reachable between each def and it's respective set of uses. Your set the union of all of the smaller sets, but we want to respect liveness though the graph.

Computing the set of reachable blocks up front is a compile-time optimisation to cull the number of blocks visited. The actual exploration is a post order traversal starting from the root, so all uses come before defs (except for phis)

I think you're going to need to precompute the number of live values per block (and transition points within blocks) given the def/use information.

I'm not sure I'm 100% following this, could you elaborate on this a bit more. To do this do we not need to walk the CFG starting from the root anyway?

Taking a complete stab in the dark here, but my gut feeling is that most of the compile time slowdown comes from SLP trees that are inside massive loops where everything is reachable. I wonder if we can have a heuristic to ignore the backedges when deciding what to traverse?

preames · 2025-02-25T16:19:11Z

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

-        InstIt = ++getLastInstructionInBundle(TE).getIterator().getReverse(),
-        PrevInstIt = PrevInst->getIterator().getReverse();
-    while (InstIt != PrevInstIt) {
-      if (PrevInstIt == PrevInst->getParent()->rend()) {


If I'm reading the old code right, we only scanned the prefix of the block containing the using bundle, and then ignored any calls between definition and the end of it's block, plus any in intervening blocks?

If so, a couple of question/suggestions.

What happens if you add a huge cost for any case not contained in the same basic block? (i.e. strongly disincentivize long cross block cases)

What happens if you handle only "easy" control flow? Things like a series of if-else or if-clause, with some small finite depth?

Is there some invariant which insures none of these nodes can be in loops? If not, do we need to scale by trip count in some way?

preames · 2025-02-25T16:26:00Z

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

+    for (Instruction &I : reverse(*BB)) {
+      for (const auto *TE : getTreeEntries(&I)) {


On this particular point, we can probably invert this loop. We can walk the vectorization tree, and then from the last instruction in the bundle, traverse the CFG back to the next defining instruction in the vectorization tree.

mikhailramalho requested review from preames, RKSimon, lukel97 and alexey-bataev February 25, 2025 03:00

llvmbot added vectorizers llvm:transforms labels Feb 25, 2025

alexey-bataev reviewed Feb 25, 2025

View reviewed changes

preames reviewed Feb 25, 2025

View reviewed changes

preames mentioned this pull request Feb 28, 2025

[SLP]Fix/improve getSpillCost analysis #129258

Merged

mikhailramalho closed this Mar 8, 2025

mikhailramalho deleted the slp-spillcost-fix branch April 23, 2025 17:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SLP] Improve block traversal in getSpillCost() #128620

[SLP] Improve block traversal in getSpillCost() #128620

Uh oh!

mikhailramalho commented Feb 25, 2025

Uh oh!

llvmbot commented Feb 25, 2025 •

edited

Loading

Uh oh!

mikhailramalho commented Feb 25, 2025 •

edited

Loading

Uh oh!

alexey-bataev Feb 25, 2025

Uh oh!

preames Feb 25, 2025

Uh oh!

alexey-bataev Feb 25, 2025

Uh oh!

preames Feb 25, 2025

Uh oh!

mikhailramalho Feb 25, 2025

Uh oh!

lukel97 Feb 26, 2025

Uh oh!

alexey-bataev Feb 26, 2025

Uh oh!

preames Feb 25, 2025

Uh oh!

alexey-bataev Feb 25, 2025

Uh oh!

lukel97 Feb 26, 2025 •

edited

Loading

Uh oh!

preames Feb 25, 2025

Uh oh!

preames Feb 25, 2025

Uh oh!

Uh oh!

		for (Instruction &I : reverse(*BB)) {
		for (const auto *TE : getTreeEntries(&I)) {

[SLP] Improve block traversal in getSpillCost() #128620

[SLP] Improve block traversal in getSpillCost() #128620

Uh oh!

Conversation

mikhailramalho commented Feb 25, 2025

Uh oh!

llvmbot commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikhailramalho commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukel97 Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

llvmbot commented Feb 25, 2025 •

edited

Loading

mikhailramalho commented Feb 25, 2025 •

edited

Loading

lukel97 Feb 26, 2025 •

edited

Loading