Skip to content

[SLP] Improve block traversal in getSpillCost() #128620

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

mikhailramalho
Copy link
Member

This is a WIP patch due to the compilation time regressions, up to 7% on gcc_r/gcc_s.

Previously, getSpillCost would skip in between blocks when traversing instructions backward. If one of the missing blocks has a function call, the existing logic would lead to incorrect spill cost calculations.

The new implementation:

  • Uses post_order traversal to visit blocks
  • Tracks live entries across basic blocks
  • Computes reachable blocks once upfront using depth_first_ext

Performance improvements:

  • Reduces execution time of SPEC CPU benchmark 544.nab_r by 9.92%
  • Reduces code size of 508.namd by 1.73%

This optimization improves vectorization decisions by making spill cost estimation more accurate, particularly for code with complex control flow.

This is a WIP patch due to the compilation time regressions, up to 7% on
gcc_r/gcc_s.

Previously, getSpillCost would skip in between blocks when traversing
instructions backward. If one of the missing blocks has a function call,
the existing logic would lead to incorrect spill cost calculations.

The new implementation:
- Uses post_order traversal to visit blocks
- Tracks live entries across basic blocks
- Computes reachable blocks once upfront using depth_first_ext
- Maintains correct cost calculation for diamond-shaped control flow

Performance improvements:
- Reduces execution time of SPEC CPU benchmark 544.nab_r by 9.92%
- Reduces code size of 508.namd by 1.73%

This optimization improves vectorization decisions by making spill cost
estimation more accurate, particularly for code with complex control flow.
@llvmbot
Copy link
Member

llvmbot commented Feb 25, 2025

@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-llvm-transforms

Author: Mikhail R. Gadelha (mikhailramalho)

Changes

This is a WIP patch due to the compilation time regressions, up to 7% on gcc_r/gcc_s.

Previously, getSpillCost would skip in between blocks when traversing instructions backward. If one of the missing blocks has a function call, the existing logic would lead to incorrect spill cost calculations.

The new implementation:

  • Uses post_order traversal to visit blocks
  • Tracks live entries across basic blocks
  • Computes reachable blocks once upfront using depth_first_ext

Performance improvements:

  • Reduces execution time of SPEC CPU benchmark 544.nab_r by 9.92%
  • Reduces code size of 508.namd by 1.73%

This optimization improves vectorization decisions by making spill cost estimation more accurate, particularly for code with complex control flow.


Full diff: https://github.com/llvm/llvm-project/pull/128620.diff

3 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+52-74)
  • (modified) llvm/test/Transforms/SLPVectorizer/RISCV/math-function.ll (+12-4)
  • (modified) llvm/test/Transforms/SLPVectorizer/RISCV/spillcost.ll (+16-6)
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 3d660b63309d4..2d23307c18839 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -12415,68 +12415,48 @@ InstructionCost BoUpSLP::getSpillCost() {
   InstructionCost Cost = 0;
 
   SmallPtrSet<const TreeEntry *, 4> LiveEntries;
-  const TreeEntry *Prev = nullptr;
-
-  // The entries in VectorizableTree are not necessarily ordered by their
-  // position in basic blocks. Collect them and order them by dominance so later
-  // instructions are guaranteed to be visited first. For instructions in
-  // different basic blocks, we only scan to the beginning of the block, so
-  // their order does not matter, as long as all instructions in a basic block
-  // are grouped together. Using dominance ensures a deterministic order.
-  SmallVector<TreeEntry *, 16> OrderedEntries;
-  for (const auto &TEPtr : VectorizableTree) {
-    if (TEPtr->isGather())
-      continue;
-    OrderedEntries.push_back(TEPtr.get());
-  }
-  llvm::stable_sort(OrderedEntries, [&](const TreeEntry *TA,
-                                        const TreeEntry *TB) {
-    Instruction &A = getLastInstructionInBundle(TA);
-    Instruction &B = getLastInstructionInBundle(TB);
-    auto *NodeA = DT->getNode(A.getParent());
-    auto *NodeB = DT->getNode(B.getParent());
-    assert(NodeA && "Should only process reachable instructions");
-    assert(NodeB && "Should only process reachable instructions");
-    assert((NodeA == NodeB) == (NodeA->getDFSNumIn() == NodeB->getDFSNumIn()) &&
-           "Different nodes should have different DFS numbers");
-    if (NodeA != NodeB)
-      return NodeA->getDFSNumIn() > NodeB->getDFSNumIn();
-    return B.comesBefore(&A);
-  });
 
-  for (const TreeEntry *TE : OrderedEntries) {
-    if (!Prev) {
-      Prev = TE;
+  const TreeEntry *Root = VectorizableTree.front().get();
+  BasicBlock *RootBB = cast<Instruction>(Root->Scalars[0])->getParent();
+
+  // Compute what nodes are reachable from the leaves to the roots
+  df_iterator_default_set<const BasicBlock *> ReachableFromLeaves;
+  for (auto &TE : VectorizableTree) {
+    if (TE->isGather())
       continue;
-    }
+    auto *BB = getLastInstructionInBundle(TE.get()).getParent();
+    for (const BasicBlock *X : depth_first_ext(BB, ReachableFromLeaves))
+      ReachableFromLeaves.insert(X);
+  }
 
-    LiveEntries.erase(Prev);
-    for (unsigned I : seq<unsigned>(Prev->getNumOperands())) {
-      const TreeEntry *Op = getVectorizedOperand(Prev, I);
-      if (!Op)
-        continue;
-      assert(!Op->isGather() && "Expected vectorized operand.");
-      LiveEntries.insert(Op);
-    }
+  DenseSet<const BasicBlock *> Reachable;
+  for (const BasicBlock *X : inverse_depth_first(RootBB))
+    Reachable.insert(X);
+  set_intersect(Reachable, ReachableFromLeaves);
 
-    LLVM_DEBUG({
-      dbgs() << "SLP: #LV: " << LiveEntries.size();
-      for (auto *X : LiveEntries)
-        X->dump();
-      dbgs() << ", Looking at ";
-      TE->dump();
-    });
+  DenseSet<const TreeEntry *> Defined;
 
-    // Now find the sequence of instructions between PrevInst and Inst.
-    unsigned NumCalls = 0;
-    const Instruction *PrevInst = &getLastInstructionInBundle(Prev);
-    BasicBlock::const_reverse_iterator
-        InstIt = ++getLastInstructionInBundle(TE).getIterator().getReverse(),
-        PrevInstIt = PrevInst->getIterator().getReverse();
-    while (InstIt != PrevInstIt) {
-      if (PrevInstIt == PrevInst->getParent()->rend()) {
-        PrevInstIt = getLastInstructionInBundle(TE).getParent()->rbegin();
-        continue;
+  // Iterate the tree from the root, post order so that all uses appear before
+  // definitions.
+  // TODO: LiveEntries are shared across all paths, so this may overestimate.
+  for (BasicBlock *BB : post_order(RootBB->getParent())) {
+    if (!Reachable.contains(BB))
+      continue;
+
+    for (Instruction &I : reverse(*BB)) {
+      for (const auto *TE : getTreeEntries(&I)) {
+        if (TE->isGather())
+          continue;
+        LiveEntries.erase(TE);
+        Defined.insert(TE);
+        for (unsigned Idx : seq<unsigned>(TE->getNumOperands())) {
+          const TreeEntry *Op = getVectorizedOperand(TE, Idx);
+          if (!Op)
+            continue;
+          assert(!Op->isGather() && "Expected vectorized operand.");
+          if (!Defined.contains(Op))
+            LiveEntries.insert(Op);
+        }
       }
 
       auto NoCallIntrinsic = [this](const Instruction *I) {
@@ -12497,26 +12477,24 @@ InstructionCost BoUpSLP::getSpillCost() {
       // Debug information does not impact spill cost.
       // Vectorized calls, represented as vector intrinsics, do not impact spill
       // cost.
-      if (const auto *CB = dyn_cast<CallBase>(&*PrevInstIt);
-          CB && !NoCallIntrinsic(CB) && !isVectorized(CB))
-        NumCalls++;
-
-      ++PrevInstIt;
-    }
+      if (const auto *CB = dyn_cast<CallBase>(&I);
+          CB && !NoCallIntrinsic(CB) && !isVectorized(CB)) {
+        SmallVector<Type *, 4> EntriesTypes;
+        for (const TreeEntry *TE : LiveEntries) {
+          auto *ScalarTy = TE->getMainOp()->getType();
+          auto It = MinBWs.find(TE);
+          if (It != MinBWs.end())
+            ScalarTy =
+                IntegerType::get(ScalarTy->getContext(), It->second.first);
+          EntriesTypes.push_back(
+              getWidenedType(ScalarTy, TE->getVectorFactor()));
+        }
 
-    if (NumCalls) {
-      SmallVector<Type *, 4> EntriesTypes;
-      for (const TreeEntry *TE : LiveEntries) {
-        auto *ScalarTy = TE->getMainOp()->getType();
-        auto It = MinBWs.find(TE);
-        if (It != MinBWs.end())
-          ScalarTy = IntegerType::get(ScalarTy->getContext(), It->second.first);
-        EntriesTypes.push_back(getWidenedType(ScalarTy, TE->getVectorFactor()));
+        LLVM_DEBUG(dbgs() << "SLP: " << LiveEntries.size()
+                          << " entries alive over call:" << I << "\n");
+        Cost += TTI->getCostOfKeepingLiveOverCall(EntriesTypes);
       }
-      Cost += NumCalls * TTI->getCostOfKeepingLiveOverCall(EntriesTypes);
     }
-
-    Prev = TE;
   }
 
   return Cost;
diff --git a/llvm/test/Transforms/SLPVectorizer/RISCV/math-function.ll b/llvm/test/Transforms/SLPVectorizer/RISCV/math-function.ll
index 8cb620f870331..fc71643f6a51d 100644
--- a/llvm/test/Transforms/SLPVectorizer/RISCV/math-function.ll
+++ b/llvm/test/Transforms/SLPVectorizer/RISCV/math-function.ll
@@ -1740,7 +1740,9 @@ entry:
 define void @f(i1 %c, ptr %p, ptr %q, ptr %r) {
 ; CHECK-LABEL: define void @f
 ; CHECK-SAME: (i1 [[C:%.*]], ptr [[P:%.*]], ptr [[Q:%.*]], ptr [[R:%.*]]) #[[ATTR1]] {
-; CHECK-NEXT:    [[TMP1:%.*]] = load <2 x i64>, ptr [[P]], align 8
+; CHECK-NEXT:    [[X0:%.*]] = load i64, ptr [[P]], align 8
+; CHECK-NEXT:    [[P1:%.*]] = getelementptr i64, ptr [[P]], i64 1
+; CHECK-NEXT:    [[X1:%.*]] = load i64, ptr [[P1]], align 8
 ; CHECK-NEXT:    br i1 [[C]], label [[FOO:%.*]], label [[BAR:%.*]]
 ; CHECK:       foo:
 ; CHECK-NEXT:    [[Y0:%.*]] = load float, ptr [[R]], align 4
@@ -1751,12 +1753,16 @@ define void @f(i1 %c, ptr %p, ptr %q, ptr %r) {
 ; CHECK-NEXT:    [[Z1:%.*]] = call float @fabsf(float [[Z0]])
 ; CHECK-NEXT:    br label [[BAZ]]
 ; CHECK:       baz:
-; CHECK-NEXT:    store <2 x i64> [[TMP1]], ptr [[Q]], align 8
+; CHECK-NEXT:    store i64 [[X0]], ptr [[Q]], align 8
+; CHECK-NEXT:    [[Q1:%.*]] = getelementptr i64, ptr [[Q]], i64 1
+; CHECK-NEXT:    store i64 [[X1]], ptr [[Q1]], align 8
 ; CHECK-NEXT:    ret void
 ;
 ; DEFAULT-LABEL: define void @f
 ; DEFAULT-SAME: (i1 [[C:%.*]], ptr [[P:%.*]], ptr [[Q:%.*]], ptr [[R:%.*]]) #[[ATTR1]] {
-; DEFAULT-NEXT:    [[TMP1:%.*]] = load <2 x i64>, ptr [[P]], align 8
+; DEFAULT-NEXT:    [[X0:%.*]] = load i64, ptr [[P]], align 8
+; DEFAULT-NEXT:    [[P1:%.*]] = getelementptr i64, ptr [[P]], i64 1
+; DEFAULT-NEXT:    [[X1:%.*]] = load i64, ptr [[P1]], align 8
 ; DEFAULT-NEXT:    br i1 [[C]], label [[FOO:%.*]], label [[BAR:%.*]]
 ; DEFAULT:       foo:
 ; DEFAULT-NEXT:    [[Y0:%.*]] = load float, ptr [[R]], align 4
@@ -1767,7 +1773,9 @@ define void @f(i1 %c, ptr %p, ptr %q, ptr %r) {
 ; DEFAULT-NEXT:    [[Z1:%.*]] = call float @fabsf(float [[Z0]])
 ; DEFAULT-NEXT:    br label [[BAZ]]
 ; DEFAULT:       baz:
-; DEFAULT-NEXT:    store <2 x i64> [[TMP1]], ptr [[Q]], align 8
+; DEFAULT-NEXT:    store i64 [[X0]], ptr [[Q]], align 8
+; DEFAULT-NEXT:    [[Q1:%.*]] = getelementptr i64, ptr [[Q]], i64 1
+; DEFAULT-NEXT:    store i64 [[X1]], ptr [[Q1]], align 8
 ; DEFAULT-NEXT:    ret void
 ;
   %x0 = load i64, ptr %p
diff --git a/llvm/test/Transforms/SLPVectorizer/RISCV/spillcost.ll b/llvm/test/Transforms/SLPVectorizer/RISCV/spillcost.ll
index b0c25bc4cc1f2..5e91616c2beb9 100644
--- a/llvm/test/Transforms/SLPVectorizer/RISCV/spillcost.ll
+++ b/llvm/test/Transforms/SLPVectorizer/RISCV/spillcost.ll
@@ -7,7 +7,9 @@ declare void @g()
 define void @f0(i1 %c, ptr %p, ptr %q) {
 ; CHECK-LABEL: define void @f0(
 ; CHECK-SAME: i1 [[C:%.*]], ptr [[P:%.*]], ptr [[Q:%.*]]) #[[ATTR0:[0-9]+]] {
-; CHECK-NEXT:    [[TMP1:%.*]] = load <2 x i64>, ptr [[P]], align 8
+; CHECK-NEXT:    [[X0:%.*]] = load i64, ptr [[P]], align 8
+; CHECK-NEXT:    [[P1:%.*]] = getelementptr i64, ptr [[P]], i64 1
+; CHECK-NEXT:    [[X1:%.*]] = load i64, ptr [[P1]], align 8
 ; CHECK-NEXT:    br i1 [[C]], label %[[FOO:.*]], label %[[BAR:.*]]
 ; CHECK:       [[FOO]]:
 ; CHECK-NEXT:    call void @g()
@@ -20,7 +22,9 @@ define void @f0(i1 %c, ptr %p, ptr %q) {
 ; CHECK-NEXT:    call void @g()
 ; CHECK-NEXT:    br label %[[BAZ]]
 ; CHECK:       [[BAZ]]:
-; CHECK-NEXT:    store <2 x i64> [[TMP1]], ptr [[Q]], align 8
+; CHECK-NEXT:    store i64 [[X0]], ptr [[Q]], align 8
+; CHECK-NEXT:    [[Q1:%.*]] = getelementptr i64, ptr [[Q]], i64 1
+; CHECK-NEXT:    store i64 [[X1]], ptr [[Q1]], align 8
 ; CHECK-NEXT:    ret void
 ;
   %x0 = load i64, ptr %p
@@ -50,10 +54,13 @@ define void @f1(i1 %c, ptr %p, ptr %q, ptr %r) {
 ; CHECK-LABEL: define void @f1(
 ; CHECK-SAME: i1 [[C:%.*]], ptr [[P:%.*]], ptr [[Q:%.*]], ptr [[R:%.*]]) #[[ATTR0]] {
 ; CHECK-NEXT:  [[ENTRY:.*:]]
-; CHECK-NEXT:    [[TMP0:%.*]] = load <2 x i64>, ptr [[P]], align 8
+; CHECK-NEXT:    [[X0:%.*]] = load i64, ptr [[P]], align 8
+; CHECK-NEXT:    [[P1:%.*]] = getelementptr i64, ptr [[P]], i64 1
+; CHECK-NEXT:    [[X1:%.*]] = load i64, ptr [[P1]], align 8
 ; CHECK-NEXT:    br i1 [[C]], label %[[FOO:.*]], label %[[BAR:.*]]
 ; CHECK:       [[FOO]]:
-; CHECK-NEXT:    [[TMP1:%.*]] = add <2 x i64> [[TMP0]], splat (i64 1)
+; CHECK-NEXT:    [[Y0:%.*]] = add i64 [[X0]], 1
+; CHECK-NEXT:    [[Y1:%.*]] = add i64 [[X1]], 1
 ; CHECK-NEXT:    br label %[[BAZ:.*]]
 ; CHECK:       [[BAR]]:
 ; CHECK-NEXT:    call void @g()
@@ -61,8 +68,11 @@ define void @f1(i1 %c, ptr %p, ptr %q, ptr %r) {
 ; CHECK-NEXT:    call void @g()
 ; CHECK-NEXT:    br label %[[BAZ]]
 ; CHECK:       [[BAZ]]:
-; CHECK-NEXT:    [[TMP2:%.*]] = phi <2 x i64> [ [[TMP1]], %[[FOO]] ], [ [[TMP0]], %[[BAR]] ]
-; CHECK-NEXT:    store <2 x i64> [[TMP2]], ptr [[Q]], align 8
+; CHECK-NEXT:    [[PHI0:%.*]] = phi i64 [ [[Y0]], %[[FOO]] ], [ [[X0]], %[[BAR]] ]
+; CHECK-NEXT:    [[PHI1:%.*]] = phi i64 [ [[Y1]], %[[FOO]] ], [ [[X1]], %[[BAR]] ]
+; CHECK-NEXT:    store i64 [[PHI0]], ptr [[Q]], align 8
+; CHECK-NEXT:    [[Q1:%.*]] = getelementptr i64, ptr [[Q]], i64 1
+; CHECK-NEXT:    store i64 [[PHI1]], ptr [[Q1]], align 8
 ; CHECK-NEXT:    ret void
 ;
 entry:

@mikhailramalho
Copy link
Member Author

mikhailramalho commented Feb 25, 2025

Current compile time regression on SPEC:

Program                                       compile_time              
                                              lhs          rhs     diff 
INT2017rate/502.gcc_r/502.gcc_r               1213.14      1296.42  6.9%
INT2017speed/602.gcc_s/602.gcc_s              1226.93      1296.10  5.6%
FP2017spee...96.specrand_fs/996.specrand_fs      1.61         1.69  5.2%
FP2017rate/508.namd_r/508.namd_r                83.55        87.05  4.2%
FP2017rate/538.imagick_r/538.imagick_r         240.84       244.25  1.4%
INT2017rat...00.perlbench_r/500.perlbench_r    215.02       217.01  0.9%
FP2017speed/638.imagick_s/638.imagick_s        226.80       228.73  0.9%
FP2017rate/526.blender_r/526.blender_r        1910.20      1923.96  0.7%
INT2017spe...00.perlbench_s/600.perlbench_s    217.00       218.35  0.6%
INT2017spe...31.deepsjeng_s/631.deepsjeng_s     26.76        26.89  0.5%
INT2017speed/657.xz_s/657.xz_s                  87.78        88.22  0.5%
FP2017rate/510.parest_r/510.parest_r          1838.97      1847.39  0.5%
FP2017rate/511.povray_r/511.povray_r           200.60       201.43  0.4%
INT2017spe...23.xalancbmk_s/623.xalancbmk_s   1536.31      1542.30  0.4%
INT2017rate/525.x264_r/525.x264_r               81.10        81.36  0.3%
INT2017speed/625.x264_s/625.x264_s              81.27        81.51  0.3%
FP2017rate/544.nab_r/544.nab_r                  30.93        31.00  0.2%
INT2017rate/505.mcf_r/505.mcf_r                 12.43        12.46  0.2%
INT2017speed/641.leela_s/641.leela_s            68.88        69.01  0.2%
INT2017speed/605.mcf_s/605.mcf_s                12.48        12.49  0.1%
INT2017rat...99.specrand_ir/999.specrand_ir      1.78         1.78  0.0%
INT2017rat...23.xalancbmk_r/523.xalancbmk_r   1539.76      1539.29 -0.0%
INT2017rate/520.omnetpp_r/520.omnetpp_r        631.83       631.48 -0.1%
INT2017spe...ed/620.omnetpp_s/620.omnetpp_s    630.56       629.81 -0.1%
INT2017rate/557.xz_r/557.xz_r                   88.09        87.89 -0.2%
INT2017rat...31.deepsjeng_r/531.deepsjeng_r     26.95        26.80 -0.6%
FP2017rate...97.specrand_fr/997.specrand_fr      1.78         1.77 -0.6%
INT2017rate/541.leela_r/541.leela_r             69.31        68.84 -0.7%
FP2017speed/644.nab_s/644.nab_s                 29.05        28.74 -1.1%
FP2017speed/619.lbm_s/619.lbm_s                  2.78         2.75 -1.3%
INT2017spe...98.specrand_is/998.specrand_is      1.80         1.78 -1.4%
FP2017rate/519.lbm_r/519.lbm_r                   2.85         2.81 -1.4%
                           Geomean difference                       0.7%

Comment on lines +12446 to +12447
for (Instruction &I : reverse(*BB)) {
for (const auto *TE : getTreeEntries(&I)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid this may affect compile time, better to invent something else here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any specific suggestions? This does impact compile time (as noted in the review), but also is a pretty big win on at least one benchmark.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe iterate over the tree and check if the main instruction has the required parent?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On this particular point, we can probably invert this loop. We can walk the vectorization tree, and then from the last instruction in the bundle, traverse the CFG back to the next defining instruction in the vectorization tree.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are describing our first approach, but there was a limitation, on the new test we added (https://github.com/mikhailramalho/llvm-project/blob/a6300f77974b9a706eec966af1f1835173251623/llvm/test/Transforms/SLPVectorizer/RISCV/spillcost.ll#L53):

       entry
    /         \
   foo       bar*
    \         /
        baz
* only bar has function calls

In this case, we get three entries that we can traverse backward:
(1) baz -> baz
(2) baz -> foo
(3) foo -> entry

None go through bar, so we don't add the calls in that block to the final cost...

That's when we decided to go with collecting all reachable blocks instead, so we wouldn't miss these calls

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is getTreeEntries not a map lookup? I thought it would have been relatively cheap

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's cheap, but the basic blocks might be very large and contain several thousands instructions, so iterating over instructions in the basic block may hurt the perf

assert(!Op->isGather() && "Expected vectorized operand.");
LiveEntries.insert(Op);
}
DenseSet<const BasicBlock *> Reachable;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Algorithmically, this doesn't look quite right. You're computing the set of blocks reachable from all defs to the root. What you actually want to have explored is the set of blocks reachable between each def and it's respective set of uses. Your set the union of all of the smaller sets, but we want to respect liveness though the graph.

I think you're going to need to precompute the number of live values per block (and transition points within blocks) given the def/use information. Once you have that, I'd suggest walking the CFG in the existing loop below (i.e. change the early exit at end of block case), and cache don't precompute which blocks have been scanned so far. In particular, cache the number of calls per block which is the primary thing you need at that point.

@alexey-bataev is there some invariant on the graph which simplifies this in any way? Do we statically know any properties of the CFG or vector tree which allow us to avoid full general liveness here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only that the instructions in vector nodes do not have deps of any kind between them

Copy link
Contributor

@lukel97 lukel97 Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Algorithmically, this doesn't look quite right. You're computing the set of blocks reachable from all defs to the root. What you actually want to have explored is the set of blocks reachable between each def and it's respective set of uses. Your set the union of all of the smaller sets, but we want to respect liveness though the graph.

Computing the set of reachable blocks up front is a compile-time optimisation to cull the number of blocks visited. The actual exploration is a post order traversal starting from the root, so all uses come before defs (except for phis)

I think you're going to need to precompute the number of live values per block (and transition points within blocks) given the def/use information.

I'm not sure I'm 100% following this, could you elaborate on this a bit more. To do this do we not need to walk the CFG starting from the root anyway?

Taking a complete stab in the dark here, but my gut feeling is that most of the compile time slowdown comes from SLP trees that are inside massive loops where everything is reachable. I wonder if we can have a heuristic to ignore the backedges when deciding what to traverse?

InstIt = ++getLastInstructionInBundle(TE).getIterator().getReverse(),
PrevInstIt = PrevInst->getIterator().getReverse();
while (InstIt != PrevInstIt) {
if (PrevInstIt == PrevInst->getParent()->rend()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading the old code right, we only scanned the prefix of the block containing the using bundle, and then ignored any calls between definition and the end of it's block, plus any in intervening blocks?

If so, a couple of question/suggestions.

  1. What happens if you add a huge cost for any case not contained in the same basic block? (i.e. strongly disincentivize long cross block cases)
  2. What happens if you handle only "easy" control flow? Things like a series of if-else or if-clause, with some small finite depth?
  3. Is there some invariant which insures none of these nodes can be in loops? If not, do we need to scale by trip count in some way?

Comment on lines +12446 to +12447
for (Instruction &I : reverse(*BB)) {
for (const auto *TE : getTreeEntries(&I)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On this particular point, we can probably invert this loop. We can walk the vectorization tree, and then from the last instruction in the bundle, traverse the CFG back to the next defining instruction in the vectorization tree.

@mikhailramalho mikhailramalho deleted the slp-spillcost-fix branch April 23, 2025 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants