-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[SLP] Improve block traversal in getSpillCost() #128620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is a WIP patch due to the compilation time regressions, up to 7% on gcc_r/gcc_s. Previously, getSpillCost would skip in between blocks when traversing instructions backward. If one of the missing blocks has a function call, the existing logic would lead to incorrect spill cost calculations. The new implementation: - Uses post_order traversal to visit blocks - Tracks live entries across basic blocks - Computes reachable blocks once upfront using depth_first_ext - Maintains correct cost calculation for diamond-shaped control flow Performance improvements: - Reduces execution time of SPEC CPU benchmark 544.nab_r by 9.92% - Reduces code size of 508.namd by 1.73% This optimization improves vectorization decisions by making spill cost estimation more accurate, particularly for code with complex control flow.
@llvm/pr-subscribers-vectorizers @llvm/pr-subscribers-llvm-transforms Author: Mikhail R. Gadelha (mikhailramalho) ChangesThis is a WIP patch due to the compilation time regressions, up to 7% on gcc_r/gcc_s. Previously, getSpillCost would skip in between blocks when traversing instructions backward. If one of the missing blocks has a function call, the existing logic would lead to incorrect spill cost calculations. The new implementation:
Performance improvements:
This optimization improves vectorization decisions by making spill cost estimation more accurate, particularly for code with complex control flow. Full diff: https://github.com/llvm/llvm-project/pull/128620.diff 3 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 3d660b63309d4..2d23307c18839 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -12415,68 +12415,48 @@ InstructionCost BoUpSLP::getSpillCost() {
InstructionCost Cost = 0;
SmallPtrSet<const TreeEntry *, 4> LiveEntries;
- const TreeEntry *Prev = nullptr;
-
- // The entries in VectorizableTree are not necessarily ordered by their
- // position in basic blocks. Collect them and order them by dominance so later
- // instructions are guaranteed to be visited first. For instructions in
- // different basic blocks, we only scan to the beginning of the block, so
- // their order does not matter, as long as all instructions in a basic block
- // are grouped together. Using dominance ensures a deterministic order.
- SmallVector<TreeEntry *, 16> OrderedEntries;
- for (const auto &TEPtr : VectorizableTree) {
- if (TEPtr->isGather())
- continue;
- OrderedEntries.push_back(TEPtr.get());
- }
- llvm::stable_sort(OrderedEntries, [&](const TreeEntry *TA,
- const TreeEntry *TB) {
- Instruction &A = getLastInstructionInBundle(TA);
- Instruction &B = getLastInstructionInBundle(TB);
- auto *NodeA = DT->getNode(A.getParent());
- auto *NodeB = DT->getNode(B.getParent());
- assert(NodeA && "Should only process reachable instructions");
- assert(NodeB && "Should only process reachable instructions");
- assert((NodeA == NodeB) == (NodeA->getDFSNumIn() == NodeB->getDFSNumIn()) &&
- "Different nodes should have different DFS numbers");
- if (NodeA != NodeB)
- return NodeA->getDFSNumIn() > NodeB->getDFSNumIn();
- return B.comesBefore(&A);
- });
- for (const TreeEntry *TE : OrderedEntries) {
- if (!Prev) {
- Prev = TE;
+ const TreeEntry *Root = VectorizableTree.front().get();
+ BasicBlock *RootBB = cast<Instruction>(Root->Scalars[0])->getParent();
+
+ // Compute what nodes are reachable from the leaves to the roots
+ df_iterator_default_set<const BasicBlock *> ReachableFromLeaves;
+ for (auto &TE : VectorizableTree) {
+ if (TE->isGather())
continue;
- }
+ auto *BB = getLastInstructionInBundle(TE.get()).getParent();
+ for (const BasicBlock *X : depth_first_ext(BB, ReachableFromLeaves))
+ ReachableFromLeaves.insert(X);
+ }
- LiveEntries.erase(Prev);
- for (unsigned I : seq<unsigned>(Prev->getNumOperands())) {
- const TreeEntry *Op = getVectorizedOperand(Prev, I);
- if (!Op)
- continue;
- assert(!Op->isGather() && "Expected vectorized operand.");
- LiveEntries.insert(Op);
- }
+ DenseSet<const BasicBlock *> Reachable;
+ for (const BasicBlock *X : inverse_depth_first(RootBB))
+ Reachable.insert(X);
+ set_intersect(Reachable, ReachableFromLeaves);
- LLVM_DEBUG({
- dbgs() << "SLP: #LV: " << LiveEntries.size();
- for (auto *X : LiveEntries)
- X->dump();
- dbgs() << ", Looking at ";
- TE->dump();
- });
+ DenseSet<const TreeEntry *> Defined;
- // Now find the sequence of instructions between PrevInst and Inst.
- unsigned NumCalls = 0;
- const Instruction *PrevInst = &getLastInstructionInBundle(Prev);
- BasicBlock::const_reverse_iterator
- InstIt = ++getLastInstructionInBundle(TE).getIterator().getReverse(),
- PrevInstIt = PrevInst->getIterator().getReverse();
- while (InstIt != PrevInstIt) {
- if (PrevInstIt == PrevInst->getParent()->rend()) {
- PrevInstIt = getLastInstructionInBundle(TE).getParent()->rbegin();
- continue;
+ // Iterate the tree from the root, post order so that all uses appear before
+ // definitions.
+ // TODO: LiveEntries are shared across all paths, so this may overestimate.
+ for (BasicBlock *BB : post_order(RootBB->getParent())) {
+ if (!Reachable.contains(BB))
+ continue;
+
+ for (Instruction &I : reverse(*BB)) {
+ for (const auto *TE : getTreeEntries(&I)) {
+ if (TE->isGather())
+ continue;
+ LiveEntries.erase(TE);
+ Defined.insert(TE);
+ for (unsigned Idx : seq<unsigned>(TE->getNumOperands())) {
+ const TreeEntry *Op = getVectorizedOperand(TE, Idx);
+ if (!Op)
+ continue;
+ assert(!Op->isGather() && "Expected vectorized operand.");
+ if (!Defined.contains(Op))
+ LiveEntries.insert(Op);
+ }
}
auto NoCallIntrinsic = [this](const Instruction *I) {
@@ -12497,26 +12477,24 @@ InstructionCost BoUpSLP::getSpillCost() {
// Debug information does not impact spill cost.
// Vectorized calls, represented as vector intrinsics, do not impact spill
// cost.
- if (const auto *CB = dyn_cast<CallBase>(&*PrevInstIt);
- CB && !NoCallIntrinsic(CB) && !isVectorized(CB))
- NumCalls++;
-
- ++PrevInstIt;
- }
+ if (const auto *CB = dyn_cast<CallBase>(&I);
+ CB && !NoCallIntrinsic(CB) && !isVectorized(CB)) {
+ SmallVector<Type *, 4> EntriesTypes;
+ for (const TreeEntry *TE : LiveEntries) {
+ auto *ScalarTy = TE->getMainOp()->getType();
+ auto It = MinBWs.find(TE);
+ if (It != MinBWs.end())
+ ScalarTy =
+ IntegerType::get(ScalarTy->getContext(), It->second.first);
+ EntriesTypes.push_back(
+ getWidenedType(ScalarTy, TE->getVectorFactor()));
+ }
- if (NumCalls) {
- SmallVector<Type *, 4> EntriesTypes;
- for (const TreeEntry *TE : LiveEntries) {
- auto *ScalarTy = TE->getMainOp()->getType();
- auto It = MinBWs.find(TE);
- if (It != MinBWs.end())
- ScalarTy = IntegerType::get(ScalarTy->getContext(), It->second.first);
- EntriesTypes.push_back(getWidenedType(ScalarTy, TE->getVectorFactor()));
+ LLVM_DEBUG(dbgs() << "SLP: " << LiveEntries.size()
+ << " entries alive over call:" << I << "\n");
+ Cost += TTI->getCostOfKeepingLiveOverCall(EntriesTypes);
}
- Cost += NumCalls * TTI->getCostOfKeepingLiveOverCall(EntriesTypes);
}
-
- Prev = TE;
}
return Cost;
diff --git a/llvm/test/Transforms/SLPVectorizer/RISCV/math-function.ll b/llvm/test/Transforms/SLPVectorizer/RISCV/math-function.ll
index 8cb620f870331..fc71643f6a51d 100644
--- a/llvm/test/Transforms/SLPVectorizer/RISCV/math-function.ll
+++ b/llvm/test/Transforms/SLPVectorizer/RISCV/math-function.ll
@@ -1740,7 +1740,9 @@ entry:
define void @f(i1 %c, ptr %p, ptr %q, ptr %r) {
; CHECK-LABEL: define void @f
; CHECK-SAME: (i1 [[C:%.*]], ptr [[P:%.*]], ptr [[Q:%.*]], ptr [[R:%.*]]) #[[ATTR1]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <2 x i64>, ptr [[P]], align 8
+; CHECK-NEXT: [[X0:%.*]] = load i64, ptr [[P]], align 8
+; CHECK-NEXT: [[P1:%.*]] = getelementptr i64, ptr [[P]], i64 1
+; CHECK-NEXT: [[X1:%.*]] = load i64, ptr [[P1]], align 8
; CHECK-NEXT: br i1 [[C]], label [[FOO:%.*]], label [[BAR:%.*]]
; CHECK: foo:
; CHECK-NEXT: [[Y0:%.*]] = load float, ptr [[R]], align 4
@@ -1751,12 +1753,16 @@ define void @f(i1 %c, ptr %p, ptr %q, ptr %r) {
; CHECK-NEXT: [[Z1:%.*]] = call float @fabsf(float [[Z0]])
; CHECK-NEXT: br label [[BAZ]]
; CHECK: baz:
-; CHECK-NEXT: store <2 x i64> [[TMP1]], ptr [[Q]], align 8
+; CHECK-NEXT: store i64 [[X0]], ptr [[Q]], align 8
+; CHECK-NEXT: [[Q1:%.*]] = getelementptr i64, ptr [[Q]], i64 1
+; CHECK-NEXT: store i64 [[X1]], ptr [[Q1]], align 8
; CHECK-NEXT: ret void
;
; DEFAULT-LABEL: define void @f
; DEFAULT-SAME: (i1 [[C:%.*]], ptr [[P:%.*]], ptr [[Q:%.*]], ptr [[R:%.*]]) #[[ATTR1]] {
-; DEFAULT-NEXT: [[TMP1:%.*]] = load <2 x i64>, ptr [[P]], align 8
+; DEFAULT-NEXT: [[X0:%.*]] = load i64, ptr [[P]], align 8
+; DEFAULT-NEXT: [[P1:%.*]] = getelementptr i64, ptr [[P]], i64 1
+; DEFAULT-NEXT: [[X1:%.*]] = load i64, ptr [[P1]], align 8
; DEFAULT-NEXT: br i1 [[C]], label [[FOO:%.*]], label [[BAR:%.*]]
; DEFAULT: foo:
; DEFAULT-NEXT: [[Y0:%.*]] = load float, ptr [[R]], align 4
@@ -1767,7 +1773,9 @@ define void @f(i1 %c, ptr %p, ptr %q, ptr %r) {
; DEFAULT-NEXT: [[Z1:%.*]] = call float @fabsf(float [[Z0]])
; DEFAULT-NEXT: br label [[BAZ]]
; DEFAULT: baz:
-; DEFAULT-NEXT: store <2 x i64> [[TMP1]], ptr [[Q]], align 8
+; DEFAULT-NEXT: store i64 [[X0]], ptr [[Q]], align 8
+; DEFAULT-NEXT: [[Q1:%.*]] = getelementptr i64, ptr [[Q]], i64 1
+; DEFAULT-NEXT: store i64 [[X1]], ptr [[Q1]], align 8
; DEFAULT-NEXT: ret void
;
%x0 = load i64, ptr %p
diff --git a/llvm/test/Transforms/SLPVectorizer/RISCV/spillcost.ll b/llvm/test/Transforms/SLPVectorizer/RISCV/spillcost.ll
index b0c25bc4cc1f2..5e91616c2beb9 100644
--- a/llvm/test/Transforms/SLPVectorizer/RISCV/spillcost.ll
+++ b/llvm/test/Transforms/SLPVectorizer/RISCV/spillcost.ll
@@ -7,7 +7,9 @@ declare void @g()
define void @f0(i1 %c, ptr %p, ptr %q) {
; CHECK-LABEL: define void @f0(
; CHECK-SAME: i1 [[C:%.*]], ptr [[P:%.*]], ptr [[Q:%.*]]) #[[ATTR0:[0-9]+]] {
-; CHECK-NEXT: [[TMP1:%.*]] = load <2 x i64>, ptr [[P]], align 8
+; CHECK-NEXT: [[X0:%.*]] = load i64, ptr [[P]], align 8
+; CHECK-NEXT: [[P1:%.*]] = getelementptr i64, ptr [[P]], i64 1
+; CHECK-NEXT: [[X1:%.*]] = load i64, ptr [[P1]], align 8
; CHECK-NEXT: br i1 [[C]], label %[[FOO:.*]], label %[[BAR:.*]]
; CHECK: [[FOO]]:
; CHECK-NEXT: call void @g()
@@ -20,7 +22,9 @@ define void @f0(i1 %c, ptr %p, ptr %q) {
; CHECK-NEXT: call void @g()
; CHECK-NEXT: br label %[[BAZ]]
; CHECK: [[BAZ]]:
-; CHECK-NEXT: store <2 x i64> [[TMP1]], ptr [[Q]], align 8
+; CHECK-NEXT: store i64 [[X0]], ptr [[Q]], align 8
+; CHECK-NEXT: [[Q1:%.*]] = getelementptr i64, ptr [[Q]], i64 1
+; CHECK-NEXT: store i64 [[X1]], ptr [[Q1]], align 8
; CHECK-NEXT: ret void
;
%x0 = load i64, ptr %p
@@ -50,10 +54,13 @@ define void @f1(i1 %c, ptr %p, ptr %q, ptr %r) {
; CHECK-LABEL: define void @f1(
; CHECK-SAME: i1 [[C:%.*]], ptr [[P:%.*]], ptr [[Q:%.*]], ptr [[R:%.*]]) #[[ATTR0]] {
; CHECK-NEXT: [[ENTRY:.*:]]
-; CHECK-NEXT: [[TMP0:%.*]] = load <2 x i64>, ptr [[P]], align 8
+; CHECK-NEXT: [[X0:%.*]] = load i64, ptr [[P]], align 8
+; CHECK-NEXT: [[P1:%.*]] = getelementptr i64, ptr [[P]], i64 1
+; CHECK-NEXT: [[X1:%.*]] = load i64, ptr [[P1]], align 8
; CHECK-NEXT: br i1 [[C]], label %[[FOO:.*]], label %[[BAR:.*]]
; CHECK: [[FOO]]:
-; CHECK-NEXT: [[TMP1:%.*]] = add <2 x i64> [[TMP0]], splat (i64 1)
+; CHECK-NEXT: [[Y0:%.*]] = add i64 [[X0]], 1
+; CHECK-NEXT: [[Y1:%.*]] = add i64 [[X1]], 1
; CHECK-NEXT: br label %[[BAZ:.*]]
; CHECK: [[BAR]]:
; CHECK-NEXT: call void @g()
@@ -61,8 +68,11 @@ define void @f1(i1 %c, ptr %p, ptr %q, ptr %r) {
; CHECK-NEXT: call void @g()
; CHECK-NEXT: br label %[[BAZ]]
; CHECK: [[BAZ]]:
-; CHECK-NEXT: [[TMP2:%.*]] = phi <2 x i64> [ [[TMP1]], %[[FOO]] ], [ [[TMP0]], %[[BAR]] ]
-; CHECK-NEXT: store <2 x i64> [[TMP2]], ptr [[Q]], align 8
+; CHECK-NEXT: [[PHI0:%.*]] = phi i64 [ [[Y0]], %[[FOO]] ], [ [[X0]], %[[BAR]] ]
+; CHECK-NEXT: [[PHI1:%.*]] = phi i64 [ [[Y1]], %[[FOO]] ], [ [[X1]], %[[BAR]] ]
+; CHECK-NEXT: store i64 [[PHI0]], ptr [[Q]], align 8
+; CHECK-NEXT: [[Q1:%.*]] = getelementptr i64, ptr [[Q]], i64 1
+; CHECK-NEXT: store i64 [[PHI1]], ptr [[Q1]], align 8
; CHECK-NEXT: ret void
;
entry:
|
Current compile time regression on SPEC:
|
for (Instruction &I : reverse(*BB)) { | ||
for (const auto *TE : getTreeEntries(&I)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid this may affect compile time, better to invent something else here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any specific suggestions? This does impact compile time (as noted in the review), but also is a pretty big win on at least one benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe iterate over the tree and check if the main instruction has the required parent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On this particular point, we can probably invert this loop. We can walk the vectorization tree, and then from the last instruction in the bundle, traverse the CFG back to the next defining instruction in the vectorization tree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are describing our first approach, but there was a limitation, on the new test we added (https://github.com/mikhailramalho/llvm-project/blob/a6300f77974b9a706eec966af1f1835173251623/llvm/test/Transforms/SLPVectorizer/RISCV/spillcost.ll#L53):
entry
/ \
foo bar*
\ /
baz
* only bar has function calls
In this case, we get three entries that we can traverse backward:
(1) baz -> baz
(2) baz -> foo
(3) foo -> entry
None go through bar
, so we don't add the calls in that block to the final cost...
That's when we decided to go with collecting all reachable blocks instead, so we wouldn't miss these calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is getTreeEntries
not a map lookup? I thought it would have been relatively cheap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's cheap, but the basic blocks might be very large and contain several thousands instructions, so iterating over instructions in the basic block may hurt the perf
assert(!Op->isGather() && "Expected vectorized operand."); | ||
LiveEntries.insert(Op); | ||
} | ||
DenseSet<const BasicBlock *> Reachable; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Algorithmically, this doesn't look quite right. You're computing the set of blocks reachable from all defs to the root. What you actually want to have explored is the set of blocks reachable between each def and it's respective set of uses. Your set the union of all of the smaller sets, but we want to respect liveness though the graph.
I think you're going to need to precompute the number of live values per block (and transition points within blocks) given the def/use information. Once you have that, I'd suggest walking the CFG in the existing loop below (i.e. change the early exit at end of block case), and cache don't precompute which blocks have been scanned so far. In particular, cache the number of calls per block which is the primary thing you need at that point.
@alexey-bataev is there some invariant on the graph which simplifies this in any way? Do we statically know any properties of the CFG or vector tree which allow us to avoid full general liveness here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only that the instructions in vector nodes do not have deps of any kind between them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Algorithmically, this doesn't look quite right. You're computing the set of blocks reachable from all defs to the root. What you actually want to have explored is the set of blocks reachable between each def and it's respective set of uses. Your set the union of all of the smaller sets, but we want to respect liveness though the graph.
Computing the set of reachable blocks up front is a compile-time optimisation to cull the number of blocks visited. The actual exploration is a post order traversal starting from the root, so all uses come before defs (except for phis)
I think you're going to need to precompute the number of live values per block (and transition points within blocks) given the def/use information.
I'm not sure I'm 100% following this, could you elaborate on this a bit more. To do this do we not need to walk the CFG starting from the root anyway?
Taking a complete stab in the dark here, but my gut feeling is that most of the compile time slowdown comes from SLP trees that are inside massive loops where everything is reachable. I wonder if we can have a heuristic to ignore the backedges when deciding what to traverse?
InstIt = ++getLastInstructionInBundle(TE).getIterator().getReverse(), | ||
PrevInstIt = PrevInst->getIterator().getReverse(); | ||
while (InstIt != PrevInstIt) { | ||
if (PrevInstIt == PrevInst->getParent()->rend()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm reading the old code right, we only scanned the prefix of the block containing the using bundle, and then ignored any calls between definition and the end of it's block, plus any in intervening blocks?
If so, a couple of question/suggestions.
- What happens if you add a huge cost for any case not contained in the same basic block? (i.e. strongly disincentivize long cross block cases)
- What happens if you handle only "easy" control flow? Things like a series of if-else or if-clause, with some small finite depth?
- Is there some invariant which insures none of these nodes can be in loops? If not, do we need to scale by trip count in some way?
for (Instruction &I : reverse(*BB)) { | ||
for (const auto *TE : getTreeEntries(&I)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On this particular point, we can probably invert this loop. We can walk the vectorization tree, and then from the last instruction in the bundle, traverse the CFG back to the next defining instruction in the vectorization tree.
This is a WIP patch due to the compilation time regressions, up to 7% on gcc_r/gcc_s.
Previously, getSpillCost would skip in between blocks when traversing instructions backward. If one of the missing blocks has a function call, the existing logic would lead to incorrect spill cost calculations.
The new implementation:
Performance improvements:
This optimization improves vectorization decisions by making spill cost estimation more accurate, particularly for code with complex control flow.