-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[SLP] Improve block traversal in getSpillCost() #128620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12415,68 +12415,48 @@ InstructionCost BoUpSLP::getSpillCost() { | |
InstructionCost Cost = 0; | ||
|
||
SmallPtrSet<const TreeEntry *, 4> LiveEntries; | ||
const TreeEntry *Prev = nullptr; | ||
|
||
// The entries in VectorizableTree are not necessarily ordered by their | ||
// position in basic blocks. Collect them and order them by dominance so later | ||
// instructions are guaranteed to be visited first. For instructions in | ||
// different basic blocks, we only scan to the beginning of the block, so | ||
// their order does not matter, as long as all instructions in a basic block | ||
// are grouped together. Using dominance ensures a deterministic order. | ||
SmallVector<TreeEntry *, 16> OrderedEntries; | ||
for (const auto &TEPtr : VectorizableTree) { | ||
if (TEPtr->isGather()) | ||
continue; | ||
OrderedEntries.push_back(TEPtr.get()); | ||
} | ||
llvm::stable_sort(OrderedEntries, [&](const TreeEntry *TA, | ||
const TreeEntry *TB) { | ||
Instruction &A = getLastInstructionInBundle(TA); | ||
Instruction &B = getLastInstructionInBundle(TB); | ||
auto *NodeA = DT->getNode(A.getParent()); | ||
auto *NodeB = DT->getNode(B.getParent()); | ||
assert(NodeA && "Should only process reachable instructions"); | ||
assert(NodeB && "Should only process reachable instructions"); | ||
assert((NodeA == NodeB) == (NodeA->getDFSNumIn() == NodeB->getDFSNumIn()) && | ||
"Different nodes should have different DFS numbers"); | ||
if (NodeA != NodeB) | ||
return NodeA->getDFSNumIn() > NodeB->getDFSNumIn(); | ||
return B.comesBefore(&A); | ||
}); | ||
|
||
for (const TreeEntry *TE : OrderedEntries) { | ||
if (!Prev) { | ||
Prev = TE; | ||
const TreeEntry *Root = VectorizableTree.front().get(); | ||
BasicBlock *RootBB = cast<Instruction>(Root->Scalars[0])->getParent(); | ||
|
||
// Compute what nodes are reachable from the leaves to the roots | ||
df_iterator_default_set<const BasicBlock *> ReachableFromLeaves; | ||
for (auto &TE : VectorizableTree) { | ||
if (TE->isGather()) | ||
continue; | ||
} | ||
auto *BB = getLastInstructionInBundle(TE.get()).getParent(); | ||
for (const BasicBlock *X : depth_first_ext(BB, ReachableFromLeaves)) | ||
ReachableFromLeaves.insert(X); | ||
} | ||
|
||
LiveEntries.erase(Prev); | ||
for (unsigned I : seq<unsigned>(Prev->getNumOperands())) { | ||
const TreeEntry *Op = getVectorizedOperand(Prev, I); | ||
if (!Op) | ||
continue; | ||
assert(!Op->isGather() && "Expected vectorized operand."); | ||
LiveEntries.insert(Op); | ||
} | ||
DenseSet<const BasicBlock *> Reachable; | ||
for (const BasicBlock *X : inverse_depth_first(RootBB)) | ||
Reachable.insert(X); | ||
set_intersect(Reachable, ReachableFromLeaves); | ||
|
||
LLVM_DEBUG({ | ||
dbgs() << "SLP: #LV: " << LiveEntries.size(); | ||
for (auto *X : LiveEntries) | ||
X->dump(); | ||
dbgs() << ", Looking at "; | ||
TE->dump(); | ||
}); | ||
DenseSet<const TreeEntry *> Defined; | ||
|
||
// Now find the sequence of instructions between PrevInst and Inst. | ||
unsigned NumCalls = 0; | ||
const Instruction *PrevInst = &getLastInstructionInBundle(Prev); | ||
BasicBlock::const_reverse_iterator | ||
InstIt = ++getLastInstructionInBundle(TE).getIterator().getReverse(), | ||
PrevInstIt = PrevInst->getIterator().getReverse(); | ||
while (InstIt != PrevInstIt) { | ||
if (PrevInstIt == PrevInst->getParent()->rend()) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I'm reading the old code right, we only scanned the prefix of the block containing the using bundle, and then ignored any calls between definition and the end of it's block, plus any in intervening blocks? If so, a couple of question/suggestions.
|
||
PrevInstIt = getLastInstructionInBundle(TE).getParent()->rbegin(); | ||
continue; | ||
// Iterate the tree from the root, post order so that all uses appear before | ||
// definitions. | ||
// TODO: LiveEntries are shared across all paths, so this may overestimate. | ||
for (BasicBlock *BB : post_order(RootBB->getParent())) { | ||
if (!Reachable.contains(BB)) | ||
continue; | ||
|
||
for (Instruction &I : reverse(*BB)) { | ||
for (const auto *TE : getTreeEntries(&I)) { | ||
Comment on lines
+12446
to
+12447
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm afraid this may affect compile time, better to invent something else here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you have any specific suggestions? This does impact compile time (as noted in the review), but also is a pretty big win on at least one benchmark. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe iterate over the tree and check if the main instruction has the required parent? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On this particular point, we can probably invert this loop. We can walk the vectorization tree, and then from the last instruction in the bundle, traverse the CFG back to the next defining instruction in the vectorization tree. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you are describing our first approach, but there was a limitation, on the new test we added (https://github.com/mikhailramalho/llvm-project/blob/a6300f77974b9a706eec966af1f1835173251623/llvm/test/Transforms/SLPVectorizer/RISCV/spillcost.ll#L53):
In this case, we get three entries that we can traverse backward: None go through That's when we decided to go with collecting all reachable blocks instead, so we wouldn't miss these calls There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it's cheap, but the basic blocks might be very large and contain several thousands instructions, so iterating over instructions in the basic block may hurt the perf |
||
if (TE->isGather()) | ||
continue; | ||
LiveEntries.erase(TE); | ||
Defined.insert(TE); | ||
for (unsigned Idx : seq<unsigned>(TE->getNumOperands())) { | ||
const TreeEntry *Op = getVectorizedOperand(TE, Idx); | ||
if (!Op) | ||
continue; | ||
assert(!Op->isGather() && "Expected vectorized operand."); | ||
if (!Defined.contains(Op)) | ||
LiveEntries.insert(Op); | ||
} | ||
} | ||
|
||
auto NoCallIntrinsic = [this](const Instruction *I) { | ||
|
@@ -12497,26 +12477,24 @@ InstructionCost BoUpSLP::getSpillCost() { | |
// Debug information does not impact spill cost. | ||
// Vectorized calls, represented as vector intrinsics, do not impact spill | ||
// cost. | ||
if (const auto *CB = dyn_cast<CallBase>(&*PrevInstIt); | ||
CB && !NoCallIntrinsic(CB) && !isVectorized(CB)) | ||
NumCalls++; | ||
|
||
++PrevInstIt; | ||
} | ||
if (const auto *CB = dyn_cast<CallBase>(&I); | ||
CB && !NoCallIntrinsic(CB) && !isVectorized(CB)) { | ||
SmallVector<Type *, 4> EntriesTypes; | ||
for (const TreeEntry *TE : LiveEntries) { | ||
auto *ScalarTy = TE->getMainOp()->getType(); | ||
auto It = MinBWs.find(TE); | ||
if (It != MinBWs.end()) | ||
ScalarTy = | ||
IntegerType::get(ScalarTy->getContext(), It->second.first); | ||
EntriesTypes.push_back( | ||
getWidenedType(ScalarTy, TE->getVectorFactor())); | ||
} | ||
|
||
if (NumCalls) { | ||
SmallVector<Type *, 4> EntriesTypes; | ||
for (const TreeEntry *TE : LiveEntries) { | ||
auto *ScalarTy = TE->getMainOp()->getType(); | ||
auto It = MinBWs.find(TE); | ||
if (It != MinBWs.end()) | ||
ScalarTy = IntegerType::get(ScalarTy->getContext(), It->second.first); | ||
EntriesTypes.push_back(getWidenedType(ScalarTy, TE->getVectorFactor())); | ||
LLVM_DEBUG(dbgs() << "SLP: " << LiveEntries.size() | ||
<< " entries alive over call:" << I << "\n"); | ||
Cost += TTI->getCostOfKeepingLiveOverCall(EntriesTypes); | ||
} | ||
Cost += NumCalls * TTI->getCostOfKeepingLiveOverCall(EntriesTypes); | ||
} | ||
|
||
Prev = TE; | ||
} | ||
|
||
return Cost; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Algorithmically, this doesn't look quite right. You're computing the set of blocks reachable from all defs to the root. What you actually want to have explored is the set of blocks reachable between each def and it's respective set of uses. Your set the union of all of the smaller sets, but we want to respect liveness though the graph.
I think you're going to need to precompute the number of live values per block (and transition points within blocks) given the def/use information. Once you have that, I'd suggest walking the CFG in the existing loop below (i.e. change the early exit at end of block case), and cache don't precompute which blocks have been scanned so far. In particular, cache the number of calls per block which is the primary thing you need at that point.
@alexey-bataev is there some invariant on the graph which simplifies this in any way? Do we statically know any properties of the CFG or vector tree which allow us to avoid full general liveness here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only that the instructions in vector nodes do not have deps of any kind between them
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Computing the set of reachable blocks up front is a compile-time optimisation to cull the number of blocks visited. The actual exploration is a post order traversal starting from the root, so all uses come before defs (except for phis)
I'm not sure I'm 100% following this, could you elaborate on this a bit more. To do this do we not need to walk the CFG starting from the root anyway?
Taking a complete stab in the dark here, but my gut feeling is that most of the compile time slowdown comes from SLP trees that are inside massive loops where everything is reachable. I wonder if we can have a heuristic to ignore the backedges when deciding what to traverse?