-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[SROA] Use SetVector for PromotableAllocas #105809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SROA] Use SetVector for PromotableAllocas #105809
Conversation
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
@llvm/pr-subscribers-llvm-transforms Author: Bartłomiej Chmiel (b-chmiel) ChangesWhen compiling large SystemVerilog designs transpiled by https://github.com/verilator/verilator, The PromotableAllocas field is represented as a Assuming that PromotableAllocas contains only unique raw pointers to allocas, SmallPtrSet may be used for storing them. Note: I'm creating @chandlerc BenchmarksInternal benchmarkBase version timed out after 9 hours, improved version finished in 37 minutes. Minimal example benchmarkThis test mimics our internal benchmark; creates a lot of allocas considered in SROA. gen.cpp - generates a test file #include <fstream>
int main() {
constexpr int fields = 100000;
std::ofstream of{"out.cpp"};
of << "#include <random>\n";
of << " struct VlWide final {\n";
of << "\tstd::uint32_t m_storage[5];\n";
of << "};\n";
of << "int main() {\n";
of << "\tunsigned int rnd = rand();\n";
for (auto i = 0; i < fields; ++i)
of << "\tVlWide tmp_" << i << "{rnd};\n";
of << "\treturn 0;\n";
of << "}\n";
return 0;
} Makefile - compiles both generate script and generated file
Run with llvm-test-suiteCTMark
Full diff: https://github.com/llvm/llvm-project/pull/105809.diff 1 Files Affected:
diff --git a/llvm/lib/Transforms/Scalar/SROA.cpp b/llvm/lib/Transforms/Scalar/SROA.cpp
index 26b62cb79cdedf..e13dfed5adb458 100644
--- a/llvm/lib/Transforms/Scalar/SROA.cpp
+++ b/llvm/lib/Transforms/Scalar/SROA.cpp
@@ -198,7 +198,7 @@ class SROA {
SmallSetVector<AllocaInst *, 16> PostPromotionWorklist;
/// A collection of alloca instructions we can directly promote.
- std::vector<AllocaInst *> PromotableAllocas;
+ SmallPtrSet<AllocaInst *, 16> PromotableAllocas;
/// A worklist of PHIs to speculate prior to promoting allocas.
///
@@ -4769,9 +4769,8 @@ bool SROA::presplitLoadsAndStores(AllocaInst &AI, AllocaSlices &AS) {
// Finally, don't try to promote any allocas that new require re-splitting.
// They have already been added to the worklist above.
- llvm::erase_if(PromotableAllocas, [&](AllocaInst *AI) {
- return ResplitPromotableAllocas.count(AI);
- });
+ for (auto *RPA : ResplitPromotableAllocas)
+ PromotableAllocas.erase(RPA);
return true;
}
@@ -4933,7 +4932,7 @@ AllocaInst *SROA::rewritePartition(AllocaInst &AI, AllocaSlices &AS,
}
if (PHIUsers.empty() && SelectUsers.empty()) {
// Promote the alloca.
- PromotableAllocas.push_back(NewAI);
+ PromotableAllocas.insert(NewAI);
} else {
// If we have either PHIs or Selects to speculate, add them to those
// worklists and re-queue the new alloca so that we promote in on the
@@ -5568,7 +5567,9 @@ bool SROA::promoteAllocas(Function &F) {
LLVM_DEBUG(dbgs() << "Not promoting allocas with mem2reg!\n");
} else {
LLVM_DEBUG(dbgs() << "Promoting allocas with mem2reg...\n");
- PromoteMemToReg(PromotableAllocas, DTU->getDomTree(), AC);
+ PromoteMemToReg(
+ std::vector(PromotableAllocas.begin(), PromotableAllocas.end()),
+ DTU->getDomTree(), AC);
}
PromotableAllocas.clear();
@@ -5585,7 +5586,7 @@ std::pair<bool /*Changed*/, bool /*CFGChanged*/> SROA::runSROA(Function &F) {
if (AllocaInst *AI = dyn_cast<AllocaInst>(I)) {
if (DL.getTypeAllocSize(AI->getAllocatedType()).isScalable() &&
isAllocaPromotable(AI))
- PromotableAllocas.push_back(AI);
+ PromotableAllocas.insert(AI);
else
Worklist.insert(AI);
}
@@ -5609,10 +5610,10 @@ std::pair<bool /*Changed*/, bool /*CFGChanged*/> SROA::runSROA(Function &F) {
// Remove the deleted allocas from various lists so that we don't try to
// continue processing them.
if (!DeletedAllocas.empty()) {
- auto IsInSet = [&](AllocaInst *AI) { return DeletedAllocas.count(AI); };
- Worklist.remove_if(IsInSet);
- PostPromotionWorklist.remove_if(IsInSet);
- llvm::erase_if(PromotableAllocas, IsInSet);
+ Worklist.set_subtract(DeletedAllocas);
+ PostPromotionWorklist.set_subtract(DeletedAllocas);
+ for (auto *DA : DeletedAllocas)
+ PromotableAllocas.erase(DA);
DeletedAllocas.clear();
}
}
|
83e73d7
to
eddc4db
Compare
Rebased to make CI pass. |
Pretty sure this is going to cause non-determinism, e.g. because mem2reg will insert phi nodes in different order. You'd need a set container with stable order like SetVector -- though I think that one doesn't actually have efficient erase. |
Maybe |
Thanks for pointing it out. I'd tried previously SmallSetVector, but it wasn't efficient enough.
Thanks, I'll try this data structure out and promptly report back with results. |
@nikic I'd tested several possibilities (including Microbenchmarks (100 runs):
|
@nikic Ping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Does this version fully address the SROA scalability issue for you, or only partially? I'd expect this version to be fast when removing elements that are not in the vector, but still equally slow if they are in the vector, as this requires a linear scan. Actually, that case may be even slower, because it will now do a linear scan per removed element, rather than one to remove all matching elements. Is removing elements not in the vector the common case? |
Optimize SROA pass for large number of allocas by speeding-up PromotableAllocas erase operation. The optimization involves using SmallPtrSet which proves to be efficient since PromotableAllocas is used only for manipulating unique pointers. Signed-off-by: Bartłomiej Chmiel <[email protected]>
Signed-off-by: Bartłomiej Chmiel <[email protected]>
e2d5b9f
to
b0e3b7e
Compare
Thanks, this fully addresses my issue with SROA scalability. In my case the number of elements is so large (~0.5 billion elements) that any operation on Still, In the common case, when elements fit in a vector, the duration of this phase is negligible and there is no noticeable difference between |
@b-chmiel Congratulations on having your first Pull Request (PR) merged into the LLVM Project! Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR. Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues. How to do this, and the rest of the post-merge process, is covered in detail here. If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again. If you don't get any reports, no action is required from you. Your changes are working as expected, well done! |
When compiling large SystemVerilog designs transpiled by https://github.com/verilator/verilator,
clang
compilation hangs during SROA phase.The PromotableAllocas field is represented as a
std::vector
. In our case this number is close to 500 000 000 which makes random search-delete onstd::vector
inefficient.Assuming that PromotableAllocas contains only unique raw pointers to allocas, SmallPtrSet may be used for storing them.
Note: I'm creating
std::vector
from SmallPtrSet inSROA::promoteAllocas
to match signature ofPromoteMem2Reg
. However, thePromoteMem2Reg
constructor makes yet another copy of this structure (using its begin/end iterators). Do You think there is a better way to optimize this? For example usingstd::move
and adjustingPromoteMem2Reg
interface?@chandlerc
Benchmarks
Internal benchmark
Base version timed out after 9 hours, improved version finished in 37 minutes.
Minimal example benchmark
This test mimics our internal benchmark; creates a lot of allocas considered in SROA.
In 10 test runs, the improved version was 6% faster than the base one.
gen.cpp - generates a test file
Makefile - compiles both generate script and generated file
Run with
make
.llvm-test-suite
CTMark
compile_time
results for base (commit b05c554) and improvedclang
versions of 100 runs: