[SROA] Use SetVector for PromotableAllocas #105809

b-chmiel · 2024-08-23T09:56:48Z

When compiling large SystemVerilog designs transpiled by https://github.com/verilator/verilator, clang compilation hangs during SROA phase.

The PromotableAllocas field is represented as a std::vector. In our case this number is close to 500 000 000 which makes random search-delete on std::vector inefficient.

Assuming that PromotableAllocas contains only unique raw pointers to allocas, SmallPtrSet may be used for storing them.

Note: I'm creating std::vector from SmallPtrSet in SROA::promoteAllocas to match signature of PromoteMem2Reg. However, the PromoteMem2Reg constructor makes yet another copy of this structure (using its begin/end iterators). Do You think there is a better way to optimize this? For example using std::move and adjusting PromoteMem2Reg interface?

@chandlerc

Benchmarks

Internal benchmark

Base version timed out after 9 hours, improved version finished in 37 minutes.

Minimal example benchmark

This test mimics our internal benchmark; creates a lot of allocas considered in SROA.
In 10 test runs, the improved version was 6% faster than the base one.

gen.cpp - generates a test file

#include <fstream>
int main() {
  constexpr int fields = 100000;
  std::ofstream of{"out.cpp"};

  of << "#include <random>\n";

  of << " struct VlWide final {\n";
  of << "\tstd::uint32_t m_storage[5];\n";
  of << "};\n";

  of << "int main() {\n";
  of << "\tunsigned int rnd = rand();\n";
  for (auto i = 0; i < fields; ++i)
    of << "\tVlWide tmp_" << i << "{rnd};\n";
  of << "\treturn 0;\n";
  of << "}\n";
  return 0;
}

Makefile - compiles both generate script and generated file

.PHONY = clean

out.o: out.cpp
    $(CXX) -c -O1 -emit-llvm -mllvm -stats -o $@ $<

out.cpp: gen.o
    ./gen.o

gen.o: gen.cpp
    $(CXX) -O3 -o $@ $<

clean:
    - rm *.o out.cpp

Run with make.

llvm-test-suite

CTMark compile_time results for base (commit b05c554) and improved clang versions of 100 runs:

Program                                       compile_time                                                                                                                                                                                                                            
                                               base         improved diff                                                                                                                                                                
tramp3d-v4/tramp3d-v4                           6.13         6.28    2.4%                                                                                                                                                               
sqlite3/sqlite3                                 1.21         1.24    2.1%                                                                                                                                                               
lencod/lencod                                   4.43         4.45    0.5%                                                                                                                                                               
SPASS/SPASS                                     5.84         5.87    0.4%                                                                                                                                                               
Bullet/bullet                                  27.64        27.58   -0.2%                                                                                                                                                               
consumer-typeset/consumer-typeset               4.37         4.34   -0.8%                                                                                                                                                               
7zip/7zip-benchmark                            71.99        71.43   -0.8%                                                                                                                                                               
ClamAV/clamscan                                 5.42         5.38   -0.8%                                                                                                                                                               
kimwitu++/kc                                   11.64        11.52   -1.0%                                                                                                                                                               
mafft/pairlocalalign                            2.31         2.27   -1.6%                                                                                                                                                                                          
Geomean difference                        0.0%                                                                                                                                                                     
compile_time                                                                                                                                                                                                                      
l/r           base   improved       diff                                                                                                                                                                                                
count  10.000000    10.000000  10.000000                                                                                                                                                                                                
mean   14.099270    14.034550  0.000112                                                                                                                                                                                                 
std    21.708482    21.533859  0.013314                                                                                                                                                                                                 
min    1.211500     1.236500  -0.016101                                                                                                                                                                                                 
25%    4.384250     4.364425  -0.007946                                                                                                                                                                                                 
50%    5.632550     5.621750  -0.004951                                                                                                                                                                                                 
75%    10.262475    10.211450  0.004309                                                                                                                                                                                                 
max    71.992200    71.425300  0.024213

github-actions · 2024-08-23T09:57:06Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2024-08-23T09:57:39Z

@llvm/pr-subscribers-llvm-transforms

Author: Bartłomiej Chmiel (b-chmiel)

Changes

When compiling large SystemVerilog designs transpiled by https://github.com/verilator/verilator, clang compilation hangs during SROA phase.

The PromotableAllocas field is represented as a std::vector. In our case this number is close to 500 000 000 which makes random search-delete on std::vector inefficient.

Assuming that PromotableAllocas contains only unique raw pointers to allocas, SmallPtrSet may be used for storing them.

Note: I'm creating std::vector from SmallPtrSet in SROA::promoteAllocas to match signature of PromoteMem2Reg. However, the PromoteMem2Reg constructor makes yet another copy of this structure (using its begin/end iterators). Do You think there is a better way to optimize this? For example using std::move and adjusting PromoteMem2Reg interface?

@chandlerc

Benchmarks

Internal benchmark

Base version timed out after 9 hours, improved version finished in 37 minutes.

Minimal example benchmark

This test mimics our internal benchmark; creates a lot of allocas considered in SROA.
In 10 test runs, the improved version was 6% faster than the base one.

gen.cpp - generates a test file

#include &lt;fstream&gt;
int main() {
  constexpr int fields = 100000;
  std::ofstream of{"out.cpp"};

  of &lt;&lt; "#include &lt;random&gt;\n";

  of &lt;&lt; " struct VlWide final {\n";
  of &lt;&lt; "\tstd::uint32_t m_storage[5];\n";
  of &lt;&lt; "};\n";

  of &lt;&lt; "int main() {\n";
  of &lt;&lt; "\tunsigned int rnd = rand();\n";
  for (auto i = 0; i &lt; fields; ++i)
    of &lt;&lt; "\tVlWide tmp_" &lt;&lt; i &lt;&lt; "{rnd};\n";
  of &lt;&lt; "\treturn 0;\n";
  of &lt;&lt; "}\n";
  return 0;
}

Makefile - compiles both generate script and generated file

.PHONY = clean

out.o: out.cpp
    $(CXX) -c -O1 -emit-llvm -mllvm -stats -o $@ $&lt;

out.cpp: gen.o
    ./gen.o

gen.o: gen.cpp
    $(CXX) -O3 -o $@ $&lt;

clean:
    - rm *.o out.cpp

Run with make.

llvm-test-suite

CTMark compile_time results for base (commit b05c554) and improved clang versions of 100 runs:

Program                                       compile_time                                                                                                                                                                                                                            
                                               base         improved diff                                                                                                                                                                
tramp3d-v4/tramp3d-v4                           6.13         6.28    2.4%                                                                                                                                                               
sqlite3/sqlite3                                 1.21         1.24    2.1%                                                                                                                                                               
lencod/lencod                                   4.43         4.45    0.5%                                                                                                                                                               
SPASS/SPASS                                     5.84         5.87    0.4%                                                                                                                                                               
Bullet/bullet                                  27.64        27.58   -0.2%                                                                                                                                                               
consumer-typeset/consumer-typeset               4.37         4.34   -0.8%                                                                                                                                                               
7zip/7zip-benchmark                            71.99        71.43   -0.8%                                                                                                                                                               
ClamAV/clamscan                                 5.42         5.38   -0.8%                                                                                                                                                               
kimwitu++/kc                                   11.64        11.52   -1.0%                                                                                                                                                               
mafft/pairlocalalign                            2.31         2.27   -1.6%                                                                                                                                                                                          
Geomean difference                        0.0%                                                                                                                                                                     
compile_time                                                                                                                                                                                                                      
l/r           base   improved       diff                                                                                                                                                                                                
count  10.000000    10.000000  10.000000                                                                                                                                                                                                
mean   14.099270    14.034550  0.000112                                                                                                                                                                                                 
std    21.708482    21.533859  0.013314                                                                                                                                                                                                 
min    1.211500     1.236500  -0.016101                                                                                                                                                                                                 
25%    4.384250     4.364425  -0.007946                                                                                                                                                                                                 
50%    5.632550     5.621750  -0.004951                                                                                                                                                                                                 
75%    10.262475    10.211450  0.004309                                                                                                                                                                                                 
max    71.992200    71.425300  0.024213

Full diff: https://github.com/llvm/llvm-project/pull/105809.diff

1 Files Affected:

(modified) llvm/lib/Transforms/Scalar/SROA.cpp (+12-11)

diff --git a/llvm/lib/Transforms/Scalar/SROA.cpp b/llvm/lib/Transforms/Scalar/SROA.cpp
index 26b62cb79cdedf..e13dfed5adb458 100644
--- a/llvm/lib/Transforms/Scalar/SROA.cpp
+++ b/llvm/lib/Transforms/Scalar/SROA.cpp
@@ -198,7 +198,7 @@ class SROA {
   SmallSetVector<AllocaInst *, 16> PostPromotionWorklist;
 
   /// A collection of alloca instructions we can directly promote.
-  std::vector<AllocaInst *> PromotableAllocas;
+  SmallPtrSet<AllocaInst *, 16> PromotableAllocas;
 
   /// A worklist of PHIs to speculate prior to promoting allocas.
   ///
@@ -4769,9 +4769,8 @@ bool SROA::presplitLoadsAndStores(AllocaInst &AI, AllocaSlices &AS) {
 
   // Finally, don't try to promote any allocas that new require re-splitting.
   // They have already been added to the worklist above.
-  llvm::erase_if(PromotableAllocas, [&](AllocaInst *AI) {
-    return ResplitPromotableAllocas.count(AI);
-  });
+  for (auto *RPA : ResplitPromotableAllocas)
+    PromotableAllocas.erase(RPA);
 
   return true;
 }
@@ -4933,7 +4932,7 @@ AllocaInst *SROA::rewritePartition(AllocaInst &AI, AllocaSlices &AS,
     }
     if (PHIUsers.empty() && SelectUsers.empty()) {
       // Promote the alloca.
-      PromotableAllocas.push_back(NewAI);
+      PromotableAllocas.insert(NewAI);
     } else {
       // If we have either PHIs or Selects to speculate, add them to those
       // worklists and re-queue the new alloca so that we promote in on the
@@ -5568,7 +5567,9 @@ bool SROA::promoteAllocas(Function &F) {
     LLVM_DEBUG(dbgs() << "Not promoting allocas with mem2reg!\n");
   } else {
     LLVM_DEBUG(dbgs() << "Promoting allocas with mem2reg...\n");
-    PromoteMemToReg(PromotableAllocas, DTU->getDomTree(), AC);
+    PromoteMemToReg(
+        std::vector(PromotableAllocas.begin(), PromotableAllocas.end()),
+        DTU->getDomTree(), AC);
   }
 
   PromotableAllocas.clear();
@@ -5585,7 +5586,7 @@ std::pair<bool /*Changed*/, bool /*CFGChanged*/> SROA::runSROA(Function &F) {
     if (AllocaInst *AI = dyn_cast<AllocaInst>(I)) {
       if (DL.getTypeAllocSize(AI->getAllocatedType()).isScalable() &&
           isAllocaPromotable(AI))
-        PromotableAllocas.push_back(AI);
+        PromotableAllocas.insert(AI);
       else
         Worklist.insert(AI);
     }
@@ -5609,10 +5610,10 @@ std::pair<bool /*Changed*/, bool /*CFGChanged*/> SROA::runSROA(Function &F) {
       // Remove the deleted allocas from various lists so that we don't try to
       // continue processing them.
       if (!DeletedAllocas.empty()) {
-        auto IsInSet = [&](AllocaInst *AI) { return DeletedAllocas.count(AI); };
-        Worklist.remove_if(IsInSet);
-        PostPromotionWorklist.remove_if(IsInSet);
-        llvm::erase_if(PromotableAllocas, IsInSet);
+        Worklist.set_subtract(DeletedAllocas);
+        PostPromotionWorklist.set_subtract(DeletedAllocas);
+        for (auto *DA : DeletedAllocas)
+          PromotableAllocas.erase(DA);
         DeletedAllocas.clear();
       }
     }

b-chmiel · 2024-08-23T14:42:11Z

Rebased to make CI pass.

nikic · 2024-08-23T15:48:17Z

Pretty sure this is going to cause non-determinism, e.g. because mem2reg will insert phi nodes in different order. You'd need a set container with stable order like SetVector -- though I think that one doesn't actually have efficient erase.

s-barannikov · 2024-08-24T06:49:22Z

You'd need a set container with stable order like SetVector -- though I think that one doesn't actually have efficient erase.

Maybe SetVector<T, std::deque<T>> can be used?

b-chmiel · 2024-08-26T08:24:25Z

Pretty sure this is going to cause non-determinism, e.g. because mem2reg will insert phi nodes in different order. You'd need a set container with stable order like SetVector -- though I think that one doesn't actually have efficient erase.

Thanks for pointing it out. I'd tried previously SmallSetVector, but it wasn't efficient enough.

You'd need a set container with stable order like SetVector -- though I think that one doesn't actually have efficient erase.

Maybe SetVector<T, std::deque<T>> can be used?

Thanks, I'll try this data structure out and promptly report back with results.

b-chmiel · 2024-08-27T08:59:01Z

@nikic I'd tested several possibilities (including std::deque, SmallSetVector and derivatives) and SetVector<AllocaInst *, SmallVector<AllocaInst *>, SmallPtrSet<AllocaInst *, 16>, 16> brought the best results while still being a stable container.

Microbenchmarks (100 runs):

Program                                       compile_time
                                              base         improved diff
consumer-typeset/consumer-typeset               4.17         4.19    0.6%
lencod/lencod                                   4.20         4.22    0.4%
7zip/7zip-benchmark                            68.23        68.37    0.2%
sqlite3/sqlite3                                 1.17         1.17   -0.0%
kimwitu++/kc                                   10.87        10.83   -0.3%
ClamAV/clamscan                                 5.26         5.24   -0.3%
Bullet/bullet                                  26.45        26.33   -0.5%
SPASS/SPASS                                     5.69         5.64   -0.8%
mafft/pairlocalalign                            2.26         2.23   -1.2%
tramp3d-v4/tramp3d-v4                           6.29         6.20   -1.5%
                           Geomean difference                       -0.3%
      compile_time
l/r           base   improved       diff
count  10.000000    10.000000  10.000000
mean   13.458060    13.441530 -0.003470
std    20.556258    20.595136  0.006564
min    1.168200     1.168000  -0.014726
25%    4.175600     4.196700  -0.006884
50%    5.474550     5.443800  -0.003395
75%    9.721975     9.671825   0.001486
max    68.229400    68.368500  0.005544

b-chmiel · 2024-09-03T08:59:08Z

@nikic Ping.

nikic

LGTM

nikic · 2024-09-03T21:15:01Z

Does this version fully address the SROA scalability issue for you, or only partially?

I'd expect this version to be fast when removing elements that are not in the vector, but still equally slow if they are in the vector, as this requires a linear scan. Actually, that case may be even slower, because it will now do a linear scan per removed element, rather than one to remove all matching elements.

Is removing elements not in the vector the common case?

Optimize SROA pass for large number of allocas by speeding-up PromotableAllocas erase operation. The optimization involves using SmallPtrSet which proves to be efficient since PromotableAllocas is used only for manipulating unique pointers. Signed-off-by: Bartłomiej Chmiel <[email protected]>

Signed-off-by: Bartłomiej Chmiel <[email protected]>

b-chmiel · 2024-09-04T10:39:02Z

Thanks, this fully addresses my issue with SROA scalability. In my case the number of elements is so large (~0.5 billion elements) that any operation on PromotableAllocas is a bottleneck, so optimizing removal from O(n^2) to O(nlog2n) makes compilation feasible.

Still, PromoteMemToReg is a bottleneck and compilation consumes a lot of operational memory (>100G), but for now I'll focus on optimizing code generation on the Verilator side.

In the common case, when elements fit in a vector, the duration of this phase is negligible and there is no noticeable difference between PromotableAllocas representations. I'd set small factor to 16 (similarly to other structures there) to stop Set-operation constant factors from affecting performance in the typical case.

github-actions · 2024-09-04T12:11:10Z

@b-chmiel Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

llvmbot added the llvm:transforms label Aug 23, 2024

dtcxzyw requested a review from nikic August 23, 2024 13:19

b-chmiel force-pushed the sroa-large-number-of-allocas-optimization branch from 83e73d7 to eddc4db Compare August 23, 2024 14:33

b-chmiel changed the title ~~[SROA] Use SmallPtrSet for PromotableAllocas~~ [SROA] Use SetVector for PromotableAllocas Aug 27, 2024

nikic approved these changes Sep 3, 2024

View reviewed changes

b-chmiel added 2 commits September 4, 2024 10:17

[SROA] Use SetVector for PromotableAllocas

b0e3b7e

Signed-off-by: Bartłomiej Chmiel <[email protected]>

b-chmiel force-pushed the sroa-large-number-of-allocas-optimization branch from e2d5b9f to b0e3b7e Compare September 4, 2024 08:17

nikic merged commit c2b92a4 into llvm:main Sep 4, 2024
8 checks passed

mgielda deleted the sroa-large-number-of-allocas-optimization branch November 18, 2024 17:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SROA] Use SetVector for PromotableAllocas #105809

[SROA] Use SetVector for PromotableAllocas #105809

Uh oh!

b-chmiel commented Aug 23, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Aug 23, 2024

Uh oh!

llvmbot commented Aug 23, 2024

Benchmarks

Internal benchmark

Minimal example benchmark

llvm-test-suite

Uh oh!

b-chmiel commented Aug 23, 2024

Uh oh!

nikic commented Aug 23, 2024

Uh oh!

s-barannikov commented Aug 24, 2024

Uh oh!

b-chmiel commented Aug 26, 2024

Uh oh!

b-chmiel commented Aug 27, 2024

Uh oh!

b-chmiel commented Sep 3, 2024

Uh oh!

nikic left a comment

Uh oh!

nikic commented Sep 3, 2024

Uh oh!

b-chmiel commented Sep 4, 2024

Uh oh!

Uh oh!

github-actions bot commented Sep 4, 2024

Uh oh!

Uh oh!

[SROA] Use SetVector for PromotableAllocas #105809

[SROA] Use SetVector for PromotableAllocas #105809

Uh oh!

Conversation

b-chmiel commented Aug 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Internal benchmark

Minimal example benchmark

llvm-test-suite

Uh oh!

github-actions bot commented Aug 23, 2024

Uh oh!

llvmbot commented Aug 23, 2024

Benchmarks

Internal benchmark

Minimal example benchmark

llvm-test-suite

Uh oh!

b-chmiel commented Aug 23, 2024

Uh oh!

nikic commented Aug 23, 2024

Uh oh!

s-barannikov commented Aug 24, 2024

Uh oh!

b-chmiel commented Aug 26, 2024

Uh oh!

b-chmiel commented Aug 27, 2024

Uh oh!

b-chmiel commented Sep 3, 2024

Uh oh!

nikic left a comment

Choose a reason for hiding this comment

Uh oh!

nikic commented Sep 3, 2024

Uh oh!

b-chmiel commented Sep 4, 2024

Uh oh!

Uh oh!

github-actions bot commented Sep 4, 2024

Uh oh!

Uh oh!

b-chmiel commented Aug 23, 2024 •

edited

Loading