Skip to content

[CodeLayout] cache-directed sort: limit max chain size #69039

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 22, 2023

Conversation

MaskRay
Copy link
Member

@MaskRay MaskRay commented Oct 14, 2023

When linking an executable with a slightly larger executable,
ld.lld --call-graph-profile-sort=cdsort can be very slow (see #68638).

   4.6%  20.7Mi    .text.hot
   3.5%  15.9Mi    .text
   3.4%  15.2Mi    .text.unknown

Add cl option cdsort-max-chain-size, which is similar to
ext-tsp-max-chain-size, and set it to 128, to improve performance.

In `ld.lld @response.txt --threads=4 --call-graph-profile-sort=cdsort --time-trace"
builds, the "Total Sort sections" time is measured as follows:

  • -mllvm -cdsort-max-chain-size=64: 1.321813
  • -mllvm -cdsort-max-chain-size=128: 2.030425
  • -mllvm -cdsort-max-chain-size=256: 2.927684
  • -mllvm -cdsort-max-chain-size=512: 5.493106
  • unlimited: 9 minutes

The rest part takes 6.8s.

@llvmbot
Copy link
Member

llvmbot commented Oct 14, 2023

@llvm/pr-subscribers-llvm-transforms

Author: Fangrui Song (MaskRay)

Changes

When linking an executable with a slightly larger executable,
ld.lld --call-graph-profile-sort=cdsort can be very slow (see #68638).

   4.6%  20.7Mi    .text.hot
   3.5%  15.9Mi    .text
   3.4%  15.2Mi    .text.unknown

Add cl option cds-max-chain-size, which is similar to
ext-tsp-max-chain-size, and set it to 128, to improve performance.

In `ld.lld @response.txt --threads=4 --call-graph-profile-sort=cdsort --time-trace"
builds, the "Total Sort sections" time is measured as follows:

  • -mllvm -cds-max-chain-size=64: 1.321813
  • -mllvm -cds-max-chain-size=128: 2.030425
  • -mllvm -cds-max-chain-size=256: 2.927684
  • -mllvm -cds-max-chain-size=512: 5.493106
  • unlimited: 9 minutes

The rest part takes 6.8s.


Full diff: https://github.com/llvm/llvm-project/pull/69039.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Utils/CodeLayout.cpp (+9)
  • (modified) llvm/unittests/Transforms/Utils/CodeLayoutTest.cpp (+13)
diff --git a/llvm/lib/Transforms/Utils/CodeLayout.cpp b/llvm/lib/Transforms/Utils/CodeLayout.cpp
index d9e302d8b4fa54d..5411cf7a6c7f6f8 100644
--- a/llvm/lib/Transforms/Utils/CodeLayout.cpp
+++ b/llvm/lib/Transforms/Utils/CodeLayout.cpp
@@ -62,6 +62,12 @@ cl::opt<bool> ApplyExtTspWithoutProfile(
     "ext-tsp-apply-without-profile",
     cl::desc("Whether to apply ext-tsp placement for instances w/o profile"),
     cl::init(true), cl::Hidden);
+
+namespace codelayout {
+cl::opt<unsigned>
+    CDMaxChainSize("cds-max-chain-size", cl::Hidden, cl::init(128),
+                   cl::desc("The maximum size of a chain to create"));
+}
 } // namespace llvm
 
 // Algorithm-specific params for Ext-TSP. The values are tuned for the best
@@ -1158,6 +1164,9 @@ class CDSortImpl {
         // Ignore loop edges.
         if (Edge->isSelfEdge())
           continue;
+        if (Edge->srcChain()->numBlocks() + Edge->dstChain()->numBlocks() >
+            CDMaxChainSize)
+          continue;
 
         // Compute the gain of merging the two chains.
         MergeGainT Gain = getBestMergeGain(Edge);
diff --git a/llvm/unittests/Transforms/Utils/CodeLayoutTest.cpp b/llvm/unittests/Transforms/Utils/CodeLayoutTest.cpp
index ce42f703229bd01..f2ac74df9afa35b 100644
--- a/llvm/unittests/Transforms/Utils/CodeLayoutTest.cpp
+++ b/llvm/unittests/Transforms/Utils/CodeLayoutTest.cpp
@@ -1,4 +1,5 @@
 #include "llvm/Transforms/Utils/CodeLayout.h"
+#include "llvm/Support/CommandLine.h"
 #include "gmock/gmock.h"
 #include "gtest/gtest.h"
 #include <vector>
@@ -7,6 +8,10 @@ using namespace llvm;
 using namespace llvm::codelayout;
 using testing::ElementsAreArray;
 
+namespace llvm::codelayout {
+extern cl::opt<unsigned> CDMaxChainSize;
+}
+
 namespace {
 TEST(CodeLayout, ThreeFunctions) {
   // Place the most likely successor (2) first.
@@ -40,6 +45,14 @@ TEST(CodeLayout, HotChain) {
     const std::vector<uint64_t> CallOffsets(std::size(Edges), 5);
     auto Order = computeCacheDirectedLayout(Sizes, Counts, Edges, CallOffsets);
     EXPECT_THAT(Order, ElementsAreArray({0, 3, 4, 2, 1}));
+
+    // -cds-max-chain-size disables forming a larger chain and therefore may
+    // change the result.
+    unsigned Saved = CDMaxChainSize;
+    CDMaxChainSize.setValue(3);
+    Order = computeCacheDirectedLayout(Sizes, Counts, Edges, CallOffsets);
+    EXPECT_THAT(Order, ElementsAreArray({0, 3, 4, 1, 2}));
+    CDMaxChainSize.setValue(Saved);
   }
 }
 

@rlavaee
Copy link
Contributor

rlavaee commented Oct 16, 2023

Can you share perf. improvement numbers to compare with results in #68638?

BTW, I have a local patch to improve performance for larger chains (based on my speculation https://reviews.llvm.org/D152834#inline-1500317). If this can wait for a while, I will try to send a PR in one week.

@MaskRay
Copy link
Member Author

MaskRay commented Oct 16, 2023

Can you share perf. improvement numbers to compare with results in #68638?

BTW, I have a local patch to improve performance for larger chains (based on my speculation reviews.llvm.org/D152834#inline-1500317). If this can wait for a while, I will try to send a PR in one week.

Personally I suspect that changing the max-chain-size limit has a very limited impact. Is there a script so that everybody can verify https://reviews.llvm.org/D152834#inline-1500317 and check performance improvement? I am only able to benchmark Clang if someone shares with me input data. (For instance, I can build llvm-project or Linux kernel with a new version of Clang).

AIUI llvm::erase_value(HotChains, From); is just to support LLVM_DEBUG(... HotChains.size()). If I comment it out:

--- i/llvm/lib/Transforms/Utils/CodeLayout.cpp
+++ w/llvm/lib/Transforms/Utils/CodeLayout.cpp
@@ -1305,3 +1305,3 @@ private:
     // Remove the chain from the list of active chains.
-    llvm::erase_value(HotChains, From);
+    //llvm::erase_value(HotChains, From);
   }
% time /tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort --threads=8 2> 0
/tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort  544.51s user 4.70s system 102% cpu 8:54.62 total

Without any change, --call-graph-profile-sort=cdsort takes about 9 minutes. Therefore, llvm::erase_value(HotChains, From); is problematic but is not a bottleneck.

@rlavaee
Copy link
Contributor

rlavaee commented Oct 16, 2023

The bottleneck is the edges running between the different chains. As the chains get larger, they tend to have more chain edges with other chains. Currently std::vector<std::pair<ChainT*, ChainEdge *>> ChainT:Edges performs well if the number of chain edges is relatively small. Another simple idea is to store them in a DenseMap.

@spupyrev
Copy link
Contributor

spupyrev commented Oct 16, 2023

Thanks a lot for speeding this up!

Can we apply the density-based condition, as in #68617? Besides making the code faster, it often helps for quality too. I'd keep the max-chain-size to be 512, if possible (assuming in conjunction with the density-based check the runtime is low).

@rlavaee
Copy link
Contributor

rlavaee commented Oct 17, 2023

Thanks for experimenting with my PR. I have another optimization too that I will try myself. It makes sense to set a limit. I was just curious what would be the perf. gain that we drop if we set it to 128. (Any ideas @spupyrev ?)

@spupyrev
Copy link
Contributor

It makes sense to set a limit. I was just curious what would be the perf. gain that we drop if we set it to 128. (Any ideas @spupyrev ?)

I'll try to measure perf impact over the next couple of days and report here

@MaskRay
Copy link
Member Author

MaskRay commented Oct 17, 2023

Thank you for the responses!

Can we apply the density-based condition, as in #68617? Besides making the code faster, it often helps for quality too. I'd keep the max-chain-size to be 512, if possible (assuming in conjunction with the density-based check the runtime is low).

I am curious whether cdsort-max-chain-size=128 or 512 has a noticeable performance regression for your applications.
I am nervous about 512 as it is quite slow to justify making the default...

I am happy to bootstrap Clang and check Clang compilation performance if you give me a script on accessable workloads (building llvm-project or linux kernel, for example).

% time /tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=hfsort --threads=8
/tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=hfsort  14.86s user 3.84s system 501% cpu 3.732 total
% time /tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort --threads=8 -mllvm -cdsort-max-chain-size=128
/tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort  16.71s user 3.94s system 360% cpu 5.735 total
% time /tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort --threads=8 -mllvm -cdsort-max-chain-size=512
/tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort  20.15s user 3.95s system 257% cpu 9.347 total

Adding something like MaxMergeDensityRatio from #68617 has very little effect. If I set the pruning threshold to 1.5 (far more effective than 100; a value as small as 1.5 should not be acceptable for optimizing):

--- i/llvm/lib/Transforms/Utils/CodeLayout.cpp
+++ w/llvm/lib/Transforms/Utils/CodeLayout.cpp
@@ -1167,6 +1167,9 @@ private:
         if (Edge->srcChain()->numBlocks() + Edge->dstChain()->numBlocks() >
             CDMaxChainSize)
           continue;
+        auto [mn, mx] = std::minmax(Edge->srcChain()->numBlocks(), Edge->dstChain()->numBlocks());
+        if (mx/mn > 1.5)
+          continue;

         // Compute the gain of merging the two chains.
         MergeGainT Gain = getBestMergeGain(Edge);

I'll get this (nearly no effect):

% repeat 2 time /tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort --threads=8 -mllvm -cdsort-max-chain-size=128 # if (mx/mn > 1.5) continue
/tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort  16.83s user 3.91s system 355% cpu 5.843 total
/tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort  16.77s user 4.01s system 356% cpu 5.826 total

If I switch to density:

--- i/llvm/lib/Transforms/Utils/CodeLayout.cpp
+++ w/llvm/lib/Transforms/Utils/CodeLayout.cpp
@@ -1167,6 +1167,9 @@ private:
         if (Edge->srcChain()->numBlocks() + Edge->dstChain()->numBlocks() >
             CDMaxChainSize)
           continue;
+        auto [mn, mx] = std::minmax(Edge->srcChain()->density(), Edge->dstChain()->density());
+        if (mx/mn > 2)
+          continue;

         // Compute the gain of merging the two chains.
         MergeGainT Gain = getBestMergeGain(Edge);
% repeat 2 time /tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort --threads=8 -mllvm -cdsort-max-chain-size=128 # density if (mx/mn > 2) continue
/tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort  16.96s user 3.99s system 353% cpu 5.934 total
/tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort  16.86s user 4.03s system 354% cpu 5.892 total

Thanks for experimenting with my PR. I have another optimization too that I will try myself. It makes sense to set a limit. I was just curious what would be the perf. gain that we drop if we set it to 128. (Any ideas @spupyrev ?)

Without applying this cdsort-max-chain-size=128 patch, using DenseMap instead of std::vector for ChainEdges (https://github.com/rlavaee/llvm-project/tree/improve-cdsort) makes the link like 10+ seconds faster out of the total link time (9 min)... (less effective than I thought)

@MaskRay MaskRay force-pushed the codelayout/max-chain-size branch from a3c979a to 9403e71 Compare October 17, 2023 00:38

namespace codelayout {
cl::opt<unsigned>
CDMaxChainSize("cdsort-max-chain-size", cl::Hidden, cl::init(128),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to keep and use the new option next to other cdsort options, e.g., CacheEntries, CacheSize? It would be also great to unify their namings

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a cdsort option, different from ext-tsp. Perhaps place this after all the other ext-tsp options?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a place to add the option

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I figured it out.

Other options are still named cds-*, but this new one is named cdsort-max-chain-size. I think it makes sense to reduce the number of abbreviations. As we have used cdsort and various case variants, it makes sense to avoid cds-*.

Perhaps land a separate commit to rename existing cds-* options to cdsort-*?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was exactly my motivation for the earlier comment on "unifying their namings"; sorry for such brevity

When linking an executable with a slightly larger executable,
ld.lld --call-graph-profile-sort=cdsort can be very slow (see llvm#68638).
```
   4.6%  20.7Mi    .text.hot
   3.5%  15.9Mi    .text
   3.4%  15.2Mi    .text.unknown
```

Add cl option `cds-max-chain-size`, which is similar to
`ext-tsp-max-chain-size`, and set it to 128, to improve performance.

In `ld.lld @response.txt --threads=4 --call-graph-profile-sort=cdsort --time-trace"
builds, the "Total Sort sections" time is measured as follows:

* -mllvm  -cds-max-chain-size=64: 1.321813
* -mllvm -cds-max-chain-size=128: 2.030425
* -mllvm -cds-max-chain-size=256: 2.927684
* -mllvm -cds-max-chain-size=512: 5.493106
* unlimited: 9 minutes

The rest part takes 6.8s.
@MaskRay MaskRay force-pushed the codelayout/max-chain-size branch from 9403e71 to 8a47e7c Compare October 19, 2023 20:42
@MaskRay MaskRay merged commit a244183 into llvm:main Oct 22, 2023
@MaskRay MaskRay deleted the codelayout/max-chain-size branch October 22, 2023 23:50
Guzhu-AMD pushed a commit to GPUOpen-Drivers/llvm-project that referenced this pull request Oct 26, 2023
Local branch amd-gfx 923f1fe Merged main:508a697acd18 into amd-gfx:57bd89831925
Remote branch main a244183 [CodeLayout] cache-directed sort: limit max chain size (llvm#69039)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants