[CodeLayout] cache-directed sort: limit max chain size #69039

MaskRay · 2023-10-14T02:04:13Z

When linking an executable with a slightly larger executable,
ld.lld --call-graph-profile-sort=cdsort can be very slow (see #68638).

   4.6%  20.7Mi    .text.hot
   3.5%  15.9Mi    .text
   3.4%  15.2Mi    .text.unknown

Add cl option cdsort-max-chain-size, which is similar to
ext-tsp-max-chain-size, and set it to 128, to improve performance.

In `ld.lld @response.txt --threads=4 --call-graph-profile-sort=cdsort --time-trace"
builds, the "Total Sort sections" time is measured as follows:

-mllvm -cdsort-max-chain-size=64: 1.321813
-mllvm -cdsort-max-chain-size=128: 2.030425
-mllvm -cdsort-max-chain-size=256: 2.927684
-mllvm -cdsort-max-chain-size=512: 5.493106
unlimited: 9 minutes

The rest part takes 6.8s.

llvmbot · 2023-10-14T02:05:22Z

@llvm/pr-subscribers-llvm-transforms

Author: Fangrui Song (MaskRay)

Changes

When linking an executable with a slightly larger executable,
ld.lld --call-graph-profile-sort=cdsort can be very slow (see #68638).

   4.6%  20.7Mi    .text.hot
   3.5%  15.9Mi    .text
   3.4%  15.2Mi    .text.unknown

Add cl option cds-max-chain-size, which is similar to
ext-tsp-max-chain-size, and set it to 128, to improve performance.

In `ld.lld @response.txt --threads=4 --call-graph-profile-sort=cdsort --time-trace"
builds, the "Total Sort sections" time is measured as follows:

-mllvm -cds-max-chain-size=64: 1.321813
-mllvm -cds-max-chain-size=128: 2.030425
-mllvm -cds-max-chain-size=256: 2.927684
-mllvm -cds-max-chain-size=512: 5.493106
unlimited: 9 minutes

The rest part takes 6.8s.

Full diff: https://github.com/llvm/llvm-project/pull/69039.diff

2 Files Affected:

(modified) llvm/lib/Transforms/Utils/CodeLayout.cpp (+9)
(modified) llvm/unittests/Transforms/Utils/CodeLayoutTest.cpp (+13)

diff --git a/llvm/lib/Transforms/Utils/CodeLayout.cpp b/llvm/lib/Transforms/Utils/CodeLayout.cpp
index d9e302d8b4fa54d..5411cf7a6c7f6f8 100644
--- a/llvm/lib/Transforms/Utils/CodeLayout.cpp
+++ b/llvm/lib/Transforms/Utils/CodeLayout.cpp
@@ -62,6 +62,12 @@ cl::opt<bool> ApplyExtTspWithoutProfile(
     "ext-tsp-apply-without-profile",
     cl::desc("Whether to apply ext-tsp placement for instances w/o profile"),
     cl::init(true), cl::Hidden);
+
+namespace codelayout {
+cl::opt<unsigned>
+    CDMaxChainSize("cds-max-chain-size", cl::Hidden, cl::init(128),
+                   cl::desc("The maximum size of a chain to create"));
+}
 } // namespace llvm
 
 // Algorithm-specific params for Ext-TSP. The values are tuned for the best
@@ -1158,6 +1164,9 @@ class CDSortImpl {
         // Ignore loop edges.
         if (Edge->isSelfEdge())
           continue;
+        if (Edge->srcChain()->numBlocks() + Edge->dstChain()->numBlocks() >
+            CDMaxChainSize)
+          continue;
 
         // Compute the gain of merging the two chains.
         MergeGainT Gain = getBestMergeGain(Edge);
diff --git a/llvm/unittests/Transforms/Utils/CodeLayoutTest.cpp b/llvm/unittests/Transforms/Utils/CodeLayoutTest.cpp
index ce42f703229bd01..f2ac74df9afa35b 100644
--- a/llvm/unittests/Transforms/Utils/CodeLayoutTest.cpp
+++ b/llvm/unittests/Transforms/Utils/CodeLayoutTest.cpp
@@ -1,4 +1,5 @@
 #include "llvm/Transforms/Utils/CodeLayout.h"
+#include "llvm/Support/CommandLine.h"
 #include "gmock/gmock.h"
 #include "gtest/gtest.h"
 #include <vector>
@@ -7,6 +8,10 @@ using namespace llvm;
 using namespace llvm::codelayout;
 using testing::ElementsAreArray;
 
+namespace llvm::codelayout {
+extern cl::opt<unsigned> CDMaxChainSize;
+}
+
 namespace {
 TEST(CodeLayout, ThreeFunctions) {
   // Place the most likely successor (2) first.
@@ -40,6 +45,14 @@ TEST(CodeLayout, HotChain) {
     const std::vector<uint64_t> CallOffsets(std::size(Edges), 5);
     auto Order = computeCacheDirectedLayout(Sizes, Counts, Edges, CallOffsets);
     EXPECT_THAT(Order, ElementsAreArray({0, 3, 4, 2, 1}));
+
+    // -cds-max-chain-size disables forming a larger chain and therefore may
+    // change the result.
+    unsigned Saved = CDMaxChainSize;
+    CDMaxChainSize.setValue(3);
+    Order = computeCacheDirectedLayout(Sizes, Counts, Edges, CallOffsets);
+    EXPECT_THAT(Order, ElementsAreArray({0, 3, 4, 1, 2}));
+    CDMaxChainSize.setValue(Saved);
   }
 }

rlavaee · 2023-10-16T15:08:30Z

Can you share perf. improvement numbers to compare with results in #68638?

BTW, I have a local patch to improve performance for larger chains (based on my speculation https://reviews.llvm.org/D152834#inline-1500317). If this can wait for a while, I will try to send a PR in one week.

MaskRay · 2023-10-16T21:43:43Z

Can you share perf. improvement numbers to compare with results in #68638?

BTW, I have a local patch to improve performance for larger chains (based on my speculation reviews.llvm.org/D152834#inline-1500317). If this can wait for a while, I will try to send a PR in one week.

Personally I suspect that changing the max-chain-size limit has a very limited impact. Is there a script so that everybody can verify https://reviews.llvm.org/D152834#inline-1500317 and check performance improvement? I am only able to benchmark Clang if someone shares with me input data. (For instance, I can build llvm-project or Linux kernel with a new version of Clang).

AIUI llvm::erase_value(HotChains, From); is just to support LLVM_DEBUG(... HotChains.size()). If I comment it out:

--- i/llvm/lib/Transforms/Utils/CodeLayout.cpp
+++ w/llvm/lib/Transforms/Utils/CodeLayout.cpp
@@ -1305,3 +1305,3 @@ private:
     // Remove the chain from the list of active chains.
-    llvm::erase_value(HotChains, From);
+    //llvm::erase_value(HotChains, From);
   }

% time /tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort --threads=8 2> 0
/tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort  544.51s user 4.70s system 102% cpu 8:54.62 total

Without any change, --call-graph-profile-sort=cdsort takes about 9 minutes. Therefore, llvm::erase_value(HotChains, From); is problematic but is not a bottleneck.

rlavaee · 2023-10-16T23:20:37Z

The bottleneck is the edges running between the different chains. As the chains get larger, they tend to have more chain edges with other chains. Currently std::vector<std::pair<ChainT*, ChainEdge *>> ChainT:Edges performs well if the number of chain edges is relatively small. Another simple idea is to store them in a DenseMap.

llvm/lib/Transforms/Utils/CodeLayout.cpp

spupyrev · 2023-10-16T23:32:19Z

Thanks a lot for speeding this up!

Can we apply the density-based condition, as in #68617? Besides making the code faster, it often helps for quality too. I'd keep the max-chain-size to be 512, if possible (assuming in conjunction with the density-based check the runtime is low).

rlavaee · 2023-10-17T00:05:26Z

Thanks for experimenting with my PR. I have another optimization too that I will try myself. It makes sense to set a limit. I was just curious what would be the perf. gain that we drop if we set it to 128. (Any ideas @spupyrev ?)

spupyrev · 2023-10-17T00:24:10Z

It makes sense to set a limit. I was just curious what would be the perf. gain that we drop if we set it to 128. (Any ideas @spupyrev ?)

I'll try to measure perf impact over the next couple of days and report here

MaskRay · 2023-10-17T00:36:52Z

Thank you for the responses!

Can we apply the density-based condition, as in #68617? Besides making the code faster, it often helps for quality too. I'd keep the max-chain-size to be 512, if possible (assuming in conjunction with the density-based check the runtime is low).

I am curious whether cdsort-max-chain-size=128 or 512 has a noticeable performance regression for your applications.
I am nervous about 512 as it is quite slow to justify making the default...

I am happy to bootstrap Clang and check Clang compilation performance if you give me a script on accessable workloads (building llvm-project or linux kernel, for example).

% time /tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=hfsort --threads=8
/tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=hfsort  14.86s user 3.84s system 501% cpu 3.732 total
% time /tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort --threads=8 -mllvm -cdsort-max-chain-size=128
/tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort  16.71s user 3.94s system 360% cpu 5.735 total
% time /tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort --threads=8 -mllvm -cdsort-max-chain-size=512
/tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort  20.15s user 3.95s system 257% cpu 9.347 total

Adding something like MaxMergeDensityRatio from #68617 has very little effect. If I set the pruning threshold to 1.5 (far more effective than 100; a value as small as 1.5 should not be acceptable for optimizing):

--- i/llvm/lib/Transforms/Utils/CodeLayout.cpp
+++ w/llvm/lib/Transforms/Utils/CodeLayout.cpp
@@ -1167,6 +1167,9 @@ private:
         if (Edge->srcChain()->numBlocks() + Edge->dstChain()->numBlocks() >
             CDMaxChainSize)
           continue;
+        auto [mn, mx] = std::minmax(Edge->srcChain()->numBlocks(), Edge->dstChain()->numBlocks());
+        if (mx/mn > 1.5)
+          continue;

         // Compute the gain of merging the two chains.
         MergeGainT Gain = getBestMergeGain(Edge);

I'll get this (nearly no effect):

% repeat 2 time /tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort --threads=8 -mllvm -cdsort-max-chain-size=128 # if (mx/mn > 1.5) continue
/tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort  16.83s user 3.91s system 355% cpu 5.843 total
/tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort  16.77s user 4.01s system 356% cpu 5.826 total

If I switch to density:

--- i/llvm/lib/Transforms/Utils/CodeLayout.cpp
+++ w/llvm/lib/Transforms/Utils/CodeLayout.cpp
@@ -1167,6 +1167,9 @@ private:
         if (Edge->srcChain()->numBlocks() + Edge->dstChain()->numBlocks() >
             CDMaxChainSize)
           continue;
+        auto [mn, mx] = std::minmax(Edge->srcChain()->density(), Edge->dstChain()->density());
+        if (mx/mn > 2)
+          continue;

         // Compute the gain of merging the two chains.
         MergeGainT Gain = getBestMergeGain(Edge);

% repeat 2 time /tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort --threads=8 -mllvm -cdsort-max-chain-size=128 # density if (mx/mn > 2) continue
/tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort  16.96s user 3.99s system 353% cpu 5.934 total
/tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort  16.86s user 4.03s system 354% cpu 5.892 total

Thanks for experimenting with my PR. I have another optimization too that I will try myself. It makes sense to set a limit. I was just curious what would be the perf. gain that we drop if we set it to 128. (Any ideas @spupyrev ?)

Without applying this cdsort-max-chain-size=128 patch, using DenseMap instead of std::vector for ChainEdges (https://github.com/rlavaee/llvm-project/tree/improve-cdsort) makes the link like 10+ seconds faster out of the total link time (9 min)... (less effective than I thought)

spupyrev · 2023-10-19T13:47:33Z

llvm/lib/Transforms/Utils/CodeLayout.cpp

+
+namespace codelayout {
+cl::opt<unsigned>
+    CDMaxChainSize("cdsort-max-chain-size", cl::Hidden, cl::init(128),


Does it make sense to keep and use the new option next to other cdsort options, e.g., CacheEntries, CacheSize? It would be also great to unify their namings

This is a cdsort option, different from ext-tsp. Perhaps place this after all the other ext-tsp options?

This would be a place to add the option

Thanks! I figured it out.

Other options are still named cds-*, but this new one is named cdsort-max-chain-size. I think it makes sense to reduce the number of abbreviations. As we have used cdsort and various case variants, it makes sense to avoid cds-*.

Perhaps land a separate commit to rename existing cds-* options to cdsort-*?

Yes, that was exactly my motivation for the earlier comment on "unifying their namings"; sorry for such brevity

When linking an executable with a slightly larger executable, ld.lld --call-graph-profile-sort=cdsort can be very slow (see llvm#68638). ``` 4.6% 20.7Mi .text.hot 3.5% 15.9Mi .text 3.4% 15.2Mi .text.unknown ``` Add cl option `cds-max-chain-size`, which is similar to `ext-tsp-max-chain-size`, and set it to 128, to improve performance. In `ld.lld @response.txt --threads=4 --call-graph-profile-sort=cdsort --time-trace" builds, the "Total Sort sections" time is measured as follows: * -mllvm -cds-max-chain-size=64: 1.321813 * -mllvm -cds-max-chain-size=128: 2.030425 * -mllvm -cds-max-chain-size=256: 2.927684 * -mllvm -cds-max-chain-size=512: 5.493106 * unlimited: 9 minutes The rest part takes 6.8s.

Local branch amd-gfx 923f1fe Merged main:508a697acd18 into amd-gfx:57bd89831925 Remote branch main a244183 [CodeLayout] cache-directed sort: limit max chain size (llvm#69039)

MaskRay requested review from rlavaee, david-xl, spupyrev, aaupov, htyu and maksfb October 14, 2023 02:04

llvmbot added the llvm:transforms label Oct 14, 2023

spupyrev reviewed Oct 16, 2023

View reviewed changes

llvm/lib/Transforms/Utils/CodeLayout.cpp Outdated Show resolved Hide resolved

MaskRay force-pushed the codelayout/max-chain-size branch from a3c979a to 9403e71 Compare October 17, 2023 00:38

spupyrev reviewed Oct 19, 2023

View reviewed changes

MaskRay added 2 commits October 19, 2023 13:17

Add MaxChainSize to CDSortConfig

8a47e7c

MaskRay force-pushed the codelayout/max-chain-size branch from 9403e71 to 8a47e7c Compare October 19, 2023 20:42

spupyrev approved these changes Oct 20, 2023

View reviewed changes

MaskRay merged commit a244183 into llvm:main Oct 22, 2023

MaskRay deleted the codelayout/max-chain-size branch October 22, 2023 23:50

spupyrev mentioned this pull request Oct 23, 2023

[BOLT] Rename cds to cdsort #69966

Merged

banach-space mentioned this pull request Oct 24, 2023

[mlir][vector] Add scalable vectors to tests for vector.contract #70039

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CodeLayout] cache-directed sort: limit max chain size #69039

[CodeLayout] cache-directed sort: limit max chain size #69039

Uh oh!

MaskRay commented Oct 14, 2023 •

edited

Loading

Uh oh!

llvmbot commented Oct 14, 2023

Uh oh!

rlavaee commented Oct 16, 2023

Uh oh!

MaskRay commented Oct 16, 2023

Uh oh!

rlavaee commented Oct 16, 2023

Uh oh!

Uh oh!

spupyrev commented Oct 16, 2023 •

edited

Loading

Uh oh!

rlavaee commented Oct 17, 2023

Uh oh!

spupyrev commented Oct 17, 2023

Uh oh!

MaskRay commented Oct 17, 2023 •

edited

Loading

Uh oh!

spupyrev Oct 19, 2023

Uh oh!

MaskRay Oct 19, 2023

Uh oh!

spupyrev Oct 19, 2023

Uh oh!

MaskRay Oct 19, 2023

Uh oh!

spupyrev Oct 19, 2023

Uh oh!

Uh oh!

[CodeLayout] cache-directed sort: limit max chain size #69039

[CodeLayout] cache-directed sort: limit max chain size #69039

Uh oh!

Conversation

MaskRay commented Oct 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Oct 14, 2023

Uh oh!

rlavaee commented Oct 16, 2023

Uh oh!

MaskRay commented Oct 16, 2023

Uh oh!

rlavaee commented Oct 16, 2023

Uh oh!

Uh oh!

spupyrev commented Oct 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rlavaee commented Oct 17, 2023

Uh oh!

spupyrev commented Oct 17, 2023

Uh oh!

MaskRay commented Oct 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

spupyrev Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

MaskRay Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

spupyrev Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

MaskRay Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

spupyrev Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MaskRay commented Oct 14, 2023 •

edited

Loading

spupyrev commented Oct 16, 2023 •

edited

Loading

MaskRay commented Oct 17, 2023 •

edited

Loading