-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[CodeLayout] cache-directed sort: limit max chain size #69039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-llvm-transforms Author: Fangrui Song (MaskRay) ChangesWhen linking an executable with a slightly larger executable,
Add cl option In `ld.lld @response.txt --threads=4 --call-graph-profile-sort=cdsort --time-trace"
The rest part takes 6.8s. Full diff: https://github.com/llvm/llvm-project/pull/69039.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/Utils/CodeLayout.cpp b/llvm/lib/Transforms/Utils/CodeLayout.cpp
index d9e302d8b4fa54d..5411cf7a6c7f6f8 100644
--- a/llvm/lib/Transforms/Utils/CodeLayout.cpp
+++ b/llvm/lib/Transforms/Utils/CodeLayout.cpp
@@ -62,6 +62,12 @@ cl::opt<bool> ApplyExtTspWithoutProfile(
"ext-tsp-apply-without-profile",
cl::desc("Whether to apply ext-tsp placement for instances w/o profile"),
cl::init(true), cl::Hidden);
+
+namespace codelayout {
+cl::opt<unsigned>
+ CDMaxChainSize("cds-max-chain-size", cl::Hidden, cl::init(128),
+ cl::desc("The maximum size of a chain to create"));
+}
} // namespace llvm
// Algorithm-specific params for Ext-TSP. The values are tuned for the best
@@ -1158,6 +1164,9 @@ class CDSortImpl {
// Ignore loop edges.
if (Edge->isSelfEdge())
continue;
+ if (Edge->srcChain()->numBlocks() + Edge->dstChain()->numBlocks() >
+ CDMaxChainSize)
+ continue;
// Compute the gain of merging the two chains.
MergeGainT Gain = getBestMergeGain(Edge);
diff --git a/llvm/unittests/Transforms/Utils/CodeLayoutTest.cpp b/llvm/unittests/Transforms/Utils/CodeLayoutTest.cpp
index ce42f703229bd01..f2ac74df9afa35b 100644
--- a/llvm/unittests/Transforms/Utils/CodeLayoutTest.cpp
+++ b/llvm/unittests/Transforms/Utils/CodeLayoutTest.cpp
@@ -1,4 +1,5 @@
#include "llvm/Transforms/Utils/CodeLayout.h"
+#include "llvm/Support/CommandLine.h"
#include "gmock/gmock.h"
#include "gtest/gtest.h"
#include <vector>
@@ -7,6 +8,10 @@ using namespace llvm;
using namespace llvm::codelayout;
using testing::ElementsAreArray;
+namespace llvm::codelayout {
+extern cl::opt<unsigned> CDMaxChainSize;
+}
+
namespace {
TEST(CodeLayout, ThreeFunctions) {
// Place the most likely successor (2) first.
@@ -40,6 +45,14 @@ TEST(CodeLayout, HotChain) {
const std::vector<uint64_t> CallOffsets(std::size(Edges), 5);
auto Order = computeCacheDirectedLayout(Sizes, Counts, Edges, CallOffsets);
EXPECT_THAT(Order, ElementsAreArray({0, 3, 4, 2, 1}));
+
+ // -cds-max-chain-size disables forming a larger chain and therefore may
+ // change the result.
+ unsigned Saved = CDMaxChainSize;
+ CDMaxChainSize.setValue(3);
+ Order = computeCacheDirectedLayout(Sizes, Counts, Edges, CallOffsets);
+ EXPECT_THAT(Order, ElementsAreArray({0, 3, 4, 1, 2}));
+ CDMaxChainSize.setValue(Saved);
}
}
|
Can you share perf. improvement numbers to compare with results in #68638? BTW, I have a local patch to improve performance for larger chains (based on my speculation https://reviews.llvm.org/D152834#inline-1500317). If this can wait for a while, I will try to send a PR in one week. |
Personally I suspect that changing the max-chain-size limit has a very limited impact. Is there a script so that everybody can verify https://reviews.llvm.org/D152834#inline-1500317 and check performance improvement? I am only able to benchmark Clang if someone shares with me input data. (For instance, I can build llvm-project or Linux kernel with a new version of Clang). AIUI
Without any change, |
The bottleneck is the edges running between the different chains. As the chains get larger, they tend to have more chain edges with other chains. Currently |
Thanks a lot for speeding this up! Can we apply the density-based condition, as in #68617? Besides making the code faster, it often helps for quality too. I'd keep the max-chain-size to be 512, if possible (assuming in conjunction with the density-based check the runtime is low). |
Thanks for experimenting with my PR. I have another optimization too that I will try myself. It makes sense to set a limit. I was just curious what would be the perf. gain that we drop if we set it to 128. (Any ideas @spupyrev ?) |
I'll try to measure perf impact over the next couple of days and report here |
Thank you for the responses!
I am curious whether I am happy to bootstrap Clang and check Clang compilation performance if you give me a script on accessable workloads (building llvm-project or linux kernel, for example).
Adding something like
I'll get this (nearly no effect):
If I switch to density:
Without applying this |
a3c979a
to
9403e71
Compare
|
||
namespace codelayout { | ||
cl::opt<unsigned> | ||
CDMaxChainSize("cdsort-max-chain-size", cl::Hidden, cl::init(128), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to keep and use the new option next to other cdsort options, e.g., CacheEntries
, CacheSize
? It would be also great to unify their namings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a cdsort option, different from ext-tsp. Perhaps place this after all the other ext-tsp
options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a place to add the option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I figured it out.
Other options are still named cds-*
, but this new one is named cdsort-max-chain-size
. I think it makes sense to reduce the number of abbreviations. As we have used cdsort
and various case variants, it makes sense to avoid cds-*
.
Perhaps land a separate commit to rename existing cds-*
options to cdsort-*
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was exactly my motivation for the earlier comment on "unifying their namings"; sorry for such brevity
When linking an executable with a slightly larger executable, ld.lld --call-graph-profile-sort=cdsort can be very slow (see llvm#68638). ``` 4.6% 20.7Mi .text.hot 3.5% 15.9Mi .text 3.4% 15.2Mi .text.unknown ``` Add cl option `cds-max-chain-size`, which is similar to `ext-tsp-max-chain-size`, and set it to 128, to improve performance. In `ld.lld @response.txt --threads=4 --call-graph-profile-sort=cdsort --time-trace" builds, the "Total Sort sections" time is measured as follows: * -mllvm -cds-max-chain-size=64: 1.321813 * -mllvm -cds-max-chain-size=128: 2.030425 * -mllvm -cds-max-chain-size=256: 2.927684 * -mllvm -cds-max-chain-size=512: 5.493106 * unlimited: 9 minutes The rest part takes 6.8s.
9403e71
to
8a47e7c
Compare
Local branch amd-gfx 923f1fe Merged main:508a697acd18 into amd-gfx:57bd89831925 Remote branch main a244183 [CodeLayout] cache-directed sort: limit max chain size (llvm#69039)
When linking an executable with a slightly larger executable,
ld.lld --call-graph-profile-sort=cdsort can be very slow (see #68638).
Add cl option
cdsort-max-chain-size
, which is similar toext-tsp-max-chain-size
, and set it to 128, to improve performance.In `ld.lld @response.txt --threads=4 --call-graph-profile-sort=cdsort --time-trace"
builds, the "Total Sort sections" time is measured as follows:
The rest part takes 6.8s.