Skip to content

[Sample Profile] Check hot callsite threshold when inlining a function with a sample profile #93286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 28, 2024

Conversation

huangjd
Copy link
Contributor

@huangjd huangjd commented May 24, 2024

Currently if a callsite is hot as determined by the sample profile, it is unconditionally inlined barring invalid cases (such as recursion). Inline cost check should still apply because a function's hotness and its inline cost are two different things.
For example if a function is calling another very large function multiple times (at different code paths), the large function should not be inlined even if its hot.

function with a sample profile

Currently if a callsite is hot as determined by the sample profile, it
is unconditionally inlined barring invalid cases (such as recursion).
Inline cost check should still apply because a function's hotness and
its inline cost are two different things.
For example if a function is calling another very large function
multiple times (at different code paths), the large function should not
be inlined even if its hot.
@huangjd huangjd added the PGO Profile Guided Optimizations label May 24, 2024
@huangjd huangjd requested review from MaskRay, snehasish and david-xl May 24, 2024 09:51
@llvmbot
Copy link
Member

llvmbot commented May 24, 2024

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-pgo

Author: William Junda Huang (huangjd)

Changes

Currently if a callsite is hot as determined by the sample profile, it is unconditionally inlined barring invalid cases (such as recursion). Inline cost check should still apply because a function's hotness and its inline cost are two different things.
For example if a function is calling another very large function multiple times (at different code paths), the large function should not be inlined even if its hot.


Full diff: https://github.com/llvm/llvm-project/pull/93286.diff

5 Files Affected:

  • (modified) llvm/lib/Transforms/IPO/SampleProfile.cpp (+5-3)
  • (added) llvm/test/Transforms/SampleProfile/Inputs/inline-hot-callsite-threshold.prof (+3)
  • (added) llvm/test/Transforms/SampleProfile/inline-hot-callsite-threshold.ll (+61)
  • (modified) llvm/test/Transforms/SampleProfile/pseudo-probe-inline.ll (+1-1)
  • (modified) llvm/test/Transforms/SampleProfile/remarks.ll (+2-2)
diff --git a/llvm/lib/Transforms/IPO/SampleProfile.cpp b/llvm/lib/Transforms/IPO/SampleProfile.cpp
index 0920179fb76b7..10d6e2282ac2a 100644
--- a/llvm/lib/Transforms/IPO/SampleProfile.cpp
+++ b/llvm/lib/Transforms/IPO/SampleProfile.cpp
@@ -1391,10 +1391,12 @@ SampleProfileLoader::shouldInlineCandidate(InlineCandidate &Candidate) {
       return InlineCost::getAlways("preinliner");
   }
 
-  // For old FDO inliner, we inline the call site as long as cost is not
-  // "Never". The cost-benefit check is done earlier.
+  // For old FDO inliner, we inline the call site if it is below hot threshold,
+  // even if the function is hot based on sample profile data. This is to
+  // prevent huge functions from being inlined.
   if (!CallsitePrioritizedInline) {
-    return InlineCost::get(Cost.getCost(), INT_MAX);
+    return InlineCost::get(Cost.getCost(),
+                           Params.HotCallSiteThreshold.value_or(INT_MAX));
   }
 
   // Otherwise only use the cost from call analyzer, but overwite threshold with
diff --git a/llvm/test/Transforms/SampleProfile/Inputs/inline-hot-callsite-threshold.prof b/llvm/test/Transforms/SampleProfile/Inputs/inline-hot-callsite-threshold.prof
new file mode 100644
index 0000000000000..d1c0408210f49
--- /dev/null
+++ b/llvm/test/Transforms/SampleProfile/Inputs/inline-hot-callsite-threshold.prof
@@ -0,0 +1,3 @@
+foo:100:100
+ 1: bar:100
+  1:100
diff --git a/llvm/test/Transforms/SampleProfile/inline-hot-callsite-threshold.ll b/llvm/test/Transforms/SampleProfile/inline-hot-callsite-threshold.ll
new file mode 100644
index 0000000000000..0584e1b5ee679
--- /dev/null
+++ b/llvm/test/Transforms/SampleProfile/inline-hot-callsite-threshold.ll
@@ -0,0 +1,61 @@
+; RUN: opt < %s -passes=sample-profile -sample-profile-file=%S/Inputs/inline-hot-callsite-threshold.prof -S -pass-remarks=sample-profile -hot-callsite-threshold=100 2>&1 | FileCheck %s
+
+; CHECK: remark: a.cc:6:12: 'bar' inlined into 'foo' to match profiling context with (cost={{.*}}, threshold=100)
+; CHECK:     define dso_local noundef i32 @foo(i32 noundef %0)
+; CHECK-NOT:   %2 = tail call noundef i32 @bar(i32 noundef %0)
+; CHECK-NEXT:  %2 = icmp sgt i32 %0, 1
+; CHECK-NEXT:  br i1 %2, label %3, label %bar.exit
+
+; Manually lower cost threshold for hot function inlining, so that the function
+; is not inlined even profile indicates it as hot.
+; RUN: opt < %s -passes=sample-profile -sample-profile-file=%S/Inputs/inline-hot-callsite-threshold.prof -S -pass-remarks=sample-profile -hot-callsite-threshold=1 2>&1 | FileCheck %s --check-prefix=COST
+
+; COST-NOT:  remark
+; COST: define dso_local noundef i32 @foo(i32 noundef %0)
+; COST-NEXT: %2 = tail call noundef i32 @bar(i32 noundef %0)
+
+define dso_local noundef i32 @bar(i32 noundef %0) #0 !dbg !10 {
+  %2 = icmp sgt i32 %0, 1
+  br i1 %2, label %3, label %15
+3:                                                ; preds = %1
+  %4 = add nsw i32 %0, -2
+  %5 = mul i32 %4, %4
+  %6 = add i32 %5, %0
+  %7 = zext nneg i32 %4 to i33
+  %8 = add nsw i32 %0, -3
+  %9 = zext i32 %8 to i33
+  %10 = mul i33 %7, %9
+  %11 = lshr i33 %10, 1
+  %12 = trunc nuw i33 %11 to i32
+  %13 = xor i32 %12, -1
+  %14 = add i32 %6, %13
+  br label %15
+15:                                               ; preds = %3, %1
+  %16 = phi i32 [ 0, %1 ], [ %14, %3 ]
+  ret i32 %16
+}
+
+define dso_local noundef i32 @foo(i32 noundef %0) #1 !dbg !20 {
+  %2 = tail call noundef i32 @bar(i32 noundef %0), !dbg !24
+  ret i32 %2
+}
+
+attributes #0 = { mustprogress nofree norecurse nosync nounwind willreturn memory(none) uwtable  "use-sample-profile" }
+attributes #1 = { mustprogress nofree norecurse nosync nounwind willreturn memory(none) uwtable  "use-sample-profile" }
+attributes #2 = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }
+
+!llvm.dbg.cu = !{!0}
+!llvm.module.flags = !{!2, !3}
+
+!0 = distinct !DICompileUnit(language: DW_LANG_C_plus_plus, file: !1, isOptimized: true, runtimeVersion: 0, emissionKind: NoDebug)
+!1 = !DIFile(filename: "a.cc", directory: ".")
+!2 = !{i32 2, !"Dwarf Version", i32 4}
+!3 = !{i32 2, !"Debug Info Version", i32 3}
+!10 = distinct !DISubprogram(name: "bar", linkageName: "bar", scope: !1, file: !1, line: 1, type: !12, isLocal: false, isDefinition: true, scopeLine: 3, flags: DIFlagPrototyped, isOptimized: true, unit: !0)
+!11 = !DIFile(filename: "a.cc", directory: ".")
+!12 = !DISubroutineType(types: !13)
+!13 = !{!14, !14}
+!14 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed)
+!20 = distinct !DISubprogram(name: "foo", linkageName: "foo", scope: !11, file: !11, line: 5, type: !12, isLocal: false, isDefinition: true, scopeLine: 3, flags: DIFlagPrototyped, isOptimized: true, unit: !0)
+!23 = !DILocation(line: 0, scope: !20)
+!24 = !DILocation(line: 6, column: 12, scope: !20)
diff --git a/llvm/test/Transforms/SampleProfile/pseudo-probe-inline.ll b/llvm/test/Transforms/SampleProfile/pseudo-probe-inline.ll
index 18cbd857d97bb..2cd9abf0e11e9 100644
--- a/llvm/test/Transforms/SampleProfile/pseudo-probe-inline.ll
+++ b/llvm/test/Transforms/SampleProfile/pseudo-probe-inline.ll
@@ -98,7 +98,7 @@ if.end:
 ;YAML-NEXT:    - String:          '(cost='
 ;YAML-NEXT:    - Cost:            '15'
 ;YAML-NEXT:    - String:          ', threshold='
-;YAML-NEXT:    - Threshold:       '2147483647'
+;YAML-NEXT:    - Threshold:       '3000'
 ;YAML-NEXT:    - String:          ')'
 ;YAML-NEXT:    - String:          ' at callsite '
 ;YAML-NEXT:    - String:          foo
diff --git a/llvm/test/Transforms/SampleProfile/remarks.ll b/llvm/test/Transforms/SampleProfile/remarks.ll
index 997e02bb5b544..9c0143ae65ca7 100644
--- a/llvm/test/Transforms/SampleProfile/remarks.ll
+++ b/llvm/test/Transforms/SampleProfile/remarks.ll
@@ -22,7 +22,7 @@
 
 ; We are expecting foo() to be inlined in main() (almost all the cycles are
 ; spent inside foo).
-; CHECK: remark: remarks.cc:13:21: '_Z3foov' inlined into 'main' to match profiling context with (cost=130, threshold=2147483647) at callsite main:0:21;
+; CHECK: remark: remarks.cc:13:21: '_Z3foov' inlined into 'main' to match profiling context with (cost=130, threshold=3000) at callsite main:0:21;
 ; CHECK: remark: remarks.cc:9:19: 'rand' inlined into 'main' to match profiling context with (cost=always): always inline attribute at callsite _Z3foov:6:19 @ main:0:21;
 
 ; The back edge for the loop is the hottest edge in the loop subgraph.
@@ -51,7 +51,7 @@
 ;YAML-NEXT:    - String:          '(cost='
 ;YAML-NEXT:    - Cost:            '130'
 ;YAML-NEXT:    - String:          ', threshold='
-;YAML-NEXT:    - Threshold:       '2147483647'
+;YAML-NEXT:    - Threshold:       '3000'
 ;YAML-NEXT:    - String:          ')'
 ;YAML-NEXT:    - String:          ' at callsite '
 ;YAML-NEXT:    - String:          main

@david-xl david-xl requested a review from WenleiHe May 24, 2024 16:02
@david-xl
Copy link
Contributor

The change looks reasonable.

Or perhaps limit the inline aggressiveness check to xfdo builds only which suffers from the compile time issue more due to more flat profile.

@WenleiHe
Copy link
Member

The change is okay but I'm curious who triggered it and have you verify its effectiveness? Asking because the effect of is likely small because if sample loader see a candidate, that means this candidate has been inlined in previous build, so it has passed the threshold/size check in the previous build. Is this change triggered by an actual pathological case where this is leading to huge callee inlined? If so, I'm curious how that happened.

For example if a function is calling another very large function multiple times (at different code paths), the large function should not be inlined even if its hot.

This should not happen today, the nested callsite / inlinee profile naturally carry context-sensitivity. When determining hotness for sample loader inlining, call site hotness is checked, instead of context-less callee hotness.

cc @helloguo @wlei-llvm

@huangjd
Copy link
Contributor Author

huangjd commented May 24, 2024

The change is okay but I'm curious who triggered it and have you verify its effectiveness? Asking because the effect of is likely small because if sample loader see a candidate, that means this candidate has been inlined in previous build, so it has passed the threshold/size check in the previous build. Is this change triggered by an actual pathological case where this is leading to huge callee inlined? If so, I'm curious how that happened.

For example if a function is calling another very large function multiple times (at different code paths), the large function should not be inlined even if its hot.

This should not happen today, the nested callsite / inlinee profile naturally carry context-sensitivity. When determining hotness for sample loader inlining, call site hotness is checked, instead of context-less callee hotness.

cc @helloguo @wlei-llvm

I encountered the issue with some internal protobuf generated c++ compilation, and I assume this applies to other libraries dealing with serialization too. To clarify, this is what I observed to happen, and I made a mistake describing it: foo calls bar multiple times and each call site is calculated to be hot by the profile , bar itself contains many nested function calls that are inlined. bar is processed first and after inlining it becomes very large (> 50000 instructions), and foo still unconditionally inline every call to bar without a cost threshold check.

The reason that bar shows up previously inlined in the profile is that another function calling bar could be processed before bar itself's inlining, and at this point its instruction count is still small

@WenleiHe
Copy link
Member

foo calls bar multiple times and each call site is calculated to be hot by the profile , bar itself contains many nested function calls that are inlined. bar is processed first and after inlining it becomes very large (> 50000 instructions), and foo still unconditionally inline every call to bar without a cost threshold check.

It sounds like it is about top-down run away inlining then. Usually top-down inlining needs to have a size budget/cap for the inliner, to avoid inlining many individually small callee and eventually bloating size too much. In the case of old sample loader inline, we don't have such cap because inlining is guaranteed to be bounded by the inlining done in the previous build. If we iteratively trace back to the first build, the inlining should come from non-PGO CGSCC inlining which is bounded due to bottom-up nature.

So I still don't see what's unique in your case.

The reason that bar shows up previously inlined in the profile is that another function calling bar could be processed before bar itself's inlining, and at this point its instruction count is still small

So in previous build, the inlining happened also because of top-down order? Then it goes back to my comment above.

@huangjd
Copy link
Contributor Author

huangjd commented May 25, 2024

inlining is guaranteed to be bounded by the inlining done in the previous build.

That's not necessarily true for subsequent builds we do modify the sample profile (merging, downsampling etc.). Using a profile that does not completely obtained from the matching binary does result in under-optimization as expected, but it should not cause a significant compile time increase.

@WenleiHe
Copy link
Member

inlining is guaranteed to be bounded by the inlining done in the previous build.

That's not necessarily true for subsequent builds we do modify the sample profile (merging, downsampling etc.). Using a profile that does not completely obtained from the matching binary does result in under-optimization as expected, but it should not cause a significant compile time increase.

Actually when using merged profile, long compile time is quite common. I know that both of us use the practice of merging a bunch of profiles from different top services to drive PGO of some long tail services. A merged profile make optimization less selective, which usually lead to size bloat and longer compile time. For that reason, we are moving away from that practice.

All that said, I'm ok with this change if you can measure performance on a server workload to make sure it won't regress perf.

@david-xl
Copy link
Contributor

inlining is guaranteed to be bounded by the inlining done in the previous build.

That's not necessarily true for subsequent builds we do modify the sample profile (merging, downsampling etc.). Using a profile that does not completely obtained from the matching binary does result in under-optimization as expected, but it should not cause a significant compile time increase.

Actually when using merged profile, long compile time is quite common. I know that both of us use the practice of merging a bunch of profiles from different top services to drive PGO of some long tail services. A merged profile make optimization less selective, which usually lead to size bloat and longer compile time. For that reason, we are moving away from that practice.

All that said, I'm ok with this change if you can measure performance on a server workload to make sure it won't regress perf.

yes, I believe it is due to the merged inline behavior from the merged profile.

@huangjd
Copy link
Contributor Author

huangjd commented May 25, 2024

Internal load testing showed there's no regression, I am not sure if there is a public load test that uses sample profile (I am looking at https://llvm-compile-time-tracker.com/ but I don't think any uses it)

@david-xl
Copy link
Contributor

Internal load testing showed there's no regression, I am not sure if there is a public load test that uses sample profile (I am looking at https://llvm-compile-time-tracker.com/ but I don't think any uses it)

Internal test is fine.

@WenleiHe
Copy link
Member

Internal load testing showed there's no regression, I am not sure if there is a public load test that uses sample profile (I am looking at https://llvm-compile-time-tracker.com/ but I don't think any uses it)

Internal test is fine.

Right, internal is fine, as long as it's testing typical scenario (not merged profile). Thanks for double checking.

if (!CallsitePrioritizedInline) {
return InlineCost::get(Cost.getCost(), INT_MAX);
return InlineCost::get(Cost.getCost(),
Params.HotCallSiteThreshold.value_or(INT_MAX));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use SampleThreshold here? and then simply make it unconditional, i.e. merge with line 1404.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huangjd any reason not to merge this with line 1403 in the latest version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It requires CallsitePrioritizedInline to set SampleThreshold to SampleHotCallSiteThreshold, otherwise it is using SampleColdCallSiteThreshold. However according to flag description CallsitePrioritizedInline is only used by CSSPGO but not for context-free profiles, so given a context free FDO profile, we don't want to use SampleColdCallSiteThreshold if the callsite is hot

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However I looked at the code using CallsitePrioritizedInline and couldn't tell a fundamental reason that it should behave differently with CSSPGO vs AutoFDO, so this probably needs another refactoring

@huangjd huangjd merged commit 5a23d31 into llvm:main May 28, 2024
4 of 6 checks passed
@huangjd huangjd deleted the SampleProfileInlineThreshold branch May 28, 2024 21:18
vg0204 pushed a commit to vg0204/llvm-project that referenced this pull request May 29, 2024
…n with a sample profile (llvm#93286)

Currently if a callsite is hot as determined by the sample profile, it
is unconditionally inlined barring invalid cases (such as recursion).
Inline cost check should still apply because a function's hotness and
its inline cost are two different things.
For example if a function is calling another very large function
multiple times (at different code paths), the large function should not
be inlined even if its hot.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llvm:transforms PGO Profile Guided Optimizations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants