[Sample Profile] Check hot callsite threshold when inlining a function with a sample profile #93286

huangjd · 2024-05-24T09:51:39Z

Currently if a callsite is hot as determined by the sample profile, it is unconditionally inlined barring invalid cases (such as recursion). Inline cost check should still apply because a function's hotness and its inline cost are two different things.
For example if a function is calling another very large function multiple times (at different code paths), the large function should not be inlined even if its hot.

function with a sample profile Currently if a callsite is hot as determined by the sample profile, it is unconditionally inlined barring invalid cases (such as recursion). Inline cost check should still apply because a function's hotness and its inline cost are two different things. For example if a function is calling another very large function multiple times (at different code paths), the large function should not be inlined even if its hot.

llvmbot · 2024-05-24T09:51:58Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-pgo

Author: William Junda Huang (huangjd)

Changes

Currently if a callsite is hot as determined by the sample profile, it is unconditionally inlined barring invalid cases (such as recursion). Inline cost check should still apply because a function's hotness and its inline cost are two different things.
For example if a function is calling another very large function multiple times (at different code paths), the large function should not be inlined even if its hot.

Full diff: https://github.com/llvm/llvm-project/pull/93286.diff

5 Files Affected:

(modified) llvm/lib/Transforms/IPO/SampleProfile.cpp (+5-3)
(added) llvm/test/Transforms/SampleProfile/Inputs/inline-hot-callsite-threshold.prof (+3)
(added) llvm/test/Transforms/SampleProfile/inline-hot-callsite-threshold.ll (+61)
(modified) llvm/test/Transforms/SampleProfile/pseudo-probe-inline.ll (+1-1)
(modified) llvm/test/Transforms/SampleProfile/remarks.ll (+2-2)

diff --git a/llvm/lib/Transforms/IPO/SampleProfile.cpp b/llvm/lib/Transforms/IPO/SampleProfile.cpp
index 0920179fb76b7..10d6e2282ac2a 100644
--- a/llvm/lib/Transforms/IPO/SampleProfile.cpp
+++ b/llvm/lib/Transforms/IPO/SampleProfile.cpp
@@ -1391,10 +1391,12 @@ SampleProfileLoader::shouldInlineCandidate(InlineCandidate &Candidate) {
       return InlineCost::getAlways("preinliner");
   }
 
-  // For old FDO inliner, we inline the call site as long as cost is not
-  // "Never". The cost-benefit check is done earlier.
+  // For old FDO inliner, we inline the call site if it is below hot threshold,
+  // even if the function is hot based on sample profile data. This is to
+  // prevent huge functions from being inlined.
   if (!CallsitePrioritizedInline) {
-    return InlineCost::get(Cost.getCost(), INT_MAX);
+    return InlineCost::get(Cost.getCost(),
+                           Params.HotCallSiteThreshold.value_or(INT_MAX));
   }
 
   // Otherwise only use the cost from call analyzer, but overwite threshold with
diff --git a/llvm/test/Transforms/SampleProfile/Inputs/inline-hot-callsite-threshold.prof b/llvm/test/Transforms/SampleProfile/Inputs/inline-hot-callsite-threshold.prof
new file mode 100644
index 0000000000000..d1c0408210f49
--- /dev/null
+++ b/llvm/test/Transforms/SampleProfile/Inputs/inline-hot-callsite-threshold.prof
@@ -0,0 +1,3 @@
+foo:100:100
+ 1: bar:100
+  1:100
diff --git a/llvm/test/Transforms/SampleProfile/inline-hot-callsite-threshold.ll b/llvm/test/Transforms/SampleProfile/inline-hot-callsite-threshold.ll
new file mode 100644
index 0000000000000..0584e1b5ee679
--- /dev/null
+++ b/llvm/test/Transforms/SampleProfile/inline-hot-callsite-threshold.ll
@@ -0,0 +1,61 @@
+; RUN: opt < %s -passes=sample-profile -sample-profile-file=%S/Inputs/inline-hot-callsite-threshold.prof -S -pass-remarks=sample-profile -hot-callsite-threshold=100 2>&1 | FileCheck %s
+
+; CHECK: remark: a.cc:6:12: 'bar' inlined into 'foo' to match profiling context with (cost={{.*}}, threshold=100)
+; CHECK:     define dso_local noundef i32 @foo(i32 noundef %0)
+; CHECK-NOT:   %2 = tail call noundef i32 @bar(i32 noundef %0)
+; CHECK-NEXT:  %2 = icmp sgt i32 %0, 1
+; CHECK-NEXT:  br i1 %2, label %3, label %bar.exit
+
+; Manually lower cost threshold for hot function inlining, so that the function
+; is not inlined even profile indicates it as hot.
+; RUN: opt < %s -passes=sample-profile -sample-profile-file=%S/Inputs/inline-hot-callsite-threshold.prof -S -pass-remarks=sample-profile -hot-callsite-threshold=1 2>&1 | FileCheck %s --check-prefix=COST
+
+; COST-NOT:  remark
+; COST: define dso_local noundef i32 @foo(i32 noundef %0)
+; COST-NEXT: %2 = tail call noundef i32 @bar(i32 noundef %0)
+
+define dso_local noundef i32 @bar(i32 noundef %0) #0 !dbg !10 {
+  %2 = icmp sgt i32 %0, 1
+  br i1 %2, label %3, label %15
+3:                                                ; preds = %1
+  %4 = add nsw i32 %0, -2
+  %5 = mul i32 %4, %4
+  %6 = add i32 %5, %0
+  %7 = zext nneg i32 %4 to i33
+  %8 = add nsw i32 %0, -3
+  %9 = zext i32 %8 to i33
+  %10 = mul i33 %7, %9
+  %11 = lshr i33 %10, 1
+  %12 = trunc nuw i33 %11 to i32
+  %13 = xor i32 %12, -1
+  %14 = add i32 %6, %13
+  br label %15
+15:                                               ; preds = %3, %1
+  %16 = phi i32 [ 0, %1 ], [ %14, %3 ]
+  ret i32 %16
+}
+
+define dso_local noundef i32 @foo(i32 noundef %0) #1 !dbg !20 {
+  %2 = tail call noundef i32 @bar(i32 noundef %0), !dbg !24
+  ret i32 %2
+}
+
+attributes #0 = { mustprogress nofree norecurse nosync nounwind willreturn memory(none) uwtable  "use-sample-profile" }
+attributes #1 = { mustprogress nofree norecurse nosync nounwind willreturn memory(none) uwtable  "use-sample-profile" }
+attributes #2 = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }
+
+!llvm.dbg.cu = !{!0}
+!llvm.module.flags = !{!2, !3}
+
+!0 = distinct !DICompileUnit(language: DW_LANG_C_plus_plus, file: !1, isOptimized: true, runtimeVersion: 0, emissionKind: NoDebug)
+!1 = !DIFile(filename: "a.cc", directory: ".")
+!2 = !{i32 2, !"Dwarf Version", i32 4}
+!3 = !{i32 2, !"Debug Info Version", i32 3}
+!10 = distinct !DISubprogram(name: "bar", linkageName: "bar", scope: !1, file: !1, line: 1, type: !12, isLocal: false, isDefinition: true, scopeLine: 3, flags: DIFlagPrototyped, isOptimized: true, unit: !0)
+!11 = !DIFile(filename: "a.cc", directory: ".")
+!12 = !DISubroutineType(types: !13)
+!13 = !{!14, !14}
+!14 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed)
+!20 = distinct !DISubprogram(name: "foo", linkageName: "foo", scope: !11, file: !11, line: 5, type: !12, isLocal: false, isDefinition: true, scopeLine: 3, flags: DIFlagPrototyped, isOptimized: true, unit: !0)
+!23 = !DILocation(line: 0, scope: !20)
+!24 = !DILocation(line: 6, column: 12, scope: !20)
diff --git a/llvm/test/Transforms/SampleProfile/pseudo-probe-inline.ll b/llvm/test/Transforms/SampleProfile/pseudo-probe-inline.ll
index 18cbd857d97bb..2cd9abf0e11e9 100644
--- a/llvm/test/Transforms/SampleProfile/pseudo-probe-inline.ll
+++ b/llvm/test/Transforms/SampleProfile/pseudo-probe-inline.ll
@@ -98,7 +98,7 @@ if.end:
 ;YAML-NEXT:    - String:          '(cost='
 ;YAML-NEXT:    - Cost:            '15'
 ;YAML-NEXT:    - String:          ', threshold='
-;YAML-NEXT:    - Threshold:       '2147483647'
+;YAML-NEXT:    - Threshold:       '3000'
 ;YAML-NEXT:    - String:          ')'
 ;YAML-NEXT:    - String:          ' at callsite '
 ;YAML-NEXT:    - String:          foo
diff --git a/llvm/test/Transforms/SampleProfile/remarks.ll b/llvm/test/Transforms/SampleProfile/remarks.ll
index 997e02bb5b544..9c0143ae65ca7 100644
--- a/llvm/test/Transforms/SampleProfile/remarks.ll
+++ b/llvm/test/Transforms/SampleProfile/remarks.ll
@@ -22,7 +22,7 @@
 
 ; We are expecting foo() to be inlined in main() (almost all the cycles are
 ; spent inside foo).
-; CHECK: remark: remarks.cc:13:21: '_Z3foov' inlined into 'main' to match profiling context with (cost=130, threshold=2147483647) at callsite main:0:21;
+; CHECK: remark: remarks.cc:13:21: '_Z3foov' inlined into 'main' to match profiling context with (cost=130, threshold=3000) at callsite main:0:21;
 ; CHECK: remark: remarks.cc:9:19: 'rand' inlined into 'main' to match profiling context with (cost=always): always inline attribute at callsite _Z3foov:6:19 @ main:0:21;
 
 ; The back edge for the loop is the hottest edge in the loop subgraph.
@@ -51,7 +51,7 @@
 ;YAML-NEXT:    - String:          '(cost='
 ;YAML-NEXT:    - Cost:            '130'
 ;YAML-NEXT:    - String:          ', threshold='
-;YAML-NEXT:    - Threshold:       '2147483647'
+;YAML-NEXT:    - Threshold:       '3000'
 ;YAML-NEXT:    - String:          ')'
 ;YAML-NEXT:    - String:          ' at callsite '
 ;YAML-NEXT:    - String:          main

david-xl · 2024-05-24T16:03:55Z

The change looks reasonable.

Or perhaps limit the inline aggressiveness check to xfdo builds only which suffers from the compile time issue more due to more flat profile.

WenleiHe · 2024-05-24T16:21:48Z

The change is okay but I'm curious who triggered it and have you verify its effectiveness? Asking because the effect of is likely small because if sample loader see a candidate, that means this candidate has been inlined in previous build, so it has passed the threshold/size check in the previous build. Is this change triggered by an actual pathological case where this is leading to huge callee inlined? If so, I'm curious how that happened.

For example if a function is calling another very large function multiple times (at different code paths), the large function should not be inlined even if its hot.

This should not happen today, the nested callsite / inlinee profile naturally carry context-sensitivity. When determining hotness for sample loader inlining, call site hotness is checked, instead of context-less callee hotness.

cc @helloguo @wlei-llvm

huangjd · 2024-05-24T21:27:44Z

The change is okay but I'm curious who triggered it and have you verify its effectiveness? Asking because the effect of is likely small because if sample loader see a candidate, that means this candidate has been inlined in previous build, so it has passed the threshold/size check in the previous build. Is this change triggered by an actual pathological case where this is leading to huge callee inlined? If so, I'm curious how that happened.

For example if a function is calling another very large function multiple times (at different code paths), the large function should not be inlined even if its hot.

This should not happen today, the nested callsite / inlinee profile naturally carry context-sensitivity. When determining hotness for sample loader inlining, call site hotness is checked, instead of context-less callee hotness.

cc @helloguo @wlei-llvm

I encountered the issue with some internal protobuf generated c++ compilation, and I assume this applies to other libraries dealing with serialization too. To clarify, this is what I observed to happen, and I made a mistake describing it: foo calls bar multiple times and each call site is calculated to be hot by the profile , bar itself contains many nested function calls that are inlined. bar is processed first and after inlining it becomes very large (> 50000 instructions), and foo still unconditionally inline every call to bar without a cost threshold check.

The reason that bar shows up previously inlined in the profile is that another function calling bar could be processed before bar itself's inlining, and at this point its instruction count is still small

WenleiHe · 2024-05-24T23:48:53Z

foo calls bar multiple times and each call site is calculated to be hot by the profile , bar itself contains many nested function calls that are inlined. bar is processed first and after inlining it becomes very large (> 50000 instructions), and foo still unconditionally inline every call to bar without a cost threshold check.

It sounds like it is about top-down run away inlining then. Usually top-down inlining needs to have a size budget/cap for the inliner, to avoid inlining many individually small callee and eventually bloating size too much. In the case of old sample loader inline, we don't have such cap because inlining is guaranteed to be bounded by the inlining done in the previous build. If we iteratively trace back to the first build, the inlining should come from non-PGO CGSCC inlining which is bounded due to bottom-up nature.

So I still don't see what's unique in your case.

The reason that bar shows up previously inlined in the profile is that another function calling bar could be processed before bar itself's inlining, and at this point its instruction count is still small

So in previous build, the inlining happened also because of top-down order? Then it goes back to my comment above.

huangjd · 2024-05-25T00:03:05Z

inlining is guaranteed to be bounded by the inlining done in the previous build.

That's not necessarily true for subsequent builds we do modify the sample profile (merging, downsampling etc.). Using a profile that does not completely obtained from the matching binary does result in under-optimization as expected, but it should not cause a significant compile time increase.

WenleiHe · 2024-05-25T00:51:10Z

inlining is guaranteed to be bounded by the inlining done in the previous build.

That's not necessarily true for subsequent builds we do modify the sample profile (merging, downsampling etc.). Using a profile that does not completely obtained from the matching binary does result in under-optimization as expected, but it should not cause a significant compile time increase.

Actually when using merged profile, long compile time is quite common. I know that both of us use the practice of merging a bunch of profiles from different top services to drive PGO of some long tail services. A merged profile make optimization less selective, which usually lead to size bloat and longer compile time. For that reason, we are moving away from that practice.

All that said, I'm ok with this change if you can measure performance on a server workload to make sure it won't regress perf.

david-xl · 2024-05-25T00:52:38Z

inlining is guaranteed to be bounded by the inlining done in the previous build.

That's not necessarily true for subsequent builds we do modify the sample profile (merging, downsampling etc.). Using a profile that does not completely obtained from the matching binary does result in under-optimization as expected, but it should not cause a significant compile time increase.

Actually when using merged profile, long compile time is quite common. I know that both of us use the practice of merging a bunch of profiles from different top services to drive PGO of some long tail services. A merged profile make optimization less selective, which usually lead to size bloat and longer compile time. For that reason, we are moving away from that practice.

All that said, I'm ok with this change if you can measure performance on a server workload to make sure it won't regress perf.

yes, I believe it is due to the merged inline behavior from the merged profile.

huangjd · 2024-05-25T01:03:21Z

Internal load testing showed there's no regression, I am not sure if there is a public load test that uses sample profile (I am looking at https://llvm-compile-time-tracker.com/ but I don't think any uses it)

david-xl · 2024-05-25T01:04:54Z

Internal load testing showed there's no regression, I am not sure if there is a public load test that uses sample profile (I am looking at https://llvm-compile-time-tracker.com/ but I don't think any uses it)

Internal test is fine.

WenleiHe · 2024-05-25T15:40:11Z

Internal load testing showed there's no regression, I am not sure if there is a public load test that uses sample profile (I am looking at https://llvm-compile-time-tracker.com/ but I don't think any uses it)

Internal test is fine.

Right, internal is fine, as long as it's testing typical scenario (not merged profile). Thanks for double checking.

WenleiHe · 2024-05-25T15:43:57Z

llvm/lib/Transforms/IPO/SampleProfile.cpp

  if (!CallsitePrioritizedInline) {
-    return InlineCost::get(Cost.getCost(), INT_MAX);
+    return InlineCost::get(Cost.getCost(),
+                           Params.HotCallSiteThreshold.value_or(INT_MAX));


Should we use SampleThreshold here? and then simply make it unconditional, i.e. merge with line 1404.

@huangjd any reason not to merge this with line 1403 in the latest version?

It requires CallsitePrioritizedInline to set SampleThreshold to SampleHotCallSiteThreshold, otherwise it is using SampleColdCallSiteThreshold. However according to flag description CallsitePrioritizedInline is only used by CSSPGO but not for context-free profiles, so given a context free FDO profile, we don't want to use SampleColdCallSiteThreshold if the callsite is hot

However I looked at the code using CallsitePrioritizedInline and couldn't tell a fundamental reason that it should behave differently with CSSPGO vs AutoFDO, so this probably needs another refactoring

…n with a sample profile (llvm#93286) Currently if a callsite is hot as determined by the sample profile, it is unconditionally inlined barring invalid cases (such as recursion). Inline cost check should still apply because a function's hotness and its inline cost are two different things. For example if a function is calling another very large function multiple times (at different code paths), the large function should not be inlined even if its hot.

huangjd added the PGO Profile Guided Optimizations label May 24, 2024

huangjd requested review from MaskRay, snehasish and david-xl May 24, 2024 09:51

llvmbot added the llvm:transforms label May 24, 2024

david-xl requested a review from WenleiHe May 24, 2024 16:02

david-xl approved these changes May 25, 2024

View reviewed changes

WenleiHe reviewed May 25, 2024

View reviewed changes

Changed to use SampleHotCallSiteThreshold

c5849d6

huangjd merged commit 5a23d31 into llvm:main May 28, 2024
4 of 6 checks passed

huangjd deleted the SampleProfileInlineThreshold branch May 28, 2024 21:18

[Sample Profile] Check hot callsite threshold when inlining a function with a sample profile #93286

[Sample Profile] Check hot callsite threshold when inlining a function with a sample profile #93286

Uh oh!

Conversation

huangjd commented May 24, 2024

Uh oh!

llvmbot commented May 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

david-xl commented May 24, 2024

Uh oh!

WenleiHe commented May 24, 2024

Uh oh!

huangjd commented May 24, 2024

Uh oh!

WenleiHe commented May 24, 2024

Uh oh!

huangjd commented May 25, 2024

Uh oh!

WenleiHe commented May 25, 2024

Uh oh!

david-xl commented May 25, 2024

Uh oh!

huangjd commented May 25, 2024

Uh oh!

david-xl commented May 25, 2024

Uh oh!

WenleiHe commented May 25, 2024

Uh oh!

WenleiHe May 25, 2024

Choose a reason for hiding this comment

Uh oh!

WenleiHe May 29, 2024

Choose a reason for hiding this comment

Uh oh!

huangjd May 30, 2024

Choose a reason for hiding this comment

Uh oh!

huangjd May 30, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

llvmbot commented May 24, 2024 •

edited

Loading