Skip to content

[LTO][Pipelines] Add 0 hot-caller threshold for SamplePGO + FullLTO #135152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 14, 2025

Conversation

tianleliu
Copy link
Contributor

If a hot callsite function is not inlined in the 1st build, inlining the hot callsite in pre-link stage of SPGO 2nd build may lead to Function Sample not found in profile file in link stage. It will miss some profile info.
ThinLTO has already considered and dealed with it by setting HotCallSiteThreshold to 0 to stop the inline. This patch just adds the same processing for FullLTO.

…+ FullLTO.

If a hot callsite function is not inlined in the 1st build,
inlining the hot callsite in pre-link stage of SPGO 2nd build
may lead to Function Sample not found in profile file in link stage.
It will miss some profile info.
ThinLTO has already considered and dealed with it by setting
HotCallSiteThreshold to 0 to stop the inline. This patch just
adds the same processing for FullLTO.
@mingmingl-llvm
Copy link
Contributor

If a hot callsite function is not inlined in the 1st build, inlining the hot callsite in pre-link stage of SPGO 2nd build may lead to Function Sample not found in profile file in link stage. ThinLTO has already considered and dealed with it by setting HotCallSiteThreshold to 0 to stop the inline. This patch just adds the same processing for FullLTO.

Do you have some build stats (e.g., functions miss profiles before and get profiles after, etc) or binary performance data for this FullLTO patch?

Does 1st build refer to the profiled binary build here? I'm asking since the motivation for zero inline threshold in the ThinLTO prelink pipeline (before cross-module inline) is to make profile matching in the backend (after cross-module inline) more accurate within one build. I'm not sure if this zero inline threshold helps across build though.

Copy link
Contributor

@mingmingl-llvm mingmingl-llvm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change itself makes sense to me.

@tianleliu
Copy link
Contributor Author

tianleliu commented Apr 11, 2025

If a hot callsite function is not inlined in the 1st build, inlining the hot callsite in pre-link stage of SPGO 2nd build may lead to Function Sample not found in profile file in link stage. ThinLTO has already considered and dealed with it by setting HotCallSiteThreshold to 0 to stop the inline. This patch just adds the same processing for FullLTO.

Do you have some build stats (e.g., functions miss profiles before and get profiles after, etc) or binary performance data for this FullLTO patch?

Does 1st build refer to the profiled binary build here? I'm asking since the motivation for zero inline threshold in the ThinLTO prelink pipeline (before cross-module inline) is to make profile matching in the backend (after cross-module inline) more accurate within one build. I'm not sure if this zero inline threshold helps across build though.

Hi Mingming, thanks for your review.
I observed an interpreter kind benchmark could find correct profiling info after the patch (This is my motivation of the patch). For spec2017 c/c++ cases, with unchanged profile file, I could see 12 cases in total16 binaries have changed after the patch. But none of them has performance change. :(
I don't know details of thinlto. In my understanding of fulllto, inline in pre-link is in-module inline and inline in link stage is in-module and cross-module inline. So setting HotCallSiteThreshold to 0 in pre-link stage would not change final inlining result, because the hot callsite functions just to be postponed to inline in link stage. Right? If yes, it probably means the 12 binary changed cases all get more accurate profiling info to impact other opt passes?
And as you commented in code, hot callsite function with below zero cost still would be inlined in pre-link stage. So I wonder that why we don't set HotCallSiteThreshold to INT_MIN to completely stop hot callsite inline in pre-link?

@mingmingl-llvm
Copy link
Contributor

I observed an interpreter kind benchmark could find correct profiling info after the patch (This is my motivation of the patch). For spec2017 c/c++ cases, with unchanged profile file, I could see 12 cases in total16 binaries have changed after the patch. But none of them has performance change. :(

Thanks for sharing. I'm asking mainly because the applications that I work on are not built with full LTO and I don't have existing scripts/tools to test fullLTO.

So setting HotCallSiteThreshold to 0 in pre-link stage would not change final inlining result, because the hot callsite functions just to be postponed to inline in link stage

Given a whole program call graph (a directed graph possibly with cycles), the LLVM inliner works bottom-up in the graph and function size is used by the heuristics. So theoretically suppressing hot functions in prelink (until postlink) may change the inline result. But I consider the different inline result secondary effect and will not try to reason about the performance effect from it.

And as you commented in code ... why we don't set HotCallSiteThreshold to INT_MIN to completely stop hot callsite inline in pre-link

https://reviews.llvm.org/D31201 did this optimization for SampleFDO + ThinLTO, and the comment mainly makes it explicit that hot callsite function with zero or negative cost still would be inlined in pre-link stage as a clarification. My understanding about this (not setting HotCallSiteThreshold to INT_MIN) is that small functions (with zero or negative cost) will be inlined similarly in the profiled binary or in the optimized binary, so they don't cause profile mismatch, or rather, the motivation of D31201 is to help profile match by making call graphs similar on a high level.

@tianleliu
Copy link
Contributor Author

Hi @teresajohnson Do you have comment on this patch?

Copy link
Contributor

@teresajohnson teresajohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@tianleliu tianleliu merged commit e038c54 into llvm:main Apr 14, 2025
10 of 11 checks passed
@tianleliu
Copy link
Contributor Author

tianleliu commented Apr 14, 2025

Hi Mingming, very appreciate for your detailed answer!
As you are mentioning button-up call graph based inline, I have a question(not closely about this patch) about inlining in SampleProfileLoaderPass.
In the bottom-up call graph based inline, each inlined function firstly runs a pipeline of optimization before it inlined. For example, in buildModuleSimplificationPipeline, callee function would first runs pre-inline passes, which contains a series of optimizations, then is inlined. And even after inlining, the caller function does buildFunctionSimplificationPipeline, which also contains a series of optimizations, before it is inlined by other functions.
But inline in SampleProfileLoaderPass, inlining (SampleProfileLoader::inlineHotFunctions) is call graph top-down order without doing any optimization for callee functions. Considering many optimizations are top-down scanning in BB/function for finding chance, So I have a concern that some hot short function's optimization chance might be broken after it is inlined in a bigger context by SampleProfileLoaderPass.
If my above point is right, I have an immature thought that we can just only mark/record the inlinings context in SampleProfileLoaderPass but defer the implementation of inline in general inline pipelines/passes. So that all functions could obey bottom-up order and do full optimizations before they are inlined.
Do you have idea about my thought?

@mingmingl-llvm
Copy link
Contributor

Interesting idea! Thanks for sharing it! To be honest I haven't had much experience tuning the inlines between the sample loader inliner and SCC inliner, so please take my opinions with a grain of salt.

IMO the argument goes both ways regarding whether more inline decisions should be moved from sample loader inliner into SCC inliner, and a solution may need the capability to try different inline combinations in interprocedural optimization space and better cost modeling for each combination (which probably requires long compile-time).

  • By both ways -- in the sample loader inliner pass, inline decisions are made with call context (especially so with csspgo). In the SCC inliner pass, inline decisions are made based on cost-benefit analysis and less affected by inlines in the profiled binary.

@tianleliu
Copy link
Contributor Author

Mingming, thanks for your comment!

@teresajohnson
Copy link
Contributor

Hi Mingming, very appreciate for your detailed answer! As you are mentioning button-up call graph based inline, I have a question(not closely about this patch) about inlining in SampleProfileLoaderPass. In the bottom-up call graph based inline, each inlined function firstly runs a pipeline of optimization before it inlined. For example, in buildModuleSimplificationPipeline, callee function would first runs pre-inline passes, which contains a series of optimizations, then is inlined. And even after inlining, the caller function does buildFunctionSimplificationPipeline, which also contains a series of optimizations, before it is inlined by other functions.

This is in part necessary because the bottom up inliner uses a cost benefit analysis to decide whether to inline, and that is aided by optimizing each function before it is considered for inlining into its callers.

But inline in SampleProfileLoaderPass, inlining (SampleProfileLoader::inlineHotFunctions) is call graph top-down order without doing any optimization for callee functions.

The inlining done in the sample loader pass is focused on getting the best matching of the profile with the context from the profiled binary. This is easier on unoptimized functions.

Considering many optimizations are top-down scanning in BB/function for finding chance, So I have a concern that some hot short function's optimization chance might be broken after it is inlined in a bigger context by SampleProfileLoaderPass.

Why wouldn't the inlined version of the hot function be optimized when we go through the normal optimization pipeline?

If my above point is right, I have an immature thought that we can just only mark/record the inlinings context in SampleProfileLoaderPass but defer the implementation of inline in general inline pipelines/passes. So that all functions could obey bottom-up order and do full optimizations before they are inlined. Do you have idea about my thought?

In addition to what I mentioned above, a number of optimization passes are profile guided. I'm not sure how you would get the correct (context sensitive) profile data if that inlining was deferred.

@tianleliu
Copy link
Contributor Author

tianleliu commented Apr 16, 2025

Hi @teresajohnson Many thanks for your comment!

Why wouldn't the inlined version of the hot function be optimized when we go through the normal optimization pipeline?

There was an example I met before in https://discourse.llvm.org/t/rfc-spgo-passpipelines-adding-instcombinepass-and-simplifycfgpass-before-sampleprofileloaderpass-when-building-in-spgo-mode/83340
The example is smax(). If we don't optimize the smax (by eliminating "sext to i32") before it is inlined, optimization pattern of "sext+cmp+sel" in smax would be broke to firstly opt "trunc+sext" after inlined. Because in InstCombinePass, it generally goes through IR from top to bottom without considering callee function's better locality and higher priority.
Though I have fixed it by adding a max/min pattern identification in InstCombine #118932, I don't think it is a perfect solution. Because I think root cause of this issue is inlining happening too early in SampleProfileLoader without enough optimization for callee function. And it is hard to cover all cases, especially more complicated cases, by adding various endless pattern match.

The inlining done in the sample loader pass is focused on getting the best matching of the profile with the context from the profiled binary. This is easier on unoptimized functions.

Yes, inlining and sample loading cross together could help match correct profile info efficiently.

a number of optimization passes are profile guided.

Yes, the earlier sample profiling loading, the more accurate following optimization could do.

I'm not sure how you would get the correct (context sensitive) profile data if that inlining was deferred.

My immature thought of how to implement a context sensitive defer inline:

  1. Remain calling and clone callee function (with an unique caller stack info wrapped in its function name or metadata) instead of inlining in SampelProfileLoaderPass.
  2. Load and record all its profile info in the corresponding cloned function.
  3. Inline the cloned function in general InlinePass.
    For example:
    define funca() {...}
    define funcb() {
    call funca() // funca is a hot call site and have profile sample in it.
    }
    will be translated in SampleProfileLoader to:
    define funca_b() { ... // have detailed !prof info in funca_b() }
    define funcb() {
    call funca_b(); // record sample head count in !prof
    }
    If a function A is called by several functions B, C .., it would have several cloned version as A_B, A_C..., each one has its own name (unique call stack) and own profile info (matches profiled binary). This could achieve context sensitive? And I believe most of them would be inlined in general InlinePass since they are considered as hot, so finally the cloned functions will be eliminated mostly.
    In this routine, all callee functions can firstly be fully optimized before inlined, and it makes inliner and sample profile loader more independent to focus on their own thing.

var-const pushed a commit to ldionne/llvm-project that referenced this pull request Apr 17, 2025
…lvm#135152)

If a hot callsite function is not inlined in the 1st build, inlining the
hot callsite in pre-link stage of SPGO 2nd build may lead to Function
Sample not found in profile file in link stage. It will miss some
profile info.
ThinLTO has already considered and dealed with it by setting
HotCallSiteThreshold to 0 to stop the inline. This patch just adds the
same processing for FullLTO.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants