-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[LTO][Pipelines] Add 0 hot-caller threshold for SamplePGO + FullLTO #135152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LTO][Pipelines] Add 0 hot-caller threshold for SamplePGO + FullLTO #135152
Conversation
…+ FullLTO. If a hot callsite function is not inlined in the 1st build, inlining the hot callsite in pre-link stage of SPGO 2nd build may lead to Function Sample not found in profile file in link stage. It will miss some profile info. ThinLTO has already considered and dealed with it by setting HotCallSiteThreshold to 0 to stop the inline. This patch just adds the same processing for FullLTO.
Do you have some build stats (e.g., functions miss profiles before and get profiles after, etc) or binary performance data for this FullLTO patch? Does 1st build refer to the profiled binary build here? I'm asking since the motivation for zero inline threshold in the ThinLTO prelink pipeline (before cross-module inline) is to make profile matching in the backend (after cross-module inline) more accurate within one build. I'm not sure if this zero inline threshold helps across build though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change itself makes sense to me.
Hi Mingming, thanks for your review. |
Thanks for sharing. I'm asking mainly because the applications that I work on are not built with full LTO and I don't have existing scripts/tools to test fullLTO.
Given a whole program call graph (a directed graph possibly with cycles), the LLVM inliner works bottom-up in the graph and function size is used by the heuristics. So theoretically suppressing hot functions in prelink (until postlink) may change the inline result. But I consider the different inline result secondary effect and will not try to reason about the performance effect from it.
https://reviews.llvm.org/D31201 did this optimization for SampleFDO + ThinLTO, and the comment mainly makes it explicit that hot callsite function with zero or negative cost still would be inlined in pre-link stage as a clarification. My understanding about this (not setting HotCallSiteThreshold to INT_MIN) is that small functions (with zero or negative cost) will be inlined similarly in the profiled binary or in the optimized binary, so they don't cause profile mismatch, or rather, the motivation of D31201 is to help profile match by making call graphs similar on a high level. |
Hi @teresajohnson Do you have comment on this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Hi Mingming, very appreciate for your detailed answer! |
Interesting idea! Thanks for sharing it! To be honest I haven't had much experience tuning the inlines between the sample loader inliner and SCC inliner, so please take my opinions with a grain of salt. IMO the argument goes both ways regarding whether more inline decisions should be moved from sample loader inliner into SCC inliner, and a solution may need the capability to try different inline combinations in interprocedural optimization space and better cost modeling for each combination (which probably requires long compile-time).
|
Mingming, thanks for your comment! |
This is in part necessary because the bottom up inliner uses a cost benefit analysis to decide whether to inline, and that is aided by optimizing each function before it is considered for inlining into its callers.
The inlining done in the sample loader pass is focused on getting the best matching of the profile with the context from the profiled binary. This is easier on unoptimized functions.
Why wouldn't the inlined version of the hot function be optimized when we go through the normal optimization pipeline?
In addition to what I mentioned above, a number of optimization passes are profile guided. I'm not sure how you would get the correct (context sensitive) profile data if that inlining was deferred. |
Hi @teresajohnson Many thanks for your comment!
There was an example I met before in https://discourse.llvm.org/t/rfc-spgo-passpipelines-adding-instcombinepass-and-simplifycfgpass-before-sampleprofileloaderpass-when-building-in-spgo-mode/83340
Yes, inlining and sample loading cross together could help match correct profile info efficiently.
Yes, the earlier sample profiling loading, the more accurate following optimization could do.
My immature thought of how to implement a context sensitive defer inline:
|
…lvm#135152) If a hot callsite function is not inlined in the 1st build, inlining the hot callsite in pre-link stage of SPGO 2nd build may lead to Function Sample not found in profile file in link stage. It will miss some profile info. ThinLTO has already considered and dealed with it by setting HotCallSiteThreshold to 0 to stop the inline. This patch just adds the same processing for FullLTO.
If a hot callsite function is not inlined in the 1st build, inlining the hot callsite in pre-link stage of SPGO 2nd build may lead to Function Sample not found in profile file in link stage. It will miss some profile info.
ThinLTO has already considered and dealed with it by setting HotCallSiteThreshold to 0 to stop the inline. This patch just adds the same processing for FullLTO.