Skip to content

Add extended stats to GH summary table #1322

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 19, 2022
Merged

Add extended stats to GH summary table #1322

merged 5 commits into from
May 19, 2022

Conversation

Kobzol
Copy link
Contributor

@Kobzol Kobzol commented May 15, 2022

This is an attempt to improve the (currently non-existing) display of secondary performance stats (like max RSS or cycles) in GH summary reports. This corresponds to the "better UI/UX" item from the performance roadmap.

Sometimes changes in icounts are virtually just noise, but some other metric changes in significant ways (typically max RSS, but sometimes also cycles). This can be missed by onlookers, since to show these secondary steps, you have to manually switch to them in the compare page, which is sometimes neglected.

I think that it would be nice if we could show the information in the summary, if we find that the changes are reasonably large (this part will be difficult, as always). The PR currently shows the same summary table that is currently used for icounts also for Max RSS, if we find that there have been large changes (TODO: define what does this mean). We can also do something different than showing the full table, like saying "Watch out, it seems that RSS has changed significantly, check this link to find more.".

The first commit is just small cleanup that enables using multiple summary tables in a single PR without duplicating footer with the footnote definition.

@Kobzol Kobzol requested review from lqd, rylev and nnethercote May 15, 2022 11:15
@lqd
Copy link
Member

lqd commented May 15, 2022

Nice, thanks for working on this!

Since the secondary stats may have too much variance, we could even maybe hide them in <details> when we notice a reasonably large change, e.g. with a label that is tentative about the change being actually significant and not just the usual random perturbations) ?

(I wonder if the historical significance factor could also be helpful here, to smooth out the number of times these detections happen. We can iterate later in any case, if the initial steps are not super obtrusive and misleading, which is hard to predict.)

@nnethercote
Copy link
Contributor

One concern is that I've noticed max-rss can be noisy in a correlated way: often all the benchmarks will improve or regress slightly on one run, but it's not actually meaningful. I don't know why this is.

But I think it's definitely worth trying. It will likely take some experimentation to find the right settings.

Copy link
Member

@rylev rylev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of this change though it's obvious we'll need to rapidly iterate as so much of the way things currently work is tuned to instruction counts and is not adjusted here.

The one thing I'd like to see changed before we merge this is a clearer indication of max-rss being a secondary stat. As noted in my comments, the way things currently are, it will be possible for us to report "no relevant results" but then show max-rss stats. We need to make this clearer to avoid a bunch of confusion.

@Kobzol
Copy link
Contributor Author

Kobzol commented May 16, 2022

Since the secondary stats may have too much variance, we could even maybe hide them in

when we notice a reasonably large change, e.g. with a label that is tentative about the change being actually significant and not just the usual random perturbations) ?

Hiding the secondary table(s) in <details> sounds like a good idea. What do others think?

One concern is that I've noticed max-rss can be noisy in a correlated way: often all the benchmarks will improve or regress slightly on one run, but it's not actually meaningful. I don't know why this is.

I have also noticed this. Maybe we can tune the change detection algorithm for this and say something like "if at least X benchmarks had large/very large change, then we consider it to be interesting".

I have refactored the code slightly. The deserves_attention naming is a bit confusing, since for icounts we use it for next steps on PR summaries and for inclusion in the triage, while for Max RSS we use it for deciding whether we should display the summary table in the PR body.. sigh :D

@Kobzol
Copy link
Contributor Author

Kobzol commented May 18, 2022

I thought about this a bit more, and I really like @lqd's suggestion. Instead of coming up with arbitrary thresholds, let's just always show MaxRSS and cycles, but at the same time:

  1. Hide them in <details>, so that they won't take that much space, but they will be "one click away".
  2. Include their mean in the <details> header, so we quickly see whether it's even worth it to click on it.

This will make the raw text content of the comment a bit larger, but I hope that won't be a problem. It will also do a few more work on the backend, but since perf. report PRs are not that frequent, I think that should be fine.

WDYT?

@rylev
Copy link
Member

rylev commented May 18, 2022

Are we always going to show max-rss and cycles even if the instruction count is not significant? So, we keep the (not quite so arbitrary but still somewhat arbitrary) threshold for instruction count but get ride of it for other stats?

@Kobzol
Copy link
Contributor Author

Kobzol commented May 18, 2022

Good point. I suggest this:

  1. Always add Max RSS and cycles table, but keep it hidden.
  2. Always add icount table. If there are relevant results, show it as before.
    If there are no relevant results, hide it by default (same as RSS/cycles) + add the current string "the run did not find any relevant results".

I think that we should treat icounts a bit specially, as they are usually the most stable/least noisy stat and one that we primarily try to optimize.

Showing the icount table (even though it will be hidden by default) if there are no results in it seems a bit wasteful, but I don't consider that to be a big problem and I would rather do that than add another special case to the code. And also we can further convince the reader that the string "no relevant results were found" indeed means that the table is "empty". I think that this mostly only happens for icounts. I don't ever recall seeing a perf. result where cycles/max RSS didn't have at least a single relevant result (it might happen of course, but I think that it's quite rare).

@Kobzol
Copy link
Contributor Author

Kobzol commented May 18, 2022

Just a reminder (I remembered this wrongly): currently there is no real dedicated heuristic for not showing the icount table in PR summaries (deserves_attention is not used for this!). The only case where the table is not shown is when the relevant result count is zero.

@Kobzol
Copy link
Contributor Author

Kobzol commented May 18, 2022

As an example, I hand-crafted it here. If there are no relevant results, there is no point in showing the empty table.

@rylev
Copy link
Member

rylev commented May 18, 2022

I think this is fine as long as we treat max-rss the same. If there are 0 results for max-rss, we should also not show an empty table.

@rylev
Copy link
Member

rylev commented May 18, 2022

I am a bit worried though that we won't be providing any assistance to folks interpreting max-rss or cycles. We've tuned instruction counts to the point where noise doesn't normally show up, but for max-rss and cycles, noise will come through more often. It would be nice if we had someone of helping people interpret these results.

@Kobzol
Copy link
Contributor Author

Kobzol commented May 18, 2022

We could create some Markdown page in the rustc-perf repository with a short "Hitchhiker's guide to interpreting perf results" and link to it from the summary.

@Kobzol
Copy link
Contributor Author

Kobzol commented May 18, 2022

I changed the code to always include Max RSS and cycles. Yes, I do realize that the instruction comparison and summaries are calculated twice, but it was just so painful to modify the code to remove the duplication that I left it like this :D Until we decide if we want to do this.

Copy link
Member

@rylev rylev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small things need to be addressed, but otherwise I'm ok with merging almost as is. We can then refactor.

@Kobzol
Copy link
Contributor Author

Kobzol commented May 18, 2022

I refactored the stat summaries to remove duplication and the panic call.

@nnethercote
Copy link
Contributor

In the "Max RSS" section is it worth having a note that RSS measurements can be noisy?

@Kobzol
Copy link
Contributor Author

Kobzol commented May 18, 2022

Directly to the PR comment or to the proposed page with a guide for interpreting the results?

@nnethercote
Copy link
Contributor

Either sounds ok, as long as it's somewhere that's fairly easy to find.

@rylev rylev enabled auto-merge May 19, 2022 07:10
@rylev rylev merged commit 53f6df8 into master May 19, 2022
@rylev rylev deleted the summary-extended-stats branch May 19, 2022 07:10
@rylev
Copy link
Member

rylev commented May 19, 2022

I think maybe using ### instead of ## headers might be nicer, thoughts?

With ##

image

With ###

image

@Kobzol
Copy link
Contributor Author

Kobzol commented May 19, 2022

That looks better, indeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants