Skip to content

Improve GitHub comments. #1365

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

nnethercote
Copy link
Contributor

  • Remove linebreak in table labels, which makes the table shorter and
    easier to read.
  • Use "-" instead of "N/A", because it's visually smaller.
  • Use ❌ and ✅ instead of 😿 and 🎉, because they're clearer, in shape,
    colour, and meaning.
  • Change "max" column to "range", which is more useful.
  • Reorder columns, to put more fundamental things first.
  • Change meaning of the "All" rows to mean all results, not just
    relevant ones. Good for gauging the overall effect, especially when
    only a small number of benchmarks are affected. Also add an "All" row
    for secondary benchmarks.
  • Reorder rows. "All" first because its the biggest one. And
    improvements before regressions, to be optimistic.
  • Tweak alignments in the table.

@nnethercote
Copy link
Contributor Author

Here is the old format and the new format, using data from the mixed_primary_secondary unit test.

mean1 max count2
Regressions 😿
(primary)
150.0% 200.0% 2
Regressions 😿
(secondary)
100.0% 100.0% 1
Improvements 🎉
(primary)
-62.5% -75.0% 2
Improvements 🎉
(secondary)
-66.7% -66.7% 1
All 😿🎉 (primary) 43.8% 200.0% 4
count1 range mean2
All (primary) 4 -75.0% to 200.0% 43.8%
All (secondary) 2 -66.7% to 100.0% 16.7%
Improvements (primary) ✅ 2 -75.0% to -50.0% -62.5%
Improvements (secondary) ✅ 1 -66.7% to -66.7% -66.7%
Regressions (primary) ❌ 2 100.0% to 200.0% 150.0%
Regressions (secondary) ❌ 1 100.0% to 100.0% 100.0%

I wonder if splitting the primary and secondary numbers is worth it for the GitHub comment. We could get the table down to three lines of data if we got rid of the split in the comment.

Footnotes

  1. the arithmetic mean of the percent change 2

  2. number of relevant changes"# 2

@nnethercote nnethercote marked this pull request as draft July 22, 2022 06:43
Copy link
Contributor

@Kobzol Kobzol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I like the ranges and the inclusion of all split by category.

Some comments:

  • When I was last changing the order, I think that @Mark-Simulacrum wanted to keep regressions first, since they are more important. I tend to agree, regressions, particularly unexpected ones, should be very visible. (Unexpected) improvements, on the other hand, are just "nice to have" in this setting.
  • One reason why I included the linebreak in the label is the reduced width. If you open the page on mobile (or in phone simulator in e.g. Chrome), you'll see why 😅 The new design is quite wider and thus wraps very poorly on mobile devices.

pub fn number_of_regressions(&self) -> usize {
self.num_regressions
pub fn number_of_relevant_regressions(&self) -> usize {
self.num_relevant_regressions
}

/// Arithmetic mean of all improvements as a percent
Copy link
Contributor

@Kobzol Kobzol Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the comment should probably be changed.

Suggested change
/// Arithmetic mean of all improvements as a percent
/// Arithmetic mean of relevant improvements as a percent

| Improvements 🎉 <br /> (primary) | -16.7% | -16.7% | 1 |
| Improvements 🎉 <br /> (secondary) | N/A | N/A | 0 |
| All 😿🎉 (primary) | 41.7% | 100.0% | 2 |
| | count[^1] | range | mean[^2] |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really use https://github.com/mitsuhiko/insta for these tests 😅 It's so annoying to change them manually.

@nnethercote
Copy link
Contributor Author

Sigh, I don't ever look at github on mobile so I prefer shorter but wider tables. On Firefox on my phone the first column has its labels split across two lines even without the hard linebreak. (Unless I flip my phone to landscape, and then everything fits just fine.) So the linebreak is seemingly unnecessary on mobile, but is definitely annoying on a larger screen or on a phone in landscape mode.

Maybe this is all just more reason to not show the primary/secondary split.

@Kobzol
Copy link
Contributor

Kobzol commented Jul 22, 2022

Indeed even the current design wraps on many devices. Although the wrapping is a bit nicer than with the wide layout, it's still not ideal. And we can't make it ideal, since we don't control the layout.

@nnethercote
Copy link
Contributor Author

BTW, this was partly inspired by looking at the summary on the compare page, which is very terse, but quite useful once you learn how to read it:

rustc-perf-wall-time-2022-04-12-to-2022-07-19

@Kobzol
Copy link
Contributor

Kobzol commented Jul 22, 2022

We could try to have such terse description also in the comments, no one says that it has to be a table (actually the table brought some UI problems, from mobile devices to plaintext e-mails).

For example, something like this:

  • Primary:
    • ✔️ 23 improvements, [-0.1, -8.5] %, mean -4 %
    • ✖️ 8 regressions, [2.0, 10.0] %, mean 5 %

But then we lose the alignment (unless we emulate it with spaces).

(Btw I was actually planning on changing the compare page summary to include much more information, but some terse description like that should also definitely stay there.)

@rylev
Copy link
Member

rylev commented Jul 22, 2022

Thanks for taking a look at this! Unfortunately, we've had a really hard time finding a solution everyone likes.

We received a lot of feedback about wide formatted tables and the desire for the information to be more vertically laid out. I think we'll definitely want to try to keep that.

I like @Kobzol's comment that there's no need for using tables. They have brought a lot of downsides since their rendering is so hard to control.

Some additional notes:

  • "improvements" and "regressions" is so long - I wish there were more terse words we could use
  • I think perhaps more important than this summary table is a high-level summary about whether the information should be acted upon. Many have commented that they desire something that says "REGRESSION" or "NO REGRESSION" so they can quickly see whether they need to care or not. We currently signal this through a label, but this doesn't seem to do the job fully.

@nnethercote
Copy link
Contributor Author

I agree that text may be better than a table. I was playing around with labels like "All", "Unchanged", "Better", "Worse", as replacements for "Improvements" and "Regressions". I'll experiment more on Monday, including the acceptable/unacceptable messaging.

@nnethercote nnethercote force-pushed the improve-github-comments branch from 2801933 to 6e92c88 Compare July 25, 2022 07:06
@nnethercote
Copy link
Contributor Author

I have updated the code, though it is still a bit rough. Below is what I am aiming to produce.


Finished benchmarking commit (c80dde43f992f3eb419899a34551b84c6301f8e8): comparison url.

Overall result: ❌ regressions found ❌ and ✔️ improvements found ✔️

Next steps: See these docs for more information on how to understand and verify the accuracy of this finding.

If you can justify the regressions found in this perf run, please put @rustbot label: +perf-regression-triaged in a comment along with sufficient written justification. Otherwise, please open an issue or create a new PR that fixes the regressions, and then put @rustbot label: +perf-regression-triaged in a comment along with a link to the newly created issue or PR.

@rustbot label: +perf-regression

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

Primary benchmarks: ❌ regressions found ❌ and ✔️ improvements found ✔️

  • All: 100 test results, changes range from 0.0% to 4.5%, with a mean of 0.5%
  • Worse: 7 test results, range 0.5% to 4.5%, mean 2.6%
  • Better: 0 test results
  • Unchanged: 93 test results

Secondary benchmarks: ❌ regressions found ❌

  • All: 110 test results, range 0.0% to 4.1%, mean 0.6%
  • Worse: 55 test results, range 0.9% to 4.1%, mean 1.6%
  • Better: 0 test results
  • Unchanged: 55 test results

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

Primary benchmarks: ...

  • All: ...
  • Worse: ...
  • Better: ...
  • Unchanged: ...

Secondary benchmarks: ❌ regressions found ❌

  • ...
  • ...

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

...similar to above...

@rust-lang rust-lang deleted a comment from rustbot Jul 25, 2022
@nnethercote
Copy link
Contributor Author

And it occurs to me that it might be useful to mark any regressing PR that lands on master with @rust-lang/wg-compiler-performance to get more attention, and earlier.

@Kobzol
Copy link
Contributor

Kobzol commented Jul 25, 2022

Quick feedback for this layout:
I like the condensed amount of information. I wonder if we even need better/worse, since looking at just the improved data and just the regressed data in isolation seems somehow wrong, we should be looking at all the data at once to have more meaningful numbers. It's great that we have a -3.5 % improvement on average, but what if there are also 20 regressions and the total mean among relevant results is -0.1 %? Maybe we should just show statistics about all of the data combined, without splitting it into improvements/regressions, otherwise it leads to thinking like "Oh, if I just ignore these 5 regressions, this PR has an -3.5 % improvement on average, nice".

I wonder if we could also show the underlying statistics in some nicer format, like rendering a small histogram of the changes.

Two issues I have with the latest layout:

  • Using two emojis per regression/improvement is quite confusing to me, I couldn't make sense of it and was thinking about how does it relate to the summary (is the first ✖️ sign a marker about the whole commit, or is it just a first emoji before a regression that is surrounded by two emojis?)
  • The important information in this layout seem to be too easy to miss. In the table, it was quite obvious "where to look" and the numbers were clearly separated from the rest of the information (which is basically just noise). Here it blends together with the extra information (like "test results", "changes range from" etc.). This is made worse by a relatively unimportant information (# of test results) being the first information on the row, and by the numbers not being aligned (unlike in the table, where they were aligned).
    I think that if we had the option to use some kind of "super-bold" font or use a colourful font for the numbers or something like that, it would be much better, but here on GH the options are quite limited. I think that this would be mostly solved if we just removed the "better" and "worse" rows though.

@nnethercote
Copy link
Contributor Author

The "Better"/"Worse" numbers are useful in cases where you have e.g. 100 improvements ranging from -3% to -1% and 5 regressions ranging from 1% to 3%. Or you just have 6 improvements but they're in the range -20% to -10%. I think removing them would be bad.

As for the "harder to read" issue: that's an unavoidable consequence of switching from text to tables. Unless we use a fixed width font which I think would be a bad idea. Though we could choose line labels that are more similar lengths, something like:

  • Overall: 100 test results, range 0.0% to 4.5%, mean 0.5%
  • Worse: 7 test results, range 0.5% to 4.5%, mean 2.6%
  • Better: 0 test results
  • Same: 93 test results

As for verbosity, I was experimenting and the comment above is inconsistent in its wording:

  • All: 100 test results, changes range from 0.0% to 4.5%, with a mean of 0.5%
  • Worse: 7 test results, range 0.5% to 4.5%, mean 2.6%

The terse form is better for us, but the verbose form may be better for non-rustc-perf experts. I used "test results" to be consistent with the terminology in the glossary, but just "results" might be better.

Also, I think the results counts are highly useful. E.g. knowing there are 5x as many improvements as regressions is good to know.

@Kobzol
Copy link
Contributor

Kobzol commented Jul 25, 2022

The counts are indeed useful, but I'm not sure about the separated mean and ranges. If there are regressions 1%, 0.5% and improvements -0.5%, -1%, it seems weird to state that the average improvement is -0.75%, without including the regression values in that mean, even though they may be quite close numerically to the improvement values. I consider it better to say the "whole truth", i.e. the range is [1%, -1%] and the mean amongst relevant results is ~0%.

To sum up, what makes sense to me is to see the number of improvements, number of regressions, number of total changes, total range ([worst regression, best improvement]) and total mean (mean amongst all regressions and improvements). Showing mean and ranges separately for improvements and regressions might be introducing bias into interpreting the results.

But this is probably not strictly relevant to this PR, it was just an idea.

@nnethercote
Copy link
Contributor Author

The current comment lacks the "all" numbers and I think we are in strong agreement that adding them in this PR is a good thing.

The counts are indeed useful, but I'm not sure about the separated mean and ranges. If there are regressions 1%, 0.5% and improvements -0.5%, -1%, it seems weird to state that the average improvement is -0.75%

I would say "among the results that improved, the average change is -0.75%".

I think it's fine if you find the overall numbers more interesting. But I find the worse/better numbers to also be interesting, so I'd like to keep them. Indeed, the numbers in this PR are now pretty much what I personally think of as the ideal set when I'm trying to get a quick picture.

@Mark-Simulacrum
Copy link
Member

This is somewhat true of existing comments, but I think the new one doesn't really improve on this: the comment is really long; on my (not small) display, it takes up essentially a full page. I think if we're expecting most folks to action the information in some way, we need to cut back drastically. I'm wondering the extent to which it's helpful to actually have all the statistics etc embedded in the PR comment -- do folks feel those are driving meaningful action from contributors?

I think I've heard several times now that folks either ignored or didn't know what to do with the perf comments, which to me suggests that delivering more information about the regression (metrics, etc.) isn't actually going to drive them to action that -- the problem is the "ok, what now?" for those folks. I think for those of us working closely with perf those metrics are a lot more useful (typically I can make an instinctive call from a few benchmark names and the metrics in the current tables / perf page), but if you're less frequently working with this information, I don't think I'd know what to do when confronted with the comment, and it's long enough that I would probably give up pretty quickly, especially for a regression that looks "somewhat small" (read: not >10%, probably).

I would suggest we try to dial back on the information in the actual comment; we can push tables, text, etc. into dedicated pages on perf.rust-lang.org -- not necessarily the comparison page! It's relatively easy for us to host template-generated content for arbitrary comparisons. As a concrete suggestion for a short comment, cutting down @nnethercote's current proposal:

Finished benchmarking commit (c80dde43f992f3eb419899a34551b84c6301f8e8): detailed comparison url.

Overall result: regressions found, though some improvements too

If not yet merged:
Next steps for the PR author or reviewer: Read [this checklist] (1) to understand what is expected from either the PR author or the reviewer prior to merging this PR. In all cases, you should reach a point that you leave a comment with @rustbot label +perf-regression-triaged and either a justification for the regression or rationale for it being spurious/unrelated to your changes.

If merged:
Next steps for the PR author or reviewer: This PR has already been merged despite this regression, but please read [this checklist] (2) to take action to follow up on the regression.

If merged, and a rollup:
Next steps for the rollup creator: This rollup resulted in a performance regression. Please follow [this checklist] (3) to evaluate whether you should investigate and what steps you should take.

@rustbot label: +perf-regression


Where I'd probably put together the checklists in separate pages/files in this repo, something like:

Not yet merged checklist

Your PR resulted in at least some regressions in compiler performance. In order to understand whether you should follow up, please follow one of these paths:

  1. If you expected a regression (e.g., your PR intentionally hurts compiler performance due to addition of extra checks), then you are likely done. Please leave a comment with @rustbot label +perf-regression-triaged and an explanation for why the performance regression is justified.
  2. If you believe that your PR should not have changed compiler performance, please follow [these steps] to investigate the regressions reported by the automated test suite.
    • After investigating, if you believe the performance change reported is spurious (e.g., noise, unrelated to your change), then leave a comment indicating that with @rustbot label +perf-regression-triaged.
    • After investigating, the change looks to be real, please either fix it and re-run perf, or per point (1) you may be able to justify it.

In all cases, you should reach a point that you leave a comment with @rustbot label +perf-regression-triaged and either a justification for the regression or rationale for it being spurious/unrelated to your changes.

(Something similar for the other two).

I think the notable thing here is that I think we should always push for the user to take some action, make an evaluation, etc. -- even if that evaluation is wrong. Today I think it's pretty atypical for users to actually leave a comment following up on regressions (before or after merge), which seems bad. If we can get to a point where people are leaving comments (maybe with incorrect justification), we can pursue improvements to tooling to correct wrong assumptions. As-is, we're flying blind, since users aren't really engaging in a way that's visible to us (or at least me).


I also think we should probably invest in blocking merge on perf-regression labels unless perf-regression-triaged is applied. It should be feasible to make bors do that if we have agreement on it, and it'll coincide with this driver toward it being applied in a mechanized way. (But I would wait on doing that until we feel it's reasonable to expect someone to have applied that label, and ideally in practice we see it applied "most" of the time).

@rustbot

This comment was marked as resolved.

- Use lists instead of tables, which cause lots of layout problems.
- Show "range" instead of just "max".
- Include mean and range of all results, as well as those numbers for
  the improvements and the regressions.
- Put all the text at the top.
- Make clearer the overall ok/not-ok evaluation, and which metrics
  contributed to that (i.e. instructions, yes; RSS and cycles, no).
- Use ❌ and ✅ instead of 😿 and 🎉, because they're clearer, in shape,
- Include a link to a document with some explanation of how to evaluate
  the overall result, for the rare cases where it is misleading or wrong.
- CC `@rust-lang/wg-compiler-performance` on regressions that occur on
  master.
@nnethercote nnethercote force-pushed the improve-github-comments branch from 6e92c88 to 69da09d Compare July 26, 2022 02:45
@nnethercote
Copy link
Contributor Author

See below for an updated example of a comment with this PR.

Notes:

  • I agree that pushing PR authors more towards action is a good thing. I have made the language here stronger, and that's why all the "here's what to do" text is at the top, and the numbers are at the bottom.
  • I think we definitely should include numbers in the comment. Forcing people to look at the full comparison page without any preliminary summary could easily overwhelm them. The numbers summary is also used in triage and compiler team meetings.
  • I think having most of the "here's what to do" text should be in the comment, not in a checklist on another page. (In contrast, I think the info on "how to understanding the verify the accuracy of this finding" does belong on a separate page, because it's more auxiliary.) I don't think the existing comment length is a problem.
  • We all clearly have very different ideas of what this comment should contain. It's a perfect bikeshed.
  • Having said that, I think what I have is a significant improvement over the existing comment.

Finished benchmarking commit (c80dde43f992f3eb419899a34551b84c6301f8e8): comparison url.

Overall result: ❌ ✅ (regressions and improvements found) - ACTION NEEDED

See these docs for information on how to understand and verify the accuracy of this finding.

Next steps: If you can justify the regressions found in this perf run, please put @rustbot label: +perf-regression-triaged in a comment along with sufficient written justification. Otherwise, please open an issue or create a new PR that fixes the regressions, and then put @rustbot label: +perf-regression-triaged in a comment along with a link to the newly created issue or PR.

@rustbot label: +perf-regression

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

Primary benchmarks: ❌ ✅ (regressions and improvements found)

  • Overall: 4 results, range -75.0% to 200.0%, mean 43.8%
  • Worse: 2 results, range 100.0% to 200.0%, mean 150.0%
  • Better: 2 results, range -75.0% to -50.0%, mean -62.5%
  • Same: 0 results

Secondary benchmarks: no notable results

  • Overall: 0 results
  • Worse: 0 results
  • Better: 0 results
  • Same: 0 results

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

...similar to above...

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

...similar to above...

@rustbot
Copy link

rustbot commented Jul 26, 2022

Error: This repository is not enabled to use triagebot.
Add a triagebot.toml in the root of the master branch to enable it.

Please file an issue on GitHub at triagebot if there's a problem with this bot, or reach out on #t-infra on Zulip.

@rylev
Copy link
Member

rylev commented Jul 26, 2022

I like the overall direction. Here are some small notes on things we could improve more:

  • Make the git sha shorter. The full sha isn't necessary, and it just adds to the noisy
  • Make the next steps section shorter and more direct. Perhaps we could have bullet points stating exactly what to do in a more terse fashion. Right now the wording is a bit long and takes a bit to read through.
  • Move the "If you disagree with this performance assessment, " and rustbot notes to the bottom. Those should only be necessary rarely.
  • I don't understand the utility of the "same" row. Do we need that?

@Mark-Simulacrum
Copy link
Member

I think we definitely should include numbers in the comment. Forcing people to look at the full comparison page without any preliminary summary could easily overwhelm them. The numbers summary is also used in triage and compiler team meetings.

FWIW, I think maybe this didn't come across in the comment I posted (went through a few revisions), but my thought was that we would move the summary tables/section to a new page or a heading on the comparison pages -- i.e., instead of dumping folks directly into the comparison page, they get put onto the current "tail" of the proposed comment, whatever we decide that should look like.

In terms of triage & compiler team meetings, I think those are distinct "views" and should indeed keep the summary in some form. Triage has a primary audience of the triagers, who I think we can call "experts" -- the posting to compiler team meetings and TWIR is more of a nice-to-have, and I think the top-line summary is the more critical aspect there, as well as any human call outs (e.g., "this PR needs investigation").

I think having most of the "here's what to do" text should be in the comment, not in a checklist on another page. (In contrast, I think the info on "how to understanding the verify the accuracy of this finding" does belong on a separate page, because it's more auxiliary.)

I don't have a strong opinion here. I think if we can fit it into the main comment, great, but if we have 3 different versions (as I suggested) then it might be nice to have a place where all three are in roughly the same place, so users can fallback on a different section if needed more easily.

I don't think the existing comment length is a problem.

I would disagree :) But I have a proposal: let's try to agree on a couple metrics we can start tracking for effectiveness of the comments, so that we can iterate and try to improve on those. Obviously we should keep in mind what our instinct and anecdotal evidence is too, not just numbers, but I think I'd start by having us:

  • Put into triage reports the number of and percent of "any" reactions: i.e., where the PR had a perf-regression-triaged label before triage arrived at it, and that was added alongside justification of some kind. This metric shouldn't care about whether that justification we think was "right", just that we actually got it.
  • A second metric for "triage success rate": how often do we (the performance wg) disagree with the assessment made that led to a -triaged label. (Probably also a % and numeric value).

If we can put some automation in place to track this (should be pretty easy, we probably have the necessary info from the existing queries we do to get PR titles?), then I think I'm happy with basically any format for the comment, though I might suggest landing metrics first, waiting a full triage cycle, then landing a new comment format. Gives us a little time to calibrate.

Basically: I don't really know that the new comment will be a net-negative; I think it may be (I would prioritize cutting length), but rather than argue about it let's just find out, I think we can.

@Mark-Simulacrum
Copy link
Member

Move the "If you disagree with this performance assessment, " and rustbot notes to the bottom. Those should only be necessary rarely.

FWIW, I somewhat strongly disagree with this in the sense that I think we do want to drive a reaction from folks even if they disagree, rather than just ignoring the comment. It should be rarely necessary, but in practice we don't really know, and I'd much rather get some comment than think they just ignored the assessment.

@nnethercote
Copy link
Contributor Author

Some of the changes here have landed in other PRs. This PR isn't going anywhere.

@nnethercote nnethercote deleted the improve-github-comments branch August 24, 2022 03:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants