-
Notifications
You must be signed in to change notification settings - Fork 156
Improve GitHub comments. #1365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve GitHub comments. #1365
Conversation
Here is the old format and the new format, using data from the
I wonder if splitting the primary and secondary numbers is worth it for the GitHub comment. We could get the table down to three lines of data if we got rid of the split in the comment. Footnotes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I like the ranges and the inclusion of all split by category.
Some comments:
- When I was last changing the order, I think that @Mark-Simulacrum wanted to keep regressions first, since they are more important. I tend to agree, regressions, particularly unexpected ones, should be very visible. (Unexpected) improvements, on the other hand, are just "nice to have" in this setting.
- One reason why I included the linebreak in the label is the reduced width. If you open the page on mobile (or in phone simulator in e.g. Chrome), you'll see why 😅 The new design is quite wider and thus wraps very poorly on mobile devices.
pub fn number_of_regressions(&self) -> usize { | ||
self.num_regressions | ||
pub fn number_of_relevant_regressions(&self) -> usize { | ||
self.num_relevant_regressions | ||
} | ||
|
||
/// Arithmetic mean of all improvements as a percent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the comment should probably be changed.
/// Arithmetic mean of all improvements as a percent | |
/// Arithmetic mean of relevant improvements as a percent |
site/src/comparison.rs
Outdated
| Improvements 🎉 <br /> (primary) | -16.7% | -16.7% | 1 | | ||
| Improvements 🎉 <br /> (secondary) | N/A | N/A | 0 | | ||
| All 😿🎉 (primary) | 41.7% | 100.0% | 2 | | ||
| | count[^1] | range | mean[^2] | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should really use https://github.com/mitsuhiko/insta
for these tests 😅 It's so annoying to change them manually.
Sigh, I don't ever look at github on mobile so I prefer shorter but wider tables. On Firefox on my phone the first column has its labels split across two lines even without the hard linebreak. (Unless I flip my phone to landscape, and then everything fits just fine.) So the linebreak is seemingly unnecessary on mobile, but is definitely annoying on a larger screen or on a phone in landscape mode. Maybe this is all just more reason to not show the primary/secondary split. |
Indeed even the current design wraps on many devices. Although the wrapping is a bit nicer than with the wide layout, it's still not ideal. And we can't make it ideal, since we don't control the layout. |
We could try to have such terse description also in the comments, no one says that it has to be a table (actually the table brought some UI problems, from mobile devices to plaintext e-mails). For example, something like this:
But then we lose the alignment (unless we emulate it with spaces). (Btw I was actually planning on changing the compare page summary to include much more information, but some terse description like that should also definitely stay there.) |
Thanks for taking a look at this! Unfortunately, we've had a really hard time finding a solution everyone likes. We received a lot of feedback about wide formatted tables and the desire for the information to be more vertically laid out. I think we'll definitely want to try to keep that. I like @Kobzol's comment that there's no need for using tables. They have brought a lot of downsides since their rendering is so hard to control. Some additional notes:
|
I agree that text may be better than a table. I was playing around with labels like "All", "Unchanged", "Better", "Worse", as replacements for "Improvements" and "Regressions". I'll experiment more on Monday, including the acceptable/unacceptable messaging. |
2801933
to
6e92c88
Compare
I have updated the code, though it is still a bit rough. Below is what I am aiming to produce. Finished benchmarking commit (c80dde43f992f3eb419899a34551b84c6301f8e8): comparison url. Overall result: ❌ regressions found ❌ and ✔️ improvements found ✔️ Next steps: See these docs for more information on how to understand and verify the accuracy of this finding. If you can justify the regressions found in this perf run, please put @rustbot label: +perf-regression If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment. Primary benchmarks: ❌ regressions found ❌ and ✔️ improvements found ✔️
Secondary benchmarks: ❌ regressions found ❌
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment. Primary benchmarks: ...
Secondary benchmarks: ❌ regressions found ❌
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment. ...similar to above... |
And it occurs to me that it might be useful to mark any regressing PR that lands on master with |
Quick feedback for this layout: I wonder if we could also show the underlying statistics in some nicer format, like rendering a small histogram of the changes. Two issues I have with the latest layout:
|
The "Better"/"Worse" numbers are useful in cases where you have e.g. 100 improvements ranging from -3% to -1% and 5 regressions ranging from 1% to 3%. Or you just have 6 improvements but they're in the range -20% to -10%. I think removing them would be bad. As for the "harder to read" issue: that's an unavoidable consequence of switching from text to tables. Unless we use a fixed width font which I think would be a bad idea. Though we could choose line labels that are more similar lengths, something like:
As for verbosity, I was experimenting and the comment above is inconsistent in its wording:
The terse form is better for us, but the verbose form may be better for non-rustc-perf experts. I used "test results" to be consistent with the terminology in the glossary, but just "results" might be better. Also, I think the results counts are highly useful. E.g. knowing there are 5x as many improvements as regressions is good to know. |
The counts are indeed useful, but I'm not sure about the separated mean and ranges. If there are regressions To sum up, what makes sense to me is to see the number of improvements, number of regressions, number of total changes, total range ( But this is probably not strictly relevant to this PR, it was just an idea. |
The current comment lacks the "all" numbers and I think we are in strong agreement that adding them in this PR is a good thing.
I would say "among the results that improved, the average change is I think it's fine if you find the overall numbers more interesting. But I find the worse/better numbers to also be interesting, so I'd like to keep them. Indeed, the numbers in this PR are now pretty much what I personally think of as the ideal set when I'm trying to get a quick picture. |
This is somewhat true of existing comments, but I think the new one doesn't really improve on this: the comment is really long; on my (not small) display, it takes up essentially a full page. I think if we're expecting most folks to action the information in some way, we need to cut back drastically. I'm wondering the extent to which it's helpful to actually have all the statistics etc embedded in the PR comment -- do folks feel those are driving meaningful action from contributors? I think I've heard several times now that folks either ignored or didn't know what to do with the perf comments, which to me suggests that delivering more information about the regression (metrics, etc.) isn't actually going to drive them to action that -- the problem is the "ok, what now?" for those folks. I think for those of us working closely with perf those metrics are a lot more useful (typically I can make an instinctive call from a few benchmark names and the metrics in the current tables / perf page), but if you're less frequently working with this information, I don't think I'd know what to do when confronted with the comment, and it's long enough that I would probably give up pretty quickly, especially for a regression that looks "somewhat small" (read: not >10%, probably). I would suggest we try to dial back on the information in the actual comment; we can push tables, text, etc. into dedicated pages on perf.rust-lang.org -- not necessarily the comparison page! It's relatively easy for us to host template-generated content for arbitrary comparisons. As a concrete suggestion for a short comment, cutting down @nnethercote's current proposal: Finished benchmarking commit (c80dde43f992f3eb419899a34551b84c6301f8e8): detailed comparison url. Overall result: regressions found, though some improvements too If not yet merged: If merged: If merged, and a rollup: @rustbot label: +perf-regression Where I'd probably put together the checklists in separate pages/files in this repo, something like: Not yet merged checklistYour PR resulted in at least some regressions in compiler performance. In order to understand whether you should follow up, please follow one of these paths:
In all cases, you should reach a point that you leave a comment with (Something similar for the other two). I think the notable thing here is that I think we should always push for the user to take some action, make an evaluation, etc. -- even if that evaluation is wrong. Today I think it's pretty atypical for users to actually leave a comment following up on regressions (before or after merge), which seems bad. If we can get to a point where people are leaving comments (maybe with incorrect justification), we can pursue improvements to tooling to correct wrong assumptions. As-is, we're flying blind, since users aren't really engaging in a way that's visible to us (or at least me). I also think we should probably invest in blocking merge on perf-regression labels unless perf-regression-triaged is applied. It should be feasible to make bors do that if we have agreement on it, and it'll coincide with this driver toward it being applied in a mechanized way. (But I would wait on doing that until we feel it's reasonable to expect someone to have applied that label, and ideally in practice we see it applied "most" of the time). |
This comment was marked as resolved.
This comment was marked as resolved.
- Use lists instead of tables, which cause lots of layout problems. - Show "range" instead of just "max". - Include mean and range of all results, as well as those numbers for the improvements and the regressions. - Put all the text at the top. - Make clearer the overall ok/not-ok evaluation, and which metrics contributed to that (i.e. instructions, yes; RSS and cycles, no). - Use ❌ and ✅ instead of 😿 and 🎉, because they're clearer, in shape, - Include a link to a document with some explanation of how to evaluate the overall result, for the rare cases where it is misleading or wrong. - CC `@rust-lang/wg-compiler-performance` on regressions that occur on master.
6e92c88
to
69da09d
Compare
See below for an updated example of a comment with this PR. Notes:
Finished benchmarking commit (c80dde43f992f3eb419899a34551b84c6301f8e8): comparison url. Overall result: ❌ ✅ (regressions and improvements found) - ACTION NEEDEDSee these docs for information on how to understand and verify the accuracy of this finding. Next steps: If you can justify the regressions found in this perf run, please put @rustbot label: +perf-regression If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment. Primary benchmarks: ❌ ✅ (regressions and improvements found)
Secondary benchmarks: no notable results
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment. ...similar to above... CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment. ...similar to above... |
I like the overall direction. Here are some small notes on things we could improve more:
|
FWIW, I think maybe this didn't come across in the comment I posted (went through a few revisions), but my thought was that we would move the summary tables/section to a new page or a heading on the comparison pages -- i.e., instead of dumping folks directly into the comparison page, they get put onto the current "tail" of the proposed comment, whatever we decide that should look like. In terms of triage & compiler team meetings, I think those are distinct "views" and should indeed keep the summary in some form. Triage has a primary audience of the triagers, who I think we can call "experts" -- the posting to compiler team meetings and TWIR is more of a nice-to-have, and I think the top-line summary is the more critical aspect there, as well as any human call outs (e.g., "this PR needs investigation").
I don't have a strong opinion here. I think if we can fit it into the main comment, great, but if we have 3 different versions (as I suggested) then it might be nice to have a place where all three are in roughly the same place, so users can fallback on a different section if needed more easily.
I would disagree :) But I have a proposal: let's try to agree on a couple metrics we can start tracking for effectiveness of the comments, so that we can iterate and try to improve on those. Obviously we should keep in mind what our instinct and anecdotal evidence is too, not just numbers, but I think I'd start by having us:
If we can put some automation in place to track this (should be pretty easy, we probably have the necessary info from the existing queries we do to get PR titles?), then I think I'm happy with basically any format for the comment, though I might suggest landing metrics first, waiting a full triage cycle, then landing a new comment format. Gives us a little time to calibrate. Basically: I don't really know that the new comment will be a net-negative; I think it may be (I would prioritize cutting length), but rather than argue about it let's just find out, I think we can. |
FWIW, I somewhat strongly disagree with this in the sense that I think we do want to drive a reaction from folks even if they disagree, rather than just ignoring the comment. It should be rarely necessary, but in practice we don't really know, and I'd much rather get some comment than think they just ignored the assessment. |
Some of the changes here have landed in other PRs. This PR isn't going anywhere. |
easier to read.
colour, and meaning.
relevant ones. Good for gauging the overall effect, especially when
only a small number of benchmarks are affected. Also add an "All" row
for secondary benchmarks.
improvements before regressions, to be optimistic.