Skip to content

Overhaul Benchmarking pipeline to use complete sample data, not summaries #61559

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Nov 9, 2022

Conversation

tbkka
Copy link
Contributor

@tbkka tbkka commented Oct 12, 2022

With this change, the Swift benchmarking harness now has two distinct output formats:

  • Default: Formatted text that's intended for human consumption. Right now, this is just the minimum value, but we can augment that.

  • --json: each output line is a JSON-encoded object that contains raw data This information is intended for use by python scripts that aggregate or compare multiple independent tests.

Previously, we tried to use the same output for both purposes. This required the python scripts to do more complex parsing of textual layouts, and also meant that the python scripts had only summary data to work with instead of full raw sample information. This in turn made it almost impossible to derive meaningful comparisons between runs or to aggregate multiple runs.

Typical output in the new JSON format looks like this:

{"number":89, "name":"PerfTest", "samples":[1.23, 2.35], "max_rss":16384}
{"number":91, "name":"OtherTest", "samples":[14.8, 19.7]}

Note the "samples" array includes the runtime for each individual run.

This format is easy to parse in Python. Just iterate over lines and decode each one separately. Also note that the optional fields ("max_rss" above) are trivial to handle:

import json
for l in lines:
   j = json.loads(l)
   # Default 0 if not present
   max_rss = j.get("max_rss", 0)

Because optional fields are so much easier to handle in this form, I reworked the Python logic to translate old formats into this JSON format for more uniformity. Hopefully, we can simplify the code in a year or so by stripping out the old log formats entirely, along with some of the redundant statistical calculations. In particular, the python logic still makes an effort to preserve mean, median, max, min, stdev, and other statistical data whenever the full set of samples is not present. Once we've gotten to a point where we're always keeping full samples, we can compute any such information on the fly as needed, eliminating the need to record it.

This is a pretty big rearchitecture of the core benchmarking logic. In order to try to keep things a bit more manageable, I have not taken this opportunity to replace any of the actual statistics used in the higher level code or to change how the actual samples are measured. (But I expect this rearchitecture will make such changes simpler.) In particular, this should not actually change any benchmark results.

For the future, please keep this general principle in mind: Statistical summaries (averages, medians, etc) should as a rule be computed for immediate output and rarely if ever stored or used as input for other processing. Instead, aim to store and transfer raw data from which statistics can be recomputed as necessary.

@tbkka
Copy link
Contributor Author

tbkka commented Oct 12, 2022

@swift-ci Please test

Copy link
Contributor

@atrick atrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much appreciated. I can't think of any issues with this change.

@shahmishal
Copy link
Member

@swift-ci smoke benchmark

@shahmishal
Copy link
Member

The new json output looks great, thanks!

@shahmishal
Copy link
Member

@swift-ci Python lint

@shahmishal
Copy link
Member

@swift-ci smoke benchmark

@tbkka tbkka force-pushed the tbkka-benchmarking branch 2 times, most recently from 1f86a49 to 067f475 Compare October 20, 2022 00:52
@tbkka
Copy link
Contributor Author

tbkka commented Oct 28, 2022

@swift-ci Please benchmark

@tbkka
Copy link
Contributor Author

tbkka commented Oct 28, 2022

@swift-ci benchmark

tbkka added 7 commits November 4, 2022 14:02
…ries

The Swift benchmarking harness now has two distinct output formats:

* Default: Formatted text that's intended for human consumption.
  Right now, this is just the minimum value, but we can augment that.

* `--json`: each output line is a JSON-encoded object that contains raw data
  This information is intended for use by python scripts that aggregate
  or compare multiple independent tests.

Previously, we tried to use the same output for both purposes.  This required
the python scripts to do more complex parsing of textual layouts, and also meant
that the python scripts had only summary data to work with instead of full raw
sample information.  This in turn made it almost impossible to derive meaningful
comparisons between runs or to aggregate multiple runs.

Typical output in the new JSON format looks like this:
```
{"number":89, "name":"PerfTest", "samples":[1.23, 2.35], "max_rss":16384}
{"number":91, "name":"OtherTest", "samples":[14.8, 19.7]}
```

This format is easy to parse in Python.  Just iterate over
lines and decode each one separately. Also note that the
optional fields (`"max_rss"` above) are trivial to handle:
```
import json
for l in lines:
   j = json.loads(l)
   # Default 0 if not present
   max_rss = j.get("max_rss", 0)
```
Note the `"samples"` array includes the runtime for each individual run.

Because optional fields are so much easier to handle in this form, I reworked
the Python logic to translate old formats into this JSON format for more
uniformity.  Hopefully, we can simplify the code in a year or so by stripping
out the old log formats entirely, along with some of the redundant statistical
calculations.  In particular, the python logic still makes an effort to preserve
mean, median, max, min, stdev, and other statistical data whenever the full set
of samples is not present.  Once we've gotten to a point where we're always
keeping full samples, we can compute any such information on the fly as needed,
eliminating the need to record it.

This is a pretty big rearchitecture of the core benchmarking logic. In order to
try to keep things a bit more manageable, I have not taken this opportunity to
replace any of the actual statistics used in the higher level code or to change
how the actual samples are measured. (But I expect this rearchitecture will make
such changes simpler.) In particular, this should not actually change any
benchmark results.

For the future, please keep this general principle in mind: Statistical
summaries (averages, medians, etc) should as a rule be computed for immediate
output and rarely if ever stored or used as input for other processing. Instead,
aim to store and transfer raw data from which statistics can be recomputed as
necessary.
The new code stores test numbers as numbers (not strings), which
requires a few adjustments. I also apparently missed a few test updates.
@tbkka tbkka force-pushed the tbkka-benchmarking branch from 806def0 to 30b3763 Compare November 4, 2022 21:03
@tbkka
Copy link
Contributor Author

tbkka commented Nov 4, 2022

@swift-ci Please benchmark

@tbkka
Copy link
Contributor Author

tbkka commented Nov 4, 2022

@swift-ci benchmark

@tbkka
Copy link
Contributor Author

tbkka commented Nov 4, 2022

@swift-ci benchmark

@tbkka
Copy link
Contributor Author

tbkka commented Nov 5, 2022

@swift-ci benchmark

@tbkka
Copy link
Contributor Author

tbkka commented Nov 5, 2022

@swift-ci benchmark

@tbkka
Copy link
Contributor Author

tbkka commented Nov 5, 2022

@swift-ci test

@tbkka
Copy link
Contributor Author

tbkka commented Nov 6, 2022

@swift-ci Please test

tbkka added 3 commits November 7, 2022 14:45
We have to continue using the non-JSON forms
until the JSON-supporting code is universally
available.
@tbkka
Copy link
Contributor Author

tbkka commented Nov 7, 2022

Some of the users of the benchmark harness rely on being able to use previously-compiled benchmark binaries. So I've switched things back over so the Python scripts request the non-JSON format. In a month or so when the JSON-supporting binaries are generally available everywhere, I'll switch it back. This way, benchmarking should continue to work throughout the transition period.

@tbkka
Copy link
Contributor Author

tbkka commented Nov 7, 2022

@swift-ci Please test

@tbkka
Copy link
Contributor Author

tbkka commented Nov 7, 2022

@swift-ci benchmark

@tbkka
Copy link
Contributor Author

tbkka commented Nov 8, 2022

@swift-ci Please test macOS Platform

@tbkka
Copy link
Contributor Author

tbkka commented Nov 8, 2022

@swift-ci Please benchmark

@tbkka
Copy link
Contributor Author

tbkka commented Nov 8, 2022

Another improvement I slipped in: Sample data is now computed in floating point rather than integer microseconds, so we can now correctly handle scaled results less than 1µs. This shows up as some amusing "regressions" compared to the old benchmark harness:

SubstringRemoveLast1                  0.0     0.196     +19600.0%   **0.01x**
SubstringRemoveFirst1                 0.0     0.167     +16700.0%   **0.01x (?)**
StringWithCString2                    0.0     0.006     +600.0%     **0.14x**

@tbkka tbkka merged commit c056e63 into swiftlang:main Nov 9, 2022
@tbkka tbkka deleted the tbkka-benchmarking branch August 1, 2024 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants