-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Overhaul Benchmarking pipeline to use complete sample data, not summaries #61559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@swift-ci Please test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much appreciated. I can't think of any issues with this change.
@swift-ci smoke benchmark |
The new json output looks great, thanks! |
@swift-ci Python lint |
@swift-ci smoke benchmark |
1f86a49
to
067f475
Compare
@swift-ci Please benchmark |
@swift-ci benchmark |
…ries The Swift benchmarking harness now has two distinct output formats: * Default: Formatted text that's intended for human consumption. Right now, this is just the minimum value, but we can augment that. * `--json`: each output line is a JSON-encoded object that contains raw data This information is intended for use by python scripts that aggregate or compare multiple independent tests. Previously, we tried to use the same output for both purposes. This required the python scripts to do more complex parsing of textual layouts, and also meant that the python scripts had only summary data to work with instead of full raw sample information. This in turn made it almost impossible to derive meaningful comparisons between runs or to aggregate multiple runs. Typical output in the new JSON format looks like this: ``` {"number":89, "name":"PerfTest", "samples":[1.23, 2.35], "max_rss":16384} {"number":91, "name":"OtherTest", "samples":[14.8, 19.7]} ``` This format is easy to parse in Python. Just iterate over lines and decode each one separately. Also note that the optional fields (`"max_rss"` above) are trivial to handle: ``` import json for l in lines: j = json.loads(l) # Default 0 if not present max_rss = j.get("max_rss", 0) ``` Note the `"samples"` array includes the runtime for each individual run. Because optional fields are so much easier to handle in this form, I reworked the Python logic to translate old formats into this JSON format for more uniformity. Hopefully, we can simplify the code in a year or so by stripping out the old log formats entirely, along with some of the redundant statistical calculations. In particular, the python logic still makes an effort to preserve mean, median, max, min, stdev, and other statistical data whenever the full set of samples is not present. Once we've gotten to a point where we're always keeping full samples, we can compute any such information on the fly as needed, eliminating the need to record it. This is a pretty big rearchitecture of the core benchmarking logic. In order to try to keep things a bit more manageable, I have not taken this opportunity to replace any of the actual statistics used in the higher level code or to change how the actual samples are measured. (But I expect this rearchitecture will make such changes simpler.) In particular, this should not actually change any benchmark results. For the future, please keep this general principle in mind: Statistical summaries (averages, medians, etc) should as a rule be computed for immediate output and rarely if ever stored or used as input for other processing. Instead, aim to store and transfer raw data from which statistics can be recomputed as necessary.
The new code stores test numbers as numbers (not strings), which requires a few adjustments. I also apparently missed a few test updates.
806def0
to
30b3763
Compare
@swift-ci Please benchmark |
@swift-ci benchmark |
@swift-ci benchmark |
@swift-ci benchmark |
@swift-ci benchmark |
@swift-ci test |
@swift-ci Please test |
We have to continue using the non-JSON forms until the JSON-supporting code is universally available.
Some of the users of the benchmark harness rely on being able to use previously-compiled benchmark binaries. So I've switched things back over so the Python scripts request the non-JSON format. In a month or so when the JSON-supporting binaries are generally available everywhere, I'll switch it back. This way, benchmarking should continue to work throughout the transition period. |
@swift-ci Please test |
@swift-ci benchmark |
@swift-ci Please test macOS Platform |
@swift-ci Please benchmark |
Another improvement I slipped in: Sample data is now computed in floating point rather than integer microseconds, so we can now correctly handle scaled results less than 1µs. This shows up as some amusing "regressions" compared to the old benchmark harness:
|
With this change, the Swift benchmarking harness now has two distinct output formats:
Default: Formatted text that's intended for human consumption. Right now, this is just the minimum value, but we can augment that.
--json
: each output line is a JSON-encoded object that contains raw data This information is intended for use by python scripts that aggregate or compare multiple independent tests.Previously, we tried to use the same output for both purposes. This required the python scripts to do more complex parsing of textual layouts, and also meant that the python scripts had only summary data to work with instead of full raw sample information. This in turn made it almost impossible to derive meaningful comparisons between runs or to aggregate multiple runs.
Typical output in the new JSON format looks like this:
Note the
"samples"
array includes the runtime for each individual run.This format is easy to parse in Python. Just iterate over lines and decode each one separately. Also note that the optional fields (
"max_rss"
above) are trivial to handle:Because optional fields are so much easier to handle in this form, I reworked the Python logic to translate old formats into this JSON format for more uniformity. Hopefully, we can simplify the code in a year or so by stripping out the old log formats entirely, along with some of the redundant statistical calculations. In particular, the python logic still makes an effort to preserve mean, median, max, min, stdev, and other statistical data whenever the full set of samples is not present. Once we've gotten to a point where we're always keeping full samples, we can compute any such information on the fly as needed, eliminating the need to record it.
This is a pretty big rearchitecture of the core benchmarking logic. In order to try to keep things a bit more manageable, I have not taken this opportunity to replace any of the actual statistics used in the higher level code or to change how the actual samples are measured. (But I expect this rearchitecture will make such changes simpler.) In particular, this should not actually change any benchmark results.
For the future, please keep this general principle in mind: Statistical summaries (averages, medians, etc) should as a rule be computed for immediate output and rarely if ever stored or used as input for other processing. Instead, aim to store and transfer raw data from which statistics can be recomputed as necessary.