Overhaul Benchmarking pipeline to use complete sample data, not summaries #61559

tbkka · 2022-10-12T20:27:08Z

With this change, the Swift benchmarking harness now has two distinct output formats:

Default: Formatted text that's intended for human consumption. Right now, this is just the minimum value, but we can augment that.
--json: each output line is a JSON-encoded object that contains raw data This information is intended for use by python scripts that aggregate or compare multiple independent tests.

Previously, we tried to use the same output for both purposes. This required the python scripts to do more complex parsing of textual layouts, and also meant that the python scripts had only summary data to work with instead of full raw sample information. This in turn made it almost impossible to derive meaningful comparisons between runs or to aggregate multiple runs.

Typical output in the new JSON format looks like this:

{"number":89, "name":"PerfTest", "samples":[1.23, 2.35], "max_rss":16384}
{"number":91, "name":"OtherTest", "samples":[14.8, 19.7]}

Note the "samples" array includes the runtime for each individual run.

This format is easy to parse in Python. Just iterate over lines and decode each one separately. Also note that the optional fields ("max_rss" above) are trivial to handle:

import json
for l in lines:
   j = json.loads(l)
   # Default 0 if not present
   max_rss = j.get("max_rss", 0)

Because optional fields are so much easier to handle in this form, I reworked the Python logic to translate old formats into this JSON format for more uniformity. Hopefully, we can simplify the code in a year or so by stripping out the old log formats entirely, along with some of the redundant statistical calculations. In particular, the python logic still makes an effort to preserve mean, median, max, min, stdev, and other statistical data whenever the full set of samples is not present. Once we've gotten to a point where we're always keeping full samples, we can compute any such information on the fly as needed, eliminating the need to record it.

This is a pretty big rearchitecture of the core benchmarking logic. In order to try to keep things a bit more manageable, I have not taken this opportunity to replace any of the actual statistics used in the higher level code or to change how the actual samples are measured. (But I expect this rearchitecture will make such changes simpler.) In particular, this should not actually change any benchmark results.

For the future, please keep this general principle in mind: Statistical summaries (averages, medians, etc) should as a rule be computed for immediate output and rarely if ever stored or used as input for other processing. Instead, aim to store and transfer raw data from which statistics can be recomputed as necessary.

tbkka · 2022-10-12T21:25:04Z

@swift-ci Please test

atrick

Much appreciated. I can't think of any issues with this change.

shahmishal · 2022-10-13T17:59:27Z

@swift-ci smoke benchmark

shahmishal · 2022-10-13T18:00:15Z

The new json output looks great, thanks!

shahmishal · 2022-10-14T08:09:16Z

@swift-ci Python lint

shahmishal · 2022-10-14T08:12:02Z

@swift-ci smoke benchmark

tbkka · 2022-10-28T22:05:10Z

@swift-ci Please benchmark

tbkka · 2022-10-28T23:14:51Z

@swift-ci benchmark

…ries The Swift benchmarking harness now has two distinct output formats: * Default: Formatted text that's intended for human consumption. Right now, this is just the minimum value, but we can augment that. * `--json`: each output line is a JSON-encoded object that contains raw data This information is intended for use by python scripts that aggregate or compare multiple independent tests. Previously, we tried to use the same output for both purposes. This required the python scripts to do more complex parsing of textual layouts, and also meant that the python scripts had only summary data to work with instead of full raw sample information. This in turn made it almost impossible to derive meaningful comparisons between runs or to aggregate multiple runs. Typical output in the new JSON format looks like this: ``` {"number":89, "name":"PerfTest", "samples":[1.23, 2.35], "max_rss":16384} {"number":91, "name":"OtherTest", "samples":[14.8, 19.7]} ``` This format is easy to parse in Python. Just iterate over lines and decode each one separately. Also note that the optional fields (`"max_rss"` above) are trivial to handle: ``` import json for l in lines: j = json.loads(l) # Default 0 if not present max_rss = j.get("max_rss", 0) ``` Note the `"samples"` array includes the runtime for each individual run. Because optional fields are so much easier to handle in this form, I reworked the Python logic to translate old formats into this JSON format for more uniformity. Hopefully, we can simplify the code in a year or so by stripping out the old log formats entirely, along with some of the redundant statistical calculations. In particular, the python logic still makes an effort to preserve mean, median, max, min, stdev, and other statistical data whenever the full set of samples is not present. Once we've gotten to a point where we're always keeping full samples, we can compute any such information on the fly as needed, eliminating the need to record it. This is a pretty big rearchitecture of the core benchmarking logic. In order to try to keep things a bit more manageable, I have not taken this opportunity to replace any of the actual statistics used in the higher level code or to change how the actual samples are measured. (But I expect this rearchitecture will make such changes simpler.) In particular, this should not actually change any benchmark results. For the future, please keep this general principle in mind: Statistical summaries (averages, medians, etc) should as a rule be computed for immediate output and rarely if ever stored or used as input for other processing. Instead, aim to store and transfer raw data from which statistics can be recomputed as necessary.

The new code stores test numbers as numbers (not strings), which requires a few adjustments. I also apparently missed a few test updates.

tbkka · 2022-11-04T21:04:12Z

@swift-ci Please benchmark

tbkka · 2022-11-04T21:05:52Z

@swift-ci benchmark

tbkka · 2022-11-04T23:18:22Z

@swift-ci benchmark

tbkka · 2022-11-05T19:09:02Z

@swift-ci benchmark

tbkka · 2022-11-05T21:19:09Z

@swift-ci benchmark

tbkka · 2022-11-05T22:53:42Z

@swift-ci test

tbkka · 2022-11-06T01:03:20Z

@swift-ci Please test

We have to continue using the non-JSON forms until the JSON-supporting code is universally available.

tbkka · 2022-11-07T23:44:47Z

Some of the users of the benchmark harness rely on being able to use previously-compiled benchmark binaries. So I've switched things back over so the Python scripts request the non-JSON format. In a month or so when the JSON-supporting binaries are generally available everywhere, I'll switch it back. This way, benchmarking should continue to work throughout the transition period.

tbkka · 2022-11-07T23:45:01Z

@swift-ci Please test

tbkka · 2022-11-07T23:45:07Z

@swift-ci benchmark

tbkka · 2022-11-08T17:03:27Z

@swift-ci Please test macOS Platform

tbkka · 2022-11-08T17:03:38Z

@swift-ci Please benchmark

tbkka · 2022-11-08T19:02:49Z

Another improvement I slipped in: Sample data is now computed in floating point rather than integer microseconds, so we can now correctly handle scaled results less than 1µs. This shows up as some amusing "regressions" compared to the old benchmark harness:

SubstringRemoveLast1                  0.0     0.196     +19600.0%   **0.01x**
SubstringRemoveFirst1                 0.0     0.167     +16700.0%   **0.01x (?)**
StringWithCString2                    0.0     0.006     +600.0%     **0.14x**

tbkka requested review from gottesmm, lorentey, atrick, shahmishal and eeckstein October 12, 2022 20:27

atrick reviewed Oct 13, 2022

View reviewed changes

tbkka force-pushed the tbkka-benchmarking branch 2 times, most recently from 1f86a49 to 067f475 Compare October 20, 2022 00:52

tbkka added 7 commits November 4, 2022 14:02

Fix some test failures

520fd79

The new code stores test numbers as numbers (not strings), which requires a few adjustments. I also apparently missed a few test updates.

Python style fixes

071e9f1

Comment some TODO items

b4fa383

Pylint cleanup, more comments

998475b

Do not use --json for listing tests (yet)

40eaaac

Fix underflow in the padding calculation

30b3763

tbkka force-pushed the tbkka-benchmarking branch from 806def0 to 30b3763 Compare November 4, 2022 21:03

tbkka added 3 commits November 4, 2022 16:16

Fix colliding fields; match old format more closely

08604ea

Match new benchmark driver default output

2a3e68a

Use non-json format for now until we have switched over completely

a63adc9

Make the default output a little more like the old version (for now)

b8e023a

tbkka added 3 commits November 5, 2022 13:29

Use results consistently

e1ab70a

For size comparisons, build the result objects directly with sample data

5c14017

Make --num-samples actually work

c3a7274

A better way to adapt to -num-samples

b0ce365

tbkka added 3 commits November 7, 2022 14:45

Pylint fixes

1681307

pylint fixes

dfe8284

Test the non-JSON output

961a38b

We have to continue using the non-JSON forms until the JSON-supporting code is universally available.

tbkka merged commit c056e63 into swiftlang:main Nov 9, 2022

tbkka deleted the tbkka-benchmarking branch August 1, 2024 16:37

Overhaul Benchmarking pipeline to use complete sample data, not summaries #61559

Overhaul Benchmarking pipeline to use complete sample data, not summaries #61559

Uh oh!

Conversation

tbkka commented Oct 12, 2022

Uh oh!

tbkka commented Oct 12, 2022

Uh oh!

atrick left a comment

Choose a reason for hiding this comment

Uh oh!

shahmishal commented Oct 13, 2022

Uh oh!

shahmishal commented Oct 13, 2022

Uh oh!

shahmishal commented Oct 14, 2022

Uh oh!

shahmishal commented Oct 14, 2022

Uh oh!

tbkka commented Oct 28, 2022

Uh oh!

tbkka commented Oct 28, 2022

Uh oh!

tbkka commented Nov 4, 2022

Uh oh!

tbkka commented Nov 4, 2022

Uh oh!

tbkka commented Nov 4, 2022

Uh oh!

tbkka commented Nov 5, 2022

Uh oh!

tbkka commented Nov 5, 2022

Uh oh!

tbkka commented Nov 5, 2022

Uh oh!

tbkka commented Nov 6, 2022

Uh oh!

tbkka commented Nov 7, 2022

Uh oh!

tbkka commented Nov 7, 2022

Uh oh!

tbkka commented Nov 7, 2022

Uh oh!

tbkka commented Nov 8, 2022

Uh oh!

tbkka commented Nov 8, 2022

Uh oh!

tbkka commented Nov 8, 2022

Uh oh!

Uh oh!