-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[benchmark] Report Quantiles from Benchmark_O and a TON of Gardening #19097
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[benchmark] Report Quantiles from Benchmark_O and a TON of Gardening #19097
Conversation
Improve conformance to Swift Naming Guidelines https://swift.org/documentation/api-design-guidelines/#parameter-names Removed the gross undescores and ticks from parameter names. Ticks are ectoparasites feeding on the blood. We are just measuring time and there is also no [mysterious ticking noise](http://bit.ly/TickNoise) here either…
We can spare 2 array passes (for min and max), if we just sort first.
Turns out that both the old code in `DriverUtils` that computed median, as well as newer quartiles in `PerformanceTestSamples` had off-by-1 error. It trully is the 3rd of the 2 hard things in computer science!
Type check command line argument to be non-negative, but store value in currency type `Int`.
Since the `scale` (or `numIters`) is passed to the `test.runFunction` as `Int`, the whole type-casting dance here was just silly!
Moved the adjustment of `lastSampleTime` to account for the `scale` (`numIters`) and conversion to microseconds into SampleRunner’s `measure` method.
Removed unnecessary use of UInt64, where appropriate, following the advice from Swift Language Guide: > Use the `Int` type for all general-purpose integer constants and variables in your code, even if they’re known to be nonnegative. Using the default integer type in everyday situations means that integer constants and variables are immediately interoperable in your code and will match the inferred type for integer literal values. https://docs.swift.org/swift-book/LanguageGuide/TheBasics.html#ID324
To make sense of this spaghetti code, let’s first use reasonable variable names: * scale -> numIters * elapsed_time -> time
Clarified the need for capping `numIters` according to the discussion at swiftlang#17268 (comment) The sampling loop is a hairy piece of code, because it’s trying to reuse the calibration measurement as a regular sample, in case the computed `numIters` turns out to be 1. But it conflicts with the case when `fixedNumIters` is 1, necessitating a separate measurement in the else branch… That was a quick fix back then, but its hard to make it clean. More thinking is required…
Extracted sample saving to inner func `addSample`. Used it to save the `oneIter` sample from `numIters` calibration when it comes out as 1 and continue the for loop to next sample. This simplified following code that can now always measure the sample with `numIters` and save it.
The spaghetti if-else code was untangled into nested function that computes `iterationsPerSampleTime` and a single constant `numIters` expression that takes care of the overflow capping as well as the choice between fixed and computed `numIters` value. The `numIters` is now computed and logged only once per benchmark measurement instead of on every sample. The sampling loop is now just a single line. Hurrah! Modified test to verify that the `LogParser` maintains `num-iters` derived from the `Measuring with scale` message across samples.
Extracted nested func logVerbose as instance method on SampleRunner. Internalized the free functions `runBech` and `runBenchmarks` into SampleRunner as methods `run` and `runBenchmarks`.
It is now running all the benchmarks, so it’s a TestRunner.
The default benchmark result reports statistics of a normal distribution — mean and standard deviation. Unfortunately the samples from our benchmarks are *not normally distributed*. To get a better picture of the underlying probability distribution, this adds support for reporting quantiles. See https://en.wikipedia.org/wiki/Quantile This gives better subsample of the measurements in the summary, without need to resort to the use of a full verbose mode, which might be unnecessarily slow.
@eeckstein Please review 🙏. |
Added `--delta` argument to print the quantiles in delta encoded format, that ommits 0s. This results in machine and human readable output that highlights modes and is easily digestible, giving you the feel for the underlying probability distribution of the samples in the reported results: ```` $ ./Benchmark_O --num-iters=1 --num-samples=20 --quantile=20 --delta 170 171 184 185 198 199 418 419 432 433 619 620 #,TEST,SAMPLES,MIN(μs),𝚫V1,𝚫V2,𝚫V3,𝚫V4,𝚫V5,𝚫V6,𝚫V7,𝚫V8,𝚫V9,𝚫VA,𝚫VB,𝚫VC,𝚫VD,𝚫VE,𝚫VF,𝚫VG,𝚫VH,𝚫VI,𝚫VJ,𝚫MAX 170,DropFirstArray,20,171,,,,,,,,,,,,,,,,,,,2,29 171,DropFirstArrayLazy,20,168,,,,,,,,,,,,,,,,,,,,8 184,DropLastArray,20,55,,,,,,,,,,,,,,,,,,,,26 185,DropLastArrayLazy,20,65,,,,,,,,,,,,,,,,,,,1,90 198,DropWhileArray,20,214,1,,,,,,,,,,,,,,,,,1,27,2 199,DropWhileArrayLazy,20,464,,,,1,,,,,,,,1,1,1,4,9,1,9,113,2903 418,PrefixArray,20,132,,,,,,,,,,,,,,,,,1,1,32,394 419,PrefixArrayLazy,20,168,,,,,,,,,,,,1,,2,9,1,15,8,88,3338 432,PrefixWhileArray,20,252,1,,,,1,,,,,,,,,,,1,,,,30 433,PrefixWhileArrayLazy,20,168,,,,,,,,,,,,,1,,6,6,14,43,28,10200 619,SuffixArray,20,68,,,,,,,,,,,,,1,,,,22,1,1,4 620,SuffixArrayLazy,20,65,,,,,,,,,,,,,,,,,,1,9,340 ````
@swift-ci smoke test. |
@swift-ci benchmark. |
Build comment file:Code size: -O
Code size: -Osize
How to read the dataThe tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the regressions before you merge the PR. Noise: Sometimes the performance results (not code size!) contain false alarms. Unexpected regressions which are marked with '(?)' are probably noise. If you see regressions which you cannot explain you can try to run the benchmarks again. If regressions still show up, please consult with the performance team (@eeckstein). Hardware Overview
|
The gardening looks good to me. Thanks for doing that. Can you elaborate (in short) why you are adding the quantile reporting? How would that be useful? |
Quantiles characterize the distribution of measured samples from one benchmark run. This gives us better options than the old We can still use U test to detect changes. This is about what inputs you plug into it. |
Pragmatically, I would like to use the |
It sounds like you want to use --quantile as a kind of pre-filter before merging sample sets. That could make sense, although we really want the low end of the samples, not the middle given the typical distribution curve of cpu-bound microbenchmarks. Do you have any graphs of these distribution curves? The point I was making about statistical metrics is that most of these metrics (mean, SD, etc) are not directly relevant to clients of the benchmark suite. They're just distracting or misleading (because they only make sense for large sample sets with a normal distribution). Ultimately, I just care that the regression I'm looking at is statistically valid within something like 95% confidence. As a broader point, we need to remember that, as an open source project, we can't accept additional features or complexity without active compiler developers who are committed to using and maintaining those features. |
Small fix following the last refactorig of MAX_RSS, the `--memory` option is required to measure memory in `--verbose` mode. Added integration test for `check` command of Benchmark_Driver that depended on it.
@atrick Could you please trigger smoke test, the last one failed on macOS due to broken master... so I took advantage of that and pushed one more commit to address one issue I've overlooked before. To answer your question: No, not a pre-filter. I want to use quantiles to characterize the distribution of the measured samples. I think we cannot start working with confidence intervals in the following steps (hypothesis testing), when we don't have a clue about sample distribution and just pick a minimum value from several runs. Maybe you've missed my report, which has a lot of graphs... To get a feel for distribution of raw measurements, check the measurement methodology in previous section and then pick any one of the series there. The distribution is best visible in the histograms on the left side and on the Runtimes chart. With higher load series it is helpful to check the exclude outliers option. |
@palimondo the histograms for all the benchmarks I looked at are about what I would expect. No outliers on the low end with the bulk of the samples somewhere between I think I understand what you're doing now. You just want to limit the textual output when the sample count is high (I think we're misusing the term sample, because that normally refers to a collection of data points taken at one time). Quantiles are just a way of picking representative data points. That seems reasonable to me. Beyond that, I haven't seen any plan for using the data. If I try to extrapolate from this PR:
Given the distributions I've seen, and assuming the process above results in a sufficient number of data points, then basing the final result on the median is probably fine. For purely cpu-bound loops, it will presumably be close enough to the minimum, but may have the advantage that it also works for larger benchmarks with a more rounded distribution. /cc @eeckstein @gottesmm My big concern here is that by limiting the number of data points as input to the U test, we limit its ability to decide that a result is significant (although I haven't actually looked at the math). That could actually result in us running more benchmark runs that necessary. I don't understand why we can't just dump all the data points to a file. It doesn't have to be human readable. In the end, we only care about the difference between medians. |
@swift-ci smoke test. |
In a nutshell, that’s it. 👆👍 |
Explicitly use round-half-to-even rounding algorithm to match the behavior of numpy's quantile(interpolation='nearest') and quantile estimate type R-3, SAS-2. See: https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample
I'm pushing one more commit to make the quantile estimation match the common type (used in numpy, SAS and R), since this PR is still hanging in the limbo... The point of adding quantile output is to expand our options. To get more data out of the sample beyond the current MIN, MAX, MEAN, SD, MEDIAN. From these, only MIN was being used. MAX is an outlier fully dependent on system load which was used for the dubious range overlap test, MEAN and SD are useless because the distribution isn't normal.
Deciding on how we'll improve the hypothesis testing is the next step. If we'll go with U test, this shouldn't be a problem, as quantiles (the R-3, SAS-2 type implemented here) are just a representative subsample. If we have ventiles from 10 independent runs (which we still should do!), that's sample size of 200.
I was just trying to stay within the current design as much as possible. So it looks like both @eeckstein and @atrick said the gardening part of the PR was OK and I think I've addressed the motivation for adding quantile output. Is there something else that needs to be done or can we smoke test and merge? |
@eeckstein Should I remove the commits that add quantile support from this PR, so that at least the gardening part can land? |
FWIW, I think the quantile output is roughly as interesting as mean/SD, so I'm not really opposed to having the ability to print it as an intermediate diagnostic. I just don't think we want to use subsamples for hypothesis testing. More generally, if we decide on a concrete strategy for the benchmark harness in a swift-dev thread, then we might not get so hung up on incremental PR's like this. |
The idea to use quantiles came as a way to somewhat normalize the sample size, mainly for scaling through the whole range of benchmark runtimes in a regular way. Fast benchmarks can produce up to ten thousand of samples per second, I’ve been capping mine at 2000. Then you need to join several independent runs (there is a class of benchmarks that has very stable single mode in a given run, but varies between independent executions, so you need several to get a good picture of the population distribution). When serialized into JSON it can be few hundred KB per benchmark — see the downloadable dataset in the report. But we don’t need sample size of 20 000 for hypothesis testing. Of course the slow benchmarks get much less. So using quantiles normalizes the sample size across the whole range. The quantile estimation type I’ve implemented isn’t using any interpolation, it reports real samples nearest to the requested quantile, so it basically returns a representative sub sample. The value we use for q-quantile controls the reported sample size. Next we need to parse these samples with This quantile reporting is about extending what kinds summary from the sample is reported. Mean was wrong, because the distribution isn’t normal and it accumulates error. Maximum is almost always an extreme outlier - quite dangerous without filtering for anomalous measurements. Minimum can be problematic if it’s a rare occurrence. Setting quantile to 2 gives you Median. 4 gives you Q1, Q2 (Median) and Q3 — we’ll know that 50% of measurements fall between Q1 and Q3. 20 (ventiles) gives you those exact same samples plus 16 more, all in 5% increments of the underlying sample distribution. 100 gives you percentiles. These are the options I’m talking about – how detailed picture of the measurement distribution we get from the sample. I was toying (just theoretically) with the idea of storing all samples in binary form. Thinking I might pack it into PNG image for a good lossless compression, but there is always the problem of unequal sample sizes… so I didn’t pursue this further. |
Yes, if you have 20k data points you might want to filter it. And, yes, if you think that the samples within each run may be dependent, then it makes sense to filter within each run, and yes quantiles are probably the best way to filter.
That's great.
If the U test is positive (inconclusive), then we might want to rerun with more samples (depending on some measure of variance). So, wouldn't it be better to feed all the accumulated data to the U test to increase the chance of a negative result, thus decreasing the chance that we need to rerun benchmarks? I think the confidence threshold for the U test is sensitive to the number of data points. If what you're saying is true, I'd be tempted to do something like 3 independent runs, each with 5 samples. Feed all 15 data points to the U test. For any benchmarks where the U test is positive, but the "variance" is significant, accumulate another round of data. Or, maybe we just rerun a single build until we're happy with the "shape" of the data before doing any U-test. Here's where the strategy is unclear. But I suspect we want to be able to do the A/B comparison, then potentially rerun without throwing away the previous data.
I'm not sure why we would ever include min/max unless we were reporting all the data.
Yes, that all makes sense.
We're focussed on microbenchmarks, not large-scale, ill-behaved, yet highly important benchmarks where accumulating thousands of samples is actually worthwhile. I could see this if you're in the business of benchmarking transaction processing. In other words, our benchmarks aren't directly representative of real performance, they are just a mechanism for finding performance bugs. I'm not saying other use cases of the benchmark harness don't matter, but if they do matter then someone who has a stake in it needs to speak up. Whenever you propose features like this it's important to know if anyone will benefit and who will maintain it. Are you doing this for me, for you, or for someone else in the Swift community? If you want to discuss any of these things further, we should do that on swift-dev. |
FWIW, this PR is fine with me if @eeckstein approves. |
When working with legacy code without good test coverage it’s recommended to not change old code just for fun of it. The work on quantiles seemed like a good reason to pay-off the remaining technical debt in @eeckstein What’s your final word? Should I strip the last 4 commits with quantile reporting from this PR? |
Well, this is more in line with what @eeckstein's |
ok, let's keep it as is |
@swift-ci smoke test and merge |
1 similar comment
@swift-ci smoke test and merge |
I reverted this because it broke in CI: https://ci.swift.org/job/oss-swift-incremental-RA-osx/4909/testReport/
Not sure why that was missed by PR testing, maybe we only build a subset of benchmarks in "smoke test"? |
Update: indeed, this failure is caught if you do "please test", but not with "please smoke test". |
The correct quantile estimation type for printing all measurements in the summary report while `quantile == num-samples - 1` is R-1, SAS-3. It's the inverse of empirical distribution function. References: * https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample * discussion in swiftlang#19097 (comment)
The correct quantile estimation type for printing all measurements in the summary report while `quantile == num-samples - 1` is R-1, SAS-3. It's the inverse of empirical distribution function. References: * https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample * discussion in swiftlang#19097 (comment)
This PR adds a support to modify benchmark log format to report quantiles in the test summary, specified by the
--quantile
argument.The default benchmark result reports statistics of a normal distribution — mean and standard deviation. Unfortunately the samples from our benchmarks are not normally distributed. To get a better picture of the underlying probability distribution, this PR adds support for reporting quantiles.
This gives better subsample of the measurements in the summary, without need to resort to the use of a full verbose mode, which might be unnecessarily slow.
Example to get the five-number summary:
Additionally the
--quantile
argument can be combined with the--delta
option, which will apply delta encoding to the quantiles, and omit0
s from the output, resulting in a concise human as well as machine readable format that gives us good picture of the underlying probability distribution of the samples:There was a significant amount of technical debt in the main sampling loop manifesting as
if-else
spaghetti code 🍝 and a lot of explicit conversions between various integer types with overly specific widths. The code has been cleaned up to use the default currency typeInt
where appropriate. Improved conformance with Swift Naming Guidelines:The functionality of the pre-existing features of
Benchmark_O
remains unchanged, with the exception of the spelling of the time unit in the log header, which is changed from(us)
to(μs)
. The new default header is now the same as in the output fromBenchmark_Driver
:Note:
LogParser
is currently ignoring the header, I'm highlighting this change just to be extra careful, in case there are other (Apple-internal?) clients that parse the log output fromBenchmark_O
.I've structured the PR as a series of small, incremental changes with detailed descriptions in the commit messages, it is therefore best reviewed sequentially by individual commits.