[benchmark] Report Quantiles from Benchmark_O and a TON of Gardening #19097

palimondo · 2018-08-31T21:57:13Z

This PR adds a support to modify benchmark log format to report quantiles in the test summary, specified by the --quantile argument.

The default benchmark result reports statistics of a normal distribution — mean and standard deviation. Unfortunately the samples from our benchmarks are not normally distributed. To get a better picture of the underlying probability distribution, this PR adds support for reporting quantiles.

This gives better subsample of the measurements in the summary, without need to resort to the use of a full verbose mode, which might be unnecessarily slow.

Example to get the five-number summary:

$ ./Benchmark_O --num-iters=1 --num-samples=5 --quantile=4 170
#,TEST,SAMPLES,MIN(μs),Q1(μs),Q2(μs),Q3(μs),MAX(μs)
170,DropFirstArray,5,171,171,171,175,215

Total performance tests executed: 1

Additionally the --quantile argument can be combined with the --delta option, which will apply delta encoding to the quantiles, and omit 0s from the output, resulting in a concise human as well as machine readable format that gives us good picture of the underlying probability distribution of the samples:

$ ./Benchmark_O --num-iters=1 --num-samples=200 --quantile=20 --delta 170 171 184 185
#,TEST,SAMPLES,MIN(μs),𝚫V1,𝚫V2,𝚫V3,𝚫V4,𝚫V5,𝚫V6,𝚫V7,𝚫V8,𝚫V9,𝚫VA,𝚫VB,𝚫VC,𝚫VD,𝚫VE,𝚫VF,𝚫VG,𝚫VH,𝚫VI,𝚫VJ,𝚫MAX
170,DropFirstArray,200,171,,,,,,,,,,,,,,,,,1,2,6,83
171,DropFirstArrayLazy,200,168,,,,,,,,,,,,,,,,,,2,92,235
184,DropLastArray,200,55,,,,,,,,,,,,,,,,1,,,4,738
185,DropLastArrayLazy,200,65,,,,,,,,,,,,,,,,,1,1,24,3922

Total performance tests executed: 4

There was a significant amount of technical debt in the main sampling loop manifesting as if-else spaghetti code 🍝 and a lot of explicit conversions between various integer types with overly specific widths. The code has been cleaned up to use the default currency type Int where appropriate. Improved conformance with Swift Naming Guidelines:

Prefer methods and properties to free functions
Follow case conventions

The functionality of the pre-existing features of Benchmark_O remains unchanged, with the exception of the spelling of the time unit in the log header, which is changed from (us) to (μs). The new default header is now the same as in the output from Benchmark_Driver:

#,TEST,SAMPLES,MIN(μs),MAX(μs),MEAN(μs),SD(μs),MEDIAN(μs)

Note: LogParser is currently ignoring the header, I'm highlighting this change just to be extra careful, in case there are other (Apple-internal?) clients that parse the log output from Benchmark_O.

I've structured the PR as a series of small, incremental changes with detailed descriptions in the commit messages, it is therefore best reviewed sequentially by individual commits.

Improve conformance to Swift Naming Guidelines https://swift.org/documentation/api-design-guidelines/#parameter-names Removed the gross undescores and ticks from parameter names. Ticks are ectoparasites feeding on the blood. We are just measuring time and there is also no [mysterious ticking noise](http://bit.ly/TickNoise) here either…

We can spare 2 array passes (for min and max), if we just sort first.

Turns out that both the old code in `DriverUtils` that computed median, as well as newer quartiles in `PerformanceTestSamples` had off-by-1 error. It trully is the 3rd of the 2 hard things in computer science!

Type check command line argument to be non-negative, but store value in currency type `Int`.

Since the `scale` (or `numIters`) is passed to the `test.runFunction` as `Int`, the whole type-casting dance here was just silly!

Moved the adjustment of `lastSampleTime` to account for the `scale` (`numIters`) and conversion to microseconds into SampleRunner’s `measure` method.

Removed unnecessary use of UInt64, where appropriate, following the advice from Swift Language Guide: > Use the `Int` type for all general-purpose integer constants and variables in your code, even if they’re known to be nonnegative. Using the default integer type in everyday situations means that integer constants and variables are immediately interoperable in your code and will match the inferred type for integer literal values. https://docs.swift.org/swift-book/LanguageGuide/TheBasics.html#ID324

To make sense of this spaghetti code, let’s first use reasonable variable names: * scale -> numIters * elapsed_time -> time

Clarified the need for capping `numIters` according to the discussion at swiftlang#17268 (comment) The sampling loop is a hairy piece of code, because it’s trying to reuse the calibration measurement as a regular sample, in case the computed `numIters` turns out to be 1. But it conflicts with the case when `fixedNumIters` is 1, necessitating a separate measurement in the else branch… That was a quick fix back then, but its hard to make it clean. More thinking is required…

Extracted sample saving to inner func `addSample`. Used it to save the `oneIter` sample from `numIters` calibration when it comes out as 1 and continue the for loop to next sample. This simplified following code that can now always measure the sample with `numIters` and save it.

The spaghetti if-else code was untangled into nested function that computes `iterationsPerSampleTime` and a single constant `numIters` expression that takes care of the overflow capping as well as the choice between fixed and computed `numIters` value. The `numIters` is now computed and logged only once per benchmark measurement instead of on every sample. The sampling loop is now just a single line. Hurrah! Modified test to verify that the `LogParser` maintains `num-iters` derived from the `Measuring with scale` message across samples.

Extracted nested func logVerbose as instance method on SampleRunner. Internalized the free functions `runBech` and `runBenchmarks` into SampleRunner as methods `run` and `runBenchmarks`.

It is now running all the benchmarks, so it’s a TestRunner.

The default benchmark result reports statistics of a normal distribution — mean and standard deviation. Unfortunately the samples from our benchmarks are *not normally distributed*. To get a better picture of the underlying probability distribution, this adds support for reporting quantiles. See https://en.wikipedia.org/wiki/Quantile This gives better subsample of the measurements in the summary, without need to resort to the use of a full verbose mode, which might be unnecessarily slow.

palimondo · 2018-08-31T22:02:03Z

@eeckstein Please review 🙏.
// cc @atrick, @lorentey per discussion in #17268 (comment)

Added `--delta` argument to print the quantiles in delta encoded format, that ommits 0s. This results in machine and human readable output that highlights modes and is easily digestible, giving you the feel for the underlying probability distribution of the samples in the reported results: ```` $ ./Benchmark_O --num-iters=1 --num-samples=20 --quantile=20 --delta 170 171 184 185 198 199 418 419 432 433 619 620 #,TEST,SAMPLES,MIN(μs),𝚫V1,𝚫V2,𝚫V3,𝚫V4,𝚫V5,𝚫V6,𝚫V7,𝚫V8,𝚫V9,𝚫VA,𝚫VB,𝚫VC,𝚫VD,𝚫VE,𝚫VF,𝚫VG,𝚫VH,𝚫VI,𝚫VJ,𝚫MAX 170,DropFirstArray,20,171,,,,,,,,,,,,,,,,,,,2,29 171,DropFirstArrayLazy,20,168,,,,,,,,,,,,,,,,,,,,8 184,DropLastArray,20,55,,,,,,,,,,,,,,,,,,,,26 185,DropLastArrayLazy,20,65,,,,,,,,,,,,,,,,,,,1,90 198,DropWhileArray,20,214,1,,,,,,,,,,,,,,,,,1,27,2 199,DropWhileArrayLazy,20,464,,,,1,,,,,,,,1,1,1,4,9,1,9,113,2903 418,PrefixArray,20,132,,,,,,,,,,,,,,,,,1,1,32,394 419,PrefixArrayLazy,20,168,,,,,,,,,,,,1,,2,9,1,15,8,88,3338 432,PrefixWhileArray,20,252,1,,,,1,,,,,,,,,,,1,,,,30 433,PrefixWhileArrayLazy,20,168,,,,,,,,,,,,,1,,6,6,14,43,28,10200 619,SuffixArray,20,68,,,,,,,,,,,,,1,,,,22,1,1,4 620,SuffixArrayLazy,20,65,,,,,,,,,,,,,,,,,,1,9,340 ````

atrick · 2018-09-04T15:39:54Z

@swift-ci smoke test.

atrick · 2018-09-04T15:40:40Z

@swift-ci benchmark.

swift-ci · 2018-09-04T16:21:26Z

Build comment file:

Code size: -O

TEST	OLD	NEW	DELTA	SPEEDUP
Regression
DriverUtils.o	153625	169739	+10.5%	0.91x

Code size: -Osize

TEST	OLD	NEW	DELTA	SPEEDUP
Regression
DriverUtils.o	134809	150627	+11.7%	0.89x

How to read the data

The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false alarms. Unexpected regressions which are marked with '(?)' are probably noise. If you see regressions which you cannot explain you can try to run the benchmarks again. If regressions still show up, please consult with the performance team (@eeckstein).

Hardware Overview

  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB

eeckstein · 2018-09-04T23:05:38Z

The gardening looks good to me. Thanks for doing that.

Can you elaborate (in short) why you are adding the quantile reporting? How would that be useful?
Why not using the U-test (https://en.wikipedia.org/wiki/Mann–Whitney_U_test) as mgottesman suggested?

palimondo · 2018-09-05T03:16:10Z

Quantiles characterize the distribution of measured samples from one benchmark run. This gives us better options than the old mean and sd statistics to be used in the next step, when determining whether the changes we’ve measured are statistically significant (instead of the current crude rule to use 5% change threshold). This is also better than using just a single extreme value, like MIN — quantiles give you representative subset of samples of the specified size.

We can still use U test to detect changes. This is about what inputs you plug into it.

palimondo · 2018-09-06T04:13:04Z

Pragmatically, I would like to use the --quantile=20 when running independent samples from Benchmark_Driver, when I add support for parsing quantiles to LogParser in follow-up PR. This way I get representative sub-sample of measured values without using slow verbose mode parsing. Merged results from independent runs form probability distribution of a benchmark.

atrick · 2018-09-06T15:52:16Z

It sounds like you want to use --quantile as a kind of pre-filter before merging sample sets. That could make sense, although we really want the low end of the samples, not the middle given the typical distribution curve of cpu-bound microbenchmarks. Do you have any graphs of these distribution curves?

The point I was making about statistical metrics is that most of these metrics (mean, SD, etc) are not directly relevant to clients of the benchmark suite. They're just distracting or misleading (because they only make sense for large sample sets with a normal distribution). Ultimately, I just care that the regression I'm looking at is statistically valid within something like 95% confidence.

As a broader point, we need to remember that, as an open source project, we can't accept additional features or complexity without active compiler developers who are committed to using and maintaining those features.

Small fix following the last refactorig of MAX_RSS, the `--memory` option is required to measure memory in `--verbose` mode. Added integration test for `check` command of Benchmark_Driver that depended on it.

palimondo · 2018-09-06T17:02:47Z

@atrick Could you please trigger smoke test, the last one failed on macOS due to broken master... so I took advantage of that and pushed one more commit to address one issue I've overlooked before.

To answer your question: No, not a pre-filter. I want to use quantiles to characterize the distribution of the measured samples. I think we cannot start working with confidence intervals in the following steps (hypothesis testing), when we don't have a clue about sample distribution and just pick a minimum value from several runs.

Maybe you've missed my report, which has a lot of graphs... To get a feel for distribution of raw measurements, check the measurement methodology in previous section and then pick any one of the series there. The distribution is best visible in the histograms on the left side and on the Runtimes chart. With higher load series it is helpful to check the exclude outliers option.

atrick · 2018-09-06T19:47:14Z

@palimondo the histograms for all the benchmarks I looked at are about what I would expect. No outliers on the low end with the bulk of the samples somewhere between mean - sd and mean. Beyond that, mean and sd aren't interesting.

I think I understand what you're doing now. You just want to limit the textual output when the sample count is high (I think we're misusing the term sample, because that normally refers to a collection of data points taken at one time). Quantiles are just a way of picking representative data points. That seems reasonable to me.

Beyond that, I haven't seen any plan for using the data. If I try to extrapolate from this PR:

We pick an even quantile so that the median is represented.
We feed the quantile-based data points from independent runs into the U test.
A significant U test will report the difference between the medians, with confidence.
An indeterminate U test will result in the driver or script performing more runs of that benchmark.

Given the distributions I've seen, and assuming the process above results in a sufficient number of data points, then basing the final result on the median is probably fine. For purely cpu-bound loops, it will presumably be close enough to the minimum, but may have the advantage that it also works for larger benchmarks with a more rounded distribution.

/cc @eeckstein @gottesmm

My big concern here is that by limiting the number of data points as input to the U test, we limit its ability to decide that a result is significant (although I haven't actually looked at the math). That could actually result in us running more benchmark runs that necessary.

I don't understand why we can't just dump all the data points to a file. It doesn't have to be human readable. In the end, we only care about the difference between medians.

atrick · 2018-09-06T23:33:56Z

@swift-ci smoke test.

palimondo · 2018-09-07T00:57:35Z

In a nutshell, that’s it. 👆👍
I was thinking about using ventiles (20-quantiles) as ideal compromise. They hit median and quartiles on the dot, while having enough data to pick up dominant modes.

Explicitly use round-half-to-even rounding algorithm to match the behavior of numpy's quantile(interpolation='nearest') and quantile estimate type R-3, SAS-2. See: https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample

palimondo · 2018-09-10T13:52:12Z

I'm pushing one more commit to make the quantile estimation match the common type (used in numpy, SAS and R), since this PR is still hanging in the limbo...

The point of adding quantile output is to expand our options. To get more data out of the sample beyond the current MIN, MAX, MEAN, SD, MEDIAN. From these, only MIN was being used. MAX is an outlier fully dependent on system load which was used for the dubious range overlap test, MEAN and SD are useless because the distribution isn't normal.

My big concern here is that by limiting the number of data points as input to the U test, we limit its ability to decide that a result is significant (although I haven't actually looked at the math). That could actually result in us running more benchmark runs that necessary.

Deciding on how we'll improve the hypothesis testing is the next step. If we'll go with U test, this shouldn't be a problem, as quantiles (the R-3, SAS-2 type implemented here) are just a representative subsample. If we have ventiles from 10 independent runs (which we still should do!), that's sample size of 200.

I don't understand why we can't just dump all the data points to a file. It doesn't have to be human readable. In the end, we only care about the difference between medians.

I was just trying to stay within the current design as much as possible.

So it looks like both @eeckstein and @atrick said the gardening part of the PR was OK and I think I've addressed the motivation for adding quantile output. Is there something else that needs to be done or can we smoke test and merge?

palimondo · 2018-09-13T19:35:41Z

@atrick:

I don't understand why we can't just dump all the data points to a file. It doesn't have to be human readable. In the end, we only care about the difference between medians.

LogParser already supports parsing --verbose output from Benchmark_O which collects all the samples. The quantile parameter was meant as a way to get to a middle road - have more real samples in the report summary but avoid the slower format, in case a sub-sample would be sufficient for reliable hypothesis testing.

@eeckstein Should I remove the commits that add quantile support from this PR, so that at least the gardening part can land?

atrick · 2018-09-13T21:38:56Z

FWIW, I think the quantile output is roughly as interesting as mean/SD, so I'm not really opposed to having the ability to print it as an intermediate diagnostic. I just don't think we want to use subsamples for hypothesis testing.

More generally, if we decide on a concrete strategy for the benchmark harness in a swift-dev thread, then we might not get so hung up on incremental PR's like this.

palimondo · 2018-09-14T03:36:13Z

The idea to use quantiles came as a way to somewhat normalize the sample size, mainly for scaling through the whole range of benchmark runtimes in a regular way. Fast benchmarks can produce up to ten thousand of samples per second, I’ve been capping mine at 2000. Then you need to join several independent runs (there is a class of benchmarks that has very stable single mode in a given run, but varies between independent executions, so you need several to get a good picture of the population distribution). When serialized into JSON it can be few hundred KB per benchmark — see the downloadable dataset in the report. But we don’t need sample size of 20 000 for hypothesis testing. Of course the slow benchmarks get much less. So using quantiles normalizes the sample size across the whole range. The quantile estimation type I’ve implemented isn’t using any interpolation, it reports real samples nearest to the requested quantile, so it basically returns a representative sub sample.

The value we use for q-quantile controls the reported sample size. Next we need to parse these samples with LogParser and they’ll be merged from the several independent runs to form a solid estimate of the probability distribution for a given benchmark. Then after adding quantile output to PerformanceTestSamples (mirroring functionality from BenchResults), these will be used as inputs for hypothesis testing (and stored in the logs). Again, this selection of q-quantile will give us options to control the sample size. If we choose ventiles (20-quantiles), we’ll be testing change (for example with the Mann Whitney U test) on two samples of size 21 (if we include min & max — or 19 of we don’t). Or we’ll go with percentiles and test with ~100 samples. But it all fits into existing text based logs.

This quantile reporting is about extending what kinds summary from the sample is reported. Mean was wrong, because the distribution isn’t normal and it accumulates error. Maximum is almost always an extreme outlier - quite dangerous without filtering for anomalous measurements. Minimum can be problematic if it’s a rare occurrence. Setting quantile to 2 gives you Median. 4 gives you Q1, Q2 (Median) and Q3 — we’ll know that 50% of measurements fall between Q1 and Q3. 20 (ventiles) gives you those exact same samples plus 16 more, all in 5% increments of the underlying sample distribution. 100 gives you percentiles. These are the options I’m talking about – how detailed picture of the measurement distribution we get from the sample.

I was toying (just theoretically) with the idea of storing all samples in binary form. Thinking I might pack it into PNG image for a good lossless compression, but there is always the problem of unequal sample sizes… so I didn’t pursue this further.

atrick · 2018-09-14T04:52:25Z

But we don’t need sample size of 20 000 for hypothesis testing. Of course the slow benchmarks get much less. So using quantiles normalizes the sample size across the whole range.

Yes, if you have 20k data points you might want to filter it. And, yes, if you think that the samples within each run may be dependent, then it makes sense to filter within each run, and yes quantiles are probably the best way to filter.
(That's not a use case I was considering.)

The quantile estimation type I’ve implemented isn’t using any interpolation, it reports real samples nearest to the requested quantile, so it basically returns a representative sub sample.

That's great.

The value we use for q-quantile controls the reported sample size. Next we need to parse these samples with LogParser and they’ll be merged from the several independent runs to form a solid estimate of the probability distribution for a given benchmark. Then after adding quantile output to PerformanceTestSamples (mirroring functionality from BenchResults), these will be used as inputs for hypothesis testing (and stored in the logs). Again, this selection of q-quantile will give us options to control the sample size. If we choose ventiles (20-quantiles), we’ll be testing change (for example with the Mann Whitney U test) on two samples of size 21 (if we include min & max — or 19 of we don’t). Or we’ll go with percentiles and test with ~100 samples. But it all fits into existing text based logs.

If the U test is positive (inconclusive), then we might want to rerun with more samples (depending on some measure of variance). So, wouldn't it be better to feed all the accumulated data to the U test to increase the chance of a negative result, thus decreasing the chance that we need to rerun benchmarks? I think the confidence threshold for the U test is sensitive to the number of data points.

If what you're saying is true, I'd be tempted to do something like 3 independent runs, each with 5 samples. Feed all 15 data points to the U test. For any benchmarks where the U test is positive, but the "variance" is significant, accumulate another round of data.

Or, maybe we just rerun a single build until we're happy with the "shape" of the data before doing any U-test. Here's where the strategy is unclear. But I suspect we want to be able to do the A/B comparison, then potentially rerun without throwing away the previous data.

(if we include min & max — or 19 of we don’t)

I'm not sure why we would ever include min/max unless we were reporting all the data.

This quantile reporting is about extending what kinds summary from the sample is reported. Mean was wrong, because the distribution isn’t normal and it accumulates error. Maximum is almost always an extreme outlier - quite dangerous without filtering for anomalous measurements. Minimum can be problematic if it’s a rare occurrence. Setting quantile to 2 gives you Median. 4 gives you Q1, Q2 (Median) and Q3 — we’ll know that 50% of measurements fall between Q1 and Q3. 20 (ventiles) gives you those exact same samples plus 16 more, all in 5% increments of the underlying sample distribution. 100 gives you percentiles. These are the options I’m talking about – how detailed picture of the measurement distribution we get from the sample.

Yes, that all makes sense.

I was toying (just theoretically) with the idea of storing all samples in binary form. Thinking I might pack it into PNG image for a good lossless compression, but there is always the problem of unequal sample sizes… so I didn’t pursue this further.

We're focussed on microbenchmarks, not large-scale, ill-behaved, yet highly important benchmarks where accumulating thousands of samples is actually worthwhile. I could see this if you're in the business of benchmarking transaction processing. In other words, our benchmarks aren't directly representative of real performance, they are just a mechanism for finding performance bugs.

I'm not saying other use cases of the benchmark harness don't matter, but if they do matter then someone who has a stake in it needs to speak up. Whenever you propose features like this it's important to know if anyone will benefit and who will maintain it. Are you doing this for me, for you, or for someone else in the Swift community?

If you want to discuss any of these things further, we should do that on swift-dev.

atrick · 2018-09-14T04:54:06Z

FWIW, this PR is fine with me if @eeckstein approves.

palimondo · 2018-09-14T05:23:33Z

When working with legacy code without good test coverage it’s recommended to not change old code just for fun of it. The work on quantiles seemed like a good reason to pay-off the remaining technical debt in DriverUtils…

@eeckstein What’s your final word? Should I strip the last 4 commits with quantile reporting from this PR?

palimondo · 2018-09-14T05:47:54Z

@atrick:

If what you're saying is true, I'd be tempted to do something like 3 independent runs, each with 5 samples. Feed all 15 data points to the U test. For any benchmarks where the U test is positive, but the "variance" is significant, accumulate another round of data.

Well, this is more in line with what @eeckstein's run_smoke_bench.pydoes, which currently uses only MIN value. If we had quantiles reporting, you could run Benchmark_O --num-samples=5 --quantile=4 to get all the 5 measured values out in reported summary without the need to go through much more heavyweight --verbose mode and LogParser. Using quantile == num-samples - 1 will always output ALL the exact measured values in the one-line summary when you add MIN and MAX.

eeckstein · 2018-09-14T15:22:55Z

ok, let's keep it as is

eeckstein · 2018-09-14T15:23:28Z

@swift-ci smoke test and merge

eeckstein · 2018-09-14T15:23:47Z

@swift-ci smoke test and merge

benlangmuir · 2018-09-14T20:26:06Z

I reverted this because it broke in CI: https://ci.swift.org/job/oss-swift-incremental-RA-osx/4909/testReport/

Benchmark_O.test.md:165:12: error: LOGFORMAT: expected string not found in input
LOGFORMAT: ,{{[0-9]+}},{{[0-9]+}},{{[0-9]+}},,{{[0-9]+}}
           ^
<stdin>:27:12: note: scanning from here
1,Ackermann,2,35233,,3173,245760
           ^

Not sure why that was missed by PR testing, maybe we only build a subset of benchmarks in "smoke test"?

benlangmuir · 2018-09-14T20:40:33Z

Update: indeed, this failure is caught if you do "please test", but not with "please smoke test".

The correct quantile estimation type for printing all measurements in the summary report while `quantile == num-samples - 1` is R-1, SAS-3. It's the inverse of empirical distribution function. References: * https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample * discussion in swiftlang#19097 (comment)

palimondo added 24 commits August 31, 2018 07:32

[benchmark] Gardening: Indentation of .listTests

e6cff27

[benchmark] Gardening: Code format class Timer

b373132

[benchmark] Gardening: Extract constant oneSecond

c1a694d

[benchmark] Refactor mean and stdev computation

963995f

[benchmark] Refactor min max median computation

5cd9f53

We can spare 2 array passes (for min and max), if we just sort first.

[benchmark] Fix index computation for quantiles

0db20fe

Turns out that both the old code in `DriverUtils` that computed median, as well as newer quartiles in `PerformanceTestSamples` had off-by-1 error. It trully is the 3rd of the 2 hard things in computer science!

[benchmark] Move stats computation to BenchResults

aa4b849

[benchmark] Gardening: afterRunSleep is UInt32

77dff0a

[benchmark] Gardening: numSamples UInt vs Int

bf4a343

Type check command line argument to be non-negative, but store value in currency type `Int`.

[benchmark] Gardening: scale was always Int

a03aede

Since the `scale` (or `numIters`) is passed to the `test.runFunction` as `Int`, the whole type-casting dance here was just silly!

[benchmark] Refactor to report samples in μs

f017d98

Moved the adjustment of `lastSampleTime` to account for the `scale` (`numIters`) and conversion to microseconds into SampleRunner’s `measure` method.

[benchmark] Gardening: DRYer verbose log

46bef89

[benchmark] Gardening: Sensibly rename variables

6e27af7

To make sense of this spaghetti code, let’s first use reasonable variable names: * scale -> numIters * elapsed_time -> time

[benchmark] Extract yield & add resetMeasurements

265f537

[benchmark] Refactor run runBenchmarks logVerbose

cdcb631

Extracted nested func logVerbose as instance method on SampleRunner. Internalized the free functions `runBech` and `runBenchmarks` into SampleRunner as methods `run` and `runBenchmarks`.

[benchmark] Gardening: Fixed method indentation

df3b385

[benchmark] Gardening: Even nicer microseconds

ae7d82b

[benchmark] Rename SampleRunner -> TestRunner

6079d4f

It is now running all the benchmarks, so it’s a TestRunner.

palimondo added 2 commits September 2, 2018 11:47

[benchmark] Gardening maxRSS as Int?

13e7c3f

[benchmark] Doctor explicitly measures memory

84bf158

Small fix following the last refactorig of MAX_RSS, the `--memory` option is required to measure memory in `--verbose` mode. Added integration test for `check` command of Benchmark_Driver that depended on it.

swift-ci merged commit 8ce03fa into swiftlang:master Sep 14, 2018

benlangmuir mentioned this pull request Sep 14, 2018

Revert "[benchmark] Report Quantiles from Benchmark_O and a TON of Gardening" #19322

Merged

palimondo mentioned this pull request Sep 14, 2018

[benchmark] Report Quantiles from Benchmark_O and a TON of Gardening (take 2) #19328

Merged

palimondo mentioned this pull request Oct 1, 2018

[benchmark] Fix quantile estimation type #19637

Merged

[benchmark] Report Quantiles from Benchmark_O and a TON of Gardening #19097

[benchmark] Report Quantiles from Benchmark_O and a TON of Gardening #19097

Uh oh!

Conversation

palimondo commented Aug 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

palimondo commented Aug 31, 2018

Uh oh!

atrick commented Sep 4, 2018

Uh oh!

atrick commented Sep 4, 2018

Uh oh!

swift-ci commented Sep 4, 2018

Build comment file:

Code size: -O

Code size: -Osize

Uh oh!

eeckstein commented Sep 4, 2018

Uh oh!

palimondo commented Sep 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

palimondo commented Sep 6, 2018

Uh oh!

atrick commented Sep 6, 2018

Uh oh!

palimondo commented Sep 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

atrick commented Sep 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

atrick commented Sep 6, 2018

Uh oh!

palimondo commented Sep 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

palimondo commented Sep 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

palimondo commented Sep 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

atrick commented Sep 13, 2018

Uh oh!

palimondo commented Sep 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

atrick commented Sep 14, 2018

Uh oh!

atrick commented Sep 14, 2018

Uh oh!

palimondo commented Sep 14, 2018

Uh oh!

palimondo commented Sep 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eeckstein commented Sep 14, 2018

Uh oh!

eeckstein commented Sep 14, 2018

Uh oh!

eeckstein commented Sep 14, 2018

Uh oh!

benlangmuir commented Sep 14, 2018

Uh oh!

benlangmuir commented Sep 14, 2018

Uh oh!

Uh oh!

palimondo commented Aug 31, 2018 •

edited

Loading

palimondo commented Sep 5, 2018 •

edited

Loading

palimondo commented Sep 6, 2018 •

edited

Loading

atrick commented Sep 6, 2018 •

edited

Loading

palimondo commented Sep 7, 2018 •

edited

Loading

palimondo commented Sep 10, 2018 •

edited

Loading

palimondo commented Sep 13, 2018 •

edited

Loading

palimondo commented Sep 14, 2018 •

edited

Loading

palimondo commented Sep 14, 2018 •

edited

Loading