-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[benchmark] Extract Setup from Benchmarks #20048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ArrayAppendStrings had setup overhead of 10ms (42%). ArrayAppendLazyMap had setup overhead of 24 μs (1%). ArrayAppendOptionals and ArrayAppendArrayOfInt also had barely visible, small overhead of ~18μs, that was mostly hidden in measurement noise, but I’ve extracted the setup from all places that had 10 000 element array initializations, in preparation for more precise measurement in the future.
ArrayInClass had setup overhead of 88 μs (17%).
DataCount had setup overhead of 18 μs (20%). DataSubscript had setup overhead of 18 μs (2%). SetUpFunction wasn’t necessary, because of short initialization (18 μs for `sampleData(.medium)`), which will inflate only the initial measurement. Runtimes of other benchmarks hide the sampleData initialization in their artificially high runtimes — most use internal multiplier of 10 000 iterations — but were changed to use the same constant data, since it was already available. The overhead will already be extracted if we go for more precise measurement with lower multipliers in the future.
Dictionary had setup overhad of 136 μs (6%). DictionaryOfObjects had setup overhead of 616 μs (7%). Also fixed variable naming convention (lowerCameCase).
DistinctClassFieldAccesses had setup overhead of 4 μs (14%). Plus cosmetic code formatting fix.
IterateData has setup overhead of 480 μs (10%). There remained strange setup overhead after extracting the data into setUpFunction, because of of-by-one error in the main loop. It should be either: `for _ 1…10*N` or: `for _ 0..<10*N`. It’s error to use 0…m*N, because this will result in `m*N + 1` iterations that will be divided by N in the reported measurement. The extra iteration then manifests as a mysterious setup overhead!
Phonebook had setup overhead of 1266 μs (7%).
PolymorphicCalls has setup overhead of 4 μs (7%).
RandomShuffleLCG2 had setup overhead of 902 μs (17%) even though it already used the setUpFunction. Turns out that copying 100k element array is measurably costly. The only way to eliminate this overhead from measurement I could think of is to let the numbersLCG array linger around (800 kB), because shuffling the IOU version had different performance.
SortSortedStrings had setup overhead of 914 μs (30%). Renamed [String] constants to be shorter and more descriptive. Extracted the lazy initialiation of all these constants into `setUpFunction`, for cleaner measurements.
SubstringComparable had setup overhead of 58 μs (26%). This was a tricky modification: extracting `substrings` and `comparison` constants out of the run function surprisingly resulted in decreased performance. For some reason this configuration causes significant increase in retain/release traffic. Aliasing the constants in the run function somehow works around this deoptimization. Also the initial split of the string into 8 substrings takes 44ms!!! (I’m suspecting some king of one-time ICU initialization?)
Sequence benchmarks that test operations on Arrays have setup overhead of 14 μs. (Up from 4 μs a year ago!) That’s just the creation of an [Int] with 2k elements from a range… This array is now extracted into a constant. This commit also removes the .unstable tag from some CountableRange benchmarks, restoring them back to commit set of the Swift Benchmark Suite.
MapReduceClass had setup overhead fo 868 μs (7%). Setup overhead of MapReduceClassShort was practically lost in the measurement noise from it’s artificially high base load, but it was there. Extracting the decimal array initialization into `SetUpFunction` also takes out the cost of releasing the [NSDecimalNumber], which turns out to be about half of the measured runtime in the case of MapReduceClass benchmark. This significantly changes the reported runtimes (to about half), therfore the modified benchmarks get a new name with suffix `2`.
@eeckstein Please review 🙏 |
Unrelated to this PR, I'm a bit concerned about the lower precision changes detected by
Here's the comparison report:
I've also compared the logs from the modified (more robust) Benchmark_Driver that included #19910:
Once these changes land, I'll look at improving the sensitivity of smoke benchmark using the improvements from #19910. EDIT: Thinking about this more, the smaller benchmarks could still be able to amortize the setup overhead in the 2500 μs |
@swift-ci benchmark |
!!! Couldn't read commit file !!! |
@swift-ci benchmark |
Build comment file:Performance: -O
Code size: -O
Performance: -Osize
Code size: -Osize
Performance: -Onone
How to read the dataThe tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the regressions before you merge the PR. Noise: Sometimes the performance results (not code size!) contain false alarms. Unexpected regressions which are marked with '(?)' are probably noise. If you see regressions which you cannot explain you can try to run the benchmarks again. If regressions still show up, please consult with the performance team (@eeckstein). Hardware Overview
|
@airspeedswift Do those (re-)added benchmark add any value for stdlib benchmarking? |
As their original author, I’d like to point out that the goal of expanding the Sequence API benchmark coverage was to demonstrate its interaction with various concrete underlying Sequence and Collection types. It shows that the optimizer can(not) deal with all the combinations that can arise from stdlib types. Also note how it demonstrates the extremely poor performance of most parts of the subsystem in the Benchmarks in question were removed because they were deemed |
Note: The changes reported here by |
Since this benchmark has been significantly modified and needs to be renamed, we can also lower the workload by a factor of 10, to keep up with the best practices. The old benchmark that uses `NSDecimalNumber` as the tested class is renamed to `MapReduceNSDecimalNumber` and the renamed `MapReduceClass2` now newly measures Swift class `Box` that wrap an `Int`. Short versions were modified analogously.
Based on the comments by @airspeedswift in the forums, I've appropriated the |
@swift-ci benchmark |
1 similar comment
@swift-ci benchmark |
Build comment file:Performance: -O
Code size: -O
Performance: -Osize
Code size: -Osize
Performance: -Onone
How to read the dataThe tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the regressions before you merge the PR. Noise: Sometimes the performance results (not code size!) contain false alarms. Unexpected regressions which are marked with '(?)' are probably noise. If you see regressions which you cannot explain you can try to run the benchmarks again. If regressions still show up, please consult with the performance team (@eeckstein). Hardware Overview
|
@eeckstein are you still waiting for @airspeedswift to chime in (re: re-added tests), or can we merge this as is? |
The legacyFactor is a great idea. Does it make sense to implement that first and use that to compensate for these benchmark changes? |
No. As I said above, the changes reported here are just artifacts of low |
If you restored the old |
@swift-ci smoke test |
For now it seems that the re-added benchmarks are not unstable. I'm merging this. If it turns out that some of those benchmarks are unstable, we can disable them again. |
This PR should land in tandem with #19910, which changes the measurement method used in
Benchmark_Driver
to runBenchmark_O
with--num-iters=1
. That change would appear to increase the reported runtime for benchmarks that have measurable setup overhead, which would be no longer amortized across multiple iterations.Where appropriate, benchmarks were modified to use
setUpFunction
for costlier workload initialization. In some cases, extracting parts of the workload into constants was a sufficient change that has only minor impact on the first measured sample.MapReduceClass
andMapReduceClassSmall
were only two benchmarks, where the changes to extract setup resulted in significantly changed runtimes that do not resemble the original values (compared to the multi-iteration-amortized setup inside run function).Running the benchmarks locally, it looks like unmodified benchmark
RecursiveOwnedParameter
is reporting a change (20% improvement). Do we still have an issue with spurious performance changes when adding/removing benchmarks caused by code alignment?The PR is structured as a series of small, incremental changes with detailed descriptions in the commit messages, it is therefore best reviewed sequentially by individual commits.