Skip to content

[WIP][stdlib] Minor Sequence Optimizations #20758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

palimondo
Copy link
Contributor

Following up on discussion from #20221 (review), these are minor tweaks to Sequence implementation:

  • In-place rotation of ringBuffer used in suffix, avoiding new array copy.
  • Internal use of ContiguousArray in dropLast, prefix and suffix (modeled after map and _filter implementations).

Rotate the ringBuffer in-place instead of copying to new array for lower memory footprint.
Nano-optimization:
Following the template used in methods `map` and `_filter`, internally use ContiguousArray before converting to Array on return.
@palimondo
Copy link
Contributor Author

palimondo commented Nov 26, 2018

When measuring locally, I saw no performance difference. But I don't think our benchmarks, slicing only short sequences of Ints, cover the affected area very thoroughly…

@airspeedswift Can this be accepted only on theoretical memory-saving grounds? Or do we need to devise better benchmark that showcases the tradeoffs between in-place rotation and array copy?

@palimondo
Copy link
Contributor Author

@swift-ci please benchmark

@palimondo
Copy link
Contributor Author

@swift-ci please smoke test

@palimondo
Copy link
Contributor Author

@airspeedswift Please review 🙏

@swift-ci

This comment has been minimized.

@palimondo
Copy link
Contributor Author

Hmm… I don't recall seeing any regressions with SuffixSequence when I ran this locally 2 weeks ago. 🤨

@palimondo
Copy link
Contributor Author

@airspeedswift Any idea why that ^^ might be?

@palimondo
Copy link
Contributor Author

Let's try again…
@swift-ci please benchmark

@swift-ci
Copy link
Contributor

Build comment file:

Performance: -O

TEST OLD NEW DELTA RATIO
Regression
SuffixSequenceLazy 569 2242 +294.0% 0.25x
SuffixAnySequence 570 2236 +292.3% 0.25x
SuffixSequence 569 2147 +277.3% 0.27x
DictionaryOfAnyHashableStrings_insert 5164 5918 +14.6% 0.87x
Improvement
PrefixWhileAnySeqCntRange 362 323 -10.8% 1.12x
PrefixWhileAnySeqCRangeIter 346 323 -6.6% 1.07x

Code size: -O

TEST OLD NEW DELTA RATIO
Regression
Suffix.o 25873 37761 +45.9% 0.69x
Improvement
PrefixWhile.o 22878 21422 -6.4% 1.07x

Performance: -Osize

TEST OLD NEW DELTA RATIO
Regression
SuffixSequenceLazy 948 2541 +168.0% 0.37x
SuffixAnySequence 948 2502 +163.9% 0.38x
SuffixSequence 948 2411 +154.3% 0.39x
PrefixWhileAnySeqCntRangeLazy 159 176 +10.7% 0.90x
SuffixCountableRangeLazy 11 12 +9.1% 0.92x
Improvement
PrefixWhileAnyCollectionLazy 176 160 -9.1% 1.10x
DropLastCountableRangeLazy 12 11 -8.3% 1.09x
PrefixWhileAnyCollection 253 235 -7.1% 1.08x (?)

Code size: -Osize

TEST OLD NEW DELTA RATIO
Regression
Suffix.o 24657 37073 +50.4% 0.67x
Improvement
PrefixWhile.o 21366 20118 -5.8% 1.06x

Performance: -Onone

TEST OLD NEW DELTA RATIO
Improvement
DropLastSequence 27911 18712 -33.0% 1.49x
DropLastAnySequence 27690 18679 -32.5% 1.48x
DropLastAnySequenceLazy 28728 19780 -31.1% 1.45x
DropLastSequenceLazy 27753 19366 -30.2% 1.43x
SuffixAnySequence 22602 17760 -21.4% 1.27x
SuffixSequence 22524 17728 -21.3% 1.27x
SuffixAnySequenceLazy 23169 18757 -19.0% 1.24x
SuffixSequenceLazy 22526 18305 -18.7% 1.23x
DropLastAnySeqCRangeIterLazy 63674 54343 -14.7% 1.17x
DropLastAnySeqCRangeIter 63187 54141 -14.3% 1.17x
DropLastAnySeqCntRangeLazy 63519 54535 -14.1% 1.16x
PrefixWhileAnySeqCRangeIter 61305 54217 -11.6% 1.13x
PrefixWhileSequence 33282 29723 -10.7% 1.12x
PrefixWhileAnySequence 33055 29742 -10.0% 1.11x
DropLastAnySeqCntRange 62930 57704 -8.3% 1.09x
How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB
--------------

@palimondo
Copy link
Contributor Author

palimondo commented Dec 19, 2018

I cannot reproduce the regression on SuffixSequence locally. Let's try remote performance debugging… First, reverting all changes to suffix for a clean baseline.

@palimondo
Copy link
Contributor Author

@swift-ci please benchmark

@swift-ci

This comment has been minimized.

@palimondo
Copy link
Contributor Author

palimondo commented Dec 19, 2018

Now just the in-place Array rotation...

@swift-ci

This comment has been minimized.

@palimondo
Copy link
Contributor Author

OK, so the in-place rotation is causing the de-optimization in SuffixSequence case. Let's try if we can stop the optimizer from tripping over itself if this isn't inlined...

@swiftlang swiftlang deleted a comment from swift-ci Dec 19, 2018
@swift-ci
Copy link
Contributor

Build comment file:

Performance: -O

TEST OLD NEW DELTA RATIO
Improvement
PrefixWhileAnySeqCntRange 362 324 -10.5% 1.12x

Code size: -O

TEST OLD NEW DELTA RATIO
Improvement
PrefixWhile.o 22878 21422 -6.4% 1.07x
Suffix.o 25873 24513 -5.3% 1.06x

Performance: -Osize

TEST OLD NEW DELTA RATIO
Regression
PrefixWhileAnySeqCntRangeLazy 159 176 +10.7% 0.90x
Improvement
PrefixWhileAnyCollectionLazy 176 160 -9.1% 1.10x
DropLastCountableRangeLazy 12 11 -8.3% 1.09x
PrefixWhileAnyCollection 252 235 -6.7% 1.07x

Code size: -Osize

TEST OLD NEW DELTA RATIO
Improvement
Suffix.o 24657 22961 -6.9% 1.07x
PrefixWhile.o 21366 20118 -5.8% 1.06x

Performance: -Onone

TEST OLD NEW DELTA RATIO
Improvement
DropLastAnySequenceLazy 29419 19290 -34.4% 1.53x
DropLastAnySequence 27777 18640 -32.9% 1.49x
DropLastSequenceLazy 27720 18678 -32.6% 1.48x
DropLastSequence 27899 18882 -32.3% 1.48x
SuffixAnySequenceLazy 23745 18011 -24.1% 1.32x
SuffixAnySequence 22630 17529 -22.5% 1.29x
SuffixSequenceLazy 22440 17476 -22.1% 1.28x
SuffixSequence 22714 17758 -21.8% 1.28x
PrefixWhileSequence 34700 27716 -20.1% 1.25x
PrefixWhileAnySequence 34567 27671 -19.9% 1.25x
DropLastAnySeqCntRangeLazy 67280 54253 -19.4% 1.24x
DropLastAnySeqCRangeIterLazy 64172 54601 -14.9% 1.18x
DropLastAnySeqCRangeIter 62879 54022 -14.1% 1.16x
DropLastAnySeqCntRange 63356 55511 -12.4% 1.14x
SuffixAnySeqCRangeIterLazy 60113 53409 -11.2% 1.13x
How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB
--------------

@palimondo
Copy link
Contributor Author

palimondo commented Dec 19, 2018

@airspeedswift I think I have this down now… can you please have a look, because in order to work around the weird de-optimization in SuffixSequence, I had to take the array rotation out of the suffix method itself. What would be a proper way to declare a non-public utility extension like I did here for ContiguousArray.rotate(left:) inside the stdlib? I can't quickly find other precedent.

@palimondo
Copy link
Contributor Author

CC @dabrahams @lorentey

@palimondo palimondo changed the title [stdlib] Minor Sequence Optimizations [WIP][stdlib] Minor Sequence Optimizations Dec 19, 2018
@usableFromInline
mutating func _rotate(left distance: Int) {
guard distance > 0 else { return }
let i = index(startIndex, offsetBy: distance)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessarily the best way to rotate, and is probably not the API we want to commit to supporting even internally. Please try to use what's already written in the algorithms prototype, and especially read this comment and test accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's better than to split this and do just the consistent use of ContiguousArray in this PR, and tackle the in-place rotation in a separate PR, right?

@palimondo palimondo closed this Jan 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants