Skip to content

[stdlib] Review and fix some problems with unsafe buffer and Range initialization #34961

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

lorentey
Copy link
Member

@lorentey lorentey commented Dec 4, 2020

The goal here is to make sure there is no performance argument against using UnsafeBufferPointer instead of passing around naked pointer/count values, while also making sure obvious errors are caught in debug builds at least.

  1. Make U*BP.init(start:count:) only check for invalid input in debug builds (by @karwa in [stdlib] Make UBP nil/count checks debug-mode only #34747)
  2. Add a missing debug mode check in [Closed]Range.init(uncheckedBounds:) (from [stdlib] Add a missing debug precondition to Range.init(uncheckedBounds:) #34951)
  3. Add a missing debug mode check in U*BP.init(rebasing:) (followup to [stdlib] Simplify buffer pointer initialization. #34879 based on discussions therein)
  4. Eliminate a signed overflow check in U[M]BP.distance(from:to:) (ditto)
  5. Allow the compiler the assume a nonnegative result in U[M]RBP.count (bonus change)

Putting on my pedantic hat, the new _debugPreconditions are technically source-breaking changes. However, for point (2), Range.init(uncheckedBounds:) was documented to strictly require these conditions, and ranges that violate the invariant aren't likely to be usable. (Code that consumes ranges invariably assumes that lowerBound <= upperBound.)

As for point (3), U*BP.init(rebasing:) can only ever trap if someone explicitly invoked the Slice(base:bounds:) initializer with a bogus index range -- which is highly unlikely, since passing that to .init(rebasing:) is far more work than just calling .init(start:count:).

ABI compatibility can only be an issue for binaries built in debug mode. (Which may be a significant problem.)

If compatibility concerns prevent us from adding (2) and (3) above, then we should remove these but keep the rest.

karwa and others added 8 commits December 4, 2020 02:06
Every other Unsafe(Mutable)BufferPointer precondition in this file is debug-mode only.

This information is already part of the documentation for this function, and it's really hard to get rid of these branches and traps otherwise.
…ckedBounds:)

Unchecked APIs must still perform checks in debug builds to ensure that invariants aren’t violated.
(I.e., these aren’t a license to perform invalid operations — they just let us get rid of the checks when we know for sure they are unnecessary.)
`Slice` does not technically guarantee that its indices are valid in its base, so these initializers accidentally allowed the creation of obviously out-of-bounds buffers.
@lorentey
Copy link
Member Author

lorentey commented Dec 4, 2020

Cc @karwa @Lukasa

@lorentey
Copy link
Member Author

lorentey commented Dec 4, 2020

Let's see how much havoc this will wreak in our benchmarks. 🙈

@swift-ci benchmark

@lorentey

This comment has been minimized.

@Lukasa
Copy link
Contributor

Lukasa commented Dec 4, 2020

Let's see how much havoc this will wreak in our benchmarks. 🙈

I assume it will wreak all the havoc. No benchmark will be safe.

Copy link
Contributor

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shiny, I love this patch.

@lorentey
Copy link
Member Author

lorentey commented Dec 4, 2020

@swift-ci benchmark

@swift-ci

This comment has been minimized.

@swift-ci
Copy link
Contributor

swift-ci commented Dec 4, 2020

Build failed
Swift Test OS X Platform
Git Sha - 2607428

@lorentey
Copy link
Member Author

lorentey commented Dec 4, 2020

OK, the egregious code size regressions in #34951 are gone now. I'm assuming the time regressions are due to changes in inlining decisions; they don't seem too bad, surprisingly enough.

Performance: -Osize

Regression OLD NEW DELTA RATIO
SubstringFromLongString 11 22 +100.0% 0.50x
FlattenListFlatMap 3932 6740 +71.4% 0.58x (?)
SubstringFromLongStringGeneric 17 27 +58.8% 0.63x
FlattenListLoop 1630 2509 +53.9% 0.65x (?)
ObjectiveCBridgeStubDataAppend 2760 3640 +31.9% 0.76x
ArrayAppendSequence 810 1040 +28.4% 0.78x (?)
ObjectiveCBridgeStringHash 110 140 +27.3% 0.79x
Data.append.Sequence.64kB.Count0.RE 295 344 +16.6% 0.86x (?)
Data.append.Sequence.64kB.Count0.RE.I 295 343 +16.3% 0.86x
ObjectiveCBridgeStringGetASCIIContents 444 513 +15.5% 0.87x (?)

SubstringFromLongString is just calling Substring(string) with a long native string. The Generic variant does the same in a function that's generic over RangeReplaceableCollection. There is no reason that these should have become slower, so that's definitely worth a look. (Although I'm guessing the test is probably unduly emphasizing inlining decisions in a particular context that may not be a good indication of code performance in general.)

FlattenListFlatMap looks like an array creation & append(contentsOf:) microbenchmark. I'm surprised it isn't dominated by allocation overhead (do the intermediates get stack-allocated?), and it's difficult to see where the regressions may have come from. Again worth a look, if only to understand what's going on. It looks like this is in the same group as FlattenListLoop and ArrayAppendSequence; these are all benchmarking Array.append()/Array.append(contentsOf:). Appending to an array got much faster in some cases in -O, so this smells like an optimizer instability. Not looking forward to diagnosing this.

The ObjectiveCBridge* regressions are mysterious to me as they don't seem to be dealing with ranges or buffer pointers at all. They are also not that large. I can revisit them once I know more about the above two groups.

@lorentey
Copy link
Member Author

lorentey commented Dec 4, 2020

macOS test failure is unrelated.

14:25:48 ******************** TEST 'Swift(iphonesimulator-i386) :: IRGen/async/run-switch-executor.swift' FAILED ********************
14:25:48 Child process terminated with signal 10: Bus error

@karwa
Copy link
Contributor

karwa commented Dec 4, 2020

Fantastic, thanks for taking this on @lorentey. I wasn’t sure how to debug the benchmark regressions, so I was worried I wouldn’t be able to land the other patch on my own. I’ve heard that seemingly unrelated regressions can sometimes be down to code alignment, but I don’t know how to check that.

Anyway, I’m really happy to see these improvements.

This eliminates a couple of _debugPreconditions which seem to change inlining enough to interfere with some benchmarks.
@lorentey

This comment has been minimized.

@lorentey

This comment has been minimized.

@swift-ci

This comment has been minimized.

@lorentey

This comment has been minimized.

@swift-ci

This comment has been minimized.

@lorentey
Copy link
Member Author

lorentey commented Dec 9, 2020

Regarding that FlattenListFlatMap regression in -O: it looks like the intermediate four-element arrays below are optimized away completely on main, but they are stack-allocated on this branch.

func flattenFlatMap(_ input: [(Int, Int, Int, Int)]) -> [Int] {
  return input.flatMap { [$0.0, $0.1, $0.2, $0.3] }
}

This is somehow triggered by the compiler losing evidence about count being nonnegative by the change of _precondition to _debugPrecondition below.

  public init(
    @_nonEphemeral start: Unsafe${Mutable}Pointer<Element>?, count: Int
  ) {
    _debugPrecondition(
      count >= 0, "Unsafe${Mutable}BufferPointer with negative count")
    _debugPrecondition(
      count == 0 || start != nil,
      "Unsafe${Mutable}BufferPointer has a nil start and nonzero count")
    _position = start
    self.count = count
  }

Replacing the assignment with self.count = _assumeNonNegative(count) seems to let the optimizer get its mojo back.

@lorentey
Copy link
Member Author

lorentey commented Dec 9, 2020

@swift-ci benchmark

@lorentey
Copy link
Member Author

lorentey commented Dec 9, 2020

@swift-ci test

@swift-ci

This comment has been minimized.

@swift-ci
Copy link
Contributor

swift-ci commented Dec 9, 2020

Build failed
Swift Test OS X Platform
Git Sha - 282a140

@lorentey
Copy link
Member Author

lorentey commented Dec 10, 2020

Remaining perf regressions seem to be due to _assumeNonnegative(count) not being enough -- the count >= 0 precondition evidently enables optimizations that this is currently missing. (FWIW, _uncheckedUnsafeAssume slows down FlattenListFlatMap by another factor of 2.) Or for all I know it could just be an inlining instability -- I have no idea what I'm doing. ;-) (The weird part is that the SIL/disassembly looks the same before/after; the difference may be hiding in one of the stdlib dylib entry points.)

The macOS test regression is another case of utf8_decoding_fastpath.swift:

/Users/buildnode/jenkins/workspace/swift-PR-osx/branch-main/swift/test/SILOptimizer/utf8_decoding_fastpath.swift:84:15: error: CHECK-DAG: expected string not found in input
// CHECK-DAG: function_ref {{.*}}_fromUTF8Repairing
              ^
<stdin>:690:84: note: scanning from here
sil @$s22utf8_decoding_fastpath022decodeStringUTF8ViewAsF0yS2S0fG0VF : $@convention(thin) (@guaranteed String.UTF8View) -> @owned String {
                                                                                   ^
<stdin>:698:7: note: possible intended match here
 %5 = function_ref @$sSS8UTF8ViewV32withContiguousStorageIfAvailableyxSgxSRys5UInt8VGKXEKlFSS_Tg50104$sSRy7ElementSTQzGSSs5Error_pIgyozo_ACSSsAD_pIegyrzo_SlRzs16_UnicodeEncodingR_8CodeUnitQy_AARtzr0_lTRSS8ab42V_s0C0O0G0OTG5033$sSS8decoding2asSSx_q_mtcst10_cd3R_8ef4Y58_7a26Rtzr0_lufcSSSRyAGGXEfU_SS8gH14V_s0C0O0H0OTg5Tf3nnpf_nTf1cn_n : $@convention(thin) (@guaranteed String.UTF8View) -> (@owned Optional<String>, @error Error) // user: %7
      ^

The tiny addition of the _assumeNonnegative call made the compiler decide not to inline either decoding closure this time, which is Just Great. This time I'm resolving it by adding an alternate UBP initializer, like with Range.init(_uncheckedBounds:).

@lorentey
Copy link
Member Author

@swift-ci test

@lorentey
Copy link
Member Author

@swift-ci benchmark

@swift-ci
Copy link
Contributor

Performance: -O

Regression OLD NEW DELTA RATIO
ObjectiveCBridgeStubDataAppend 2340 3320 +41.9% 0.70x
ObjectiveCBridgeStringHash 99 118 +19.2% 0.84x
Data.append.Sequence.64kB.Count0.RE 249 294 +18.1% 0.85x
Data.append.Sequence.64kB.Count0.RE.I 249 293 +17.7% 0.85x
Data.init.Sequence.64kB.Count0.RE.I 250 294 +17.6% 0.85x
Data.init.Sequence.64kB.Count0.RE 250 294 +17.6% 0.85x
Data.init.Sequence.64kB.Count0.I 249 292 +17.3% 0.85x
Data.init.Sequence.64kB.Count0 249 291 +16.9% 0.86x
Data.init.Sequence.809B.Count0.RE 403 469 +16.4% 0.86x
Data.append.Sequence.64kB.Count0.I 248 288 +16.1% 0.86x
Data.init.Sequence.809B.Count0.RE.I 404 469 +16.1% 0.86x (?)
Data.init.Sequence.2047B.Count0.I 451 523 +16.0% 0.86x
Data.append.Sequence.64kB.Count0 249 288 +15.7% 0.86x
Data.init.Sequence.809B.Count0 404 467 +15.6% 0.87x (?)
Data.init.Sequence.809B.Count0.I 405 468 +15.6% 0.87x (?)
Data.append.Sequence.809B.Count0.RE 349 403 +15.5% 0.87x
Data.init.Sequence.2049B.Count0.I 454 523 +15.2% 0.87x
Data.append.Sequence.809B.Count0.RE.I 351 404 +15.1% 0.87x
Data.append.Sequence.809B.Count0.I 347 398 +14.7% 0.87x
Data.init.Sequence.511B.Count0.I 424 486 +14.6% 0.87x (?)
Data.init.Sequence.513B.Count0.I 424 486 +14.6% 0.87x (?)
Data.append.Sequence.809B.Count0 348 397 +14.1% 0.88x
ObjectiveCBridgeStringGetASCIIContents 396 440 +11.1% 0.90x (?)
DataAppendBytesSmall 234 257 +9.8% 0.91x (?)
DataAppendDataMediumToSmall 3680 4040 +9.8% 0.91x (?)
Breadcrumbs.IdxToUTF16Range.longASCII 31 34 +9.7% 0.91x
ObjectiveCBridgeStringCStringUsingEncoding 669 731 +9.3% 0.92x (?)
DataSubscriptMedium 56 61 +8.9% 0.92x
Breadcrumbs.UTF16ToIdxRange.longASCII 26 28 +7.7% 0.93x
 
Improvement OLD NEW DELTA RATIO
ArrayAppendLazyMap 2320 1110 -52.2% 2.09x
ArrayAppendUTF16Substring 23544 15120 -35.8% 1.56x
ArrayAppendLatin1Substring 24012 15444 -35.7% 1.55x
ArrayAppendAsciiSubstring 23508 15156 -35.5% 1.55x
Data.init.Sequence.64kB.Count.I 52 35 -32.7% 1.49x
Data.init.Sequence.64kB.Count 52 35 -32.7% 1.49x
Data.init.Sequence.2047B.Count.I 91 65 -28.6% 1.40x
Data.init.Sequence.2049B.Count.I 91 65 -28.6% 1.40x
Data.init.Sequence.809B.Count 86 65 -24.4% 1.32x
Data.init.Sequence.809B.Count.I 86 65 -24.4% 1.32x
Data.init.Sequence.513B.Count.I 100 80 -20.0% 1.25x
Data.init.Sequence.511B.Count.I 99 81 -18.2% 1.22x
ObjectiveCBridgeStubFromNSDateRef 4700 3960 -15.7% 1.19x (?)
DataCreateSmallArray 2900 2450 -15.5% 1.18x
DropWhileSequence 20 17 -15.0% 1.18x
DataAccessBytesMedium 74 67 -9.5% 1.10x
LessSubstringSubstring 38 35 -7.9% 1.09x (?)
EqualSubstringSubstring 38 35 -7.9% 1.09x (?)
EqualSubstringString 38 35 -7.9% 1.09x
LessSubstringSubstringGenericComparable 38 35 -7.9% 1.09x
EqualStringSubstring 39 36 -7.7% 1.08x (?)
EqualSubstringSubstringGenericEquatable 39 36 -7.7% 1.08x (?)
LazilyFilteredArrayContains 14800 13700 -7.4% 1.08x (?)

Code size: -O

Regression OLD NEW DELTA RATIO
LazyFilter.o 7688 7864 +2.3% 0.98x
DropLast.o 20158 20558 +2.0% 0.98x
CSVParsing.o 54339 55027 +1.3% 0.99x
SetTests.o 121053 122421 +1.1% 0.99x
 
Improvement OLD NEW DELTA RATIO
ObjectiveCBridgingStubs.o 16165 14181 -12.3% 1.14x
RemoveWhere.o 15643 14315 -8.5% 1.09x
RangeOverlaps.o 6038 5798 -4.0% 1.04x
Combos.o 5966 5822 -2.4% 1.02x
RomanNumbers.o 5836 5740 -1.6% 1.02x

Performance: -Osize

Regression OLD NEW DELTA RATIO
ObjectiveCBridgeStubDataAppend 2360 3320 +40.7% 0.71x
DistinctClassFieldAccesses 292 398 +36.3% 0.73x (?)
ArrayAppendSequence 700 910 +30.0% 0.77x (?)
DataCopyBytesSmall 84 105 +25.0% 0.80x
ObjectiveCBridgeStringHash 99 118 +19.2% 0.84x
Data.append.Sequence.64kB.Count0.RE 263 301 +14.4% 0.87x
Data.init.Sequence.64kB.Count0.RE 265 301 +13.6% 0.88x
Data.append.Sequence.64kB.Count0.I 268 304 +13.4% 0.88x
Data.init.Sequence.809B.Count0.RE 430 487 +13.3% 0.88x (?)
Data.append.Sequence.64kB.Count0.RE.I 263 297 +12.9% 0.89x
Breadcrumbs.IdxToUTF16Range.longASCII 31 35 +12.9% 0.89x
Data.append.Sequence.809B.Count0 373 421 +12.9% 0.89x (?)
Data.init.Sequence.64kB.Count0.RE.I 265 299 +12.8% 0.89x
Data.init.Sequence.809B.Count0.RE.I 429 484 +12.8% 0.89x (?)
Data.append.Sequence.809B.Count0.RE 369 416 +12.7% 0.89x
Data.init.Sequence.2047B.Count0.I 488 550 +12.7% 0.89x
Data.append.Sequence.64kB.Count0 268 302 +12.7% 0.89x
Data.init.Sequence.64kB.Count0.I 269 303 +12.6% 0.89x
Data.init.Sequence.809B.Count0.I 436 491 +12.6% 0.89x (?)
Data.append.Sequence.809B.Count0.I 373 420 +12.6% 0.89x
DataAppendDataSmallToSmall 3680 4140 +12.5% 0.89x (?)
Data.init.Sequence.64kB.Count0 270 303 +12.2% 0.89x
Data.append.Sequence.809B.Count0.RE.I 369 412 +11.7% 0.90x (?)
Data.init.Sequence.511B.Count0.I 455 508 +11.6% 0.90x (?)
Data.init.Sequence.809B.Count0 436 486 +11.5% 0.90x (?)
DropWhileAnyCollectionLazy 227 253 +11.5% 0.90x
Data.init.Sequence.2049B.Count0.I 487 542 +11.3% 0.90x (?)
ObjectiveCBridgeStringGetASCIIContents 397 441 +11.1% 0.90x (?)
UTF8Decode_InitFromCustom_contiguous_ascii_as_ascii 360 398 +10.6% 0.90x (?)
PrefixWhileAnyCollectionLazy 143 158 +10.5% 0.91x (?)
PrefixAnySeqCRangeIterLazy 143 158 +10.5% 0.91x (?)
PrefixAnySeqCntRangeLazy 143 158 +10.5% 0.91x (?)
PrefixWhileAnySeqCRangeIterLazy 143 158 +10.5% 0.91x (?)
PrefixWhileAnySeqCntRangeLazy 143 158 +10.5% 0.91x (?)
CharIndexing_tweet_unicodeScalars 9680 10640 +9.9% 0.91x
CharIndexing_ascii_unicodeScalars 4880 5360 +9.8% 0.91x
SuffixAnyCollection 53 58 +9.4% 0.91x
ObjectiveCBridgeStringCStringUsingEncoding 670 731 +9.1% 0.92x (?)
PrefixWhileAnyCollection 195 211 +8.2% 0.92x
DataAppendBytesSmall 234 253 +8.1% 0.92x
UTF8Decode_InitFromBytes_ascii_as_ascii 412 444 +7.8% 0.93x (?)
Breadcrumbs.UTF16ToIdxRange.longASCII 26 28 +7.7% 0.93x
 
Improvement OLD NEW DELTA RATIO
CharIteration_tweet_unicodeScalars 6360 4560 -28.3% 1.39x
CharIteration_ascii_unicodeScalars 3040 2360 -22.4% 1.29x
DataCreateSmallArray 3250 2650 -18.5% 1.23x
DropWhileAnySeqCRangeIter 196 165 -15.8% 1.19x
CharIteration_punctuatedJapanese_unicodeScalars 800 680 -15.0% 1.18x
CharIteration_utf16_unicodeScalars 4760 4080 -14.3% 1.17x
CharIteration_japanese_unicodeScalars 5480 4720 -13.9% 1.16x
CharIteration_korean_unicodeScalars 4040 3480 -13.9% 1.16x
CharIteration_russian_unicodeScalars 3840 3320 -13.5% 1.16x
DataAccessBytesMedium 92 80 -13.0% 1.15x
CharIteration_chinese_unicodeScalars 3240 2840 -12.3% 1.14x
DropFirstSequence 60 53 -11.7% 1.13x
DropFirstSequenceLazy 60 53 -11.7% 1.13x
CharIteration_punctuated_unicodeScalars 760 680 -10.5% 1.12x
ObjectiveCBridgeStubDateMutation 256 230 -10.2% 1.11x
LessSubstringSubstring 38 35 -7.9% 1.09x
EqualSubstringSubstring 38 35 -7.9% 1.09x
EqualStringSubstring 38 35 -7.9% 1.09x (?)
EqualSubstringSubstringGenericEquatable 38 35 -7.9% 1.09x
EqualSubstringString 38 35 -7.9% 1.09x
LessSubstringSubstringGenericComparable 38 35 -7.9% 1.09x
DropFirstAnySeqCRangeIterLazy 205 189 -7.8% 1.08x
DropFirstAnySeqCntRangeLazy 205 189 -7.8% 1.08x (?)
Array2D 7280 6736 -7.5% 1.08x
ObjectiveCBridgeFromNSSetAnyObjectToStringForced 81000 75000 -7.4% 1.08x (?)
RandomShuffleLCG2 432 400 -7.4% 1.08x

Code size: -Osize

Regression OLD NEW DELTA RATIO
UTF8Decode.o 22231 22516 +1.3% 0.99x
CSVParsing.o 51591 52135 +1.1% 0.99x
 
Improvement OLD NEW DELTA RATIO
ObjectiveCBridgingStubs.o 15254 13868 -9.1% 1.10x
RemoveWhere.o 14270 13245 -7.2% 1.08x
Combos.o 6287 6077 -3.3% 1.03x
StringMatch.o 4140 4027 -2.7% 1.03x
RomanNumbers.o 5509 5406 -1.9% 1.02x
LazyFilter.o 7437 7314 -1.7% 1.02x

Performance: -Onone

Regression OLD NEW DELTA RATIO
DataCopyBytesSmall 1455 2666 +83.2% 0.55x
DataCopyBytesMedium 1570 2786 +77.5% 0.56x
ArrayValueProp3 5978 8613 +44.1% 0.69x
ArrayValueProp4 5924 8358 +41.1% 0.71x
DataAppendDataSmallToSmall 5560 7260 +30.6% 0.77x (?)
DataAppendBytesMedium 5540 6860 +23.8% 0.81x (?)
DataAppendArray 5500 6800 +23.6% 0.81x (?)
DataAppendDataSmallToMedium 5540 6820 +23.1% 0.81x (?)
StackPromo 51700 63100 +22.1% 0.82x (?)
RandomIntegersDef 57600 69900 +21.4% 0.82x (?)
StringToDataEmpty 3450 4150 +20.3% 0.83x (?)
DevirtualizeProtocolComposition 6852 8232 +20.1% 0.83x (?)
ObjectiveCBridgeStubDataAppend 3900 4680 +20.0% 0.83x
DataAppendDataMediumToSmall 5460 6520 +19.4% 0.84x (?)
ObjectiveCBridgeStringHash 99 118 +19.2% 0.84x
ObjectiveCBridgeStubFromNSDateRef 4800 5710 +19.0% 0.84x (?)
Walsh 7272 8630 +18.7% 0.84x
StringToDataSmall 3600 4250 +18.1% 0.85x (?)
DataReplaceMedium 6000 7000 +16.7% 0.86x (?)
ObjectiveCBridgeStubFromNSDate 5450 6310 +15.8% 0.86x (?)
NSStringConversion.Rebridge.Mutable 1594 1828 +14.7% 0.87x (?)
DataAppendDataMediumToMedium 5800 6640 +14.5% 0.87x (?)
StringBuilderLong 2290 2600 +13.5% 0.88x (?)
NSStringConversion.Mutable 3360 3810 +13.4% 0.88x (?)
NSStringConversion.UTF8 2367 2665 +12.6% 0.89x (?)
Data.init.Sequence.809B.Count0.RE.I 29779 33513 +12.5% 0.89x (?)
Data.append.Sequence.809B.Count0.RE 29367 33047 +12.5% 0.89x (?)
Data.append.Sequence.64kB.Count0.RE.I 23655 26539 +12.2% 0.89x (?)
Data.init.Sequence.809B.Count.RE 29553 33155 +12.2% 0.89x (?)
DataReplaceMediumBuffer 7400 8300 +12.2% 0.89x (?)
Data.append.Sequence.64kB.Count0.RE 23814 26707 +12.1% 0.89x (?)
Data.init.Sequence.64kB.Count0.RE.I 23916 26813 +12.1% 0.89x (?)
Data.init.Sequence.809B.Count0.RE 29818 33412 +12.1% 0.89x (?)
Data.append.Sequence.809B.Count0.RE.I 29463 32976 +11.9% 0.89x (?)
Data.init.Sequence.64kB.Count0.RE 23923 26753 +11.8% 0.89x (?)
Data.append.Sequence.64kB.Count.RE 23945 26759 +11.8% 0.89x (?)
DataAppendSequence 2976900 3321600 +11.6% 0.90x (?)
Data.append.Sequence.64kB.Count.RE.I 23888 26632 +11.5% 0.90x (?)
Data.init.Sequence.809B.Count.RE.I 29559 32800 +11.0% 0.90x (?)
Data.init.Sequence.64kB.Count.RE 23970 26573 +10.9% 0.90x (?)
ObjectiveCBridgeStringGetASCIIContents 397 440 +10.8% 0.90x (?)
Data.append.Sequence.809B.Count.RE.I 29771 32979 +10.8% 0.90x (?)
DataCreateSmall 72150 79890 +10.7% 0.90x (?)
DictionaryCompactMapValuesOfCastValue 52920 58428 +10.4% 0.91x (?)
Data.append.Sequence.809B.Count.RE 29809 32911 +10.4% 0.91x (?)
Data.init.Sequence.64kB.Count.RE.I 24122 26554 +10.1% 0.91x (?)
DataAppendDataLargeToSmall 20000 22000 +10.0% 0.91x (?)
ObjectiveCBridgeStringCStringUsingEncoding 668 732 +9.6% 0.91x (?)
Radix2CooleyTukeyf 79680 87264 +9.5% 0.91x (?)
Sim2DArray 12685 13859 +9.3% 0.92x
CharIndexing_tweet_unicodeScalars_Backwards 599880 653840 +9.0% 0.92x (?)
RomanNumbers2 7651 8336 +9.0% 0.92x (?)
CharIndexing_punctuated_unicodeScalars_Backwards 68800 74760 +8.7% 0.92x (?)
Radix2CooleyTukey 84144 90528 +7.6% 0.93x (?)
LuhnAlgoLazy 4657 5010 +7.6% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
DataCreateSmallArray 36250 33150 -8.6% 1.09x (?)
ArrayAppend 6850 6360 -7.2% 1.08x (?)
Integrate 3648 3388 -7.1% 1.08x (?)

Code size: -swiftlibs

How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 8-Core Intel Xeon E5
  Processor Speed: 3 GHz
  Number of Processors: 1
  Total Number of Cores: 8
  L2 Cache (per Core): 256 KB
  L3 Cache: 25 MB
  Memory: 64 GB

Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I'm in favor of this change! I would like to see the optimizer be smarter about this code, what's the best way to follow up on that?

_position = start
self.count = count
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be @inlinable? I don't think AEIC implies that, but I don't recall the details.

What is @_nonEphemeral?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@_aeic does imply @inlinable -- indeed, it is a stronger variant of it.

The @_nonEphemeral attribute indicates that the associated function escapes the pointer. The compiler emits a warning when a temporary pointer (such as the ones resulting from the implicit inout-to-pointer conversion feature) is passed for a parameter that has this attribute. It's a nice (if limited) safety feature for a common issue.

@lorentey
Copy link
Member Author

lorentey commented Dec 11, 2020

@milseman I have not looked at the implementation of the optimizer, but based the behavior exhibited here and similar cases (as well as second/third-hand info), it seems that one major practical source of performance instability is that the presence/absence of a _debugPrecondition/assert can affect inlining decisions in optimized builds. (I assume because they count towards some sort of a function size metric that is collected before constant folding.) This isn't right -- AFAICT these calls are ~guaranteed to get completely eliminated without a trace, so they should have no effect on optimization decisions.

If this diagnosis is correct (which is a big if -- recall that I know very little about how the optimizer works), then one piece of duck tape we could apply is to explicitly tell the compiler about this fact. I.e., we could introduce a semantic attribute that entirely removes these calls at a very early stage in optimized mode (or just changes any code size metric to not count their projection). This would be an extremely limited half-solution, but it may still be worth it, given how much we rely on these assertions. @atrick @aschwaighofer @eeckstein @gottesmm What do you think? As a Swift engineer, I expect to be able to add assert calls with wild, reckless abandon, without worrying about their potential effect on production builds.

@lorentey
Copy link
Member Author

Yep, #35065 demonstrates that _debugPrecondition has a measurable effect on optimized code. I think it should not.

@usableFromInline @_transparent
internal func _debugPrecondition(
  _ condition: @autoclosure () -> Bool, _ message: StaticString = StaticString(),
  file: StaticString = #file, line: UInt = #line
) {
  // Only check in debug mode.
  if _slowPath(_isDebugAssertConfiguration()) {
    if !_fastPath(condition()) {
      _fatalErrorMessage("Fatal error", message, file: file, line: line,
        flags: _fatalErrorFlags())
    }
  }

@lorentey
Copy link
Member Author

@swift-ci test

@lorentey lorentey merged commit 8082b58 into swiftlang:main Dec 13, 2020
@lorentey lorentey deleted the buffers-need-to-be-fast-but-not-too-fast branch December 13, 2020 03:13
@eeckstein
Copy link
Contributor

@lorentey Unfortunately the situation is more complicated.

We have to distinguish between debug/release build of the stdlib itself and debug/release build of client code.
It's not the case the uses of assert/_debugPrecondition in the stdlib are removed in release builds (of the client program). It cannot, because the same release-built stdlib is used to for debug and release builds of client programs.

assert/_debugPrecondition is only removed if

  • used in the client code (and the client is built in release)
  • a stdlib function is specialized in the client module. But this is not something one can rely on.

Now, I'm not sure what you are asking for. In case you are looking for an "assert" which is removed in the release build of the stdlib: it's _internalInvariant (as you probably know).

Or do you mean that specialized stdlib functions should result in the same code regardless if an assert/_debugPrecondition is in the function's implementation? I'm not sure why this should be a goal, because we cannot guarantee this in general, i.e. in the case a function is not specialized and called in the stdlib dylib.

All this is quite confusing and it always takes me a while until I fully understand it.

@lorentey
Copy link
Member Author

That makes sense! However, at least some of the issues I've been fighting was about compiler behavior for code that isn't calling into the stdlib dylib -- e.g., the utf8_decoding_fastpath.swift test that was the most troublesome in #34747 is looking entirely at code that's emitted into the client module. (Then again, for all I know, changing _precondition to _debugPrecondition may have somehow caused less code getting inlined. E.g., eliminating a >= 0 check up front may have induced more code emitted to handle the < 0 case later.)

I also don't recall seeing any calls into the swift parts of libswiftCore in the FlattenListFlatMap hot loop that I needed to investigate above -- and that one is listed as (weakly) impacted in #35065.

I can easily believe how the AnySequence/AnyCollection benchmarks (that got a 2x speedup in #35065) may call into libswiftCore though -- their code is fully inlinable but I don't expect their boxes would get specialized. The 2x factor seems a bit harsh; we should probably track down the exact cause and try to fix it somehow.

@lorentey
Copy link
Member Author

I also don't recall seeing any calls into the swift parts of libswiftCore in the FlattenListFlatMap hot loop that I needed to investigate above -- and that one is listed as (weakly) impacted in #35065.

Hm, I may be misremembering; I explicitly speculated on dylib changes triggering the issue above. I'll need to double check.

@lorentey
Copy link
Member Author

lorentey commented Dec 14, 2020

Yeah, FlattenListFlatMap only calls runtime entry points and a specialized version of _ArrayBuffer<Int>._consumeAndCreateNew that is within the same module. The generated code is, as far as I can tell, exactly the same before/after removing _debugPrecondition's body.

The FlattenListFlatMap benchmark results in #35065 seem to be entirely due to benchmark flakiness -- locally measured numbers seem to be roughly similar.

Phew! 😌

@lorentey
Copy link
Member Author

SequenceAlgosAnySequence seems to be a good laboratory sample for the 2x speedups in #35065. It is calling this function with s == AnySequence(0 ..< 100_000):

func benchmarkSequenceAlgos<S: Sequence>(s: S, n: Int) where S.Element == Int {
  CheckResults(s.reduce(0, &+) == (n*(n-1))/2)
  let mn = s.min()
  let mx = s.max()
  CheckResults(mn == 0 && mx == n-1)
  CheckResults(s.starts(with: s))
}

This gets correctly specialized for AnySequence<Int>, and it essentially turns into a benchmark for _AnyIteratorBoxBase.next(), which lives in libswiftCore.dylib. Unfortunately that calls Range.subscript that includes a _debugPrecondition that regretfully it cannot be eliminated within the stdlib itself. (ಥ﹏ಥ)

I think the large speedups in #35065 are all because of this wrinkle. (I don't think they are cause for alarm, though. Who uses collection existentials on ranges in practice?)

I have not looked into the ObjectiveCBridge* regressions in the same PR; I expect they might be depending on count >= 0 invariants that the compiler cannot prove once the preconditions are gone, like with the utf8_decoding_fastpath test.

This leaves the FlattenListLoop, InsertCharacterEndIndex and NSStringConversion.MutableCopy.Rebridge.LongUTF8 regressions which are (like FlattenListFlatMap) marked as unreliable in the benchmark report, and I don't see why we shouldn't blame them on flakiness.

So I think I have convinced myself everything is "fine" -- _debugPrecondition works as designed.

Unfortunately we'll likely need to stop relying on the collection conformances of [Closed]Range and Unsafe[Mutable][Raw]BufferPointer within the internals of the stdlib -- for example, having bounds checks has serious consequences for any generic algorithm that calls withContiguous[Mutable]StorageIfAvailable within the stdlib as an optimization. (As, say, MutableCollection.partition(by:) does in its random-access implementation.)

@milseman
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants