[stdlib] Review and fix some problems with unsafe buffer and Range initialization #34961

lorentey · 2020-12-04T10:29:33Z

The goal here is to make sure there is no performance argument against using UnsafeBufferPointer instead of passing around naked pointer/count values, while also making sure obvious errors are caught in debug builds at least.

Make U*BP.init(start:count:) only check for invalid input in debug builds (by @karwa in [stdlib] Make UBP nil/count checks debug-mode only #34747)
Add a missing debug mode check in [Closed]Range.init(uncheckedBounds:) (from [stdlib] Add a missing debug precondition to Range.init(uncheckedBounds:) #34951)
Add a missing debug mode check in U*BP.init(rebasing:) (followup to [stdlib] Simplify buffer pointer initialization. #34879 based on discussions therein)
Eliminate a signed overflow check in U[M]BP.distance(from:to:) (ditto)
Allow the compiler the assume a nonnegative result in U[M]RBP.count (bonus change)

Putting on my pedantic hat, the new _debugPreconditions are technically source-breaking changes. However, for point (2), Range.init(uncheckedBounds:) was documented to strictly require these conditions, and ranges that violate the invariant aren't likely to be usable. (Code that consumes ranges invariably assumes that lowerBound <= upperBound.)

As for point (3), U*BP.init(rebasing:) can only ever trap if someone explicitly invoked the Slice(base:bounds:) initializer with a bogus index range -- which is highly unlikely, since passing that to .init(rebasing:) is far more work than just calling .init(start:count:).

ABI compatibility can only be an issue for binaries built in debug mode. (Which may be a significant problem.)

If compatibility concerns prevent us from adding (2) and (3) above, then we should remove these but keep the rest.

Every other Unsafe(Mutable)BufferPointer precondition in this file is debug-mode only. This information is already part of the documentation for this function, and it's really hard to get rid of these branches and traps otherwise.

…uilds

…ckedBounds:) Unchecked APIs must still perform checks in debug builds to ensure that invariants aren’t violated. (I.e., these aren’t a license to perform invalid operations — they just let us get rid of the checks when we know for sure they are unnecessary.)

`Slice` does not technically guarantee that its indices are valid in its base, so these initializers accidentally allowed the creation of obviously out-of-bounds buffers.

…nnegative

lorentey · 2020-12-04T10:30:06Z

Cc @karwa @Lukasa

lorentey · 2020-12-04T10:31:43Z

Let's see how much havoc this will wreak in our benchmarks. 🙈

@swift-ci benchmark

Lukasa · 2020-12-04T10:33:04Z

Let's see how much havoc this will wreak in our benchmarks. 🙈

I assume it will wreak all the havoc. No benchmark will be safe.

Lukasa

Shiny, I love this patch.

lorentey · 2020-12-04T20:48:00Z

@swift-ci benchmark

swift-ci · 2020-12-04T22:29:41Z

Build failed
Swift Test OS X Platform
Git Sha - 2607428

lorentey · 2020-12-04T22:32:41Z

OK, the egregious code size regressions in #34951 are gone now. I'm assuming the time regressions are due to changes in inlining decisions; they don't seem too bad, surprisingly enough.

Performance: -Osize

Regression OLD NEW DELTA RATIO

SubstringFromLongString 11 22 +100.0% 0.50x

FlattenListFlatMap 3932 6740 +71.4% 0.58x (?)

SubstringFromLongStringGeneric 17 27 +58.8% 0.63x

FlattenListLoop 1630 2509 +53.9% 0.65x (?)

ObjectiveCBridgeStubDataAppend 2760 3640 +31.9% 0.76x

ArrayAppendSequence 810 1040 +28.4% 0.78x (?)

ObjectiveCBridgeStringHash 110 140 +27.3% 0.79x

Data.append.Sequence.64kB.Count0.RE 295 344 +16.6% 0.86x (?)

Data.append.Sequence.64kB.Count0.RE.I 295 343 +16.3% 0.86x

ObjectiveCBridgeStringGetASCIIContents 444 513 +15.5% 0.87x (?)

SubstringFromLongString is just calling Substring(string) with a long native string. The Generic variant does the same in a function that's generic over RangeReplaceableCollection. There is no reason that these should have become slower, so that's definitely worth a look. (Although I'm guessing the test is probably unduly emphasizing inlining decisions in a particular context that may not be a good indication of code performance in general.)

FlattenListFlatMap looks like an array creation & append(contentsOf:) microbenchmark. I'm surprised it isn't dominated by allocation overhead (do the intermediates get stack-allocated?), and it's difficult to see where the regressions may have come from. Again worth a look, if only to understand what's going on. It looks like this is in the same group as FlattenListLoop and ArrayAppendSequence; these are all benchmarking Array.append()/Array.append(contentsOf:). Appending to an array got much faster in some cases in -O, so this smells like an optimizer instability. Not looking forward to diagnosing this.

The ObjectiveCBridge* regressions are mysterious to me as they don't seem to be dealing with ranges or buffer pointers at all. They are also not that large. I can revisit them once I know more about the above two groups.

lorentey · 2020-12-04T22:35:40Z

macOS test failure is unrelated.

14:25:48 ******************** TEST 'Swift(iphonesimulator-i386) :: IRGen/async/run-switch-executor.swift' FAILED ********************
14:25:48 Child process terminated with signal 10: Bus error

karwa · 2020-12-04T23:07:18Z

Fantastic, thanks for taking this on @lorentey. I wasn’t sure how to debug the benchmark regressions, so I was worried I wouldn’t be able to land the other patch on my own. I’ve heard that seemingly unrelated regressions can sometimes be down to code alignment, but I don’t know how to check that.

Anyway, I’m really happy to see these improvements.

This eliminates a couple of _debugPreconditions which seem to change inlining enough to interfere with some benchmarks.

…condition

lorentey · 2020-12-09T06:33:24Z

Regarding that FlattenListFlatMap regression in -O: it looks like the intermediate four-element arrays below are optimized away completely on main, but they are stack-allocated on this branch.

func flattenFlatMap(_ input: [(Int, Int, Int, Int)]) -> [Int] {
  return input.flatMap { [$0.0, $0.1, $0.2, $0.3] }
}

This is somehow triggered by the compiler losing evidence about count being nonnegative by the change of _precondition to _debugPrecondition below.

  public init(
    @_nonEphemeral start: Unsafe${Mutable}Pointer<Element>?, count: Int
  ) {
    _debugPrecondition(
      count >= 0, "Unsafe${Mutable}BufferPointer with negative count")
    _debugPrecondition(
      count == 0 || start != nil,
      "Unsafe${Mutable}BufferPointer has a nil start and nonzero count")
    _position = start
    self.count = count
  }

Replacing the assignment with self.count = _assumeNonNegative(count) seems to let the optimizer get its mojo back.

lorentey · 2020-12-09T06:33:49Z

@swift-ci benchmark

lorentey · 2020-12-09T06:33:57Z

@swift-ci test

swift-ci · 2020-12-09T07:58:35Z

Build failed
Swift Test OS X Platform
Git Sha - 282a140

lorentey · 2020-12-10T03:27:29Z

Remaining perf regressions seem to be due to _assumeNonnegative(count) not being enough -- the count >= 0 precondition evidently enables optimizations that this is currently missing. (FWIW, _uncheckedUnsafeAssume slows down FlattenListFlatMap by another factor of 2.) Or for all I know it could just be an inlining instability -- I have no idea what I'm doing. ;-) (The weird part is that the SIL/disassembly looks the same before/after; the difference may be hiding in one of the stdlib dylib entry points.)

The macOS test regression is another case of utf8_decoding_fastpath.swift:

/Users/buildnode/jenkins/workspace/swift-PR-osx/branch-main/swift/test/SILOptimizer/utf8_decoding_fastpath.swift:84:15: error: CHECK-DAG: expected string not found in input
// CHECK-DAG: function_ref {{.*}}_fromUTF8Repairing
              ^
<stdin>:690:84: note: scanning from here
sil @$s22utf8_decoding_fastpath022decodeStringUTF8ViewAsF0yS2S0fG0VF : $@convention(thin) (@guaranteed String.UTF8View) -> @owned String {
                                                                                   ^
<stdin>:698:7: note: possible intended match here
 %5 = function_ref @$sSS8UTF8ViewV32withContiguousStorageIfAvailableyxSgxSRys5UInt8VGKXEKlFSS_Tg50104$sSRy7ElementSTQzGSSs5Error_pIgyozo_ACSSsAD_pIegyrzo_SlRzs16_UnicodeEncodingR_8CodeUnitQy_AARtzr0_lTRSS8ab42V_s0C0O0G0OTG5033$sSS8decoding2asSSx_q_mtcst10_cd3R_8ef4Y58_7a26Rtzr0_lufcSSSRyAGGXEfU_SS8gH14V_s0C0O0H0OTg5Tf3nnpf_nTf1cn_n : $@convention(thin) (@guaranteed String.UTF8View) -> (@owned Optional<String>, @error Error) // user: %7
      ^

The tiny addition of the _assumeNonnegative call made the compiler decide not to inline either decoding closure this time, which is Just Great. This time I'm resolving it by adding an alternate UBP initializer, like with Range.init(_uncheckedBounds:).

… test failures

lorentey · 2020-12-10T03:32:18Z

@swift-ci test

lorentey · 2020-12-10T03:32:25Z

@swift-ci benchmark

swift-ci · 2020-12-10T04:38:06Z

Performance: -O

Regression	OLD	NEW	DELTA	RATIO
ObjectiveCBridgeStubDataAppend	2340	3320	+41.9%	0.70x
ObjectiveCBridgeStringHash	99	118	+19.2%	0.84x
Data.append.Sequence.64kB.Count0.RE	249	294	+18.1%	0.85x
Data.append.Sequence.64kB.Count0.RE.I	249	293	+17.7%	0.85x
Data.init.Sequence.64kB.Count0.RE.I	250	294	+17.6%	0.85x
Data.init.Sequence.64kB.Count0.RE	250	294	+17.6%	0.85x
Data.init.Sequence.64kB.Count0.I	249	292	+17.3%	0.85x
Data.init.Sequence.64kB.Count0	249	291	+16.9%	0.86x
Data.init.Sequence.809B.Count0.RE	403	469	+16.4%	0.86x
Data.append.Sequence.64kB.Count0.I	248	288	+16.1%	0.86x
Data.init.Sequence.809B.Count0.RE.I	404	469	+16.1%	0.86x (?)
Data.init.Sequence.2047B.Count0.I	451	523	+16.0%	0.86x
Data.append.Sequence.64kB.Count0	249	288	+15.7%	0.86x
Data.init.Sequence.809B.Count0	404	467	+15.6%	0.87x (?)
Data.init.Sequence.809B.Count0.I	405	468	+15.6%	0.87x (?)
Data.append.Sequence.809B.Count0.RE	349	403	+15.5%	0.87x
Data.init.Sequence.2049B.Count0.I	454	523	+15.2%	0.87x
Data.append.Sequence.809B.Count0.RE.I	351	404	+15.1%	0.87x
Data.append.Sequence.809B.Count0.I	347	398	+14.7%	0.87x
Data.init.Sequence.511B.Count0.I	424	486	+14.6%	0.87x (?)
Data.init.Sequence.513B.Count0.I	424	486	+14.6%	0.87x (?)
Data.append.Sequence.809B.Count0	348	397	+14.1%	0.88x
ObjectiveCBridgeStringGetASCIIContents	396	440	+11.1%	0.90x (?)
DataAppendBytesSmall	234	257	+9.8%	0.91x (?)
DataAppendDataMediumToSmall	3680	4040	+9.8%	0.91x (?)
Breadcrumbs.IdxToUTF16Range.longASCII	31	34	+9.7%	0.91x
ObjectiveCBridgeStringCStringUsingEncoding	669	731	+9.3%	0.92x (?)
DataSubscriptMedium	56	61	+8.9%	0.92x
Breadcrumbs.UTF16ToIdxRange.longASCII	26	28	+7.7%	0.93x

Improvement	OLD	NEW	DELTA	RATIO
ArrayAppendLazyMap	2320	1110	-52.2%	2.09x
ArrayAppendUTF16Substring	23544	15120	-35.8%	1.56x
ArrayAppendLatin1Substring	24012	15444	-35.7%	1.55x
ArrayAppendAsciiSubstring	23508	15156	-35.5%	1.55x
Data.init.Sequence.64kB.Count.I	52	35	-32.7%	1.49x
Data.init.Sequence.64kB.Count	52	35	-32.7%	1.49x
Data.init.Sequence.2047B.Count.I	91	65	-28.6%	1.40x
Data.init.Sequence.2049B.Count.I	91	65	-28.6%	1.40x
Data.init.Sequence.809B.Count	86	65	-24.4%	1.32x
Data.init.Sequence.809B.Count.I	86	65	-24.4%	1.32x
Data.init.Sequence.513B.Count.I	100	80	-20.0%	1.25x
Data.init.Sequence.511B.Count.I	99	81	-18.2%	1.22x
ObjectiveCBridgeStubFromNSDateRef	4700	3960	-15.7%	1.19x (?)
DataCreateSmallArray	2900	2450	-15.5%	1.18x
DropWhileSequence	20	17	-15.0%	1.18x
DataAccessBytesMedium	74	67	-9.5%	1.10x
LessSubstringSubstring	38	35	-7.9%	1.09x (?)
EqualSubstringSubstring	38	35	-7.9%	1.09x (?)
EqualSubstringString	38	35	-7.9%	1.09x
LessSubstringSubstringGenericComparable	38	35	-7.9%	1.09x
EqualStringSubstring	39	36	-7.7%	1.08x (?)
EqualSubstringSubstringGenericEquatable	39	36	-7.7%	1.08x (?)
LazilyFilteredArrayContains	14800	13700	-7.4%	1.08x (?)

Code size: -O

Regression	OLD	NEW	DELTA	RATIO
LazyFilter.o	7688	7864	+2.3%	0.98x
DropLast.o	20158	20558	+2.0%	0.98x
CSVParsing.o	54339	55027	+1.3%	0.99x
SetTests.o	121053	122421	+1.1%	0.99x

Improvement	OLD	NEW	DELTA	RATIO
ObjectiveCBridgingStubs.o	16165	14181	-12.3%	1.14x
RemoveWhere.o	15643	14315	-8.5%	1.09x
RangeOverlaps.o	6038	5798	-4.0%	1.04x
Combos.o	5966	5822	-2.4%	1.02x
RomanNumbers.o	5836	5740	-1.6%	1.02x

Performance: -Osize

Regression	OLD	NEW	DELTA	RATIO
ObjectiveCBridgeStubDataAppend	2360	3320	+40.7%	0.71x
DistinctClassFieldAccesses	292	398	+36.3%	0.73x (?)
ArrayAppendSequence	700	910	+30.0%	0.77x (?)
DataCopyBytesSmall	84	105	+25.0%	0.80x
ObjectiveCBridgeStringHash	99	118	+19.2%	0.84x
Data.append.Sequence.64kB.Count0.RE	263	301	+14.4%	0.87x
Data.init.Sequence.64kB.Count0.RE	265	301	+13.6%	0.88x
Data.append.Sequence.64kB.Count0.I	268	304	+13.4%	0.88x
Data.init.Sequence.809B.Count0.RE	430	487	+13.3%	0.88x (?)
Data.append.Sequence.64kB.Count0.RE.I	263	297	+12.9%	0.89x
Breadcrumbs.IdxToUTF16Range.longASCII	31	35	+12.9%	0.89x
Data.append.Sequence.809B.Count0	373	421	+12.9%	0.89x (?)
Data.init.Sequence.64kB.Count0.RE.I	265	299	+12.8%	0.89x
Data.init.Sequence.809B.Count0.RE.I	429	484	+12.8%	0.89x (?)
Data.append.Sequence.809B.Count0.RE	369	416	+12.7%	0.89x
Data.init.Sequence.2047B.Count0.I	488	550	+12.7%	0.89x
Data.append.Sequence.64kB.Count0	268	302	+12.7%	0.89x
Data.init.Sequence.64kB.Count0.I	269	303	+12.6%	0.89x
Data.init.Sequence.809B.Count0.I	436	491	+12.6%	0.89x (?)
Data.append.Sequence.809B.Count0.I	373	420	+12.6%	0.89x
DataAppendDataSmallToSmall	3680	4140	+12.5%	0.89x (?)
Data.init.Sequence.64kB.Count0	270	303	+12.2%	0.89x
Data.append.Sequence.809B.Count0.RE.I	369	412	+11.7%	0.90x (?)
Data.init.Sequence.511B.Count0.I	455	508	+11.6%	0.90x (?)
Data.init.Sequence.809B.Count0	436	486	+11.5%	0.90x (?)
DropWhileAnyCollectionLazy	227	253	+11.5%	0.90x
Data.init.Sequence.2049B.Count0.I	487	542	+11.3%	0.90x (?)
ObjectiveCBridgeStringGetASCIIContents	397	441	+11.1%	0.90x (?)
UTF8Decode_InitFromCustom_contiguous_ascii_as_ascii	360	398	+10.6%	0.90x (?)
PrefixWhileAnyCollectionLazy	143	158	+10.5%	0.91x (?)
PrefixAnySeqCRangeIterLazy	143	158	+10.5%	0.91x (?)
PrefixAnySeqCntRangeLazy	143	158	+10.5%	0.91x (?)
PrefixWhileAnySeqCRangeIterLazy	143	158	+10.5%	0.91x (?)
PrefixWhileAnySeqCntRangeLazy	143	158	+10.5%	0.91x (?)
CharIndexing_tweet_unicodeScalars	9680	10640	+9.9%	0.91x
CharIndexing_ascii_unicodeScalars	4880	5360	+9.8%	0.91x
SuffixAnyCollection	53	58	+9.4%	0.91x
ObjectiveCBridgeStringCStringUsingEncoding	670	731	+9.1%	0.92x (?)
PrefixWhileAnyCollection	195	211	+8.2%	0.92x
DataAppendBytesSmall	234	253	+8.1%	0.92x
UTF8Decode_InitFromBytes_ascii_as_ascii	412	444	+7.8%	0.93x (?)
Breadcrumbs.UTF16ToIdxRange.longASCII	26	28	+7.7%	0.93x

Improvement	OLD	NEW	DELTA	RATIO
CharIteration_tweet_unicodeScalars	6360	4560	-28.3%	1.39x
CharIteration_ascii_unicodeScalars	3040	2360	-22.4%	1.29x
DataCreateSmallArray	3250	2650	-18.5%	1.23x
DropWhileAnySeqCRangeIter	196	165	-15.8%	1.19x
CharIteration_punctuatedJapanese_unicodeScalars	800	680	-15.0%	1.18x
CharIteration_utf16_unicodeScalars	4760	4080	-14.3%	1.17x
CharIteration_japanese_unicodeScalars	5480	4720	-13.9%	1.16x
CharIteration_korean_unicodeScalars	4040	3480	-13.9%	1.16x
CharIteration_russian_unicodeScalars	3840	3320	-13.5%	1.16x
DataAccessBytesMedium	92	80	-13.0%	1.15x
CharIteration_chinese_unicodeScalars	3240	2840	-12.3%	1.14x
DropFirstSequence	60	53	-11.7%	1.13x
DropFirstSequenceLazy	60	53	-11.7%	1.13x
CharIteration_punctuated_unicodeScalars	760	680	-10.5%	1.12x
ObjectiveCBridgeStubDateMutation	256	230	-10.2%	1.11x
LessSubstringSubstring	38	35	-7.9%	1.09x
EqualSubstringSubstring	38	35	-7.9%	1.09x
EqualStringSubstring	38	35	-7.9%	1.09x (?)
EqualSubstringSubstringGenericEquatable	38	35	-7.9%	1.09x
EqualSubstringString	38	35	-7.9%	1.09x
LessSubstringSubstringGenericComparable	38	35	-7.9%	1.09x
DropFirstAnySeqCRangeIterLazy	205	189	-7.8%	1.08x
DropFirstAnySeqCntRangeLazy	205	189	-7.8%	1.08x (?)
Array2D	7280	6736	-7.5%	1.08x
ObjectiveCBridgeFromNSSetAnyObjectToStringForced	81000	75000	-7.4%	1.08x (?)
RandomShuffleLCG2	432	400	-7.4%	1.08x

Code size: -Osize

Regression	OLD	NEW	DELTA	RATIO
UTF8Decode.o	22231	22516	+1.3%	0.99x
CSVParsing.o	51591	52135	+1.1%	0.99x

Improvement	OLD	NEW	DELTA	RATIO
ObjectiveCBridgingStubs.o	15254	13868	-9.1%	1.10x
RemoveWhere.o	14270	13245	-7.2%	1.08x
Combos.o	6287	6077	-3.3%	1.03x
StringMatch.o	4140	4027	-2.7%	1.03x
RomanNumbers.o	5509	5406	-1.9%	1.02x
LazyFilter.o	7437	7314	-1.7%	1.02x

Performance: -Onone

Regression	OLD	NEW	DELTA	RATIO
DataCopyBytesSmall	1455	2666	+83.2%	0.55x
DataCopyBytesMedium	1570	2786	+77.5%	0.56x
ArrayValueProp3	5978	8613	+44.1%	0.69x
ArrayValueProp4	5924	8358	+41.1%	0.71x
DataAppendDataSmallToSmall	5560	7260	+30.6%	0.77x (?)
DataAppendBytesMedium	5540	6860	+23.8%	0.81x (?)
DataAppendArray	5500	6800	+23.6%	0.81x (?)
DataAppendDataSmallToMedium	5540	6820	+23.1%	0.81x (?)
StackPromo	51700	63100	+22.1%	0.82x (?)
RandomIntegersDef	57600	69900	+21.4%	0.82x (?)
StringToDataEmpty	3450	4150	+20.3%	0.83x (?)
DevirtualizeProtocolComposition	6852	8232	+20.1%	0.83x (?)
ObjectiveCBridgeStubDataAppend	3900	4680	+20.0%	0.83x
DataAppendDataMediumToSmall	5460	6520	+19.4%	0.84x (?)
ObjectiveCBridgeStringHash	99	118	+19.2%	0.84x
ObjectiveCBridgeStubFromNSDateRef	4800	5710	+19.0%	0.84x (?)
Walsh	7272	8630	+18.7%	0.84x
StringToDataSmall	3600	4250	+18.1%	0.85x (?)
DataReplaceMedium	6000	7000	+16.7%	0.86x (?)
ObjectiveCBridgeStubFromNSDate	5450	6310	+15.8%	0.86x (?)
NSStringConversion.Rebridge.Mutable	1594	1828	+14.7%	0.87x (?)
DataAppendDataMediumToMedium	5800	6640	+14.5%	0.87x (?)
StringBuilderLong	2290	2600	+13.5%	0.88x (?)
NSStringConversion.Mutable	3360	3810	+13.4%	0.88x (?)
NSStringConversion.UTF8	2367	2665	+12.6%	0.89x (?)
Data.init.Sequence.809B.Count0.RE.I	29779	33513	+12.5%	0.89x (?)
Data.append.Sequence.809B.Count0.RE	29367	33047	+12.5%	0.89x (?)
Data.append.Sequence.64kB.Count0.RE.I	23655	26539	+12.2%	0.89x (?)
Data.init.Sequence.809B.Count.RE	29553	33155	+12.2%	0.89x (?)
DataReplaceMediumBuffer	7400	8300	+12.2%	0.89x (?)
Data.append.Sequence.64kB.Count0.RE	23814	26707	+12.1%	0.89x (?)
Data.init.Sequence.64kB.Count0.RE.I	23916	26813	+12.1%	0.89x (?)
Data.init.Sequence.809B.Count0.RE	29818	33412	+12.1%	0.89x (?)
Data.append.Sequence.809B.Count0.RE.I	29463	32976	+11.9%	0.89x (?)
Data.init.Sequence.64kB.Count0.RE	23923	26753	+11.8%	0.89x (?)
Data.append.Sequence.64kB.Count.RE	23945	26759	+11.8%	0.89x (?)
DataAppendSequence	2976900	3321600	+11.6%	0.90x (?)
Data.append.Sequence.64kB.Count.RE.I	23888	26632	+11.5%	0.90x (?)
Data.init.Sequence.809B.Count.RE.I	29559	32800	+11.0%	0.90x (?)
Data.init.Sequence.64kB.Count.RE	23970	26573	+10.9%	0.90x (?)
ObjectiveCBridgeStringGetASCIIContents	397	440	+10.8%	0.90x (?)
Data.append.Sequence.809B.Count.RE.I	29771	32979	+10.8%	0.90x (?)
DataCreateSmall	72150	79890	+10.7%	0.90x (?)
DictionaryCompactMapValuesOfCastValue	52920	58428	+10.4%	0.91x (?)
Data.append.Sequence.809B.Count.RE	29809	32911	+10.4%	0.91x (?)
Data.init.Sequence.64kB.Count.RE.I	24122	26554	+10.1%	0.91x (?)
DataAppendDataLargeToSmall	20000	22000	+10.0%	0.91x (?)
ObjectiveCBridgeStringCStringUsingEncoding	668	732	+9.6%	0.91x (?)
Radix2CooleyTukeyf	79680	87264	+9.5%	0.91x (?)
Sim2DArray	12685	13859	+9.3%	0.92x
CharIndexing_tweet_unicodeScalars_Backwards	599880	653840	+9.0%	0.92x (?)
RomanNumbers2	7651	8336	+9.0%	0.92x (?)
CharIndexing_punctuated_unicodeScalars_Backwards	68800	74760	+8.7%	0.92x (?)
Radix2CooleyTukey	84144	90528	+7.6%	0.93x (?)
LuhnAlgoLazy	4657	5010	+7.6%	0.93x (?)

Improvement	OLD	NEW	DELTA	RATIO
DataCreateSmallArray	36250	33150	-8.6%	1.09x (?)
ArrayAppend	6850	6360	-7.2%	1.08x (?)
Integrate	3648	3388	-7.1%	1.08x (?)

Code size: -swiftlibs

How to read the data

The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview

  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 8-Core Intel Xeon E5
  Processor Speed: 3 GHz
  Number of Processors: 1
  Total Number of Cores: 8
  L2 Cache (per Core): 256 KB
  L3 Cache: 25 MB
  Memory: 64 GB

milseman

Overall I'm in favor of this change! I would like to see the optimizer be smarter about this code, what's the best way to follow up on that?

milseman · 2020-12-11T17:26:53Z

stdlib/public/core/UnsafeBufferPointer.swift.gyb

+    _position = start
+    self.count = count
+  }
+


Should this also be @inlinable? I don't think AEIC implies that, but I don't recall the details.

What is @_nonEphemeral?

@_aeic does imply @inlinable -- indeed, it is a stronger variant of it.

The @_nonEphemeral attribute indicates that the associated function escapes the pointer. The compiler emits a warning when a temporary pointer (such as the ones resulting from the implicit inout-to-pointer conversion feature) is passed for a parameter that has this attribute. It's a nice (if limited) safety feature for a common issue.

lorentey · 2020-12-11T20:42:22Z

@milseman I have not looked at the implementation of the optimizer, but based the behavior exhibited here and similar cases (as well as second/third-hand info), it seems that one major practical source of performance instability is that the presence/absence of a _debugPrecondition/assert can affect inlining decisions in optimized builds. (I assume because they count towards some sort of a function size metric that is collected before constant folding.) This isn't right -- AFAICT these calls are ~guaranteed to get completely eliminated without a trace, so they should have no effect on optimization decisions.

If this diagnosis is correct (which is a big if -- recall that I know very little about how the optimizer works), then one piece of duck tape we could apply is to explicitly tell the compiler about this fact. I.e., we could introduce a semantic attribute that entirely removes these calls at a very early stage in optimized mode (or just changes any code size metric to not count their projection). This would be an extremely limited half-solution, but it may still be worth it, given how much we rely on these assertions. @atrick @aschwaighofer @eeckstein @gottesmm What do you think? As a Swift engineer, I expect to be able to add assert calls with wild, reckless abandon, without worrying about their potential effect on production builds.

lorentey · 2020-12-11T22:27:54Z

Yep, #35065 demonstrates that _debugPrecondition has a measurable effect on optimized code. I think it should not.

@usableFromInline @_transparent
internal func _debugPrecondition(
  _ condition: @autoclosure () -> Bool, _ message: StaticString = StaticString(),
  file: StaticString = #file, line: UInt = #line
) {
  // Only check in debug mode.
  if _slowPath(_isDebugAssertConfiguration()) {
    if !_fastPath(condition()) {
      _fatalErrorMessage("Fatal error", message, file: file, line: line,
        flags: _fatalErrorFlags())
    }
  }

lorentey · 2020-12-12T02:32:55Z

@swift-ci test

eeckstein · 2020-12-14T15:33:15Z

@lorentey Unfortunately the situation is more complicated.

We have to distinguish between debug/release build of the stdlib itself and debug/release build of client code.
It's not the case the uses of assert/_debugPrecondition in the stdlib are removed in release builds (of the client program). It cannot, because the same release-built stdlib is used to for debug and release builds of client programs.

assert/_debugPrecondition is only removed if

used in the client code (and the client is built in release)
a stdlib function is specialized in the client module. But this is not something one can rely on.

Now, I'm not sure what you are asking for. In case you are looking for an "assert" which is removed in the release build of the stdlib: it's _internalInvariant (as you probably know).

Or do you mean that specialized stdlib functions should result in the same code regardless if an assert/_debugPrecondition is in the function's implementation? I'm not sure why this should be a goal, because we cannot guarantee this in general, i.e. in the case a function is not specialized and called in the stdlib dylib.

All this is quite confusing and it always takes me a while until I fully understand it.

lorentey · 2020-12-14T22:01:01Z

That makes sense! However, at least some of the issues I've been fighting was about compiler behavior for code that isn't calling into the stdlib dylib -- e.g., the utf8_decoding_fastpath.swift test that was the most troublesome in #34747 is looking entirely at code that's emitted into the client module. (Then again, for all I know, changing _precondition to _debugPrecondition may have somehow caused less code getting inlined. E.g., eliminating a >= 0 check up front may have induced more code emitted to handle the < 0 case later.)

I also don't recall seeing any calls into the swift parts of libswiftCore in the FlattenListFlatMap hot loop that I needed to investigate above -- and that one is listed as (weakly) impacted in #35065.

I can easily believe how the AnySequence/AnyCollection benchmarks (that got a 2x speedup in #35065) may call into libswiftCore though -- their code is fully inlinable but I don't expect their boxes would get specialized. The 2x factor seems a bit harsh; we should probably track down the exact cause and try to fix it somehow.

lorentey · 2020-12-14T22:23:51Z

I also don't recall seeing any calls into the swift parts of libswiftCore in the FlattenListFlatMap hot loop that I needed to investigate above -- and that one is listed as (weakly) impacted in #35065.

Hm, I may be misremembering; I explicitly speculated on dylib changes triggering the issue above. I'll need to double check.

lorentey · 2020-12-14T23:39:55Z

Yeah, FlattenListFlatMap only calls runtime entry points and a specialized version of _ArrayBuffer<Int>._consumeAndCreateNew that is within the same module. The generated code is, as far as I can tell, exactly the same before/after removing _debugPrecondition's body.

The FlattenListFlatMap benchmark results in #35065 seem to be entirely due to benchmark flakiness -- locally measured numbers seem to be roughly similar.

Phew! 😌

lorentey · 2020-12-15T00:18:53Z

SequenceAlgosAnySequence seems to be a good laboratory sample for the 2x speedups in #35065. It is calling this function with s == AnySequence(0 ..< 100_000):

func benchmarkSequenceAlgos<S: Sequence>(s: S, n: Int) where S.Element == Int {
  CheckResults(s.reduce(0, &+) == (n*(n-1))/2)
  let mn = s.min()
  let mx = s.max()
  CheckResults(mn == 0 && mx == n-1)
  CheckResults(s.starts(with: s))
}

This gets correctly specialized for AnySequence<Int>, and it essentially turns into a benchmark for _AnyIteratorBoxBase.next(), which lives in libswiftCore.dylib. Unfortunately that calls Range.subscript that includes a _debugPrecondition that regretfully it cannot be eliminated within the stdlib itself. (ಥ﹏ಥ)

I think the large speedups in #35065 are all because of this wrinkle. (I don't think they are cause for alarm, though. Who uses collection existentials on ranges in practice?)

I have not looked into the ObjectiveCBridge* regressions in the same PR; I expect they might be depending on count >= 0 invariants that the compiler cannot prove once the preconditions are gone, like with the utf8_decoding_fastpath test.

This leaves the FlattenListLoop, InsertCharacterEndIndex and NSStringConversion.MutableCopy.Rebridge.LongUTF8 regressions which are (like FlattenListFlatMap) marked as unreliable in the benchmark report, and I don't see why we shouldn't blame them on flakiness.

So I think I have convinced myself everything is "fine" -- _debugPrecondition works as designed.

Unfortunately we'll likely need to stop relying on the collection conformances of [Closed]Range and Unsafe[Mutable][Raw]BufferPointer within the internals of the stdlib -- for example, having bounds checks has serious consequences for any generic algorithm that calls withContiguous[Mutable]StorageIfAvailable within the stdlib as an optimization. (As, say, MutableCollection.partition(by:) does in its random-access implementation.)

milseman · 2020-12-21T18:07:05Z

Precondition/debugPrecondition/internalInvariant is mentioned in https://github.com/apple/swift/blob/main/docs/StandardLibraryProgrammersManual.md#_precondition-_debugprecondition-and-_internalinvariant.

karwa and others added 8 commits December 4, 2020 02:06

Make UBP nil/count checks debug-mode only

74a8254

Every other Unsafe(Mutable)BufferPointer precondition in this file is debug-mode only. This information is already part of the documentation for this function, and it's really hard to get rid of these branches and traps otherwise.

Make raw buffer nil/count checks debug-mode only

83c14bc

[stdlib] Encourage more inlining on String’s UTF-8 decoding fast path

90d58fe

[test] ArrayTraps: Skip trap test on negative UBP counts in release b…

7f63b42

…uilds

[stdlib] Add a missing debug mode check to U*BP.init(rebasing:)

d52cd82

`Slice` does not technically guarantee that its indices are valid in its base, so these initializers accidentally allowed the creation of obviously out-of-bounds buffers.

[stdlib] Eliminate an overflow check in U[M]BP.distance

27978d1

[stdlib] U[M]RBP.count: Let the compiler assume that the result is no…

2607428

…nnegative

lorentey requested review from milseman, kylemacomber, atrick, natecook1000 and airspeedswift December 4, 2020 10:29

This comment has been minimized.

Sign in to view

lorentey mentioned this pull request Dec 4, 2020

[stdlib] Add a missing debug precondition to Range.init(uncheckedBounds:) #34951

Closed

Lukasa approved these changes Dec 4, 2020

View reviewed changes

lorentey mentioned this pull request Dec 4, 2020

[stdlib] Make UBP nil/count checks debug-mode only #34747

Closed

This comment has been minimized.

Sign in to view

[stdlib] Substring: Restore original Range paths

513a7d9

This eliminates a couple of _debugPreconditions which seem to change inlining enough to interfere with some benchmarks.

This comment has been minimized.

Sign in to view

[stdlib] U[R]BP: Restore compile-time condition lost with removed pre…

282a140

…condition

This comment has been minimized.

Sign in to view

[stdlib] Add an unsafe U[M]BP initializer to work around some inliner…

1f92df0

… test failures

lorentey requested a review from stephentyrone December 10, 2020 20:48

milseman reviewed Dec 11, 2020

View reviewed changes

lorentey mentioned this pull request Dec 11, 2020

[DO NOT MERGE][stdlib] Make assert(), _debugPreconditions() noops #35065

Closed

[stdlib] Document sources of code bloat

94a7eee

lorentey mentioned this pull request Dec 12, 2020

[DO NOT MERGE] More experiments with UnsafeBufferPointer.init #35075

Closed

lorentey merged commit 8082b58 into swiftlang:main Dec 13, 2020

lorentey deleted the buffers-need-to-be-fast-but-not-too-fast branch December 13, 2020 03:13

lorentey mentioned this pull request Jun 3, 2021

Remove as much checked math as possible from buffer pointers. #37424

Merged

[stdlib] Review and fix some problems with unsafe buffer and Range initialization #34961

[stdlib] Review and fix some problems with unsafe buffer and Range initialization #34961

Uh oh!

Conversation

lorentey commented Dec 4, 2020

Uh oh!

lorentey commented Dec 4, 2020

Uh oh!

lorentey commented Dec 4, 2020

Uh oh!

This comment has been minimized.

Lukasa commented Dec 4, 2020

Uh oh!

Lukasa left a comment

Choose a reason for hiding this comment

Uh oh!

lorentey commented Dec 4, 2020

Uh oh!

This comment has been minimized.

swift-ci commented Dec 4, 2020

Uh oh!

lorentey commented Dec 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance: -Osize

Uh oh!

lorentey commented Dec 4, 2020

Uh oh!

karwa commented Dec 4, 2020

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

lorentey commented Dec 9, 2020

Uh oh!

lorentey commented Dec 9, 2020

Uh oh!

lorentey commented Dec 9, 2020

Uh oh!

This comment has been minimized.

swift-ci commented Dec 9, 2020

Uh oh!

lorentey commented Dec 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lorentey commented Dec 10, 2020

Uh oh!

lorentey commented Dec 10, 2020

Uh oh!

swift-ci commented Dec 10, 2020

Performance: -O

Code size: -O

Performance: -Osize

Code size: -Osize

Performance: -Onone

Code size: -swiftlibs

Uh oh!

milseman left a comment

Choose a reason for hiding this comment

Uh oh!

milseman Dec 11, 2020

Choose a reason for hiding this comment

Uh oh!

lorentey Dec 11, 2020

Choose a reason for hiding this comment

Uh oh!

lorentey commented Dec 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lorentey commented Dec 11, 2020

Uh oh!

lorentey commented Dec 12, 2020

Uh oh!

eeckstein commented Dec 14, 2020

Uh oh!

lorentey commented Dec 14, 2020

Uh oh!

lorentey commented Dec 14, 2020

Uh oh!

lorentey commented Dec 4, 2020 •

edited

Loading

lorentey commented Dec 10, 2020 •

edited

Loading

lorentey commented Dec 11, 2020 •

edited

Loading

lorentey commented Dec 14, 2020 •

edited

Loading