Skip to content

Remove as much checked math as possible from buffer pointers. #37424

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 4, 2022

Conversation

Lukasa
Copy link
Contributor

@Lukasa Lukasa commented May 14, 2021

This patch removes as much checked math as I was able to find from the
unsafe buffer pointers types. In particular, I removed two kinds of
checked math:

  1. Checked math that was provably unable to trap due to prior operations
    done in the same function, or in a function that must have been
    called before the current one. For example, in
    index(_:offsetBy:limitedBy:), when we do the final index addition we
    have already validated that it must be in-bounds. Therefore, it
    cannot overflow.
  2. Checked math that can only fail if the user is misusing the indices.
    As previously discussed with Karoy and Andrew, the unsafe raw buffer
    types are not obligated to crash if you misuse their indices, they
    are free to invoke undefined behaviour. In these cases, I added
    defensive overflow checks in debug mode.

The result of this change should be reductions in code size in almost
all usage sites of the raw buffer types.

@Lukasa
Copy link
Contributor Author

Lukasa commented May 14, 2021

@swift-ci please test

@Lukasa
Copy link
Contributor Author

Lukasa commented May 14, 2021

cc @atrick @lorentey

@swift-ci
Copy link
Contributor

Build failed
Swift Test OS X Platform
Git Sha - ea3e1c5199dd2579643037d410ea40ffaa3cc97e

Copy link
Contributor

@atrick atrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I agree with the premise. UnsafeBufferPointer indices are only checked in debug builds, so this change makes the implementation more consistent. And these are are inlinable critical code paths where we don't want extra traps.

UnsafeRawBufferPointer may have the same issue, at least in the iterator.

@swift-ci
Copy link
Contributor

Build failed
Swift Test Linux Platform
Git Sha - ea3e1c5199dd2579643037d410ea40ffaa3cc97e

@Lukasa
Copy link
Contributor Author

Lukasa commented May 14, 2021

@swift-ci please test

@Lukasa
Copy link
Contributor Author

Lukasa commented May 14, 2021

UnsafeRawBufferPointer definitely also needs cleaning up, I’ll tackle that as well.

@swift-ci
Copy link
Contributor

Build failed
Swift Test OS X Platform
Git Sha - 0fedd0b9cfe3dcac2543dddec3f789d6d3b1016f

@swift-ci
Copy link
Contributor

Build failed
Swift Test Linux Platform
Git Sha - 0fedd0b9cfe3dcac2543dddec3f789d6d3b1016f

@Lukasa Lukasa force-pushed the cb-fewer-checks-in-ubp branch from 0fedd0b to b5d176b Compare May 15, 2021 06:51
@Lukasa
Copy link
Contributor Author

Lukasa commented May 15, 2021

@swift-ci please test

@Lukasa Lukasa force-pushed the cb-fewer-checks-in-ubp branch from b5d176b to a38d9f6 Compare May 15, 2021 07:08
@Lukasa
Copy link
Contributor Author

Lukasa commented May 15, 2021

@swift-ci please test

@Lukasa
Copy link
Contributor Author

Lukasa commented May 15, 2021

@atrick I've added URBP to this as well, the same kinds of changes really, though I also removed a division by replacing it with a series of unchecked additions in a loop we were going to execute anyway.

@swift-ci
Copy link
Contributor

Build failed
Swift Test OS X Platform
Git Sha - a38d9f6eb63de9ee3ba0f39a9debad1f81943a7c

@swift-ci
Copy link
Contributor

Build failed
Swift Test Linux Platform
Git Sha - a38d9f6eb63de9ee3ba0f39a9debad1f81943a7c

@Lukasa Lukasa force-pushed the cb-fewer-checks-in-ubp branch from a38d9f6 to bb36d84 Compare May 15, 2021 09:32
@Lukasa
Copy link
Contributor Author

Lukasa commented May 15, 2021

@swift-ci please test

@swift-ci
Copy link
Contributor

Build failed
Swift Test OS X Platform
Git Sha - bb36d84f3aec5bd681d3a4b58fce1f095edb3daf

@swift-ci
Copy link
Contributor

Build failed
Swift Test Linux Platform
Git Sha - bb36d84f3aec5bd681d3a4b58fce1f095edb3daf

@Lukasa Lukasa force-pushed the cb-fewer-checks-in-ubp branch from bb36d84 to 7e98c9c Compare May 15, 2021 11:31
@Lukasa
Copy link
Contributor Author

Lukasa commented May 15, 2021

@swift-ci please test

@karwa
Copy link
Contributor

karwa commented May 15, 2021

Fantastic! I've noticed something like a 20% improvement to performance in WebURL by using a custom UBP which doesn't trap on overflow.

As discussed on the forums, I think this is actually a general improvement that applies to all Collections - including Array and ContiguousArray, and the default implementation for all collections with integer indexes.

The comments in this patch reason that it is okay to ignore overflow because this is an unsafe buffer; that is not entirely correct, or at least not the whole picture. It is okay to ignore overflow here because a Collection's indexing operations are not arithmetic operations, so arithmetic overflow is a meaningless signal, even in debug builds. If there were a boundary possibly worth raising a debug-mode error for, it would be incrementing past endIndex, not incrementing past Int.max.

The thing that gives safe types their memory safety is bounds-checking on access, and the reason UBP is unsafe because it doesn't do that. It has nothing to do with its indexing operations trapping on overflow or not.

The only thing trapping on overflow gives us is slightly earlier termination if somebody increments beyond endIndex, doesn't notice, continues going all the way to Int.max (a process which I estimate would take literally decades if I left my 2.6Ghz Intel MBP running 24/7), wrapped around to Int.min, and incremented all the way back (more decades) to coincidentally arrive at another valid index -- and even in that extreme case, memory safety would be preserved by bounds-checking the eventual access.

@swift-ci
Copy link
Contributor

Build failed
Swift Test Linux Platform
Git Sha - 7e98c9c5745971969d70deb24456ec414d05b0cb

@swift-ci
Copy link
Contributor

Build failed
Swift Test OS X Platform
Git Sha - 7e98c9c5745971969d70deb24456ec414d05b0cb

@Lukasa
Copy link
Contributor Author

Lukasa commented May 15, 2021

@gottesmm Any idea what's going on with the SILOptimizer pass here? I think this is just the result of the loop being restructured, but I haven't gotten around to rebuilding this toolchain locally to be sure.

@Lukasa
Copy link
Contributor Author

Lukasa commented May 18, 2021

Ok, here are my comparative SIL outputs. Firstly, from the existing 5.4-RELEASE toolchain:

; Function Attrs: minsize nounwind optsize
define protected swiftcc i64 @"$s4test0A9Subscriptys5Int64VSWF"(i64 %0, i64 %1) #0 {
entry:
  %2 = icmp eq i64 %0, 0
  br i1 %2, label %.thread3, label %3

3:                                                ; preds = %entry
  %4 = icmp eq i64 %1, 0
  br i1 %4, label %5, label %6

5:                                                ; preds = %3
  tail call void asm sideeffect "", "n"(i32 2) #2
  tail call void @llvm.trap()
  unreachable

6:                                                ; preds = %3
  %7 = sub i64 %1, %0
  %8 = icmp slt i64 %7, 0
  br i1 %8, label %23, label %9, !prof !7, !misexpect !8

9:                                                ; preds = %6
  %10 = icmp eq i64 %7, 0
  br i1 %10, label %.thread3, label %11

11:                                               ; preds = %9
  %12 = inttoptr i64 %0 to i8*
  br label %13

13:                                               ; preds = %11, %13
  %14 = phi i64 [ 0, %11 ], [ %20, %13 ]
  %15 = phi i64 [ 0, %11 ], [ %16, %13 ]
  %16 = add nuw i64 %15, 1
  %17 = getelementptr inbounds i8, i8* %12, i64 %15
  %18 = load i8, i8* %17, align 1
  %19 = zext i8 %18 to i64
  %20 = add i64 %14, %19
  %21 = icmp eq i64 %16, %7
  br i1 %21, label %.thread3, label %13

.thread3:                                         ; preds = %13, %entry, %9
  %22 = phi i64 [ 0, %9 ], [ 0, %entry ], [ %20, %13 ]
  ret i64 %22

23:                                               ; preds = %6
  tail call void asm sideeffect "", "n"(i32 0) #2
  tail call void @llvm.trap()
  unreachable
}

Then, from my new toolchain:

; Function Attrs: minsize nounwind optsize
define protected swiftcc i64 @"$s4test0A9Subscriptys5Int64VSWF"(i64 %0, i64 %1) #0 {
entry:
  %2 = icmp eq i64 %0, 0
  %3 = sub i64 %1, %0
  %4 = select i1 %2, i64 0, i64 %3
  %5 = icmp slt i64 %4, 0
  br i1 %5, label %20, label %6, !prof !7

6:                                                ; preds = %entry
  %7 = icmp eq i64 %4, 0
  br i1 %7, label %.loopexit, label %8

8:                                                ; preds = %6
  %9 = inttoptr i64 %0 to i8*
  br label %10

10:                                               ; preds = %8, %10
  %11 = phi i64 [ 0, %8 ], [ %17, %10 ]
  %12 = phi i64 [ 0, %8 ], [ %13, %10 ]
  %13 = add nuw i64 %12, 1
  %14 = getelementptr inbounds i8, i8* %9, i64 %12
  %15 = load i8, i8* %14, align 1
  %16 = zext i8 %15 to i64
  %17 = add i64 %11, %16
  %18 = icmp eq i64 %13, %4
  br i1 %18, label %.loopexit, label %10

.loopexit:                                        ; preds = %10, %6
  %19 = phi i64 [ 0, %6 ], [ %17, %10 ]
  ret i64 %19

20:                                               ; preds = %entry
  tail call void asm sideeffect "", "n"(i32 0) #2
  tail call void @llvm.trap()
  unreachable
}

@Lukasa
Copy link
Contributor Author

Lukasa commented May 18, 2021

I think I've convinced myself that this IR is strictly better. We don't do an immediate jump to the end of the function if the base pointer is nil which is a bit sad: instead we do a sub/cmov/cmp chain. I don't love that (not sure of the relative cost of cmov in this context).

Might need to chat to @gottesmm to work out what we think is better here. LLVM seems to have a strong preference for the cmov though.

@atrick
Copy link
Contributor

atrick commented May 19, 2021

In the interest of code size, LLVM should be able to reorder traps (if the frontend says it's ok). That's probably worth an LLVM radar. Although this case might be fixable in SIL. Would need to see -emit-sil and -emit-ir -disable-llvm-optzns.

@Lukasa
Copy link
Contributor Author

Lukasa commented May 19, 2021

To convince myself I replicated this code in C and compiled to LLVM IR, and then stripped out the debug noise. My code was:

#include <stddef.h>

long testSubscript(char *base, char *end) {
    size_t count;
    long accumulator = 0;

    if (base == NULL) {
        count = 0;
    } else {
        count = end - base;
    }

    for (size_t i = 0; i < count; i++) {
        accumulator += base[i];
    }

    return accumulator;
}

This includes the slightly weird pattern of NULL leading to zero count, which is unavoidable due to the way we have constructed URBP.

The resulting IR is:

define dso_local i64 @_Z13testSubscriptPcS_(i8* %0, i8* %1) local_unnamed_addr #0 !dbg !8 {
  %3 = icmp eq i8* %0, null
  %4 = ptrtoint i8* %1 to i64
  %5 = ptrtoint i8* %0 to i64
  %6 = sub i64 %4, %5
  %7 = icmp eq i64 %6, 0
  %8 = select i1 %3, i1 true, i1 %7
  br i1 %8, label %9, label %11

9:                                                ; preds = %11, %2
  %10 = phi i64 [ 0, %2 ], [ %17, %11 ]
  ret i64 %10

11:                                               ; preds = %2, %11
  %12 = phi i64 [ %18, %11 ], [ 0, %2 ]
  %13 = phi i64 [ %17, %11 ], [ 0, %2 ]
  %14 = getelementptr inbounds i8, i8* %0, i64 %12
  %15 = load i8, i8* %14, align 1
  %16 = sext i8 %15 to i64
  %17 = add nsw i64 %13, %16
  %18 = add nuw i64 %12, 1
  %19 = icmp eq i64 %18, %6
  br i1 %19, label %9, label %11
}

This is, modulo some differences in signedness and the extra precondition branch on the end pointer, basically the same as what we have now, including the loop preamble. So I've convinced myself this is fine.

@lorentey
Copy link
Member

lorentey commented Jun 2, 2021

@swift-ci Apple Silicon benchmark

@swift-ci
Copy link
Contributor

swift-ci commented Jun 2, 2021

Performance (x86_64): -O

Regression OLD NEW DELTA RATIO
DictionaryOfAnyHashableStrings_insert 2058 3696 +79.6% 0.56x
Set.isDisjoint.Box25 276 397 +43.8% 0.70x (?)
EqualSubstringSubstring 22 30 +36.4% 0.73x
LessSubstringSubstring 22 29 +31.8% 0.76x
EqualStringSubstring 22 29 +31.8% 0.76x
EqualSubstringSubstringGenericEquatable 22 29 +31.8% 0.76x
EqualSubstringString 22 29 +31.8% 0.76x
LessSubstringSubstringGenericComparable 22 29 +31.8% 0.76x
DataCreateSmallArray 1450 1850 +27.6% 0.78x (?)
DataCountSmall 15 19 +26.7% 0.79x
Set.isDisjoint.Int50 196 233 +18.9% 0.84x (?)
DictionaryKeysContainsCocoa 16 19 +18.7% 0.84x (?)
ObjectiveCBridgeStubDateAccess 130 152 +16.9% 0.86x (?)
DataCountMedium 15 17 +13.3% 0.88x
StringComparison_longSharedPrefix 325 358 +10.2% 0.91x (?)
 
Improvement OLD NEW DELTA RATIO
StaticArray 2 1 -50.0% 2.00x
FlattenListLoop 1565 1026 -34.4% 1.53x (?)
SIMDReduce.Int8x16.Cast 72 48 -33.3% 1.50x
ObjectiveCBridgeStubDataAppend 2420 1840 -24.0% 1.32x (?)
SIMDReduce.Int32x4.Initializer 71 56 -21.1% 1.27x
Dictionary4 186 148 -20.4% 1.26x
SortIntPyramid 515 420 -18.4% 1.23x
SortAdjacentIntPyramids 815 705 -13.5% 1.16x (?)
Data.init.Sequence.64kB.Count0.RE 183 159 -13.1% 1.15x (?)
Data.init.Sequence.64kB.Count0 178 157 -11.8% 1.13x (?)
Data.init.Sequence.809B.Count 55 49 -10.9% 1.12x (?)
Data.init.Sequence.64kB.Count0.I 176 157 -10.8% 1.12x (?)
ObjectiveCBridgeStubFromNSDate 3820 3440 -9.9% 1.11x (?)
Data.append.Sequence.64kB.Count0.RE.I 180 163 -9.4% 1.10x (?)
Data.append.Sequence.64kB.Count0 173 157 -9.2% 1.10x (?)
Data.append.Sequence.64kB.Count0.I 173 157 -9.2% 1.10x (?)
Data.append.Sequence.809B.Count0.I 246 225 -8.5% 1.09x (?)
Data.append.Sequence.809B.Count0 245 225 -8.2% 1.09x (?)
RGBHistogram 1610 1480 -8.1% 1.09x (?)
Phonebook 1113 1029 -7.5% 1.08x (?)
Data.append.Sequence.64kB.Count0.RE 176 164 -6.8% 1.07x (?)

Code size: -O

Regression OLD NEW DELTA RATIO
ObjectiveCBridgingStubs.o 15237 17189 +12.8% 0.89x
 
Improvement OLD NEW DELTA RATIO
SortLettersInPlace.o 8290 8162 -1.5% 1.02x

Performance (x86_64): -Osize

Regression OLD NEW DELTA RATIO
LessSubstringSubstring 22 30 +36.4% 0.73x
EqualSubstringSubstring 22 30 +36.4% 0.73x
EqualStringSubstring 22 30 +36.4% 0.73x
EqualSubstringString 22 30 +36.4% 0.73x
LessSubstringSubstringGenericComparable 22 30 +36.4% 0.73x
EqualSubstringSubstringGenericEquatable 22 29 +31.8% 0.76x
DataCountSmall 15 19 +26.7% 0.79x
Data.init.Sequence.809B.Count.RE.I 40 48 +20.0% 0.83x
FlattenListFlatMap 2725 3248 +19.2% 0.84x (?)
Data.init.Sequence.809B.Count.RE 40 46 +15.0% 0.87x (?)
DataCountMedium 15 17 +13.3% 0.88x (?)
Data.append.Sequence.64kB.Count 29 32 +10.3% 0.91x (?)
Data.hash.Empty 45 49 +8.9% 0.92x (?)
DataCreateSmallArray 1850 2000 +8.1% 0.93x (?)
DataCreateMedium 7600 8200 +7.9% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
Data.init.Sequence.64kB.Count.I 58 29 -50.0% 2.00x
Data.init.Sequence.64kB.Count 58 29 -50.0% 2.00x
Data.init.Sequence.2047B.Count.I 98 52 -46.9% 1.88x
Data.init.Sequence.2049B.Count.I 98 52 -46.9% 1.88x
Data.init.Sequence.809B.Count 85 50 -41.2% 1.70x
Data.init.Sequence.809B.Count.I 85 50 -41.2% 1.70x
Data.init.Sequence.511B.Count.I 93 60 -35.5% 1.55x
Data.init.Sequence.513B.Count.I 93 62 -33.3% 1.50x
Data.init.Sequence.64kB.Count0 212 155 -26.9% 1.37x
Data.init.Sequence.64kB.Count0.I 212 156 -26.4% 1.36x
ObjectiveCBridgeStubDataAppend 2420 1900 -21.5% 1.27x (?)
Data.init.Sequence.2047B.Count0.I 370 297 -19.7% 1.25x
Data.init.Sequence.2049B.Count0.I 366 295 -19.4% 1.24x
DataSubscriptSmall 19 16 -15.8% 1.19x
FlattenListLoop 1065 936 -12.1% 1.14x (?)
Data.append.Sequence.64kB.Count0 171 152 -11.1% 1.12x (?)
SortIntPyramid 820 740 -9.8% 1.11x (?)
WordCountHistogramASCII 2800 2600 -7.1% 1.08x (?)

Code size: -Osize

Regression OLD NEW DELTA RATIO
ObjectiveCBridgingStubs.o 13827 15212 +10.0% 0.91x

Performance (x86_64): -Onone

Regression OLD NEW DELTA RATIO
LessSubstringSubstringGenericComparable 49 56 +14.3% 0.88x (?)
EqualSubstringSubstringGenericEquatable 50 57 +14.0% 0.88x
LessSubstringSubstring 51 58 +13.7% 0.88x (?)
EqualSubstringSubstring 51 58 +13.7% 0.88x (?)
EqualStringSubstring 51 58 +13.7% 0.88x (?)
EqualSubstringString 51 58 +13.7% 0.88x
 
Improvement OLD NEW DELTA RATIO
ObjectiveCBridgeStubDataAppend 3260 2980 -8.6% 1.09x (?)

Code size: -swiftlibs

How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac mini
  Model Identifier: Macmini8,1
  Processor Name: 6-Core Intel Core i7
  Processor Speed: 3.2 GHz
  Number of Processors: 1
  Total Number of Cores: 6
  L2 Cache (per Core): 256 KB
  L3 Cache: 12 MB
  Memory: 64 GB

@swift-ci
Copy link
Contributor

swift-ci commented Jun 2, 2021

Performance (arm64): -O

Regression OLD NEW DELTA RATIO
DictionaryOfAnyHashableStrings_insert 1316 2324 +76.6% 0.57x
Set.isDisjoint.Box25 117 199 +70.1% 0.59x (?)
Set.isDisjoint.Int25 89 135 +51.7% 0.66x (?)
Set.isDisjoint.Int50 89 131 +47.2% 0.68x (?)
StringRemoveDupes 121 147 +21.5% 0.82x (?)
ArrayAppend 380 460 +21.1% 0.83x (?)
DictionaryGroupOfObjects 459 549 +19.6% 0.84x (?)
DictionaryKeysContainsNative 12 14 +16.7% 0.86x (?)
SequenceAlgosUnfoldSequence 310 360 +16.1% 0.86x
DataCreateSmallArray 1300 1500 +15.4% 0.87x
DictionaryRemove 1750 2000 +14.3% 0.88x (?)
SetIsSubsetBox25 86 98 +14.0% 0.88x (?)
DictionaryRemoveOfObjects 5800 6600 +13.8% 0.88x (?)
Set.isStrictSubset.Box25 87 98 +12.6% 0.89x (?)
DictionarySwap 528 592 +12.1% 0.89x (?)
DictionaryGroup 117 131 +12.0% 0.89x (?)
Histogram 252 280 +11.1% 0.90x (?)
DictionarySwapAt 424 456 +7.5% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
ObjectiveCBridgeStubDataAppend 1720 1220 -29.1% 1.41x
DataAccessBytesMedium 45 32 -28.9% 1.41x
DataAppendBytesSmall 128 93 -27.3% 1.38x
DataAccessBytesSmall 48 35 -27.1% 1.37x
StringComparison_ascii 356 288 -19.1% 1.24x
Phonebook 896 742 -17.2% 1.21x
LessSubstringSubstring 22 19 -13.6% 1.16x
EqualSubstringSubstring 22 19 -13.6% 1.16x
EqualStringSubstring 22 19 -13.6% 1.16x
EqualSubstringSubstringGenericEquatable 22 19 -13.6% 1.16x
EqualSubstringString 22 19 -13.6% 1.16x
LessSubstringSubstringGenericComparable 22 19 -13.6% 1.16x
SIMDReduce.Int32x4.Initializer 49 44 -10.2% 1.11x
CharacterPropertiesStashed 610 550 -9.8% 1.11x (?)
DataCreateMediumArray 880 800 -9.1% 1.10x (?)
Chars2 2600 2400 -7.7% 1.08x
Set.filter.Int100.24k 874 808 -7.6% 1.08x
Dict.CopyKeyValue.24k 888 822 -7.4% 1.08x
Dict.CopyKeyValue.20k 768 715 -6.9% 1.07x
Calculator 135 126 -6.7% 1.07x
SetSymmetricDifferenceBox25 243 227 -6.6% 1.07x

Code size: -O

Regression OLD NEW DELTA RATIO
ObjectiveCBridgingStubs.o 14006 15502 +10.7% 0.90x
 
Improvement OLD NEW DELTA RATIO
SortLettersInPlace.o 7529 7337 -2.6% 1.03x
SortStrings.o 23949 23693 -1.1% 1.01x

Performance (arm64): -Osize

Regression OLD NEW DELTA RATIO
DataCreateSmallArray 1550 1900 +22.6% 0.82x
Data.append.Sequence.64kB.Count.I 37 42 +13.5% 0.88x
Data.append.Sequence.64kB.Count 37 42 +13.5% 0.88x
Data.append.Sequence.809B.Count.I 61 68 +11.5% 0.90x
Data.append.Sequence.809B.Count 61 67 +9.8% 0.91x
 
Improvement OLD NEW DELTA RATIO
ObjectiveCBridgeStubDataAppend 1720 1220 -29.1% 1.41x
DataAppendBytesSmall 138 106 -23.2% 1.30x
Phonebook 903 742 -17.8% 1.22x
EqualStringSubstring 22 19 -13.6% 1.16x
EqualSubstringString 22 19 -13.6% 1.16x
StringComparison_ascii 521 451 -13.4% 1.16x (?)
LessSubstringSubstring 23 20 -13.0% 1.15x
EqualSubstringSubstring 23 20 -13.0% 1.15x
EqualSubstringSubstringGenericEquatable 23 20 -13.0% 1.15x
LessSubstringSubstringGenericComparable 23 20 -13.0% 1.15x
SortStrings 419 378 -9.8% 1.11x
Chars2 3150 2900 -7.9% 1.09x
CharIndexing_punctuated_unicodeScalars 560 520 -7.1% 1.08x (?)

Code size: -Osize

Regression OLD NEW DELTA RATIO
ObjectiveCBridgingStubs.o 13718 15006 +9.4% 0.91x
 
Improvement OLD NEW DELTA RATIO
DataBenchmarks.o 47449 46977 -1.0% 1.01x

Performance (arm64): -Onone

Regression OLD NEW DELTA RATIO
XorLoop 4745 5551 +17.0% 0.85x
IterateData 1469 1665 +13.3% 0.88x
StringToDataEmpty 1200 1300 +8.3% 0.92x (?)
NSArray.bridged.mutableCopy.objectAtIndex 597 643 +7.7% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
ObjectiveCBridgeStubDataAppend 2180 1840 -15.6% 1.18x
LessSubstringSubstringGenericComparable 40 36 -10.0% 1.11x
ParseInt.UInt64.Decimal 1380 1275 -7.6% 1.08x
LessSubstringSubstring 40 37 -7.5% 1.08x
EqualSubstringSubstring 40 37 -7.5% 1.08x
EqualStringSubstring 40 37 -7.5% 1.08x
EqualSubstringSubstringGenericEquatable 40 37 -7.5% 1.08x

Code size: -swiftlibs

How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac mini
  Model Identifier: Macmini9,1
  Total Number of Cores: 8 (4 performance and 4 efficiency)
  Memory: 16 GB

@Lukasa
Copy link
Contributor Author

Lukasa commented Jun 3, 2021

Thoughts on those results @lorentey? DictionaryOfAnyHashableStrings_insert is a particularly worrying result. I have a suspicion that some of these preconditions were load-bearing.

@eeckstein
Copy link
Contributor

IIRC, _debugPrecondition is only completely eliminated if the function is inlined into client code. If it's called within the stdlib, it performs a dynamic check if it's a debug configuration. This may be even worse than an overflow check.
Maybe you want to use 'internalInvariant` instead? But this would not do any checks in debug builds (of the client code).

@Lukasa
Copy link
Contributor Author

Lukasa commented Jun 3, 2021

I think the performance of this object fairly critically relies on the idea that the function calls are being inlined into client code.

@eeckstein
Copy link
Contributor

eeckstein commented Jun 3, 2021

That's true, but if those functions are called and inlined within the stdlib core, the debug-configuration check is still not eliminated.

That could e.g. explain the perf regressions in the string benchmarks

@Lukasa
Copy link
Contributor Author

Lukasa commented Jun 3, 2021

Hmm, so we don't really have an appropriate kind of precondition for this?

@eeckstein
Copy link
Contributor

I don't think so. Is internalInvariant an option for this?

@Lukasa
Copy link
Contributor Author

Lukasa commented Jun 3, 2021

I think that question is above my pay grade. @lorentey and @atrick are the right people to answer that.

@lorentey
Copy link
Member

lorentey commented Jun 3, 2021

The benchmarks are largely in line with my expectations -- absolutely everything about these low-level types is load bearing, and I don't expect large performance improvements from eliminating highly predictable branches. (#34961 wasn't exactly worth the effort it took to land it, either, when we only look at the performance results. I consider this stuff to be mostly cosmetic.)

I was hoping for a somewhat larger impact on code size, though.

DictionaryOfAnyHashableStrings_insert is surprising -- I can't immediately think of a reason why a buffer pointer change would affect Dictionary or AnyHashable. Perhaps it's the same root cause as the String regressions -- String has a lot more non-inlinable code than usual for the stdlib.

I don't think _debugPrecondition does any runtime check for debug config (or at least not in the usual case) -- in the cases I've seen, in non-inlinable code it effectively turns into a regular precondition.

The next step towards landing this would probably be to look at codegen changes that could explain the regressions -- for that we'd need to pinpoint what exactly uses UBPs on the affected benchmarks.

If it turns out to be a precondition thing, then the usual approach to mitigate it is to create internal-only variants of the affected entry points that use _internalInvariant. The public entry points must continue to use _debugPrecondition though (or something with the same inlinable behavior).

@karwa
Copy link
Contributor

karwa commented Jun 4, 2021

The benchmarks are largely in line with my expectations -- absolutely everything about these low-level types is load bearing, and I don't expect large performance improvements from eliminating highly predictable branches. (#34961 wasn't exactly worth the effort it took to land it, either, when we only look at the performance results. I consider this stuff to be mostly cosmetic.)

@lorentey

The branches may be highly predictable, but that is a runtime consideration. It definitely blocks the compiler performing other optimisations, although the performance impact will of course vary based on the kind of code you're benchmarking. This isn't theoretical; it is practical experience. WebURL uses a custom UnsafeBufferPointer for exactly this reason. You can try it for yourself:

This is the command I use to run the benchmarks (from the repo's Benchmarks folder):

# To generate a baseline
xcrun -toolchain swift swift build -c release && ./.build/release/WebURLBenchmark --warmup-iterations 30 --format json > results/main.json

# And again, outputting to a different file and using the python script to compare results.
xcrun -toolchain swift swift build -c release && ./.build/release/WebURLBenchmark --warmup-iterations 30 --format json > results/latest.json && python compare.py results/main.json results/latest.json

Also, I run each command twice. I don't trust the benchmark results directly after a build.

The simplest way to test the impact of indexing overflow checks would be to change this property and have it return Self. All the code that uses it is generic, so everything should work and you can get a direct comparison.

If the only thing to consider was how predictable the branches are, and you don't expect a significant performance impact from eliminating them, there should be ~no change (within the margin of error). That is demonstrably not the case. Run the benchmarks a few times and you'll see the google benchmark library is very good at getting consistent results for the same code. They are reliable indicators of the code's performance and it's not a fluke.

I already optimised this code to death before I switched to the custom buffer pointer, and doing so gave me about a 20% increase in performance. It's huge. The best change I ever made, and I'm not sure I could reach that level of performance without it.

By all means, copy the type and test it in your own projects. It's very easy to adopt and to compare results. If you have code which uses wCSIA, just tack .withoutTrappingOnIndexOverflow on the end of the given buffer.

If it turns out to be a precondition thing, then the usual approach to mitigate it is to create internal-only variants of the affected entry points that use _internalInvariant. The public entry points must continue to use _debugPrecondition though (or something with the same inlinable behavior).

I would like to (respectfully) challenge this. Why does it need to trap on overflow? What is it actually protecting against?

As I mentioned in my earlier comment, if there were salient indexes which the developer should not increment past, it would be the bounds of the collection, not the bounds of the integer type. I've asked a number of times now, but haven't had a direct answer to these questions, and I'm quite disappointed about that. This is an area that I care about, have spent a lot of time digging through, and my experience shows that it can yield significant benefits for a great number of Swift developers.

It would be a terrible shame if I had to advise people that the standard library has known performance issues which won't be fixed for nebulous reasons. I know that I am not owed answers to any of these questions, but I would really appreciate some. I think we're missing a trick here that could unlock better performance for other developers.

Thanks.

@lorentey
Copy link
Member

lorentey commented Jun 5, 2021

This is a gentle reminder that the single best way to convince us stdlib engineers that a potential change would be a performance improvement is to contribute a self-contained benchmark that demonstrates the issue. 😉

As it stands, this PR does not look like much of a performance improvement, and neither did #34961. (Which is not to say I'm arguing against landing it! I just don't think it's fair to push for this as a performance fix when the available data indicates it's more of a regression.)

Clearly, you have some specific algorithms that make this a measurable performance win -- have you considered porting & reducing some of them and adding them to the benchmarking suite? (If you aren't able to do that, I'd be more than happy to take on the work of understanding your project myself, but please understand that it will have to be on my own schedule.)

@lorentey
Copy link
Member

lorentey commented Jun 5, 2021

It would be a terrible shame if I had to advise people that the standard library has known performance issues which won't be fixed for nebulous reasons.

Nice. To be extremely clear -- removing stdlib preconditions is an extraordinarily risky proposition. We are threading on extremely thin ice here.

I don't think it's fair to characterize the desire to check, double check, and triple check these changes as "nebulous". If we make a mistake here, it could result in irreparable damage to the language. (Re-adding a precondition we've previously removed would be an ABI breaking change, and because this is in inlinable code, merely updating libswiftCore.dylib wouldn't be enough to patch up a potential issue.)

Searching for some evidence that this is actually going to be a beneficial change (and investigating actually measurable regressions) is simply due diligence. The way you seem to be arguing against doing this work (up to and including intimidation tactics like the quote above) makes me want to look through this change again in case there is a hidden issue I'm not seeing -- even though I've gone through it many times now, and I've already indicated I'd like to merge this.

I would like to (respectfully) challenge this. Why does it need to trap on overflow? What is it actually protecting against?

By default, we like Swift code to perform overflow checks on arithmetic operations, despite the obvious fact that this can sometimes have a performance impact. These unsafe constructs are, by definition, unable to fully check their preconditions, but we keep widening this safety hole -- we decided to systematically weaken conditions that we can check, starting from our original sin: disabling the range check on the UBP subscript in optimized code. This is a risky process, but we still have one safety net left: we are still leaving the precondition checks enabled in debug builds.

I dislike the idea of removing even that, unless we can prove that the precondition cannot possibly trigger. I think we can prove this in the case of the range initializer calls (which is why I'm fine with using the unchecked Range initializer). We obviously can't prove this in the case of the indexing operations -- these can trivially overflow.

Disabling [debug mode -- ed] overflow checks in index(_:offsetBy:) is particularly suspect to me, as it can be used to generate logic errors, where incrementing an index by a positive value turns out to actually decrement it*. How confident are we that this will not lead to some clever exploit down the road? (That said, there is important precedent here, as IIRC, UnsafePointer.advanced() doesn't check for overflow either -- and UBP indices are sort of like pointers. Disabling the checks altogether is not entirely off the table; but it's a last resort thing.)

* Edit: Actually, this requires multiple index(_:offsetBy:) calls -- this does weaken my point, but it doesn't entirely negate it.
* Edit^2: added clarification in brackets above

@Lukasa
Copy link
Contributor Author

Lukasa commented Jun 5, 2021

Ok, I think we need to take a step back for a moment.

Based on @lorentey’s comment above, it is clear that I have been labouring under a misapprehension. Specifically, my understanding was that the position of the core standard library maintainers is that URBP should not be responsible for policing misuse of its indices: that given that we did not consider it an error to subscript URBP with an invalid index, and given that URBP’s index is an Int and so can trivially have unchecked operations performed on it in user code, there was simply no point adding costs to the (more verbose) Collection-based equivalents.

This understanding is clearly not true! @lorentey my read of what you’re saying is that in an ideal world you’d prefer URBP to have more checks, not fewer. This is definitely against the tide of this PR, and I’d like to better understand the situation before we do anything, including merging this.

Importantly, for me, I’d like to try to come to an understanding of what we think URBP is and is for. This understanding must necessarily be holistic: it needs to take into account our similar views about URP and the other stdlib Collection classes. Who should be using URBP? When? How does this differ from our ideal world?

In the absence of a clear, shared understanding of what this data type is for we are going to go around in circles on this kind of issue. I was more than happy to go digging to try to find out why the benchmarks got slower, but if this PR represents a direction the stdlib team aren’t happy with then I’d rather not waste my time.

@lorentey
Copy link
Member

lorentey commented Jun 5, 2021

This understanding is clearly not true! @lorentey my read of what you’re saying is that in an ideal world you’d prefer URBP to have more checks, not fewer. This is definitely against the tide of this PR, and I’d like to better understand the situation before we do anything, including merging this.

This isn't what I'm arguing. I'm arguing that we need to preserve assertions in unoptimized debug builds. That's the contract we ended up with for these unsafe buffer pointer types.

(My personal opinion is indeed that eliminating the bounds check in the subscript was a mistake we'll eventually regret -- in hindsight, I would've preferred if UBP had a separate, explicitly unchecked subscript instead, leaving the default subscript safe. But this opinion doesn't matter the slightest -- in the stdlib we have, UBP types are completely unsafe in optimized code, and I want to make sure we implement this the best we can.)

In the absence of a clear, shared understanding of what this data type is for we are going to go around in circles on this kind of issue. I was more than happy to go digging to try to find out why the benchmarks got slower, but if this PR represents a direction the stdlib team aren’t happy with then I’d rather not waste my time.

This is not my position. Have I not been actively working on landing this?

What I do have an issue with is @karwa's push to merge this quickly by violating the debug mode contract, or by taking other shortcuts. I don't feel there is any reason to land this PR as soon as possible, but there is every reason to thread carefully. We should investigate, understand, and fix issues, and make sure that this isn't going to actually make things worse.

These are fundamental abstractions; absolutely any change we make to them will show up as a regression in some benchmark. This is okay! -- we can fix the egregious ones without affecting the goals of this PR. But is it really too much to ask to add a benchmark that is dedicated to exercising these code paths? UBP operations are woefully underrepresented in our benchmarking suite. (If the answer is yes, that's fine -- I'm more than willing to do it myself! In my own time, of course.)

@atrick
Copy link
Contributor

atrick commented Jun 7, 2021

I'm about to throw more mud into the water, so apologies for that. I personally think the Swift compiler has always been over-optimizing trap operations. Generally reordering traps and hoisting them above memory side effects is just fine. Hoisting traps above potential program exits, I/O, and memory barriers leads to a mystifying programming model and ultimately systems that can't be easily debugged. If we ever hope to improve that without too much pushback from performance watchdogs, it will help if critical paths are not laden with traps in -O code.

Add to that the fact that inlining heuristics are impossible to predict, and I think it's fair to write contrived benchmarks using @inline(never) to shut down heroic optimization.

As a rule though, there needs to be some way of checking UB assumptions. For bounds/overflow, the best way to do that is testing with -Onone. So let's not remove any checks from debug builds. If these critical paths are slowing down code within the stdlib itself, rather than being inlined at least into the application code, then I think we'll need to address that as a separate stdlib performance problem.

@lorentey
Copy link
Member

Hoisting traps above potential program exits, I/O, and memory barriers leads to a mystifying programming model and ultimately systems that can't be easily debugged.

I honestly had no idea that was a thing swiftc was doing! 🤯

@atrick
Copy link
Contributor

atrick commented Jul 19, 2021

@Lukasa you can ignore the DictionaryOfAnyHashableStrings_insert benchmark regression. I'll be fixing that if someone else doesn't get to it first: rdar://80746149 (Fix 90% performance regression in benchmark DictionaryOfAnyHashableStrings_insert)

@Lukasa
Copy link
Contributor Author

Lukasa commented Feb 3, 2022

Just a quick reminder that this patch exists.

@karwa
Copy link
Contributor

karwa commented Feb 3, 2022

Thanks for bumping this. It's an important thing to address.

What I do have an issue with is @karwa's push to merge this quickly by violating the debug mode contract, or by taking other shortcuts. I don't feel there is any reason to land this PR as soon as possible, but there is every reason to thread carefully. We should investigate, understand, and fix issues, and make sure that this isn't going to actually make things worse.

I have an even bigger issue with this mischaracterisation, @lorentey. I'm not trying to push to merge this quickly - indeed, I've filed many, many bug reports and started forum topics (and you were pinged in many/all of them, but did not respond), trying to actually have a discussion about whether it is meaningful to trap on overflow for indexing operations which do not actually read from/write to memory. Precisely because I believe it is important to be careful, investigate, understand, and fix issues. This characterisation that I'm being lazy or hasty or "taking shortcuts" is grossly inaccurate and I'm deeply offended by the suggestion.

My understanding is that: in general, for bounds-checked collections, indexing operations which do not touch memory do not need to trap on overflow. Overflow is only possible in those situations if the developer makes a mistake using the collection, and bounds-checking will catch that when they actually try to touch memory.

For the unsafe buffers, bounds-checking is disabled in release mode. That means that -- even without overflow -- there are many, many more mistakes a developer can make which will allow them to touch memory outside of the buffer's bounds. When using unsafe buffers, we rely on developer to perform their own bounds-checking (in whichever form that takes), and if they do so correctly, that will also cover integer overflows. If they don't, they will see those problems long before they even come close to overflowing a 64-bit integer.

But - again - I'm very open to hearing alternative opinions. Preferably on the forum topic I linked to previously, because I think it is an important discussion for the entire Swift community.

@lorentey
Copy link
Member

lorentey commented Feb 3, 2022

@swift-ci test

@lorentey
Copy link
Member

lorentey commented Feb 3, 2022

@swift-ci benchmark

@swift-ci
Copy link
Contributor

swift-ci commented Feb 3, 2022

Performance (x86_64): -O

Regression OLD NEW DELTA RATIO
Data.append.Sequence.64kB.Count.RE.I 20 29 +45.0% 0.69x
Data.append.Sequence.64kB.Count.RE 20 29 +45.0% 0.69x
Data.append.Sequence.809B.Count.RE.I 55 70 +27.3% 0.79x
Data.append.Sequence.809B.Count.RE 55 69 +25.5% 0.80x
DataCreateSmallArray 1400 1700 +21.4% 0.82x
DataAppendSequence 5700 6900 +21.1% 0.83x
Breadcrumbs.MutatedIdxToUTF16.Mixed 190 223 +17.4% 0.85x (?)
Set.isStrictSubset.Int0 72 80 +11.1% 0.90x (?)
Set.isDisjoint.Empty.Int 79 87 +10.1% 0.91x (?)
Set.isSuperset.Seq.Int.Empty 80 88 +10.0% 0.91x (?)
Set.isDisjoint.Seq.Empty.Int 74 81 +9.5% 0.91x (?)
 
Improvement OLD NEW DELTA RATIO
RawBufferInitializeMemory 130 87 -33.1% 1.49x
ObjectiveCBridgeStubDataAppend 2520 1880 -25.4% 1.34x
DataCountSmall 19 15 -21.1% 1.27x
DataCountMedium 19 15 -21.1% 1.27x
FlattenListFlatMap 4313 3428 -20.5% 1.26x (?)
Data.init.Sequence.64kB.Count.RE 20 16 -20.0% 1.25x
Data.init.Sequence.64kB.Count.RE.I 19 16 -15.8% 1.19x
SortIntPyramid 540 460 -14.8% 1.17x (?)
PrefixWhileAnySequence 215 188 -12.6% 1.14x
PrefixWhileSequence 203 178 -12.3% 1.14x
Data.init.Sequence.809B.Count.RE 43 38 -11.6% 1.13x (?)
Data.init.Sequence.809B.Count.RE.I 43 38 -11.6% 1.13x
StringHasSuffixAscii 1650 1470 -10.9% 1.12x
RandomShuffleLCG2 320 288 -10.0% 1.11x
DataSubscriptMedium 41 37 -9.8% 1.11x
Array2D 4624 4176 -9.7% 1.11x
Data.hash.Empty 52 47 -9.6% 1.11x (?)
MapReduceString 46 42 -8.7% 1.10x (?)
ArrayInClass 1190 1090 -8.4% 1.09x (?)
DropLastAnySequence 331 304 -8.2% 1.09x
RGBHistogram 1790 1660 -7.3% 1.08x (?)
WordCountHistogramASCII 2800 2600 -7.1% 1.08x (?)
ConvertFloatingPoint.MockFloat64Exactly2 15 14 -6.7% 1.07x (?)

Code size: -O

Regression OLD NEW DELTA RATIO
ObjectiveCBridgingStubs.o 19069 21049 +10.4% 0.91x
 
Improvement OLD NEW DELTA RATIO
SortLettersInPlace.o 8720 8496 -2.6% 1.03x
SortIntPyramids.o 9299 9203 -1.0% 1.01x

Performance (x86_64): -Osize

Regression OLD NEW DELTA RATIO
Breadcrumbs.MutatedIdxToUTF16.Mixed 191 224 +17.3% 0.85x (?)
Breadcrumbs.MutatedUTF16ToIdx.Mixed 187 219 +17.1% 0.85x
Set.isStrictSubset.Int0 72 80 +11.1% 0.90x (?)
Set.subtracting.Empty.Int 28 31 +10.7% 0.90x
Set.isDisjoint.Seq.Empty.Box 81 89 +9.9% 0.91x (?)
Set.isStrictSubset.Box0 75 82 +9.3% 0.91x (?)
ObjectiveCBridgeStubToNSStringRef 89 97 +9.0% 0.92x (?)
 
Improvement OLD NEW DELTA RATIO
Data.append.Sequence.64kB.Count.I 58 30 -48.3% 1.93x
Data.append.Sequence.64kB.Count 58 30 -48.3% 1.93x
RawBufferInitializeMemory 152 87 -42.8% 1.75x
Data.append.Sequence.809B.Count.I 99 64 -35.4% 1.55x
Data.append.Sequence.809B.Count 98 64 -34.7% 1.53x
Dictionary4 196 155 -20.9% 1.26x
ObjectiveCBridgeStubDateAccess 152 130 -14.5% 1.17x
NibbleSort 2120 1850 -12.7% 1.15x (?)
DataCountSmall 17 15 -11.8% 1.13x (?)
DataCountMedium 17 15 -11.8% 1.13x
SortStrings 555 498 -10.3% 1.11x
Data.hash.Empty 52 47 -9.6% 1.11x
RemoveWhereSwapInts 23 21 -8.7% 1.10x (?)
RemoveWhereFilterInts 25 23 -8.0% 1.09x (?)
DataAccessBytesMedium 56 52 -7.1% 1.08x (?)
PrefixWhileCountableRange 14 13 -7.1% 1.08x (?)

Code size: -Osize

Performance (x86_64): -Onone

Regression OLD NEW DELTA RATIO
Breadcrumbs.MutatedUTF16ToIdx.Mixed 188 221 +17.6% 0.85x (?)
Breadcrumbs.MutatedIdxToUTF16.Mixed 199 231 +16.1% 0.86x (?)
ArrayOfPOD 667 733 +9.9% 0.91x (?)
SIMDReduce.Int8 6201 6737 +8.6% 0.92x (?)
ArrayPlusEqualThreeElements 4780 5150 +7.7% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
DataReplaceLargeBuffer 12 11 -8.3% 1.09x (?)
DropFirstSequence 8735 8116 -7.1% 1.08x
DropFirstSequenceLazy 8740 8121 -7.1% 1.08x
DropFirstAnySequence 9465 8821 -6.8% 1.07x (?)
DropFirstAnySequenceLazy 8968 8379 -6.6% 1.07x

Code size: -swiftlibs

How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac mini
  Model Identifier: Macmini8,1
  Processor Name: 6-Core Intel Core i7
  Processor Speed: 3.2 GHz
  Number of Processors: 1
  Total Number of Cores: 6
  L2 Cache (per Core): 256 KB
  L3 Cache: 12 MB
  Memory: 64 GB

@lorentey lorentey merged commit 8533b42 into swiftlang:main Feb 4, 2022
@swiftlang swiftlang deleted a comment Feb 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants