[stdlib] Improve performance of heterogeneous binary integer `==` and `<` #63034

xwu · 2023-01-14T04:38:11Z

As it turns out, we can simplify this logic quite a bit and get more optimal generated code in the process:

In the first iteration of this PR, we unconditionally apply the "roundtripping" logic used in the existing code to widen a comparand without checking the instance property bitWidth:

If rhs can be represented as a value of type Self, or if lhs can be represented as a value of type Other, then the comparison is straightforward.
If neither is the case, then we can actually know which of lhs and rhs is smaller just by knowing which of Self or Other is a signed type (see comments in the code).

This change allows Swift to generate the same, optimal code (at -O) for the following input as reported in #62260, which isn't the case today:

func compare_direct(_ x: Int, _ y: UInt32) -> Bool {
    x < y
}

func compare_widened(_ x: Int, _ y: UInt32) -> Bool {
    x < Int(y)
}

However, because the roundtripping logic has to start with either the left-hand side or the right-hand side, the same optimal output can't be obtained when we swap the parameters above—i.e., when x is of type UInt32 and y is of type Int. The generated code is still better than the status quo, but it's not the best possible.

To address this issue, in the second iteration of this PR, we compare comparand bit widths explicitly, but only once along each code path in case it's expensive to compute for values of a hypothetical arbitrary-width type. (I'd expect this should be negligible as compared to the widening operation that follows anyway.)

For fixed-width integer types, this approach produces optimal code generation for heterogeneous comparison of any two types. This is because we defer comparison to zero until we reach the only scenario where it's required (when a signed and therefore possibly negative value is being compared to an unsigned value of greater bit width) and otherwise stick to operations that can be optimized away in the concrete case.

This change apparently speeds up generic floating-point conversion severalfold, improves code size for a number of integer and floating-point algorithms, and even shrinks libswiftCore.dylib code size by 2%. It also revealed brittleness in an IRGen test which has been modified as a result to improve robustness.

Resolves #62260.

benrimmington · 2023-01-14T15:58:24Z

In the implementation comments, I'm not sure that "bitcasting" or "bitcast operation" are the correct terms.

Could the (0 as Self) and (0 as Other) expressions be replaced by Self.zero and Other.zero?
AdditiveArithmetic.zero is a requirement which could be implemented by larger integer types.

xwu · 2023-01-14T16:40:43Z

In the implementation comments, I'm not sure that "bitcasting" or "bitcast operation" are the correct terms.

What would you suggest? I’d like something more succinct than “constructing a value of Foo type from the bit pattern obtained by truncating or sign-extending as necessary the bit pattern of bar.” To do so, I’m adapting the terminology used in the existing implementation comment but I’m certainly open to other suggestions.

Could the (0 as Self) and (0 as Other) expressions be replaced by Self.zero and Other.zero?

It could, just as most if not all uses of literal 0 could be replaced by .zero, but we’ve always said that 0 is the preferred canonical spelling for integers. That there is an explicit type here is a reflection of an extant bug that causes < in generic contexts (but not concrete contexts) not to favor homogeneous comparison; this ought to be fixed because it is a footgun, and at such point when it is, this code can be rewritten lhs < 0, dropping the explicit type coercion.

I’d rather the workaround until then be (a) visually thorny rather than potentially appear like a stylistic preference; (b) generally applicable for all values—we can’t very well write lhs == Self.fortyTwo—so that authors who need to work on this code can see both the need for this workaround throughout the implementation and how to apply it no matter the comparand.

benrimmington · 2023-01-14T17:08:17Z

What would you suggest?

"bitcast operation" → "conversion"
"bitcasting" → "converting"

https://github.com/apple/swift/blob/release/5.7.0/stdlib/public/core/Integers.swift#L419-L425

xwu · 2023-01-15T18:36:14Z

@swift-ci benchmark

xwu · 2023-01-15T20:38:44Z

~~Irksome is the apparent benchmark regression in string–substring comparison in all optimization modes; not sure how to explain that one.~~

This PR also apparently affects the temporary allocation codegen test on macOS, which I'll need to update I guess.

------- Performance (x86_64): -O -------

REGRESSION                                   OLD       NEW        DELTA    RATIO    
FlattenListLoop                              976.0     1615.0     +65.5%   **0.60x (?)**
EqualSubstringString                         24.824    30.719     +23.7%   **0.81x**
LessSubstringSubstringGenericComparable      24.824    30.594     +23.2%   **0.81x (?)**
LessSubstringSubstring                       24.824    30.59      +23.2%   **0.81x**
EqualSubstringSubstringGenericEquatable      24.844    30.577     +23.1%   **0.81x (?)**
StringComparison_longSharedPrefix            208.75    239.9      +14.9%   **0.87x (?)**
EqualSubstringSubstring                      26.923    30.8       +14.4%   **0.87x (?)**
EqualStringSubstring                         26.909    30.682     +14.0%   **0.88x (?)**
FlattenListFlatMap                           3764.0    4135.0     +9.9%    **0.91x (?)**
Data.hash.Empty                              50.083    54.435     +8.7%    **0.92x (?)**
StringBuilder                                214.556   232.75     +8.5%    **0.92x (?)**
SortSortedStrings                            44.735    48.435     +8.3%    **0.92x (?)**
StringBuilderSmallReservingCapacity          222.5     240.75     +8.2%    **0.92x (?)**
ArraySetElement                              284.5     306.833    +7.8%    **0.93x (?)**

IMPROVEMENT                                  OLD       NEW        DELTA    RATIO    
ConvertFloatingPoint.MockFloat64Exactly2     16.065    5.596      -65.2%   **2.87x**
DataCountSmall                               19.588    15.234     -22.2%   **1.29x (?)**
ObjectiveCBridgeStringHash                   74.103    58.556     -21.0%   **1.27x**
CreateObjects                                13.067    10.884     -16.7%   **1.20x (?)**
Data.hash.Medium                             29.188    24.406     -16.4%   **1.20x (?)**
DataSubscriptSmall                           15.233    13.059     -14.3%   **1.17x (?)**
BridgeString.find.native.longNonASCII        450.5     393.25     -12.7%   **1.15x (?)**
DataCountMedium                              17.417    15.234     -12.5%   **1.14x (?)**
ObjectiveCBridgeStringGetASCIIContents       268.889   238.2      -11.4%   **1.13x (?)**
Calculator                                   151.417   134.692    -11.0%   **1.12x (?)**
NormalizedIterator_nonBMPSlowestPrenormal    479.787   428.148    -10.8%   **1.12x (?)**
LuhnAlgoEager                                182.5     163.125    -10.6%   **1.12x (?)**
ObjectiveCBridgeStringCStringUsingEncoding   489.0     439.0      -10.2%   **1.11x (?)**
Chars2                                       3373.81   3095.455   -8.3%    **1.09x (?)**
ObjectiveCBridgeStringCompare2               628.0     577.0      -8.1%    **1.09x (?)**
OpenClose                                    54.429    50.08      -8.0%    **1.09x (?)**
ObjectiveCBridgeStringCompare                650.333   600.75     -7.6%    **1.08x (?)**
StringComparison_ascii                       200.909   185.75     -7.5%    **1.08x (?)**
NSStringConversion.InlineBuffer.UTF8         786.0     728.5      -7.3%    **1.08x (?)**
ObjectiveCBridgeStringIsEqual                152.5     142.25     -6.7%    **1.07x (?)**

------- Code size: -O -------

REGRESSION                                         OLD     NEW     DELTA   RATIO  
LuhnAlgoLazy.o                                     12079   12794   +5.9%   **0.94x**
LuhnAlgoEager.o                                    12079   12794   +5.9%   **0.94x**
MonteCarloE.o                                      2778    2826    +1.7%   **0.98x**
CSVParsing.o                                       60865   61537   +1.1%   **0.99x**

IMPROVEMENT                                        OLD     NEW     DELTA   RATIO  
CreateObjects.o                                    991     927     -6.5%   **1.07x**
RandomShuffle.o                                    3210    3034    -5.5%   **1.06x**
BinaryFloatingPointConversionFromBinaryInteger.o   29320   27784   -5.2%   **1.06x**
NibbleSort.o                                       13750   13334   -3.0%   **1.03x**
RomanNumbers.o                                     4405    4357    -1.1%   **1.01x**
IntegerParsing.o                                   57453   56877   -1.0%   **1.01x**

------- Performance (x86_64): -Osize -------

REGRESSION                                   OLD        NEW        DELTA     RATIO    
ArrayAppendGenericStructs                    595.0      1424.0     +139.3%   **0.42x (?)**
RandomShuffleLCG2                            132.0      175.036    +32.6%    **0.75x**
EqualSubstringString                         24.824     30.919     +24.6%    **0.80x**
EqualSubstringSubstringGenericEquatable      24.618     30.594     +24.3%    **0.80x (?)**
ParseInt.UInt64.Decimal                      91.923     111.636    +21.4%    **0.82x (?)**
LessSubstringSubstring                       25.488     30.586     +20.0%    **0.83x (?)**
ParseInt.UInt64.Hex                          243.778    291.0      +19.4%    **0.84x (?)**
EqualSubstringSubstring                      26.061     30.571     +17.3%    **0.85x (?)**
EqualStringSubstring                         26.036     30.538     +17.3%    **0.85x (?)**
ParseInt.UIntSmall.Binary                    373.0      433.8      +16.3%    **0.86x (?)**
LessSubstringSubstringGenericComparable      26.353     30.594     +16.1%    **0.86x (?)**
StringComparison_longSharedPrefix            207.6      239.9      +15.6%    **0.87x (?)**
StringDistance.scalars.ascii                 428.6      475.5      +10.9%    **0.90x (?)**
ArrayLiteral2                                90.4       98.474     +8.9%     **0.92x (?)**
ParseInt.IntSmall.UncommonRadix              214.3      233.3      +8.9%     **0.92x (?)**
ParseInt.IntSmall.Decimal                    197.667    213.182    +7.8%     **0.93x (?)**
SortSortedStrings                            44.633     48.133     +7.8%     **0.93x (?)**

IMPROVEMENT                                  OLD        NEW        DELTA     RATIO    
Data.init.Sequence.64kB.Count.RE.I           29.538     19.222     -34.9%    **1.54x (?)**
Data.init.Sequence.64kB.Count.RE             29.531     19.235     -34.9%    **1.54x (?)**
Data.init.Sequence.64kB.Count.I              43.741     29.462     -32.6%    **1.48x**
Data.init.Sequence.64kB.Count                43.737     29.476     -32.6%    **1.48x (?)**
Data.append.Sequence.64kB.Count.I            44.185     29.906     -32.3%    **1.48x**
Data.append.Sequence.64kB.Count              44.154     29.9       -32.3%    **1.48x**
Data.init.Sequence.2049B.Count.I             76.25      52.72      -30.9%    **1.45x**
Data.append.Sequence.64kB.Count.RE.I         30.0       20.743     -30.9%    **1.45x**
Data.init.Sequence.2047B.Count.I             76.158     52.692     -30.8%    **1.45x**
Data.append.Sequence.64kB.Count.RE           29.958     20.765     -30.7%    **1.44x**
Data.init.Sequence.809B.Count                68.095     50.708     -25.5%    **1.34x (?)**
Data.init.Sequence.809B.Count.I              68.111     50.72      -25.5%    **1.34x (?)**
Data.init.Sequence.809B.Count.RE.I           56.217     43.917     -21.9%    **1.28x (?)**
Data.init.Sequence.809B.Count.RE             56.261     44.227     -21.4%    **1.27x (?)**
ObjectiveCBridgeStringHash                   74.08      58.514     -21.0%    **1.27x**
Data.append.Sequence.809B.Count              80.421     63.591     -20.9%    **1.26x (?)**
Data.append.Sequence.809B.Count.I            80.412     63.636     -20.9%    **1.26x (?)**
Data.init.Sequence.511B.Count.I              77.1       61.053     -20.8%    **1.26x (?)**
Data.init.Sequence.513B.Count.I              78.118     61.947     -20.7%    **1.26x (?)**
CreateObjects                                11.973     9.699      -19.0%    **1.23x (?)**
Data.append.Sequence.809B.Count.RE.I         70.619     57.478     -18.6%    **1.23x (?)**
Data.append.Sequence.809B.Count.RE           70.222     57.35      -18.3%    **1.22x (?)**
Set.filter.Int100.20k                        32.093     26.354     -17.9%    **1.22x (?)**
DataAppendSequence                           7047.619   5848.148   -17.0%    **1.21x (?)**
Data.hash.Medium                             28.733     24.394     -15.1%    **1.18x (?)**
ObjectiveCBridgeStubDateAccess               152.538    130.765    -14.3%    **1.17x (?)**
ObjectiveCBridgeStubFromNSDate               3260.0     2814.286   -13.7%    **1.16x (?)**
Set.filter.Int100.24k                        36.172     31.325     -13.4%    **1.15x (?)**
DataCountMedium                              17.417     15.234     -12.5%    **1.14x (?)**
BridgeString.find.native.longNonASCII        450.25     394.0      -12.5%    **1.14x (?)**
StrComplexWalk                               3198.333   2801.429   -12.4%    **1.14x (?)**
Dictionary4OfObjects                         271.5      238.0      -12.3%    **1.14x (?)**
Dictionary4                                  232.333    204.143    -12.1%    **1.14x (?)**
Set.filter.Int100.28k                        42.896     38.077     -11.2%    **1.13x (?)**
ObjectiveCBridgeStringGetASCIIContents       269.125    238.9      -11.2%    **1.13x (?)**
Set.filter.Int100.16k                        24.133     21.481     -11.0%    **1.12x (?)**
NibbleSort                                   1883.333   1684.615   -10.6%    **1.12x (?)**
Chars2                                       3564.286   3193.182   -10.4%    **1.12x (?)**
StringWalk                                   1533.559   1380.0     -10.0%    **1.11x (?)**
ObjectiveCBridgeStringCStringUsingEncoding   486.0      437.75     -9.9%     **1.11x (?)**
Set.subtracting.Seq.Empty.Int                161.786    145.929    -9.8%     **1.11x (?)**
ObjectiveCBridgeStringCompare2               628.333    577.0      -8.2%     **1.09x (?)**
FlattenListFlatMap                           2575.0     2369.0     -8.0%     **1.09x (?)**
StringHasSuffixAscii                         1503.333   1383.333   -8.0%     **1.09x (?)**
ObjectiveCBridgeStringCompare                648.667    598.0      -7.8%     **1.08x (?)**

------- Code size: -Osize -------

REGRESSION                                         OLD     NEW     DELTA   RATIO  
LuhnAlgoLazy.o                                     12853   13063   +1.6%   **0.98x**
LuhnAlgoEager.o                                    12853   13063   +1.6%   **0.98x**
MonteCarloE.o                                      2644    2673    +1.1%   **0.99x**

IMPROVEMENT                                        OLD     NEW     DELTA   RATIO  
CreateObjects.o                                    936     865     -7.6%   **1.08x**
BinaryFloatingPointConversionFromBinaryInteger.o   28471   26790   -5.9%   **1.06x**
NibbleSort.o                                       13402   13221   -1.4%   **1.01x**
RandomShuffle.o                                    3097    3061    -1.2%   **1.01x**

------- Performance (x86_64): -Onone -------

REGRESSION                                          OLD         NEW         DELTA    RATIO    
LessSubstringSubstringGenericComparable             27.406      33.6        +22.6%   **0.82x (?)**
EqualSubstringSubstringGenericEquatable             27.463      33.645      +22.5%   **0.82x (?)**
LessSubstringSubstring                              30.6        35.241      +15.2%   **0.87x (?)**
EqualSubstringSubstring                             30.897      35.241      +14.1%   **0.88x**
EqualStringSubstring                                31.75       35.727      +12.5%   **0.89x (?)**
EqualSubstringString                                32.2        35.528      +10.3%   **0.91x (?)**

IMPROVEMENT                                         OLD         NEW         DELTA    RATIO    
CreateObjects                                       1271.0      661.333     -48.0%   **1.92x (?)**
RangeIterationSigned                                9682.0      6235.0      -35.6%   **1.55x**
FloatingPointPrinting_Float_description_uniform     13394.118   9075.0      -32.2%   **1.48x**
ArraySubscript                                      80536.0     59560.0     -26.0%   **1.35x**
RandomInt8LCG                                       30458.0     22557.0     -25.9%   **1.35x**
MonteCarloE                                         874740.0    656680.0    -24.9%   **1.33x**
RandomInt64LCG                                      31019.0     23343.0     -24.7%   **1.33x**
MonteCarloPi                                        3934625.0   2969125.0   -24.5%   **1.33x**
ChaCha                                              32293.0     24843.0     -23.1%   **1.30x**
ObjectiveCBridgeStringHash                          74.276      58.676      -21.0%   **1.27x (?)**
FloatingPointPrinting_Float80_description_uniform   35516.667   28428.571   -20.0%   **1.25x (?)**
RandomInt64Def                                      40260.0     32683.333   -18.8%   **1.23x (?)**
RandomIntegersLCG                                   21107.0     17176.0     -18.6%   **1.23x**
ConvertFloatingPoint.MockFloat64ToInt64             41675.0     34077.0     -18.2%   **1.22x**
RandomInt8Def                                       39360.0     32583.333   -17.2%   **1.21x**
RandomDoubleLCG                                     29538.0     24569.0     -16.8%   **1.20x**
RandomDoubleDef                                     34500.0     28700.0     -16.8%   **1.20x**
RandomDoubleOpaqueLCG                               29870.0     24861.0     -16.8%   **1.20x**
BitCount                                            3650.0      3053.0      -16.4%   **1.20x (?)**
RandomDoubleOpaqueDef                               34716.667   29057.143   -16.3%   **1.19x**
FloatingPointPrinting_Double_description_uniform    19970.0     16725.0     -16.2%   **1.19x (?)**
RandomIntegersDef                                   26742.857   22466.667   -16.0%   **1.19x**
RC4                                                 12362.0     10706.0     -13.4%   **1.15x (?)**
FloatingPointPrinting_Float80_interpolated          60066.667   52050.0     -13.3%   **1.15x (?)**
NSStringConversion.InlineBuffer.ASCII               5451.0      4735.0      -13.1%   **1.15x (?)**
BridgeString.find.native.longNonASCII               450.667     394.0       -12.6%   **1.14x (?)**
ArrayAppendUTF16Substring                           25572.0     22428.0     -12.3%   **1.14x (?)**
FloatingPointPrinting_Float_interpolated            42755.556   37540.0     -12.2%   **1.14x (?)**
ConvertFloatingPoint.MockFloat64Exactly2            696.667     612.333     -12.1%   **1.14x (?)**
Data.hash.Medium                                    33.136      29.167      -12.0%   **1.14x**
ArrayAppendAsciiSubstring                           25524.0     22488.0     -11.9%   **1.14x (?)**
ObjectiveCBridgeStringGetASCIIContents              268.857     237.25      -11.8%   **1.13x (?)**
DataCreateMedium                                    157000.0    138800.0    -11.6%   **1.13x (?)**
ArrayAppendLatin1Substring                          25872.0     22968.0     -11.2%   **1.13x (?)**
NSStringConversion.InlineBuffer.UTF8                3439.0      3060.0      -11.0%   **1.12x (?)**
DataCreateSmall                                     21670.0     19500.0     -10.0%   **1.11x (?)**
FloatingPointPrinting_Double_interpolated           44050.0     39955.556   -9.3%    **1.10x (?)**
RandomDouble01LCG                                   19713.0     17884.0     -9.3%    **1.10x (?)**
ByteSwap                                            3568.0      3294.0      -7.7%    **1.08x (?)**
RandomDouble01Def                                   24928.571   23062.5     -7.5%    **1.08x (?)**

------- Code size: -swiftlibs -------

IMPROVEMENT          OLD       NEW       DELTA   RATIO  
libswiftCore.dylib   4653056   4554752   -2.1%   **1.02x**

benrimmington · 2023-01-15T21:01:20Z

Most of the benchmark regressions are marked with (?) — "false alarms" or "noise". Can any of the results be trusted?

xwu · 2023-01-15T23:23:03Z

Most of the benchmark regressions are marked with (?) — "false alarms" or "noise". Can any of the results be trusted?

The string comparison benchmarks are consistently showing up, so it'll take some manual inspection of the generated code to know if it's noise or real.

xwu · 2023-01-15T23:55:12Z

@swift-ci please build toolchain

xwu · 2023-01-16T06:09:54Z

Can confirm identical SIL is generated for all benchmarks in SubstringTest using the PR toolchain versus the nightly, which answers that concern.

… less brittle.

xwu · 2023-01-17T03:45:41Z

@grynspan I've had to make edits to the temporary allocation codegen test to keep CI happy.

In the IR, the relative order in which basic blocks for stack versus heap allocation are emitted for the large allocation test is arbitrary, but the existing test only accepts one specific order. With this PR, the basic blocks are now omitted in a different order on macOS because the generated code (rightly) has some different inlining choices earlier in the file with respect to a comparison operation with a randomly generated fixed-width integer.

To make the test robust to arbitrary changes in the order of basic blocks, I've changed certain "CHECK"s to "CHECK-DAG"s. At some point in the past decade, folks proposed a "CHECK-LABEL-DAG" feature to be added to FileCheck, but unfortunately it didn't gain traction and it doesn't appear possible to check entire basic blocks being reordered relative to one another in one invocation of FileCheck, just single lines. Therefore, I've also changed the test to invoke FileCheck three times, with the first run testing for presence of the expected branch instructions, and the latter two specifically testing either the heap allocation IR or the stack allocation IR.

Note that the test as it currently exists on main has a RUN line that actually only tells FileCheck to check the platform-specific lines, as far as I can tell, so the vast majority of the "CHECK"s aren't being tested; I've changed the flags on the first invocation so that all "CHECK"s are actually tested--and fortunately it all seems to work fine.

Let me know if this change looks at least reasonable to you. I'd like to defer any detailed tinkering to you or others as it's not salient to this PR, but some degree of change was required to unblock this work and I figured I'd at least make a real effort to leave it better than I found it.

xwu · 2023-01-17T03:49:08Z

@swift-ci test

grynspan

I don't object to the changes to the temp-alloc unit tests, so long as they continue to pass. ;)

stephentyrone · 2023-01-17T20:57:18Z

I'm planning to rework these further, but this seems like a reasonable start...

stdlib/public/core/Integers.swift

stephentyrone · 2023-01-17T21:00:27Z

One style nit, otherwise LGTM.

xwu · 2023-01-18T00:43:19Z

@swift-ci test and merge

shahmishal · 2023-01-18T21:18:35Z

We are seeing multiple aarch64 bots failing:

Failed Tests (1):
  Swift(linux-aarch64) :: IRGen/temporary_allocation/codegen.swift

https://ci.swift.org/job/oss-swift-package-ubuntu-20_04-aarch64//1442/console

cc: @xwu @stephentyrone

xwu · 2023-01-18T21:26:38Z

🫤 this is what I get for trying to unblock things by fixing rather than just disabling the test.

Most of the tests weren't actually being checked and my good deed was properly enabling it, but looks like zeroext on that call is applicable only for x86_64; I think the best way to unblock is to disable this test again for arm64.

cc @grynspan

xwu · 2023-01-18T21:45:04Z

#63093

xwu · 2023-01-18T21:46:27Z

Incidentally, @shahmishal, given CI passed multiple times, was there any way to spot this issue pre-emptively?

xwu changed the title ~~[stdlib] Improve performance of generic binary integer == and <~~ [stdlib] Improve performance of generic binary integer heterogeneous == and < Jan 14, 2023