Skip to content

[stdlib]String normalization functions #21026

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Jan 8, 2019
Merged

Conversation

lancep
Copy link
Contributor

@lancep lancep commented Dec 5, 2018

rdar://problem/45443666

@lancep
Copy link
Contributor Author

lancep commented Dec 5, 2018

@swift-ci please test

@swift-ci
Copy link
Contributor

swift-ci commented Dec 5, 2018

Build failed
Swift Test Linux Platform
Git Sha - a3a723aadec2d36e511fa6ccd44fa710c1b988d6

@swift-ci
Copy link
Contributor

swift-ci commented Dec 5, 2018

Build failed
Swift Test OS X Platform
Git Sha - a3a723aadec2d36e511fa6ccd44fa710c1b988d6

@lancep
Copy link
Contributor Author

lancep commented Dec 5, 2018

@swift-ci please test

@swift-ci
Copy link
Contributor

swift-ci commented Dec 5, 2018

Build failed
Swift Test OS X Platform
Git Sha - a3a723aadec2d36e511fa6ccd44fa710c1b988d6

@lancep
Copy link
Contributor Author

lancep commented Dec 5, 2018

@swift-ci please test

@lancep
Copy link
Contributor Author

lancep commented Dec 5, 2018

@swift-ci please test Linux platform

@swift-ci
Copy link
Contributor

swift-ci commented Dec 5, 2018

Build failed
Swift Test Linux Platform
Git Sha - 42abea9edc9122df7d37072bf88efbdebddcee4b

@lancep
Copy link
Contributor Author

lancep commented Dec 5, 2018

@swift-ci please test

@swift-ci
Copy link
Contributor

swift-ci commented Dec 5, 2018

Build failed
Swift Test Linux Platform
Git Sha - 9c26ca4

@milseman
Copy link
Member

milseman commented Dec 5, 2018

@swift-ci please benchmark

@swift-ci
Copy link
Contributor

swift-ci commented Dec 5, 2018

Build failed
Swift Test OS X Platform
Git Sha - 9c26ca4

@swift-ci
Copy link
Contributor

swift-ci commented Dec 5, 2018

Build comment file:

Performance: -O

TEST OLD NEW DELTA RATIO
Regression
StringUTF16Builder 400 432 +8.0% 0.93x
Improvement
UTF8Decode_InitDecoding_ascii 725 560 -22.8% 1.29x
UTF8Decode_InitDecoding 231 201 -13.0% 1.15x

Performance: -Osize

TEST OLD NEW DELTA RATIO
Regression
InsertCharacterEndIndex 147 162 +10.2% 0.91x (?)
NSStringConversion 539 588 +9.1% 0.92x
Improvement
Breadcrumbs.CopyUTF16CodeUnits.ASCII 20 18 -10.0% 1.11x
How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 16 GB
--------------

Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick, mostly cosmetic review. I'll do a more in-depth one after you're done with some of your in-flight changes

_ outputBuffer: UnsafeMutableBufferPointer<UInt8>
) -> (Int, Int) {
// Quick check if a scalar is NFC and a segment starter
@inline(__always) func isNFCStarter(_ scalar: Unicode.Scalar) -> Bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This already exists in StringNormalization.swift

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still exists in StringNormalization.swift


var outputBufferThreshold: Int {
return outputBuffer.count - 4
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this computed? It's not going to change.

return (inputCount, outputCount)
}

internal func transcodeToUTF16(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is specifically transcoding a segment. Please adjust the name to reflect that and add a comment, otherwise it looks like a general purpose transcoding function.

outputBuffer: UnsafeMutableBufferPointer<UInt8>,
icuInputBuffer: UnsafeMutableBufferPointer<UInt16>,
icuOutputBuffer: UnsafeMutableBufferPointer<UInt16>
) -> NormalizationResult {

This comment was marked as outdated.

return (index.encodedOffset - readIndex.encodedOffset, writeIndex)
}

private func sharedNormalize(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shared meaning shared strings, or shared code, or what? Please add comments to these methods. If this is a core common implementation method, Impl or is probably better than Shared

@@ -23,6 +23,7 @@ internal func _isTrailingSurrogate(_ cu: UInt16) -> Bool {
return cu & _surrogateMask == _trailingSurrogateBias
}
@inline(__always)
@usableFromInline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

@@ -84,6 +85,23 @@ internal func _decodeUTF8(
return Unicode.Scalar(_unchecked: value)
}

@inlinable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this inlinable?

@@ -148,6 +166,7 @@ internal func _continuationPayload(_ x: UInt8) -> UInt32 {
}

@inline(__always)
@usableFromInline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

@lancep
Copy link
Contributor Author

lancep commented Dec 5, 2018

@swift-ci please benchmark

@lancep
Copy link
Contributor Author

lancep commented Dec 5, 2018

@swift-ci please test

@lancep
Copy link
Contributor Author

lancep commented Dec 6, 2018

@swift-ci please benchmark

@lancep
Copy link
Contributor Author

lancep commented Dec 6, 2018

@swift-ci please test OS X platform

@swift-ci
Copy link
Contributor

swift-ci commented Dec 6, 2018

Build failed
Swift Test OS X Platform
Git Sha - 21455e3

@swift-ci
Copy link
Contributor

swift-ci commented Dec 6, 2018

Build comment file:

Performance: -O

TEST OLD NEW DELTA RATIO
Regression
StringHashing_latin1 366 782 +113.7% 0.47x
StringHashing_fastPrenormal 1042 1691 +62.3% 0.62x
StringHashing_slowerPrenormal 1038 1607 +54.8% 0.65x
StringHashing_emoji 1258 1697 +34.9% 0.74x
WordCountUniqueUTF16 2766 3674 +32.8% 0.75x
StringHashing_nonBMPSlowestPrenormal 1624 2139 +31.7% 0.76x
WordCountHistogramUTF16 6406 7375 +15.1% 0.87x
Improvement
StringHashing_zalgo 4086 3601 -11.9% 1.13x
UTF8Decode_InitFromBytes_ascii 534 490 -8.2% 1.09x

Performance: -Osize

TEST OLD NEW DELTA RATIO
Regression
StringHashing_latin1 367 792 +115.8% 0.46x
StringHashing_fastPrenormal 1042 1698 +63.0% 0.61x
StringHashing_slowerPrenormal 1038 1552 +49.5% 0.67x
WordCountUniqueUTF16 2762 3749 +35.7% 0.74x
StringHashing_nonBMPSlowestPrenormal 1620 2162 +33.5% 0.75x
StringHashing_emoji 1258 1651 +31.2% 0.76x
WordCountHistogramUTF16 6576 7553 +14.9% 0.87x
Improvement
StringHashing_zalgo 4072 3607 -11.4% 1.13x

Performance: -Onone

TEST OLD NEW DELTA RATIO
Regression
StringHashing_latin1 822 1374 +67.2% 0.60x
StringHashing_fastPrenormal 1544 2545 +64.8% 0.61x
StringHashing_slowerPrenormal 1417 2103 +48.4% 0.67x
StringHashing_emoji 1514 2026 +33.8% 0.75x
StringHashing_nonBMPSlowestPrenormal 1976 2606 +31.9% 0.76x
Improvement
StringHashing_zalgo 4301 3788 -11.9% 1.14x
How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB
--------------

normalize = fastNormalize
} else {
normalize = foreignNormalize
var bufferCleanup: (() -> ())? = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err, why not just a Bool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call

}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like all this logic could be refactored into a helper function that took the buffers inout and told the caller if they're responsible for calling deallocate.

@@ -219,7 +218,7 @@ internal func transcodeToUTF16(
return (readIndex, writeIndex)
}

internal func transcodeToUTF8(
private func transcodeToUTF8(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment?

_ sourceBuffer: UnsafeBufferPointer<UInt8>,
_ outputBuffer: UnsafeMutableBufferPointer<UInt8>,
_ icuInputBuffer: UnsafeMutableBufferPointer<UInt16>,
_ icuOutputBuffer: UnsafeMutableBufferPointer<UInt16>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think argument labels are necessary here, otherwise it's hard to know what buffer to pass where.

E.g. _fastNormalized(startingAt:from:into:icuInputBuffer:icuOutputBuffer:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

let sourceCount = sourceBuffer.count - readIndex.encodedOffset
let start = readIndex.encodedOffset
let rebasedSourceBuffer = UnsafeBufferPointer(rebasing: sourceBuffer[start...])
do {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scopes the returns for fastFill so I don't have to come up with some other name for read/filled. Also prevents me from accidentally using the wrong variable later on.

@lancep
Copy link
Contributor Author

lancep commented Dec 6, 2018

@swift-ci please benchmark

@swift-ci
Copy link
Contributor

swift-ci commented Dec 6, 2018

Build comment file:

Performance: -O

TEST OLD NEW DELTA RATIO
Regression
StringHashing_nonBMPSlowestPrenormal 1603 1746 +8.9% 0.92x
StringHashing_fastPrenormal 1040 1122 +7.9% 0.93x
Improvement
StringHashing_zalgo 4062 3414 -16.0% 1.19x
StringHashing_abnormal 1681 1452 -13.6% 1.16x
UTF8Decode_InitDecoding 230 201 -12.6% 1.14x
WordCountUniqueUTF16 2690 2504 -6.9% 1.07x

Performance: -Osize

TEST OLD NEW DELTA RATIO
Regression
StringHashing_fastPrenormal 1042 1123 +7.8% 0.93x
Improvement
StringHashing_abnormal 1852 1443 -22.1% 1.28x
StringHashing_zalgo 4208 3381 -19.7% 1.24x
WordCountUniqueUTF16 2783 2523 -9.3% 1.10x

Performance: -Onone

TEST OLD NEW DELTA RATIO
Improvement
StringHashing_abnormal 2135 1681 -21.3% 1.27x
StringHashing_zalgo 4412 3575 -19.0% 1.23x
How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB
--------------

@lancep
Copy link
Contributor Author

lancep commented Dec 6, 2018

@swift-ci please benchmark

@lancep
Copy link
Contributor Author

lancep commented Dec 6, 2018

@swift-ci please test

@lancep
Copy link
Contributor Author

lancep commented Dec 6, 2018

@swift-ci please benchmark

@lancep
Copy link
Contributor Author

lancep commented Dec 6, 2018

@swift-ci please test

@lancep
Copy link
Contributor Author

lancep commented Jan 3, 2019

@swift-ci please test

@swift-ci
Copy link
Contributor

swift-ci commented Jan 3, 2019

Build failed
Swift Test Linux Platform
Git Sha - ad584dcfd471aaeea1d01eff7b05d2891d8036db

@swift-ci
Copy link
Contributor

swift-ci commented Jan 3, 2019

Build failed
Swift Test OS X Platform
Git Sha - ad584dcfd471aaeea1d01eff7b05d2891d8036db

@swift-ci
Copy link
Contributor

swift-ci commented Jan 3, 2019

Build comment file:

Performance: -O

TEST OLD NEW DELTA RATIO
Regression
StringHashing_nonBMPSlowestPrenormal 1620 2040 +25.9% 0.79x
StringHashing_fastPrenormal 1090 1360 +24.8% 0.80x
StringHashing_emoji 1252 1452 +16.0% 0.86x
StringHashing_latin1 380 436 +14.7% 0.87x
InsertCharacterEndIndex 147 166 +12.9% 0.89x
EqualSubstringString 39 44 +12.8% 0.89x
StringHashing_slowerPrenormal 1030 1140 +10.7% 0.90x
LessSubstringSubstring 39 43 +10.3% 0.91x
EqualStringSubstring 39 43 +10.3% 0.91x (?)
EqualSubstringSubstringGenericEquatable 39 43 +10.3% 0.91x
LessSubstringSubstringGenericComparable 39 43 +10.3% 0.91x
Improvement
StringComparison_abnormal 8980 1100 -87.8% 8.16x
StringComparison_zalgo 98000 63500 -35.2% 1.54x
StringHashing_abnormal 1800 1460 -18.9% 1.23x
StringHashing_zalgo 4050 3350 -17.3% 1.21x
RemoveWhereQuadraticString 423 378 -10.6% 1.12x
Added
StringNormalization_abnormal 1460 1500 1473
StringNormalization_ascii 164 167 165
StringNormalization_emoji 1552 1552 1552
StringNormalization_fastPrenormal 1460 1490 1470
StringNormalization_latin1 476 480 477
StringNormalization_nonBMPSlowestPrenormal 1980 2000 1990
StringNormalization_slowerPrenormal 1190 1220 1200
StringNormalization_zalgo 3425 3475 3442
Removed
NormalizedIterator_abnormal 1780 1820 1793
NormalizedIterator_ascii 141 143 142
NormalizedIterator_emoji 1328 1328 1328
NormalizedIterator_fastPrenormal 1210 1230 1217
NormalizedIterator_latin1 414 420 416
NormalizedIterator_nonBMPSlowestPrenormal 1720 1720 1720
NormalizedIterator_slowerPrenormal 1060 1070 1063
NormalizedIterator_zalgo 4075 4150 4108

Code size: -O

TEST OLD NEW DELTA RATIO
Regression
CSVParsing.o 30353 31233 +2.9% 0.97x
RemoveWhere.o 23767 24119 +1.5% 0.99x

Performance: -Osize

TEST OLD NEW DELTA RATIO
Regression
StringHashing_fastPrenormal 1090 1340 +22.9% 0.81x
StringHashing_nonBMPSlowestPrenormal 1630 1990 +22.1% 0.82x
StringHashing_emoji 1260 1492 +18.4% 0.84x
StringHashing_latin1 380 434 +14.2% 0.88x
EqualStringSubstring 39 44 +12.8% 0.89x
EqualSubstringString 39 44 +12.8% 0.89x
LessSubstringSubstringGenericComparable 39 44 +12.8% 0.89x
EqualSubstringSubstring 39 43 +10.3% 0.91x
StringHashing_slowerPrenormal 1040 1140 +9.6% 0.91x
Improvement
StringComparison_abnormal 8880 1100 -87.6% 8.07x
StringWalk 3840 1520 -60.4% 2.53x
StrComplexWalk 5540 3530 -36.3% 1.57x
StringComparison_zalgo 97425 63550 -34.8% 1.53x
CSVParsingAltIndices2 1441 1078 -25.2% 1.34x
StringHashing_zalgo 4075 3375 -17.2% 1.21x
WordCountUniqueUTF16 2966 2557 -13.8% 1.16x
StringComparison_slowerPrenormal 1730 1590 -8.1% 1.09x
StringMatch 8800 8100 -8.0% 1.09x
HashTest 1394 1293 -7.2% 1.08x (?)
Added
StringNormalization_abnormal 1480 1520 1493
StringNormalization_ascii 164 167 165
StringNormalization_emoji 1564 1564 1564
StringNormalization_fastPrenormal 1460 1500 1473
StringNormalization_latin1 480 484 481
StringNormalization_nonBMPSlowestPrenormal 2000 2150 2060
StringNormalization_slowerPrenormal 1210 1230 1217
StringNormalization_zalgo 3400 3475 3425
Removed
NormalizedIterator_abnormal 1760 1780 1767
NormalizedIterator_ascii 142 153 146
NormalizedIterator_emoji 1328 1328 1328
NormalizedIterator_fastPrenormal 1210 1240 1220
NormalizedIterator_latin1 416 426 419
NormalizedIterator_nonBMPSlowestPrenormal 1720 1720 1720
NormalizedIterator_slowerPrenormal 1060 1090 1070
NormalizedIterator_zalgo 4125 4125 4125

Code size: -Osize

TEST OLD NEW DELTA RATIO
Regression
CSVParsing.o 30433 30977 +1.8% 0.98x
Hash.o 19873 20193 +1.6% 0.98x
RemoveWhere.o 23710 24062 +1.5% 0.99x
ReversedCollections.o 9610 9746 +1.4% 0.99x
Improvement
RangeAssignment.o 5101 5018 -1.6% 1.02x

Performance: -Onone

TEST OLD NEW DELTA RATIO
Regression
StringHashing_fastPrenormal 1600 2000 +25.0% 0.80x
StringHashing_nonBMPSlowestPrenormal 2000 2460 +23.0% 0.81x
ReversedDictionary2 30740 37101 +20.7% 0.83x
StringHashing_emoji 1560 1800 +15.4% 0.87x
Improvement
StringComparison_abnormal 10740 2420 -77.5% 4.44x
StringComparison_zalgo 102250 66900 -34.6% 1.53x
StringHashing_abnormal 2100 1700 -19.0% 1.24x
StringHashing_zalgo 4300 3525 -18.0% 1.22x
QueueConcrete 18510 15779 -14.8% 1.17x
QueueGeneric 24309 21251 -12.6% 1.14x
ArrayOfPOD 858 779 -9.2% 1.10x
StrComplexWalk 7980 7340 -8.0% 1.09x
Added
StringNormalization_abnormal 1680 1720 1693
StringNormalization_ascii 526 526 526
StringNormalization_emoji 1808 1808 1808
StringNormalization_fastPrenormal 1850 1880 1867
StringNormalization_latin1 896 916 903
StringNormalization_nonBMPSlowestPrenormal 2350 2400 2373
StringNormalization_slowerPrenormal 1640 1660 1647
StringNormalization_zalgo 3550 3625 3575
Removed
NormalizedIterator_abnormal 2100 2100 2100
NormalizedIterator_ascii 582 583 582
NormalizedIterator_emoji 1588 1588 1588
NormalizedIterator_fastPrenormal 1740 1760 1747
NormalizedIterator_latin1 934 936 935
NormalizedIterator_nonBMPSlowestPrenormal 2090 2090 2090
NormalizedIterator_slowerPrenormal 1480 1520 1493
NormalizedIterator_zalgo 4325 4325 4325
Benchmark Check Report
⛔️🔤 StringNormalization_abnormal name doesn`t conform to benchmark naming convention.
See http://bit.ly/BenchmarkNaming
⛔️🔤 StringNormalization_fastPrenormal name doesn`t conform to benchmark naming convention.
See http://bit.ly/BenchmarkNaming
⛔️🔤 StringNormalization_slowerPrenormal name doesn`t conform to benchmark naming convention.
See http://bit.ly/BenchmarkNaming
⛔️🔤 StringNormalization_nonBMPSlowestPrenormal name doesn`t conform to benchmark naming convention.
See http://bit.ly/BenchmarkNaming
⛔️🔤 StringNormalization_nonBMPSlowestPrenormal name is 42 characters long.
Benchmark name should not be longer than 40 characters.
⛔️🔤 StringNormalization_emoji name doesn`t conform to benchmark naming convention.
See http://bit.ly/BenchmarkNaming
⛔️🔤 StringNormalization_zalgo name doesn`t conform to benchmark naming convention.
See http://bit.ly/BenchmarkNaming
⛔️🔤 StringNormalization_ascii name doesn`t conform to benchmark naming convention.
See http://bit.ly/BenchmarkNaming
⛔️🔤 StringNormalization_latin1 name doesn`t conform to benchmark naming convention.
See http://bit.ly/BenchmarkNaming
How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB
--------------

@lancep
Copy link
Contributor Author

lancep commented Jan 4, 2019

@swift-ci please benchmark

@lancep
Copy link
Contributor Author

lancep commented Jan 4, 2019

@swift-ci please test

@swift-ci
Copy link
Contributor

swift-ci commented Jan 5, 2019

Build comment file:

Performance: -O

TEST OLD NEW DELTA RATIO
Regression
StringHashing_nonBMPSlowestPrenormal 1450 2090 +44.1% 0.69x
StringHashing_emoji 1132 1628 +43.8% 0.70x
StringHashing_slowerPrenormal 920 1300 +41.3% 0.71x
StringHashing_fastPrenormal 930 1190 +28.0% 0.78x
StringHashing_latin1 330 394 +19.4% 0.84x
DataSetCountSmall 118 133 +12.7% 0.89x
Improvement
StringComparison_abnormal 7620 960 -87.4% 7.94x
StringComparison_zalgo 87050 57475 -34.0% 1.51x
StringHashing_zalgo 3575 3075 -14.0% 1.16x
Added
StringNormalization_abnormal 1280 1320 1293
StringNormalization_ascii 147 151 148
StringNormalization_emoji 1404 1416 1408
StringNormalization_fastPrenormal 1290 1310 1297
StringNormalization_latin1 426 432 428
StringNormalization_nonBMPSlowestPrenormal 1820 1830 1823
StringNormalization_slowerPrenormal 1080 1100 1087
StringNormalization_zalgo 3050 3125 3075
Removed
NormalizedIterator_abnormal 1480 1500 1487
NormalizedIterator_ascii 126 128 127
NormalizedIterator_emoji 1200 1204 1201
NormalizedIterator_fastPrenormal 1040 1060 1047
NormalizedIterator_latin1 358 362 359
NormalizedIterator_nonBMPSlowestPrenormal 1540 1540 1540
NormalizedIterator_slowerPrenormal 940 950 943
NormalizedIterator_zalgo 3625 3650 3633

Code size: -O

TEST OLD NEW DELTA RATIO
Regression
CSVParsing.o 30353 31233 +2.9% 0.97x
RemoveWhere.o 23767 24119 +1.5% 0.99x

Performance: -Osize

TEST OLD NEW DELTA RATIO
Regression
StringHashing_fastPrenormal 930 1190 +28.0% 0.78x
StringHashing_emoji 1132 1360 +20.1% 0.83x
StringHashing_latin1 330 396 +20.0% 0.83x
StringHashing_nonBMPSlowestPrenormal 1460 1730 +18.5% 0.84x
StringHashing_slowerPrenormal 920 1050 +14.1% 0.88x
DataSetCountSmall 120 135 +12.5% 0.89x
InsertCharacterEndIndex 131 144 +9.9% 0.91x (?)
StringComparison_slowerPrenormal 1440 1550 +7.6% 0.93x
Improvement
StringComparison_abnormal 7480 980 -86.9% 7.63x
StringWalk 3440 1360 -60.5% 2.53x
StrComplexWalk 5000 3140 -37.2% 1.59x
StringComparison_zalgo 87100 57450 -34.0% 1.52x
CSVParsingAltIndices2 1298 979 -24.6% 1.33x
StringHashing_zalgo 3600 3025 -16.0% 1.19x
StringMatch 7800 7100 -9.0% 1.10x
StringHashing_abnormal 1440 1320 -8.3% 1.09x
Added
StringNormalization_abnormal 1300 1360 1327
StringNormalization_ascii 152 156 153
StringNormalization_emoji 1452 1456 1453
StringNormalization_fastPrenormal 1330 1360 1340
StringNormalization_latin1 438 446 441
StringNormalization_nonBMPSlowestPrenormal 1860 1900 1873
StringNormalization_slowerPrenormal 1110 1150 1127
StringNormalization_zalgo 3125 3225 3158
Removed
NormalizedIterator_abnormal 1500 1500 1500
NormalizedIterator_ascii 126 128 127
NormalizedIterator_emoji 1192 1196 1193
NormalizedIterator_fastPrenormal 1040 1050 1043
NormalizedIterator_latin1 358 362 359
NormalizedIterator_nonBMPSlowestPrenormal 1540 1540 1540
NormalizedIterator_slowerPrenormal 940 970 950
NormalizedIterator_zalgo 3675 3675 3675

Code size: -Osize

TEST OLD NEW DELTA RATIO
Regression
CSVParsing.o 30433 30977 +1.8% 0.98x
Hash.o 19873 20193 +1.6% 0.98x
RemoveWhere.o 23710 24062 +1.5% 0.99x
ReversedCollections.o 9610 9746 +1.4% 0.99x
Improvement
RangeAssignment.o 5101 5018 -1.6% 1.02x

Performance: -Onone

TEST OLD NEW DELTA RATIO
Regression
StringHashing_fastPrenormal 1380 1700 +23.2% 0.81x
PopFrontArrayGeneric 4749 5361 +12.9% 0.89x
Improvement
StringComparison_abnormal 9220 2060 -77.7% 4.48x
StringComparison_zalgo 91575 60450 -34.0% 1.51x
StringHashing_zalgo 3825 3200 -16.3% 1.20x
CharIndexing_chinese_unicodeScalars 449600 378440 -15.8% 1.19x
StringHashing_abnormal 1740 1480 -14.9% 1.18x
CharIndexing_russian_unicodeScalars 486920 417080 -14.3% 1.17x
StrComplexWalk 7290 6610 -9.3% 1.10x
WordCountUniqueASCII 11761 10686 -9.1% 1.10x
WordCountUniqueUTF16 12486 11465 -8.2% 1.09x
Added
StringNormalization_abnormal 1440 1500 1460
StringNormalization_ascii 479 480 479
StringNormalization_emoji 1580 1580 1580
StringNormalization_fastPrenormal 1640 1670 1650
StringNormalization_latin1 804 842 817
StringNormalization_nonBMPSlowestPrenormal 2040 2080 2053
StringNormalization_slowerPrenormal 1380 1410 1390
StringNormalization_zalgo 3150 3225 3175
Removed
NormalizedIterator_abnormal 1720 1720 1720
NormalizedIterator_ascii 451 459 456
NormalizedIterator_emoji 1376 1376 1376
NormalizedIterator_fastPrenormal 1400 1410 1403
NormalizedIterator_latin1 738 738 738
NormalizedIterator_nonBMPSlowestPrenormal 1790 1790 1790
NormalizedIterator_slowerPrenormal 1260 1290 1273
NormalizedIterator_zalgo 3800 3825 3808
Benchmark Check Report
⛔️🔤 StringNormalization_abnormal name doesn`t conform to benchmark naming convention.
See http://bit.ly/BenchmarkNaming
⛔️🔤 StringNormalization_fastPrenormal name doesn`t conform to benchmark naming convention.
See http://bit.ly/BenchmarkNaming
⛔️🔤 StringNormalization_slowerPrenormal name doesn`t conform to benchmark naming convention.
See http://bit.ly/BenchmarkNaming
⛔️🔤 StringNormalization_nonBMPSlowestPrenormal name doesn`t conform to benchmark naming convention.
See http://bit.ly/BenchmarkNaming
⛔️🔤 StringNormalization_nonBMPSlowestPrenormal name is 42 characters long.
Benchmark name should not be longer than 40 characters.
⛔️🔤 StringNormalization_emoji name doesn`t conform to benchmark naming convention.
See http://bit.ly/BenchmarkNaming
⛔️🔤 StringNormalization_zalgo name doesn`t conform to benchmark naming convention.
See http://bit.ly/BenchmarkNaming
⛔️🔤 StringNormalization_ascii name doesn`t conform to benchmark naming convention.
See http://bit.ly/BenchmarkNaming
⛔️🔤 StringNormalization_latin1 name doesn`t conform to benchmark naming convention.
See http://bit.ly/BenchmarkNaming
How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 8-Core Intel Xeon E5
  Processor Speed: 3 GHz
  Number of Processors: 1
  Total Number of Cores: 8
  L2 Cache (per Core): 256 KB
  L3 Cache: 25 MB
  Memory: 64 GB
--------------

@lancep
Copy link
Contributor Author

lancep commented Jan 5, 2019

@swift-ci please benchmark

@swift-ci
Copy link
Contributor

swift-ci commented Jan 5, 2019

Build comment file:

Performance: -O

TEST OLD NEW DELTA RATIO
Regression
DataSetCountSmall 117 133 +13.7% 0.88x
StringComparison_slowerPrenormal 1440 1570 +9.0% 0.92x
Improvement
StringComparison_abnormal 7500 960 -87.2% 7.81x
StringComparison_zalgo 87650 57075 -34.9% 1.54x
StringHashing_zalgo 3600 3225 -10.4% 1.12x
NormalizedIterator_zalgo 3625 3275 -9.7% 1.11x
NormalizedIterator_ascii 126 114 -9.5% 1.11x
StringHashing_abnormal 1500 1360 -9.3% 1.10x
NormalizedIterator_latin1 356 332 -6.7% 1.07x

Code size: -O

TEST OLD NEW DELTA RATIO
Regression
CSVParsing.o 30353 31233 +2.9% 0.97x
RemoveWhere.o 23767 24119 +1.5% 0.99x

Performance: -Osize

TEST OLD NEW DELTA RATIO
Regression
DataSetCountSmall 120 135 +12.5% 0.89x
InsertCharacterEndIndex 131 147 +12.2% 0.89x
StringComparison_slowerPrenormal 1440 1580 +9.7% 0.91x
InsertCharacterEndIndexNonASCII 50 54 +8.0% 0.93x (?)
Improvement
StringComparison_abnormal 7540 960 -87.3% 7.85x
StringWalk 3440 1360 -60.5% 2.53x
StrComplexWalk 4990 3120 -37.5% 1.60x
StringComparison_zalgo 87450 57000 -34.8% 1.53x
CSVParsingAltIndices2 1298 990 -23.7% 1.31x
NormalizedIterator_ascii 126 112 -11.1% 1.12x
StringHashing_zalgo 3600 3225 -10.4% 1.12x
NormalizedIterator_zalgo 3625 3275 -9.7% 1.11x
RemoveWhereMoveInts 11 10 -9.1% 1.10x
StringHashing_latin1 330 302 -8.5% 1.09x
StringHashing_abnormal 1480 1360 -8.1% 1.09x
NormalizedIterator_latin1 358 332 -7.3% 1.08x
NormalizedIterator_abnormal 1440 1340 -6.9% 1.07x

Code size: -Osize

TEST OLD NEW DELTA RATIO
Regression
CSVParsing.o 30433 30977 +1.8% 0.98x
Hash.o 19873 20193 +1.6% 0.98x
RemoveWhere.o 23710 24062 +1.5% 0.99x
ReversedCollections.o 9610 9746 +1.4% 0.99x
Improvement
RangeAssignment.o 5101 5018 -1.6% 1.02x

Performance: -Onone

TEST OLD NEW DELTA RATIO
Regression
InsertCharacterEndIndex 195 213 +9.2% 0.92x
InsertCharacterEndIndexNonASCII 70 76 +8.6% 0.92x (?)
Improvement
StringComparison_abnormal 9060 2160 -76.2% 4.19x
StringComparison_zalgo 91300 60150 -34.1% 1.52x
StrComplexWalk 7260 6110 -15.8% 1.19x
CharIteration_ascii_unicodeScalars_Backwards 376520 320520 -14.9% 1.17x
CharIndexing_punctuatedJapanese_unicodeScalars 99720 87680 -12.1% 1.14x
RemoveWhereFilterStrings 4191 3690 -12.0% 1.14x
WordCountUniqueUTF16 12265 10871 -11.4% 1.13x
NormalizedIterator_zalgo 3825 3400 -11.1% 1.12x
StringHashing_zalgo 3825 3400 -11.1% 1.12x
NormalizedIterator_abnormal 1720 1540 -10.5% 1.12x
StringHashing_abnormal 1740 1560 -10.3% 1.12x
CharIteration_punctuated_unicodeScalars_Backwards 79240 71440 -9.8% 1.11x
ArrayOfPOD 763 693 -9.2% 1.10x
WordCountUniqueASCII 11775 10764 -8.6% 1.09x
StringWalk 12520 11640 -7.0% 1.08x
How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 8-Core Intel Xeon E5
  Processor Speed: 3 GHz
  Number of Processors: 1
  Total Number of Cores: 8
  L2 Cache (per Core): 256 KB
  L3 Cache: 25 MB
  Memory: 64 GB
--------------

@lancep
Copy link
Contributor Author

lancep commented Jan 5, 2019

@swift-ci please test

@swift-ci
Copy link
Contributor

swift-ci commented Jan 5, 2019

Build failed
Swift Test Linux Platform
Git Sha - 280e002

@swift-ci
Copy link
Contributor

swift-ci commented Jan 5, 2019

Build failed
Swift Test OS X Platform
Git Sha - 280e002

@lancep
Copy link
Contributor Author

lancep commented Jan 7, 2019

@swift-ci please test OS X platform

@swift-ci
Copy link
Contributor

swift-ci commented Jan 7, 2019

Build failed
Swift Test OS X Platform
Git Sha - cef0fb8

@lancep
Copy link
Contributor Author

lancep commented Jan 7, 2019

@swift-ci please test

@swift-ci
Copy link
Contributor

swift-ci commented Jan 7, 2019

Build failed
Swift Test OS X Platform
Git Sha - cef0fb8

@swift-ci
Copy link
Contributor

swift-ci commented Jan 7, 2019

Build failed
Swift Test Linux Platform
Git Sha - cef0fb8

@@ -114,11 +114,11 @@ let codeUnitNormalizationTests: [([UInt8], String)] = [
"몌̴ᆮ닮돈a֮𖬴̀̕b됆섨⽱쀪ﺏ탅"
),
(
[0xc3, 0xa0, 0xd6, 0xae, 0xf0, 0x91, 0x8d, 0xb3, 0xcc, 0x95, 0x62, 0xe1, 0xb6, 0xb6, 0xec, 0x93, 0xa1, 0xed, 0x83, 0x97, 0xec, 0xa4, 0x80, 0xec, 0xa0, 0x95, 0xef, 0xbb, 0xaa, 0xec, 0x82, 0x90, 0xcc, 0xb4, 0xe1, 0x86, 0xb5, 0xed, 0x84, 0xa3, 0xed, 0x8b, 0xad, 0xeb, 0x9c, 0xaf, 0xef, 0xb6, 0x8e, 0xe3, 0x8e, 0xa1, 0xef, 0xb6, 0x94, 0xe3, 0x8e, 0x9a, 0xe2, 0x85, 0xef, 0xb0, 0x92, 0xeb, 0x83, 0xa8, 0xed, 0x9d, 0x94, 0xf0, 0x9f, 0x88, 0xb9],
[0xc3, 0xa0, 0xd6, 0xae, 0xf0, 0x91, 0x8d, 0xb3, 0xcc, 0x95, 0x62, 0xe1, 0xb6, 0xb6, 0xec, 0x93, 0xa1, 0xed, 0x83, 0x97, 0xec, 0xa4, 0x80, 0xec, 0xa0, 0x95, 0xef, 0xbb, 0xaa, 0xec, 0x82, 0x90, 0xcc, 0xb4, 0xe1, 0x86, 0xb5, 0xed, 0x84, 0xa3, 0xed, 0x8b, 0xad, 0xeb, 0x9c, 0xaf, 0xef, 0xb6, 0x8e, 0xe3, 0x8e, 0xa1, 0xef, 0xb6, 0x94, 0xe3, 0x8e, 0x9a, 0xe2, 0x85, 0xb3, 0xef, 0xb0, 0x92, 0xeb, 0x83, 0xa8, 0xed, 0x9d, 0x94, 0xf0, 0x9f, 0x88, 0xb9],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change needed?

"à֮𑍳̕bᶶ쓡탗준정ﻪ삐̴ᆵ턣틭뜯ﶎ㎡ﶔ㎚ⅳﰒ냨흔🈹"
),
(
[0xeb, 0xb3, 0x95, 0xf0, 0xa6, 0xbc, 0xac, 0xef, 0xbc, 0xba, 0xce, 0xa9, 0xec, 0xb2, 0x84, 0xe7, 0xb3, 0x96, 0xed, 0x91, 0xa0, 0xc3, 0xa0, 0xd6, 0xae, 0xe2, 0xb7, 0xb9, 0xcc, 0x95, 0x62, 0xec, 0xbc, 0x98, 0xe5, 0x8b, 0x9e, 0xec, 0x82, 0x9a, 0xed, 0x88, 0x92, 0xed, 0x87, 0x91, 0xec, 0xaf, 0x99, 0xe6, 0x86, 0xaf, 0xe2, 0xb5, 0xaf, 0xec, 0x9d, 0x8c, 0xe9, 0xb6, 0xe2, 0x93, 0xa2, 0xe2, 0x82, 0x84],
[0xeb, 0xb3, 0x95, 0xf0, 0xa6, 0xbc, 0xac, 0xef, 0xbc, 0xba, 0xce, 0xa9, 0xec, 0xb2, 0x84, 0xe7, 0xb3, 0x96, 0xed, 0x91, 0xa0, 0xc3, 0xa0, 0xd6, 0xae, 0xe2, 0xb7, 0xb9, 0xcc, 0x95, 0x62, 0xec, 0xbc, 0x98, 0xe5, 0x8b, 0x9e, 0xec, 0x82, 0x9a, 0xed, 0x88, 0x92, 0xed, 0x87, 0x91, 0xec, 0xaf, 0x99, 0xe6, 0x86, 0xaf, 0xe2, 0xb5, 0xaf, 0xec, 0x9d, 0x8c, 0xe9, 0xb6, 0xb4, 0xe2, 0x93, 0xa2, 0xe2, 0x82, 0x84],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change needed?

@@ -134,323 +134,323 @@ let codeUnitNormalizationTests: [([UInt8], String)] = [
"쁏音MW삢썍곛뻏쩄뻍겔ᄇ̴ᅪ킾a〪᳟̖֚b죆퉆Ṑﰆ뒳킟저̴ᆼ"
),
(
[0xec, 0x8c, 0x9d, 0xe6, 0x86, 0x8e, 0xec, 0xb0, 0x9a, 0xec, 0xba, 0xb0, 0xe1, 0xb6, 0xb0, 0xc3, 0xa0, 0xd6, 0xae, 0xe1, 0xaa, 0xb4, 0xcc, 0x95, 0x62, 0xec, 0xac, 0x9e, 0xec, 0xa7, 0xb9, 0xea, 0xbe, 0x89, 0xef, 0xad, 0xab, 0xd7, 0x9c, 0xd6, 0xbc, 0xe2, 0x85, 0x85, 0xeb, 0x94, 0x93, 0xe1, 0xbd, 0x92, 0xec, 0xaa, 0xa3, 0xeb, 0xab, 0x85, 0xec, 0xb7, 0x8c, 0xea, 0xb9, 0xa9, 0xec, 0xab, 0x8d, 0xec, 0x8c, 0x84, 0xe5, 0x8f, 0xa5, 0xeb, 0x89, 0x85, 0xeb, 0xed, 0x91, 0xa4, 0xeb, 0x9b, 0x95, 0xe3, 0x89, 0xa5, 0x61, 0xd6, 0xae, 0xf0, 0x9e, 0x80, 0x9d, 0xcc, 0x80, 0xcc, 0x95, 0x62, 0xe4, 0x84, 0xaf, 0xc8, 0x9f, 0xec, 0xba, 0x9b],
[0xec, 0x8c, 0x9d, 0xe6, 0x86, 0x8e, 0xec, 0xb0, 0x9a, 0xec, 0xba, 0xb0, 0xe1, 0xb6, 0xb0, 0xc3, 0xa0, 0xd6, 0xae, 0xe1, 0xaa, 0xb4, 0xcc, 0x95, 0x62, 0xec, 0xac, 0x9e, 0xec, 0xa7, 0xb9, 0xea, 0xbe, 0x89, 0xef, 0xad, 0xab, 0xd7, 0x9c, 0xd6, 0xbc, 0xe2, 0x85, 0x85, 0xeb, 0x94, 0x93, 0xe1, 0xbd, 0x92, 0xec, 0xaa, 0xa3, 0xeb, 0xab, 0x85, 0xec, 0xb7, 0x8c, 0xea, 0xb9, 0xa9, 0xec, 0xab, 0x8d, 0xec, 0x8c, 0x84, 0xe5, 0x8f, 0xa5, 0xeb, 0x89, 0x85, 0xeb, 0x8c, 0xa6, 0xed, 0x91, 0xa4, 0xeb, 0x9b, 0x95, 0xe3, 0x89, 0xa5, 0x61, 0xd6, 0xae, 0xf0, 0x9e, 0x80, 0x9d, 0xcc, 0x80, 0xcc, 0x95, 0x62, 0xe4, 0x84, 0xaf, 0xc8, 0x9f, 0xec, 0xba, 0x9b],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the rest whitespace changes?

@lancep lancep merged commit 15aaa1e into swiftlang:master Jan 8, 2019
@lancep lancep deleted the no-mo-iterato branch January 8, 2019 21:55
@lancep
Copy link
Contributor Author

lancep commented Jan 8, 2019

Michael approved in person

lancep pushed a commit to lancep/swift that referenced this pull request Jan 10, 2019
airspeedswift pushed a commit that referenced this pull request Jan 11, 2019
* [stdlib]String normalization functions (#21026)

* fast/foreignNormalize functions

* fix bad cherry-pick
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants