Skip to content

[stdlib] Implement the Indic grapheme breaking rules #40746

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 11, 2022

Conversation

Azoy
Copy link
Contributor

@Azoy Azoy commented Jan 6, 2022

This implements the CLDR Indic grapheme sequences rule in the stdlib that was also handled by ICU. The rule can be found here: https://github.com/unicode-org/cldr/tree/main/common/properties/segments

@Azoy Azoy requested a review from milseman January 6, 2022 00:22
@Azoy
Copy link
Contributor Author

Azoy commented Jan 6, 2022

@swift-ci please smoke test

@Azoy
Copy link
Contributor Author

Azoy commented Jan 6, 2022

@swift-ci please benchmark

@swift-ci
Copy link
Contributor

swift-ci commented Jan 6, 2022

Performance (x86_64): -O

Regression OLD NEW DELTA RATIO
FlattenListFlatMap 3555 5817 +63.6% 0.61x (?)
FlattenListLoop 1480 2206 +49.1% 0.67x (?)
StringHasPrefixUnicode 45000 50000 +11.1% 0.90x (?)
StringMatch 5700 6300 +10.5% 0.90x (?)
Breadcrumbs.MutatedIdxToUTF16.Mixed 245 268 +9.4% 0.91x (?)
Breadcrumbs.MutatedUTF16ToIdx.Mixed 242 262 +8.3% 0.92x (?)
RemoveWhereQuadraticString 244 264 +8.2% 0.92x
 
Improvement OLD NEW DELTA RATIO
Set.subtracting.Int.Empty 36 33 -8.3% 1.09x (?)
NSStringConversion.MutableCopy.Rebridge.UTF8 759 699 -7.9% 1.09x (?)
Array2D 6736 6208 -7.8% 1.09x (?)
RandomShuffleLCG2 432 400 -7.4% 1.08x (?)
StringBuilder 297 275 -7.4% 1.08x (?)
NSStringConversion 550 510 -7.3% 1.08x (?)

Code size: -O

Performance (x86_64): -Osize

Regression OLD NEW DELTA RATIO
String.data.LargeUnicode 91 104 +14.3% 0.88x (?)
DictionaryBridgeToObjC_Access 883 995 +12.7% 0.89x (?)
FlattenListFlatMap 5005 5630 +12.5% 0.89x (?)
StringHasPrefixUnicode 45000 50000 +11.1% 0.90x
FrequenciesUsingReduce 4800 5300 +10.4% 0.91x
Breadcrumbs.MutatedUTF16ToIdx.Mixed 242 263 +8.7% 0.92x (?)
StringMatch 6300 6800 +7.9% 0.93x
 
Improvement OLD NEW DELTA RATIO
DataToStringLargeUnicode 6150 5500 -10.6% 1.12x (?)
StringBuilder 296 274 -7.4% 1.08x (?)
Array2D 6736 6240 -7.4% 1.08x (?)
RemoveWhereFilterInts 41 38 -7.3% 1.08x (?)
StringComparison_fastPrenormal 880 820 -6.8% 1.07x

Code size: -Osize

Performance (x86_64): -Onone

Regression OLD NEW DELTA RATIO
Breadcrumbs.MutatedUTF16ToIdx.ASCII 6 8 +33.3% 0.75x (?)
StringMatch 38600 50500 +30.8% 0.76x
StringRemoveDupes 712 887 +24.6% 0.80x
RandomDoubleLCG 39603 47022 +18.7% 0.84x (?)
RandomDoubleOpaqueLCG 40007 46880 +17.2% 0.85x (?)
SevenBoom 1259 1450 +15.2% 0.87x (?)
RangeReplaceableCollectionPlusDefault 5460 6156 +12.7% 0.89x (?)
LuhnAlgoEager 3916 4363 +11.4% 0.90x (?)
SubstringTrimmingASCIIWhitespace 874 973 +11.3% 0.90x (?)
DictionaryBridgeToObjC_Access 1044 1152 +10.3% 0.91x (?)
StringHasPrefixUnicode 49000 54000 +10.2% 0.91x (?)
RandomInt64LCG 42139 46432 +10.2% 0.91x (?)
StringEdits 167000 182900 +9.5% 0.91x (?)
NSStringConversion.Rebridge.Mutable 1415 1546 +9.3% 0.92x (?)
Breadcrumbs.MutatedIdxToUTF16.ASCII 11 12 +9.1% 0.92x (?)
Breadcrumbs.MutatedUTF16ToIdx.Mixed 248 270 +8.9% 0.92x (?)
ErrorHandling 3830 4160 +8.6% 0.92x (?)
FrequenciesUsingReduce 7830 8500 +8.6% 0.92x (?)
RandomDoubleDef 46200 50100 +8.4% 0.92x (?)
RandomDoubleOpaqueDef 46600 50500 +8.4% 0.92x (?)
RandomInt64Def 53700 57800 +7.6% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
ArrayAppendLatin1 20196 14926 -26.1% 1.35x (?)
ArrayAppendUTF16 19924 14790 -25.8% 1.35x (?)
ArrayAppendAscii 19992 14892 -25.5% 1.34x (?)
DataReplaceMediumBuffer 6500 5800 -10.8% 1.12x (?)
UTF8Decode_InitFromCustom_contiguous_ascii_as_ascii 453 416 -8.2% 1.09x (?)
FloatingPointPrinting_Float_interpolated 61400 56600 -7.8% 1.08x (?)
ArrayOfPOD 1005 933 -7.2% 1.08x (?)

Code size: -swiftlibs

How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 8-Core Intel Xeon E5
  Processor Speed: 3 GHz
  Number of Processors: 1
  Total Number of Cores: 8
  L2 Cache (per Core): 256 KB
  L3 Cache: 25 MB
  Memory: 64 GB

Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick pass over LGTM, but I haven't looked too deeply at checkIfInIndicSequence. IIRC, that's similar to other kinds of scan-backwards checks we have for regional indicators. Is the logic sufficiently similar that we could share the algorithm, or is it fairly distinct?

@Azoy Azoy merged commit 6ca7366 into swiftlang:main Jan 11, 2022
@Azoy Azoy deleted the indic-grapheme-clusters branch January 11, 2022 20:50
@Azoy
Copy link
Contributor Author

Azoy commented Jan 11, 2022

They're both similar in that they need to walk backwards, but the logic afterwards is fairly distinct. We could parameterize the logic and share a backwards walker if we wanted.

Azoy added a commit to Azoy/swift that referenced this pull request Jan 11, 2022
[stdlib] Implement the Indic grapheme breaking rules
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants