[stdlib] Implement the Indic grapheme breaking rules #40746

Azoy · 2022-01-06T00:22:53Z

This implements the CLDR Indic grapheme sequences rule in the stdlib that was also handled by ICU. The rule can be found here: https://github.com/unicode-org/cldr/tree/main/common/properties/segments

Azoy · 2022-01-06T00:23:06Z

@swift-ci please smoke test

Azoy · 2022-01-06T00:23:15Z

@swift-ci please benchmark

swift-ci · 2022-01-06T01:18:45Z

Performance (x86_64): -O

Regression	OLD	NEW	DELTA	RATIO
FlattenListFlatMap	3555	5817	+63.6%	0.61x (?)
FlattenListLoop	1480	2206	+49.1%	0.67x (?)
StringHasPrefixUnicode	45000	50000	+11.1%	0.90x (?)
StringMatch	5700	6300	+10.5%	0.90x (?)
Breadcrumbs.MutatedIdxToUTF16.Mixed	245	268	+9.4%	0.91x (?)
Breadcrumbs.MutatedUTF16ToIdx.Mixed	242	262	+8.3%	0.92x (?)
RemoveWhereQuadraticString	244	264	+8.2%	0.92x

Improvement	OLD	NEW	DELTA	RATIO
Set.subtracting.Int.Empty	36	33	-8.3%	1.09x (?)
NSStringConversion.MutableCopy.Rebridge.UTF8	759	699	-7.9%	1.09x (?)
Array2D	6736	6208	-7.8%	1.09x (?)
RandomShuffleLCG2	432	400	-7.4%	1.08x (?)
StringBuilder	297	275	-7.4%	1.08x (?)
NSStringConversion	550	510	-7.3%	1.08x (?)

Code size: -O

Performance (x86_64): -Osize

Regression	OLD	NEW	DELTA	RATIO
String.data.LargeUnicode	91	104	+14.3%	0.88x (?)
DictionaryBridgeToObjC_Access	883	995	+12.7%	0.89x (?)
FlattenListFlatMap	5005	5630	+12.5%	0.89x (?)
StringHasPrefixUnicode	45000	50000	+11.1%	0.90x
FrequenciesUsingReduce	4800	5300	+10.4%	0.91x
Breadcrumbs.MutatedUTF16ToIdx.Mixed	242	263	+8.7%	0.92x (?)
StringMatch	6300	6800	+7.9%	0.93x

Improvement	OLD	NEW	DELTA	RATIO
DataToStringLargeUnicode	6150	5500	-10.6%	1.12x (?)
StringBuilder	296	274	-7.4%	1.08x (?)
Array2D	6736	6240	-7.4%	1.08x (?)
RemoveWhereFilterInts	41	38	-7.3%	1.08x (?)
StringComparison_fastPrenormal	880	820	-6.8%	1.07x

Code size: -Osize

Performance (x86_64): -Onone

Regression	OLD	NEW	DELTA	RATIO
Breadcrumbs.MutatedUTF16ToIdx.ASCII	6	8	+33.3%	0.75x (?)
StringMatch	38600	50500	+30.8%	0.76x
StringRemoveDupes	712	887	+24.6%	0.80x
RandomDoubleLCG	39603	47022	+18.7%	0.84x (?)
RandomDoubleOpaqueLCG	40007	46880	+17.2%	0.85x (?)
SevenBoom	1259	1450	+15.2%	0.87x (?)
RangeReplaceableCollectionPlusDefault	5460	6156	+12.7%	0.89x (?)
LuhnAlgoEager	3916	4363	+11.4%	0.90x (?)
SubstringTrimmingASCIIWhitespace	874	973	+11.3%	0.90x (?)
DictionaryBridgeToObjC_Access	1044	1152	+10.3%	0.91x (?)
StringHasPrefixUnicode	49000	54000	+10.2%	0.91x (?)
RandomInt64LCG	42139	46432	+10.2%	0.91x (?)
StringEdits	167000	182900	+9.5%	0.91x (?)
NSStringConversion.Rebridge.Mutable	1415	1546	+9.3%	0.92x (?)
Breadcrumbs.MutatedIdxToUTF16.ASCII	11	12	+9.1%	0.92x (?)
Breadcrumbs.MutatedUTF16ToIdx.Mixed	248	270	+8.9%	0.92x (?)
ErrorHandling	3830	4160	+8.6%	0.92x (?)
FrequenciesUsingReduce	7830	8500	+8.6%	0.92x (?)
RandomDoubleDef	46200	50100	+8.4%	0.92x (?)
RandomDoubleOpaqueDef	46600	50500	+8.4%	0.92x (?)
RandomInt64Def	53700	57800	+7.6%	0.93x (?)

Improvement	OLD	NEW	DELTA	RATIO
ArrayAppendLatin1	20196	14926	-26.1%	1.35x (?)
ArrayAppendUTF16	19924	14790	-25.8%	1.35x (?)
ArrayAppendAscii	19992	14892	-25.5%	1.34x (?)
DataReplaceMediumBuffer	6500	5800	-10.8%	1.12x (?)
UTF8Decode_InitFromCustom_contiguous_ascii_as_ascii	453	416	-8.2%	1.09x (?)
FloatingPointPrinting_Float_interpolated	61400	56600	-7.8%	1.08x (?)
ArrayOfPOD	1005	933	-7.2%	1.08x (?)

Code size: -swiftlibs

How to read the data

The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview

  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 8-Core Intel Xeon E5
  Processor Speed: 3 GHz
  Number of Processors: 1
  Total Number of Cores: 8
  L2 Cache (per Core): 256 KB
  L3 Cache: 25 MB
  Memory: 64 GB

milseman

Quick pass over LGTM, but I haven't looked too deeply at checkIfInIndicSequence. IIRC, that's similar to other kinds of scan-backwards checks we have for regional indicators. Is the logic sufficiently similar that we could share the algorithm, or is it fairly distinct?

Azoy · 2022-01-11T20:51:19Z

They're both similar in that they need to walk backwards, but the logic afterwards is fairly distinct. We could parameterize the logic and share a backwards walker if we wanted.

[stdlib] Implement the Indic grapheme breaking rules

Implement the Indic grapheme breaking rules

4a45182

Azoy requested a review from milseman January 6, 2022 00:22

milseman reviewed Jan 10, 2022

View reviewed changes

Azoy merged commit 6ca7366 into swiftlang:main Jan 11, 2022

Azoy deleted the indic-grapheme-clusters branch January 11, 2022 20:50

Azoy added a commit to Azoy/swift that referenced this pull request Jan 11, 2022

Merge pull request swiftlang#40746 from Azoy/indic-grapheme-clusters

3b48522

[stdlib] Implement the Indic grapheme breaking rules

Azoy mentioned this pull request Jan 11, 2022

[5.6] [stdlib] Implement the Indic grapheme breaking rules #40803

Merged

benrimmington mentioned this pull request Jan 14, 2022

[stdlib] Use const for graphemeBreakProperties #40711

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[stdlib] Implement the Indic grapheme breaking rules #40746

[stdlib] Implement the Indic grapheme breaking rules #40746

Uh oh!

Azoy commented Jan 6, 2022

Uh oh!

Azoy commented Jan 6, 2022

Uh oh!

Azoy commented Jan 6, 2022

Uh oh!

swift-ci commented Jan 6, 2022

Uh oh!

milseman left a comment

Uh oh!

Azoy commented Jan 11, 2022

Uh oh!

Uh oh!

[stdlib] Implement the Indic grapheme breaking rules #40746

[stdlib] Implement the Indic grapheme breaking rules #40746

Uh oh!

Conversation

Azoy commented Jan 6, 2022

Uh oh!

Azoy commented Jan 6, 2022

Uh oh!

Azoy commented Jan 6, 2022

Uh oh!

swift-ci commented Jan 6, 2022

Performance (x86_64): -O

Code size: -O

Performance (x86_64): -Osize

Code size: -Osize

Performance (x86_64): -Onone

Code size: -swiftlibs

Uh oh!

milseman left a comment

Choose a reason for hiding this comment

Uh oh!

Azoy commented Jan 11, 2022

Uh oh!

Uh oh!