Skip to content

Commit 0a88de5

Browse files
committed
[stdlib] Grapheme break fast-paths for Cyrillic, Arabic, Hangul
Add in more grapheme break fast paths for scripts based on Cyrillic, Arabic, or Hangul. Generates significant performance wins, similar to those for the unihan fast paths. While every extra check does slow down the runtime of _internalExtraCheckGraphemeBreakBetween as currently implemented, I've not found the performance cost to be relevant for workloads with occasional mixed emoji contents, nor for workloads that his the earlier checks. A pure Korean workload (currently the last check) does pays a rather noticable price for the previous checks, but this is only because the workload is now so greatly improved. Optimizing this implementation is interesting future work, but not urgent.
1 parent 784ccb2 commit 0a88de5

File tree

1 file changed

+33
-15
lines changed

1 file changed

+33
-15
lines changed

stdlib/public/core/StringCharacterView.swift

Lines changed: 33 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -301,29 +301,47 @@ extension String.CharacterView : BidirectionalCollection {
301301
// satisfying this property, has a grapheme break between it and the other
302302
// scalar.
303303
func hasBreakWhenPaired(_ x: UInt16) -> Bool {
304-
// TODO: This doesn't generate optimal code, tune/re-write at a lower level.
305-
304+
// TODO: This doesn't generate optimal code, tune/re-write at a lower
305+
// level.
306+
//
307+
// NOTE: Order of case ranges affects codegen, and thus performance. All
308+
// things being equal, keep existing order below.
309+
switch x {
306310
// Unified CJK Han ideographs, common and some supplemental, amongst
307311
// others:
308312
// 0x3400-0xA4CF
309-
if 0x3400 <= x && x <= 0xa4cf {
310-
return true
311-
}
313+
case 0x3400...0xa4cf: return true
314+
// TODO: CJK punctuation
315+
316+
// Repeat sub-300 check, this is beneficial for common cases of Latin
317+
// characters embedded within non-Latin script (e.g. newlines, spaces,
318+
// proper nouns and/or jargon, punctuation).
319+
case 0x0000...0x02ff:
320+
// Conservatively exclude CR, though this might not be necessary from
321+
// previous checks.
322+
return x != _CR
323+
// TODO: general punctuation
312324

313-
//
314325
// Non-combining kana:
315326
// 0x3041-0x3096
316327
// 0x30A1-0x30FA
317-
//
318-
// TODO: may be faster to verify whether only 3099 and 309A don't have
319-
// this property, and compare not-equal rather than using two ranges.
320-
if 0x3041 <= x && x <= 0x3096 || 0x30a1 <= x && x <= 0x30fa {
321-
return true
322-
}
328+
case 0x3041...0x3096: return true
329+
case 0x30a1...0x30fa: return true
330+
331+
// Non-combining modern (and some archaic) Cyrillic:
332+
// 0x0400-0x0482 (first half of Cyrillic block)
333+
case 0x0400...0x0482: return true
334+
335+
// Modern Arabic, excluding extenders and prependers:
336+
// 0x061D-0x064A
337+
case 0x061d...0x064a: return true
323338

324-
// TODO: sub-300 check would also be valuable, e.g. when breaking at the
325-
// boundary between English embedded in Chinese.
326-
return false
339+
// Precomposed Hangul syllables:
340+
// 0xAC00–0xD7AF
341+
case 0xac00...0xd7af: return true
342+
343+
default: return false
344+
}
327345
}
328346
return hasBreakWhenPaired(lhs) && hasBreakWhenPaired(rhs)
329347
}

0 commit comments

Comments
 (0)