-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[stdlib] String: Walk Chinese/Japanese faster: 2x/4x forwards/backwards #9575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This adds more fast path checks for grapheme breaks between BMP scalars. Notably the rather vast range of 0x3400–0xA4CF which includes unified common Han ideographs as well as the first extension to unified Han ideographs. It also happens to pick up various Yijin and Yi symbols/radicals. Additionally, the narrow hiragana/katakana ranges 0x3041-0x3096 and 0x30A1-0x30FA (including pre-composed semi-voiced characters but excluding the combining semi-voice marks) have fast paths. The net effect is that the vast majority of modern Chinese and Japanese text should be fast-pathed. This is especially important, as adopting Unicode 9 might otherwise pessimize performance here relative to the tries.
Gyb up StringWalk, to avoid the code explosion. Add in benchmarks for walking Chinese, Japanese, and Korean text.
@swift-ci please test |
@swift-ci please benchmark |
谢谢你! |
Build comment file:Optimized (O) Regression (11)
Improvement (9)
No Changes (253)
Regression (9)
Improvement (3)
No Changes (261)
|
I'm guessing the speedup in StringWalk was due to benchmark refactoring rather than that making Chinese faster happened to give ASCII a 5x speedup... |
(I'm pretty sure the other benchmarks are noise, especially StringHasPrefix which seems to have gotten constant-folded recently so needs refactoring...) |
@swift-ci please smoke benchmark |
Build comment file:Optimized (O) Regression (10)
Improvement (6)
No Changes (257)
Regression (8)
Improvement (2)
No Changes (263)
|
This adds more fast path checks for grapheme breaks between BMP
scalars. Notably the rather vast range of 0x3400–0xA4CF which includes
unified common Han ideographs as well as the first extension to
unified Han ideographs. It also happens to pick up various Yijin and
Yi symbols/radicals. Additionally, the narrow hiragana/katakana ranges
0x3041-0x3096 and 0x30A1-0x30FA (including pre-composed semi-voiced
characters but excluding the combining semi-voice marks) have fast
paths.
The net effect is that the vast majority of modern Chinese and
Japanese text should be fast-pathed. This is especially important, as
adopting Unicode 9 might otherwise pessimize performance here relative
to the tries.
Gyb up StringWalk benchmark, to avoid the code explosion. Add in benchmarks for
walking Chinese, Japanese, and Korean text.