-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[String] Grapheme fast paths for punctuation: 5-8x speedup. #10648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Many strings use non-sub-300 punctuation characters (e.g. unicode hyphen, CJK quotes, etc). This can cause switching between fast and slow paths for grapheme breaking. Add in fast-paths for general punctuation characters and CJK punctuation and symbol characters. This results in about a 5-8x speedup for heavily (unicode) punctuated Latiny and CJKy workloads.
056567b
to
bd5189c
Compare
@swift-ci please test |
Build failed |
Build failed |
@shahmishal Bots are having issues: |
@swift-ci please smoke test |
@swift-ci please benchmark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the TODO comment here and the one above about CJK punctuation can go now, yes?
Yup edit: I'll wait until after I get some testing / benchmarking results, don't want to invalidate the bots with a push. |
We had network outage. |
Build comment file:Build failed before running benchmark. |
@swift-ci please test |
@swift-ci please smoke benchmark |
Build failed |
Build failed |
Build comment file:Optimized (O)Regression (5)
Improvement (3)
No Changes (302)
Added (8)
Unoptimized (Onone)Regression (2)
Improvement (7)
No Changes (301)
Added (8)
Hardware Overview
|
// 0x2010-0x2029 | ||
case 0x2010...0x2029: return true | ||
|
||
// CJK punctuation characters, excluding extenders: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "extenders"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The characters (i.e. scalars) that have the "extend" property for the purposes of grapheme breaking. Such characters usually* don't have a grapheme break before them. E.g. 0x302A.
See:
http://unicode.org/reports/tr29/#GB9
http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakProperty.txt
Many strings use non-sub-300 punctuation characters (e.g. unicode
hyphen, CJK quotes, etc). This can cause switching between fast and
slow paths for grapheme breaking. Add in fast-paths for general
punctuation characters and CJK punctuation and symbol characters.
This results in about a 5-8x speedup for heavily (unicode) punctuated
Latiny and CJKy workloads.