-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[5.8][stdlib] Speed up short UTF-16 distance calculations #62823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
lorentey
merged 12 commits into
swiftlang:release/5.8
from
lorentey:string-utf16-speedup-5.8
Jan 4, 2023
Merged
[5.8][stdlib] Speed up short UTF-16 distance calculations #62823
lorentey
merged 12 commits into
swiftlang:release/5.8
from
lorentey:string-utf16-speedup-5.8
Jan 4, 2023
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Previously we insisted on using breadcrumbs even if we only needed to travel a very short way. This could be as much as ten times slower than the naive algorithm of simply visiting all the Unicode scalars in between the start and the end. (Using breadcrumbs generally means that we need to walk to both endpoints from their nearest breadcrumb, which on average requires walking half the distance between breadcrumbs — and this can mean visiting vastly more Unicode scalars than the ones that are simply lying in between the endpoints themselves.) (cherry picked from commit 483087a)
(cherry picked from commit f3a9305)
… ranges Instead of calling `_toUTF16Index` twice, call it once and then use `index(_:offsetBy:)` to potentially avoid another breadcrumbs lookup. (cherry picked from commit 2423b8b)
(cherry picked from commit 6fee1b3)
We commonly start from the `startIndex`, in which case `_nativeGetOffset` is essentially free. Consider this case when calculating the threshold for using breadcrumbs. (cherry picked from commit ec35728)
Speed up conversion between UTF-16 offset ranges and string index ranges, by carefully switching between absolute and relative index calculations, depending on the distance we need to go. It is a surprisingly tricky puzzle to do this correctly while avoiding redundant calculations. Offset ranges within substrings add the additional complication of having to bias offset values with the absolute offset of the substring’s start index. (cherry picked from commit d00f8ed)
…checks (cherry picked from commit 7d89d62)
(cherry picked from commit fce428e)
Evidently we did not have any tests that exercised `distance(from:to:)` and `index(_:offsetBy:)`. :-O (cherry picked from commit 051f9ed)
- Align input indices to scalar boundaries - Don’t pass decreasing indices to _utf16Distance (cherry picked from commit 5d354ce)
(cherry picked from commit cd55016)
…ithms [Bidirectional]Collection’s default index manipulation methods (as well as _utf16Distance) do not expect to be given unreachable indices, and they tend to fail when operating on them. Round indices down to the nearest scalar boundary before calling these. (cherry picked from commit e46f8f8)
@swift-ci test |
stephentyrone
approved these changes
Jan 4, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Cherry pick of #62717 for 5.8.
Previously we insisted on using UTF-16 breadcrumbs even if we only needed to travel a very short way. This could be as much as
tenforty times slower than the naive algorithm of simply visiting all the Unicode scalars in between the start and the end.(Using breadcrumbs generally means that we need to walk to both endpoints from their nearest breadcrumb, which on average requires walking half the distance between breadcrumbs, twice — and this can mean visiting vastly more Unicode scalars than if we simply walked through the ones that are lying in between the endpoints themselves.)
To put it another way, when we want to measure how long it takes to walk between two trees within a nearby park, it probably isn't a great idea to start by separately measuring each of their distances from the nearest airport. 😛
rdar://103575481