Skip to content

[5.8][stdlib] Speed up short UTF-16 distance calculations #62823

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jan 4, 2023

Conversation

lorentey
Copy link
Member

@lorentey lorentey commented Jan 4, 2023

Cherry pick of #62717 for 5.8.

Previously we insisted on using UTF-16 breadcrumbs even if we only needed to travel a very short way. This could be as much as ten forty times slower than the naive algorithm of simply visiting all the Unicode scalars in between the start and the end.

(Using breadcrumbs generally means that we need to walk to both endpoints from their nearest breadcrumb, which on average requires walking half the distance between breadcrumbs, twice — and this can mean visiting vastly more Unicode scalars than if we simply walked through the ones that are lying in between the endpoints themselves.)

To put it another way, when we want to measure how long it takes to walk between two trees within a nearby park, it probably isn't a great idea to start by separately measuring each of their distances from the nearest airport. 😛

rdar://103575481

Previously we insisted on using breadcrumbs even if we only needed to
travel a very short way. This could be as much as ten times slower
than the naive algorithm of simply visiting all the Unicode scalars
in between the start and the end.

(Using breadcrumbs generally means that we need to walk to both
endpoints from their nearest breadcrumb, which on average requires
walking half the distance between breadcrumbs — and this can mean
visiting vastly more Unicode scalars than the ones that are simply
lying in between the endpoints themselves.)

(cherry picked from commit 483087a)
… ranges

Instead of calling `_toUTF16Index` twice, call it once and then use
`index(_:offsetBy:)` to potentially avoid another breadcrumbs lookup.

(cherry picked from commit 2423b8b)
We commonly start from the `startIndex`, in which case
`_nativeGetOffset` is essentially free. Consider this
case when calculating the threshold for using breadcrumbs.

(cherry picked from commit ec35728)
Speed up conversion between UTF-16 offset ranges
and string index ranges, by carefully switching
between absolute and relative index calculations,
depending on the distance we need to go.

It is a surprisingly tricky puzzle to do this
correctly while avoiding redundant calculations.
Offset ranges within substrings add the additional
complication of having to bias offset values with
the absolute offset of the substring’s start index.

(cherry picked from commit d00f8ed)
Evidently we did not have any tests that exercised
`distance(from:to:)` and `index(_:offsetBy:)`. :-O

(cherry picked from commit 051f9ed)
- Align input indices to scalar boundaries
- Don’t pass decreasing indices to _utf16Distance

(cherry picked from commit 5d354ce)
(cherry picked from commit cd55016)
…ithms

[Bidirectional]Collection’s default index manipulation methods (as
well as _utf16Distance) do not expect to be given unreachable
indices, and they tend to fail when operating on them. Round indices
down to the nearest scalar boundary before calling these.

(cherry picked from commit e46f8f8)
@lorentey
Copy link
Member Author

lorentey commented Jan 4, 2023

@swift-ci test

@lorentey lorentey merged commit 7497aee into swiftlang:release/5.8 Jan 4, 2023
@lorentey lorentey deleted the string-utf16-speedup-5.8 branch January 4, 2023 19:17
@AnthonyLatsis AnthonyLatsis added 🍒 release cherry pick Flag: Release branch cherry picks swift 5.8 labels Jan 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🍒 release cherry pick Flag: Release branch cherry picks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants