[5.7][stdlib] Fix String indexing edge cases, anomalies & validation bugs #42402

lorentey · 2022-04-15T22:52:15Z

Cherry-picked from #41417 and #42385.

Explanation: This addresses a multitude of index validation and semantic issues in the collection implementations of String, Substring, and their Unicode Scalar, UTF-8 and UTF-16 views, by implementing additional runtime index validation, as needed; this includes tracking the expected encoding of String.Index offsets within the index value itself.
Scope: String indexing and grapheme breaking. This changes the runtime behavior of code that passes invalid indices to methods of the types above. Code that previously resulted in unexpected traps, out-of-bounds memory accesses and/or nonsensical results will now behave more consistently, either gracefully substituting a suitable valid index or generating a clear, nondeterministic trap.
Issues: rdar://89482809
Risk: Medium/high. Care was taken to ensure backward/forward compatibility whenever practical (even including code that violates documented preconditions), but this is a large change.
Testing: Additional, extensive unit tests as well as local binary compatibility testing.
Reviewer: @milseman @Azoy @stephentyrone

For more detailed list of changes, see the original description of #41417, expandable here by clicking on the sections below.

Motivation

Swift String and its views generally work very reliably as long as we keep to only applying indices to each string view that come directly from it.

This PR fixes issues that happen when we don’t. There are three major sources of such issues:

SE-0180 String Index Overhaul

Unfortunately, back in Swift 4, SE-0180 unified the index types of string views, eliminating distinct indices and explicitly allowing an index from one view to be applied to any other (of the same underlying string), without limit.

The proposal (very loosely) defined the intended behavior to organically arise as if a new string was constructed from the underlying code units that are addressed by the indices given. This was never truly implemented — while we did get the unified String.Index type, the semantics of using a “lower-level” index on a “higher-level” collection (like using an arbitrary index in the UTF-8 view on the character-level String type itself) were never formally specified, and the behavior that actually emerged from the implementation does not match what the proposal intended (or any reasonable definition of expected behavior).

In Swift 5.1, character- and scalar-based views (String, Substring, String.UnicodeScalarView, Substring.UnicodeScalarView) started rounding indices that address sub-scalar positions down to the nearest scalar boundary, which addressed the most egregious correctness issues. (Such as the ability to create strings with invalid encodings.) However, we have not yet tackled the case where sub-character indices are applied to character views: the String implementation in Swift 5.6 is still full of unexpected/undesirable/nonsensical behavior in this area.

let s = "Hello,\r\nworld!"

let i = s.startIndex
let j = s.unicodeScalars.index(s.startIndex, offsetBy: 7) // between \r and \n

print(s.distance(from: j, to: i)) // -7
print(s.distance(from: i, to: j)) // 💥 Fatal error: String index is out of bounds

let ss = s[i ..< j]
print(ss.count) // 💥 Fatal error: String index is out of bounds

As discussed on the forums, the right way to fix these issues is to either trap when these indexing operations are given unaligned indices, or to automatically round every index down to the nearest Character boundary. The latter option seems less surprising and probably more compatible with existing code, so that's what this PR implements.

Accidental use of actually invalid indices

A separate source of issues in this area (unrelated to SE-0180) is the fact that indices don’t currently keep track of whether the numerical offset that they contain counts UTF-8 or UTF-16 code units. (Some strings bridged over from Cocoa can be encoded in UTF-16, while most others are in UTF-8.) This is usually not a problem, except when people mistakenly apply a UTF-16 index to a UTF-8 string (e.g., after a call to makeContiguousUTF8). This is always an error, as the index is definitely invalid in this case; however, this error is currently undiagnosed.

var str = (some UTF-16 string bridged over from Cocoa)
let i = str.firstIndex(of: “X”)!
str.makeContiguousUTF8() // documented to invalidate all indices
// str is now UTF-8 encoded
print(str[i]) // illegal, i isn’t valid any more — but it still tends to succeed. 
              // For ASCII strings, it even returns a value that looks right!

This can stay undiagnosed for a very long time — until someone tries to run this code on a non-ASCII string, and gets surprised by getting a nonsensical result. The upshot is that there is a potentially large amount of Swift string processing code out there with dormant bugs of the form above.

The best way to improve this situation is to add bookkeeping to indices so that we record their encodings, allowing us to detect such problems and diagnose and/or recover from them. (For example, in the example above, we could either trap, or just emit a runtime warning and return the expected value by transcoding the string on the fly.)

This PR assigns two flag bits in each string index to track its encoding, and automatically transcodes UTF-16 indices if they are applied to an UTF-8 string, making the (still invalid!) code above produce consistently good results. No such conversion is done in the opposite direction, as we don't typically convert UTF-8 strings to UTF-16.

Small oversights

String has a large API surface area, and some entry points are a bit behind when it comes to index validation. For example:

String.replaceSubrange does not always fully validate/sanitize its input indices, allowing the creation of invalid strings
Substring.replaceSubrange does not always preserve the location of its startIndex/endIndex (when the substring is a UTF-16 string bridged over from Cocoa)
String.index(before:) may perform out of bounds access to memory
String.distance(from:to:) may perform out of bounds access to memory
String.UnicodeScalarView may perform out of bounds access to memory in indexing methods
too many similar issues to precisely list here

This PR fixes these by thoroughly reviewing every String API that takes an index and carefully adding missing validation steps, as needed.

Summary of the changes applied

The .index(after:), .index(before:), index(_:offsetBy:), index(_:offsetBy:limitedBy:), and distance(from:to:) methods on String and Substring now start by rounding their input indices down to the nearest Character boundary. This avoids SE-0180-related anomalies, such as distance not being well-defined on every pair of indices.
To prevent this from causing a significant performance regression, String.Index now has a flag bit that tracks whether the index is Character-aligned, i.e., whether it addresses a location that falls on an extended grapheme cluster boundary. This bit is set on indices that are returned from any method that guarantees Character alignment. (Such as the methods listed above.)
SE-0180 explicitly allows exotic Substrings whose startIndex isn't Character aligned, and (loosely) prescribes how they should behave. This PR finally implements this behavior, so exotic substrings now behave in a well-defined, predictable way: they are now always valid collections, containing characters that "arise organically by applying the usual Unicode grapheme breaking rules" to the scalars that they contain. (And only the scalars they contain -- meaning that no data outside the substring bounds is used to determine grapheme breaks within it. This means that exotic substrings sometimes do not share their characters with the base string.)
The Character-alignment bit in String.Index always reflects grapheme breaking results within the entire string. Exotic substrings cannot set or use this bit, as their grapheme breaks may not match with the base string. Therefore, exotic substrings always do a full alignment dance, which makes their indexing operations quite slow. (I expect they are currently about three times slower than regular substrings, although this can vary.) This is much preferable to random trapping or other nonsensical behavior.
String.Index now includes two extra flag bits that track whether the encoded offset value is measured in UTF-8 or UTF-16 units. This is used to detect if someone attempts to use a UTF-16 index on a UTF-8 string, or vice versa.
- When a UTF-16 index is applied to a UTF-8 string, string operations now automatically transcode the offset value to UTF-8 units. This can be quite slow, but it produces consistently good results in the case of someone keeping indices around after mutating a bridged string. This is preferable to trapping, to avoid breaking (invalid) code that appeared to work correctly before.
- We do trap if a UTF-8 index is applied to a UTF-16 string, as we don't implicitly convert strings in that direction. This case (from what I've seen so far) never happens "organically". This carries a slight risk of a binary compatibility issue -- if some Swift applications did get this wrong, then they will now crash. If this happens, we'll be able to adjust the trap to e.g. make it only happen if the app was built with a newer SDK.
- Some indices may originate from older versions of the stdlib, in which case they won't have either of these bits set. In this case, we revert to assuming the index has the correct encoding, like before.
- Some indices may have both their UTF-8 and UTF-16 flags set. This isn't an error -- some indices work in either encoding. For example, every startIndex is encoding-independent. (And so are indices in ASCII strings, although this PR doesn't actually make use of this.)
String's ABI allows, but the stdlib does not currently implement, non-contiguous UTF-8 encoded string forms. (We only have UTF-16 encoded foreign forms -- as used by some bridged NSString instances.) To continue allowing future versions of the stdlib to introduce UTF-8 foreign strings, this PR assigns a new flag bit in the large string representation that signals this potential future case. This bit is used by the (inlinable!) encoding validation logic above.

If in the future we decide to give up on the idea of UTF-8 foreign strings, we can then stop checking this bit in a future stdlib release, slightly simplifying logic. (However, this PR irrevocably assigns this bit in the ABI, so we won't be able to reclaim it for another use later.)
In each string view, methods that take indices now consistently perform the following index validation steps, in this precise order:
- For each supplied index i, ensure that i's offset uses the expected coding units. Mitigate mismatches by transcoding the index or trapping, as described in point 5 above.
- Perform bounds checking to ensure that i addresses a location within the string's storage.
- [Only in String, Substring and their UnicodeScalarViews] Round down i to the nearest Unicode scalar boundary.
- [Only in String and Substring] Round down i to the nearest Character boundary.
- In index(before:), ensure that the resulting i is above the startIndex.
Of particular note is that bounds checking must come before scalar/character alignment, or the alignment operations may end up accessing out-of-bounds memory. index(before:) is a tricky edge case, as it needs an i > startIndex check that must not be done as part of regular bounds checking -- because alignment operations may move i to the startIndex, invalidating the result of the earlier check.

Performance impact

Performance Impact

Full index validation takes 2-5 steps and at least as many branches. As is often the case, ensuring that we do things correctly (and never access out-of-bounds memory) tends to slow things down a bit. I'm pretty sure the performance regressions can be mostly/entirely mitigated. This PR already implements some mitigations, with more likely coming later:

The new character-alignment bit in string indices avoids having to do a full random-access search for a grapheme break in the >99% case where the index simply comes from the same string view.
replaceSubrange needed to be rewritten in every range-replaceable string view to fix its behavior. The new implementations contain algorithmic improvements that dramatically boost performance compared to the previous versions, despite the additional index validation. (Especially for Substring and the two UnicodeScalarViews.)
The grapheme breaking algorithm was refactored to support exotic Substrings. As part of this work, some duplicated work was eliminated and the forward & backward directions now have their own standalone code paths, simplifying logic.
Higher-level indexing operations (distance(from:to:), index(_:offsetBy:), index(_:offsetBy:limitedBy:)) no longer delegate to generic implementations in BidirectionalCollection; rather they use hand-written implementations that are tailored to strings. This fixes some edge cases that are unique to strings, as well as allowing us to avoid index validation overhead while iterating over indices; this should make the new implementations just as fast or faster as the versions we had in <=5.6.
Some care was taken not to introduce unnecessary ARC traffic, with releasenone annotations added as needed.
To shortcut the number of steps needed to validate an index, I added direct fast paths that test for multiple index flags (encoding, scalar/character alignment) at once, and only going through the full validation procedure if the flags indicate that we aren't on the happy path.

Cherry-picked from swiftlang#41417

This fixes a Windows regression triggered by swiftlang#41417. (cherry picked from commit 66a8ae0)

lorentey · 2022-04-15T22:56:46Z

@swift-ci test

… into a standalone long test Also trim down its input a bit so that this doesn’t take 20 minutes. (cherry picked from commit 318277c)

lorentey · 2022-04-16T04:42:23Z

I'll also add #42409 here to prevent slowing down CI test runs, especially in debug mode.

lorentey · 2022-04-16T04:42:54Z

@swift-ci test and merge

lorentey added 2 commits April 15, 2022 15:17

[stdlib] Fix String indexing edge cases, anomalies & validation bugs

09afcac

Cherry-picked from swiftlang#41417

[test] Move string test helper methods to StdlibUnittest

0d03f1f

This fixes a Windows regression triggered by swiftlang#41417. (cherry picked from commit 66a8ae0)

lorentey requested a review from a team as a code owner April 15, 2022 22:52

stephentyrone approved these changes Apr 16, 2022

View reviewed changes

[test] stdlib/StringIndex: Spin off O(n^4) substring replacement test…

7f34490

… into a standalone long test Also trim down its input a bit so that this doesn’t take 20 minutes. (cherry picked from commit 318277c)

swift-ci merged commit d1a9c63 into swiftlang:release/5.7 Apr 16, 2022

lorentey deleted the the-horror-of-se-0180-5.7 branch April 17, 2022 21:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[5.7][stdlib] Fix String indexing edge cases, anomalies & validation bugs #42402

[5.7][stdlib] Fix String indexing edge cases, anomalies & validation bugs #42402

Uh oh!

lorentey commented Apr 15, 2022 •

edited

Loading

Uh oh!

lorentey commented Apr 15, 2022

Uh oh!

lorentey commented Apr 16, 2022

Uh oh!

lorentey commented Apr 16, 2022

Uh oh!

Uh oh!

[5.7][stdlib] Fix String indexing edge cases, anomalies & validation bugs #42402

[5.7][stdlib] Fix String indexing edge cases, anomalies & validation bugs #42402

Uh oh!

Conversation

lorentey commented Apr 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SE-0180 String Index Overhaul

Accidental use of actually invalid indices

Small oversights

Summary of the changes applied

Performance Impact

Uh oh!

lorentey commented Apr 15, 2022

Uh oh!

lorentey commented Apr 16, 2022

Uh oh!

lorentey commented Apr 16, 2022

Uh oh!

Uh oh!

lorentey commented Apr 15, 2022 •

edited

Loading