-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[stdlib] Fix String indexing edge cases, anomalies & validation bugs #41417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
One big insight here is that if a Rounding indices down to the nearest character boundary (like we do with sub-scalar indices) doesn't currently feel to me like the best approach to deal with this. It seems better to allow such This does mean that non-aligned substrings cannot use or set grapheme caches in Testing still keeps popping up more cases to fix. I think this PR will ultimately require |
Just out of idle curiosity: @swift-ci benchmark (The new branches will definitely add some overhead, but how much? And will simplifying scalar decoding make up for it?) |
@swift-ci test |
(Ah benchmarking is currently offline; it will come back soon but meanwhile I'll just satisfy my curiosity at home) |
Does this mean the
Hmm, I wonder. Strings (and Substrings) are semantically collections of Characters, not Unicode scalars, and the requirement from Collection that they produce the same indices, and that those indices are compatible with each other, seems to suggest that we may indeed need character alignment, not just scalar alignment. If I understand what you're saying (and I'm not 100% sure that I do), character alignment would make things more predictable and help solve some of the implementation difficulties, would it not? It would be a slight change to SE-0180, but given your example of how broken the current behaviour is, personally I think an amendment would not be unreasonable. |
I don't think so -- if a The details of course depend on what exactly we mean by a "valid index", but (for the two mutually incompatible definitions below)
From my view, SE-0180 widened
I think I agree this would have been a much simpler option than what SE-0180 went for. However, that proposal clearly intended to allow sub-character slicing -- it (very loosely) defined what the intended semantics are, and even called out explicit rounding-to-character bounds operations as desirable future work. There are many reasons to dislike this proposal, but it did get accepted, and I think we should make an honest effort to fix its implementation before going back to Swift Evolution. For now, this PR tries to work out if we can fix the implementation of SE-0180 to work as as closely as possible to what was originally accepted. If at the end of this work, the resulting behavior still feels clearly wrong, or if the performance implications are unbearable, or we run into a fundamental problem, then we can start pushing for overhauling it. (Things don't seem too bad to me so far; although the situation seems terrible on (I know @milseman would be very much on board with entirely rolling SE-0180 back, restoring the original distinct index types. FWIW, I'm a bit hesitant/scared about that, because it did make the string APIs a lot less fussy when going from higher-level views down, and because we'd have to maintain ABI/source compatibility with current releases and language versions. That would take a lot of effort we could spend on something more interesting, as well as potentially canceling out most of the performance benefits of undoing SE-0180.) |
One argument that could potentially become decisive is that (For all strings let s = "Hello,\r\nworld!"
let i = s.startIndex
let j = s.unicodeScalars.index(s.startIndex, offsetBy: 7) // on \n
let k = s.unicodeScalars.index(s.startIndex, offsetBy: 6) // on \r
let ss = s[i ..< j]
print(s[k]) // "\r\n"
print(ss[k]) // "\r" ( This behavior is explicitly required by SE-0180, but if it turns out to interfere with some generic algorithms, then that could be enough to make the rounding behavior more desirable. (This isn't as clear cut as it may appear on first glance -- as a rule, generic algorithms don't take indices from one collection and apply them to another, so the indices involved must have been passed in as arguments. This considerably limits the set of algorithms that may be impacted; and ideally the results should be so bad that no one would file them as just an "emergent property of applying the usual Unicode rules for decoding those code units". SE-0180 is infuriating.) |
75e4dfe
to
cf6ee03
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick pass by of first 1/3rd
@@ -57,17 +57,34 @@ extension String: BidirectionalCollection { | |||
/// `endIndex`. | |||
/// - Returns: The index value immediately after `i`. | |||
public func index(after i: Index) -> Index { | |||
let i = _guts.ensureMatchingEncoding(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh cool, this isn't already inlinable :-). Would we want to consider inlining the matching encoding check so that a series of API calls can coalesce them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that would probably be the logical choice if this ends up being a significant performance regression.
I’m hoping we won’t need to do that, though, as we’d have to figure out how to add an inlinable wrapper around this. Playing tricks with SILGen names is the route we’ve chosen recently, but I wonder if we’d better come up with a less invasive alternative.
encodedOffset: priorOffset, characterStride: stride)._scalarAligned | ||
|
||
let r = Index(encodedOffset: priorOffset, characterStride: stride) | ||
return r._scalarAligned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r._graphemeAligned
, which would set both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would _characterAligned
be a better name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"character" is wildly ambiguous, I feel like. Swift.Character
is a name for grapheme clusters, but for internal code/names, I think it's better to be specific
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of the Swift stdlib, I think the definition of a character cannot possibly be clearer. String
is a Collection
of Character
values; these routines help identify the boundaries between Character
s. The words “string”, “collection” and “character” all have ambiguous meanings outside of this context, but inside of this codebase, we narrowed things down to extremely precise technical definitions.
If we must ignore the context this code is in (why?), then “grapheme” is just as ambiguous as character is — e.g., the “ly” in my name is two Character
s, but it is a single letter, and it is considered one grapheme according to at least some linguists. Unicode’s glossary has a non-technical definition of grapheme that doesn’t clear things up at all. The mere fact that it is a different name than Character
invites confusion — if I came across it, I would suspect that the different names convey a difference in meaning.
Unicode only has unambiguous definitions for “grapheme cluster” and “extended grapheme cluster”, and iiuc Character
implements the latter. So if we want to avoid confusion for people working on stdlib code who aren’t familiar with Swift’s Character
type (who are these people?), then we should probably go with _isExtendedGraphemeClusterAligned
, or _isEGCAligned
for short.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're missing the point. Stdlib API is formally layered on top of an understanding of Unicode semantics, and "character" in Unicode has many definitions. If you want to cite Union's Glossary, then read character. "Extended" is meaningless except to contrast with legacy clusters. "Grapheme cluster" is the more precise name, I only mention "grapheme" because it clearly contrasts with scalars and code units.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Character
is precisely defined in the Standard Library. If we must internally pretend this isn't so (why?), then I think we should stick to the corresponding technical term in Unicode, which is "extended grapheme cluster".
Unless I'm completely mistaken (please correct me if so!) the phrase "grapheme cluster" in Unicode refers to something other than what Character
implements. Therefore, in my view using that term would increase confusion, not decrease it.
C.f.
/// A single extended grapheme cluster that approximates a user-perceived
/// character. [...]
@frozen
public struct Character: Sendable {
`String.UnicodeScalarView` currently lacks proper index validation in its `index(after:)` and `index(before:)` methods, leading to out-of-bounds memory accesses when index navigation methods in this view are given invalid indices. (Also see swiftlang#41598 and swiftlang#41417) rdar://89498541
6d856ea
to
859c1aa
Compare
Hm, (I'm pretty sure the undesirable effects include moving the substring's bounds in its base string by side effect, which should never happen.) |
499ec6b
to
e4aad9e
Compare
e4aad9e
to
66137c8
Compare
🤨 This will not improve my carefully curated net negative code contribution record. |
66137c8
to
8b39356
Compare
@swift-ci test |
Strictly for fun: @swift-ci benchmark |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
8b39356
to
dc15dff
Compare
The last commit made @swift-ci benchmark |
This comment was marked as outdated.
This comment was marked as outdated.
(Significant code size improvements, as expected. -O performance seemed to somewhat improve(?!), -Osize somewhat regressed, -Onone stayed as is.) |
…sequences of SE-0180
… indices Assign some previously reserved bits in String.Index and _StringObject to keep track of their associated storage encoding (either UTF-8 or UTF-16). None of these bits will be reliably set in processes that load binaries compiled with older stdlib releases, but when they do end up getting set, we can use them opportunistically to more reliably detect cases where an index is applied on a string with a mismatching encoding. As more and more code gets recompiled with 5.7+, the stdlib will gradually become able to detect such issues with complete accuracy. Code that misuses indices this way was always considered broken; however, String wasn’t able to reliably detect these runtime errors before. Therefore, I expect there is a large amount of broken code out there that keeps using bridged Cocoa String indices (UTF-16) after a mutation turns them into native UTF-8 strings. Therefore, instead of trapping, this commit silently corrects the issue, transcoding the offsets into the correct encoding. It would probably be a good idea to also emit a runtime warning in addition to recovering from the error. This would generate some noise that would gently nudge folks to fix their code. rdar://89369680
9be58fc
to
67adcab
Compare
@swift-ci benchmark |
@swift-ci test |
Oops, this is real.
So is this -- I did not realize we have implicit _modify accessors. Neat! |
Gah, are we really not able to run grapheme break tests on Linux?
|
@swift-ci test |
Yeah, the StdlibUnicodeUnittest uses Foundation to read the test files and for platforms besides Darwin ones we haven't built corelibs-foundation yet when we run Swift tests. |
@swift-ci test linux platform |
This fixes a Windows regression triggered by swiftlang#41417.
This fixes a Windows regression triggered by swiftlang#41417. (cherry picked from commit 66a8ae0)
In Swift 5.6 and below, (broken) code that acquired indices from a UTF-16-encoded string bridged from Cocoa and kept using them after a `makeContiguousUTF8` call (or other mutation) may have appeared to be working correctly as long as the string was ASCII. Since swiftlang#41417, the `String(_:within:)` initializers recognize miscoded indices and reject them by returning nil. This is technically correct, but it unfortunately may be a binary compatibility issue, as these used to return non-nil in previous versions. Mitigate this issue by accepting UTF-16 indices on a UTF-8 string, transcoding their offset as needed. (Attempting to use an UTF-8 index on a UTF-16 string is still rejected — we do not implicitly convert strings in that direction.) rdar://89369680
In Swift 5.6 and below, (broken) code that acquired indices from a UTF-16-encoded string bridged from Cocoa and kept using them after a `makeContiguousUTF8` call (or other mutation) may have appeared to be working correctly as long as the string was ASCII. Since swiftlang#41417, the `String(_:within:)` initializers recognize miscoded indices and reject them by returning nil. This is technically correct, but it unfortunately may be a binary compatibility issue, as these used to return non-nil in previous versions. Mitigate this issue by accepting UTF-16 indices on a UTF-8 string, transcoding their offset as needed. (Attempting to use an UTF-8 index on a UTF-16 string is still rejected — we do not implicitly convert strings in that direction.) rdar://89369680 (cherry picked from commit 4d557b0)
(This is a smarter attempt at solving the same family of issues that #41331 attempted to tackle.)
Resolves rdar://89482809.
Motivation
Swift String and its views generally work very reliably as long as we keep to only applying indices to each string view that come directly from it.
This PR fixes issues that happen when we don’t. There are three major sources of such issues:
SE-0180 String Index Overhaul
Unfortunately, back in Swift 4, SE-0180 unified the index types of string views, eliminating distinct indices and explicitly allowing an index from one view to be applied to any other (of the same underlying string), without limit.
The proposal (very loosely) defined the intended behavior to organically arise as if a new string was constructed from the underlying code units that are addressed by the indices given. This was never truly implemented — while we did get the unified String.Index type, the semantics of using a “lower-level” index on a “higher-level” collection (like using an arbitrary index in the UTF-8 view on the character-level
String
type itself) were never formally specified, and the behavior that actually emerged from the implementation does not match what the proposal intended (or any reasonable definition of expected behavior).In Swift 5.1, character- and scalar-based views (
String
,Substring
,String.UnicodeScalarView
,Substring.UnicodeScalarView
) started rounding indices that address sub-scalar positions down to the nearest scalar boundary, which addressed the most egregious correctness issues. (Such as the ability to create strings with invalid encodings.) However, we have not yet tackled the case where sub-character indices are applied to character views: the String implementation in Swift 5.6 is still full of unexpected/undesirable/nonsensical behavior in this area.As discussed on the forums, the right way to fix these issues is to either trap when these indexing operations are given unaligned indices, or to automatically round every index down to the nearest
Character
boundary. The latter option seems less surprising and probably more compatible with existing code, so that's what this PR implements.Accidental use of actually invalid indices
A separate source of issues in this area (unrelated to SE-0180) is the fact that indices don’t currently keep track of whether the numerical offset that they contain counts UTF-8 or UTF-16 code units. (Some strings bridged over from Cocoa can be encoded in UTF-16, while most others are in UTF-8.) This is usually not a problem, except when people mistakenly apply a UTF-16 index to a UTF-8 string (e.g., after a call to
makeContiguousUTF8
). This is always an error, as the index is definitely invalid in this case; however, this error is currently undiagnosed.This can stay undiagnosed for a very long time — until someone tries to run this code on a non-ASCII string, and gets surprised by getting a nonsensical result. The upshot is that there is a potentially large amount of Swift string processing code out there with dormant bugs of the form above.
The best way to improve this situation is to add bookkeeping to indices so that we record their encodings, allowing us to detect such problems and diagnose and/or recover from them. (For example, in the example above, we could either trap, or just emit a runtime warning and return the expected value by transcoding the string on the fly.)
This PR assigns two flag bits in each string index to track its encoding, and automatically transcodes UTF-16 indices if they are applied to an UTF-8 string, making the (still invalid!) code above produce consistently good results. No such conversion is done in the opposite direction, as we don't typically convert UTF-8 strings to UTF-16.
Small oversights
String
has a large API surface area, and some entry points are a bit behind when it comes to index validation. For example:String.replaceSubrange
does not always fully validate/sanitize its input indices, allowing the creation of invalid stringsSubstring.replaceSubrange
does not always preserve the location of itsstartIndex
/endIndex
(when the substring is a UTF-16 string bridged over from Cocoa)String.index(before:)
may perform out of bounds access to memoryString.distance(from:to:)
may perform out of bounds access to memoryString.UnicodeScalarView
may perform out of bounds access to memory in indexing methodsThis PR fixes these by thoroughly reviewing every String API that takes an index and carefully adding missing validation steps, as needed.
Summary of the changes applied
The
.index(after:)
,.index(before:)
,index(_:offsetBy:)
,index(_:offsetBy:limitedBy:)
, anddistance(from:to:)
methods onString
andSubstring
now start by rounding their input indices down to the nearestCharacter
boundary. This avoids SE-0180-related anomalies, such asdistance
not being well-defined on every pair of indices.To prevent this from causing a significant performance regression,
String.Index
now has a flag bit that tracks whether the index isCharacter
-aligned, i.e., whether it addresses a location that falls on an extended grapheme cluster boundary. This bit is set on indices that are returned from any method that guaranteesCharacter
alignment. (Such as the methods listed above.)SE-0180 explicitly allows exotic
Substring
s whosestartIndex
isn'tCharacter
aligned, and (loosely) prescribes how they should behave. This PR finally implements this behavior, so exotic substrings now behave in a well-defined, predictable way: they are now always valid collections, containing characters that "arise organically by applying the usual Unicode grapheme breaking rules" to the scalars that they contain. (And only the scalars they contain -- meaning that no data outside the substring bounds is used to determine grapheme breaks within it. This means that exotic substrings sometimes do not share their characters with the base string.)The
Character
-alignment bit inString.Index
always reflects grapheme breaking results within the entire string. Exotic substrings cannot set or use this bit, as their grapheme breaks may not match with the base string. Therefore, exotic substrings always do a full alignment dance, which makes their indexing operations quite slow. (I expect they are currently about three times slower than regular substrings, although this can vary.) This is much preferable to random trapping or other nonsensical behavior.String.Index
now includes two extra flag bits that track whether the encoded offset value is measured in UTF-8 or UTF-16 units. This is used to detect if someone attempts to use a UTF-16 index on a UTF-8 string, or vice versa.startIndex
is encoding-independent. (And so are indices in ASCII strings, although this PR doesn't actually make use of this.)String
's ABI allows, but the stdlib does not currently implement, non-contiguous UTF-8 encoded string forms. (We only have UTF-16 encoded foreign forms -- as used by some bridged NSString instances.) To continue allowing future versions of the stdlib to introduce UTF-8 foreign strings, this PR assigns a new flag bit in the large string representation that signals this potential future case. This bit is used by the (inlinable!) encoding validation logic above.If in the future we decide to give up on the idea of UTF-8 foreign strings, we can then stop checking this bit in a future stdlib release, slightly simplifying logic. (However, this PR irrevocably assigns this bit in the ABI, so we won't be able to reclaim it for another use later.)
In each string view, methods that take indices now consistently perform the following index validation steps, in this precise order:
i
, ensure thati
's offset uses the expected coding units. Mitigate mismatches by transcoding the index or trapping, as described in point 5 above.i
addresses a location within the string's storage.String
,Substring
and theirUnicodeScalarView
s] Round downi
to the nearest Unicode scalar boundary.String
andSubstring
] Round downi
to the nearestCharacter
boundary.index(before:)
, ensure that the resultingi
is above thestartIndex
.Of particular note is that bounds checking must come before scalar/character alignment, or the alignment operations may end up accessing out-of-bounds memory.
index(before:)
is a tricky edge case, as it needs ani > startIndex
check that must not be done as part of regular bounds checking -- because alignment operations may movei
to thestartIndex
, invalidating the result of the earlier check.Performance Impact
Full index validation takes 2-5 steps and at least as many branches. As is often the case, ensuring that we do things correctly (and never access out-of-bounds memory) tends to slow things down a bit. I'm pretty sure the performance regressions can be mostly/entirely mitigated. This PR already implements some mitigations, with more likely coming later:
The new character-alignment bit in string indices avoids having to do a full random-access search for a grapheme break in the >99% case where the index simply comes from the same string view.
replaceSubrange
needed to be rewritten in every range-replaceable string view to fix its behavior. The new implementations contain algorithmic improvements that dramatically boost performance compared to the previous versions, despite the additional index validation. (Especially forSubstring
and the twoUnicodeScalarView
s.)The grapheme breaking algorithm was refactored to support exotic Substrings. As part of this work, some duplicated work was eliminated and the forward & backward directions now have their own standalone code paths, simplifying logic.
Higher-level indexing operations (
distance(from:to:)
,index(_:offsetBy:)
,index(_:offsetBy:limitedBy:)
) no longer delegate to generic implementations inBidirectionalCollection
; rather they use hand-written implementations that are tailored to strings. This fixes some edge cases that are unique to strings, as well as allowing us to avoid index validation overhead while iterating over indices; this should make the new implementations just as fast or faster as the versions we had in <=5.6.Some care was taken not to introduce unnecessary ARC traffic, with releasenone annotations added as needed.
To shortcut the number of steps needed to validate an index, I added direct fast paths that test for multiple index flags (encoding, scalar/character alignment) at once, and only going through the full validation procedure if the flags indicate that we aren't on the happy path.