[stdlib] Fix String indexing edge cases, anomalies & validation bugs #41417

lorentey · 2022-02-16T23:28:31Z

(This is a smarter attempt at solving the same family of issues that #41331 attempted to tackle.)

Resolves rdar://89482809.

Motivation

Swift String and its views generally work very reliably as long as we keep to only applying indices to each string view that come directly from it.

This PR fixes issues that happen when we don’t. There are three major sources of such issues:

SE-0180 String Index Overhaul

Unfortunately, back in Swift 4, SE-0180 unified the index types of string views, eliminating distinct indices and explicitly allowing an index from one view to be applied to any other (of the same underlying string), without limit.

The proposal (very loosely) defined the intended behavior to organically arise as if a new string was constructed from the underlying code units that are addressed by the indices given. This was never truly implemented — while we did get the unified String.Index type, the semantics of using a “lower-level” index on a “higher-level” collection (like using an arbitrary index in the UTF-8 view on the character-level String type itself) were never formally specified, and the behavior that actually emerged from the implementation does not match what the proposal intended (or any reasonable definition of expected behavior).

In Swift 5.1, character- and scalar-based views (String, Substring, String.UnicodeScalarView, Substring.UnicodeScalarView) started rounding indices that address sub-scalar positions down to the nearest scalar boundary, which addressed the most egregious correctness issues. (Such as the ability to create strings with invalid encodings.) However, we have not yet tackled the case where sub-character indices are applied to character views: the String implementation in Swift 5.6 is still full of unexpected/undesirable/nonsensical behavior in this area.

let s = "Hello,\r\nworld!"

let i = s.startIndex
let j = s.unicodeScalars.index(s.startIndex, offsetBy: 7) // between \r and \n

print(s.distance(from: j, to: i)) // -7
print(s.distance(from: i, to: j)) // 💥 Fatal error: String index is out of bounds

let ss = s[i ..< j]
print(ss.count) // 💥 Fatal error: String index is out of bounds

As discussed on the forums, the right way to fix these issues is to either trap when these indexing operations are given unaligned indices, or to automatically round every index down to the nearest Character boundary. The latter option seems less surprising and probably more compatible with existing code, so that's what this PR implements.

Accidental use of actually invalid indices

A separate source of issues in this area (unrelated to SE-0180) is the fact that indices don’t currently keep track of whether the numerical offset that they contain counts UTF-8 or UTF-16 code units. (Some strings bridged over from Cocoa can be encoded in UTF-16, while most others are in UTF-8.) This is usually not a problem, except when people mistakenly apply a UTF-16 index to a UTF-8 string (e.g., after a call to makeContiguousUTF8). This is always an error, as the index is definitely invalid in this case; however, this error is currently undiagnosed.

var str = (some UTF-16 string bridged over from Cocoa)
let i = str.firstIndex(of: “X”)!
str.makeContiguousUTF8() // documented to invalidate all indices
// str is now UTF-8 encoded
print(str[i]) // illegal, i isn’t valid any more — but it still tends to succeed. 
              // For ASCII strings, it even returns a value that looks right!

This can stay undiagnosed for a very long time — until someone tries to run this code on a non-ASCII string, and gets surprised by getting a nonsensical result. The upshot is that there is a potentially large amount of Swift string processing code out there with dormant bugs of the form above.

The best way to improve this situation is to add bookkeeping to indices so that we record their encodings, allowing us to detect such problems and diagnose and/or recover from them. (For example, in the example above, we could either trap, or just emit a runtime warning and return the expected value by transcoding the string on the fly.)

This PR assigns two flag bits in each string index to track its encoding, and automatically transcodes UTF-16 indices if they are applied to an UTF-8 string, making the (still invalid!) code above produce consistently good results. No such conversion is done in the opposite direction, as we don't typically convert UTF-8 strings to UTF-16.

Small oversights

String has a large API surface area, and some entry points are a bit behind when it comes to index validation. For example:

String.replaceSubrange does not always fully validate/sanitize its input indices, allowing the creation of invalid strings
Substring.replaceSubrange does not always preserve the location of its startIndex/endIndex (when the substring is a UTF-16 string bridged over from Cocoa)
String.index(before:) may perform out of bounds access to memory
String.distance(from:to:) may perform out of bounds access to memory
String.UnicodeScalarView may perform out of bounds access to memory in indexing methods
too many similar issues to precisely list here

This PR fixes these by thoroughly reviewing every String API that takes an index and carefully adding missing validation steps, as needed.

Summary of the changes applied

The .index(after:), .index(before:), index(_:offsetBy:), index(_:offsetBy:limitedBy:), and distance(from:to:) methods on String and Substring now start by rounding their input indices down to the nearest Character boundary. This avoids SE-0180-related anomalies, such as distance not being well-defined on every pair of indices.
To prevent this from causing a significant performance regression, String.Index now has a flag bit that tracks whether the index is Character-aligned, i.e., whether it addresses a location that falls on an extended grapheme cluster boundary. This bit is set on indices that are returned from any method that guarantees Character alignment. (Such as the methods listed above.)
SE-0180 explicitly allows exotic Substrings whose startIndex isn't Character aligned, and (loosely) prescribes how they should behave. This PR finally implements this behavior, so exotic substrings now behave in a well-defined, predictable way: they are now always valid collections, containing characters that "arise organically by applying the usual Unicode grapheme breaking rules" to the scalars that they contain. (And only the scalars they contain -- meaning that no data outside the substring bounds is used to determine grapheme breaks within it. This means that exotic substrings sometimes do not share their characters with the base string.)
The Character-alignment bit in String.Index always reflects grapheme breaking results within the entire string. Exotic substrings cannot set or use this bit, as their grapheme breaks may not match with the base string. Therefore, exotic substrings always do a full alignment dance, which makes their indexing operations quite slow. (I expect they are currently about three times slower than regular substrings, although this can vary.) This is much preferable to random trapping or other nonsensical behavior.
String.Index now includes two extra flag bits that track whether the encoded offset value is measured in UTF-8 or UTF-16 units. This is used to detect if someone attempts to use a UTF-16 index on a UTF-8 string, or vice versa.
- When a UTF-16 index is applied to a UTF-8 string, string operations now automatically transcode the offset value to UTF-8 units. This can be quite slow, but it produces consistently good results in the case of someone keeping indices around after mutating a bridged string. This is preferable to trapping, to avoid breaking (invalid) code that appeared to work correctly before.
- We do trap if a UTF-8 index is applied to a UTF-16 string, as we don't implicitly convert strings in that direction. This case (from what I've seen so far) never happens "organically". This carries a slight risk of a binary compatibility issue -- if some Swift applications did get this wrong, then they will now crash. If this happens, we'll be able to adjust the trap to e.g. make it only happen if the app was built with a newer SDK.
- Some indices may originate from older versions of the stdlib, in which case they won't have either of these bits set. In this case, we revert to assuming the index has the correct encoding, like before.
- Some indices may have both their UTF-8 and UTF-16 flags set. This isn't an error -- some indices work in either encoding. For example, every startIndex is encoding-independent. (And so are indices in ASCII strings, although this PR doesn't actually make use of this.)
String's ABI allows, but the stdlib does not currently implement, non-contiguous UTF-8 encoded string forms. (We only have UTF-16 encoded foreign forms -- as used by some bridged NSString instances.) To continue allowing future versions of the stdlib to introduce UTF-8 foreign strings, this PR assigns a new flag bit in the large string representation that signals this potential future case. This bit is used by the (inlinable!) encoding validation logic above.

If in the future we decide to give up on the idea of UTF-8 foreign strings, we can then stop checking this bit in a future stdlib release, slightly simplifying logic. (However, this PR irrevocably assigns this bit in the ABI, so we won't be able to reclaim it for another use later.)
In each string view, methods that take indices now consistently perform the following index validation steps, in this precise order:
- For each supplied index i, ensure that i's offset uses the expected coding units. Mitigate mismatches by transcoding the index or trapping, as described in point 5 above.
- Perform bounds checking to ensure that i addresses a location within the string's storage.
- [Only in String, Substring and their UnicodeScalarViews] Round down i to the nearest Unicode scalar boundary.
- [Only in String and Substring] Round down i to the nearest Character boundary.
- In index(before:), ensure that the resulting i is above the startIndex.
Of particular note is that bounds checking must come before scalar/character alignment, or the alignment operations may end up accessing out-of-bounds memory. index(before:) is a tricky edge case, as it needs an i > startIndex check that must not be done as part of regular bounds checking -- because alignment operations may move i to the startIndex, invalidating the result of the earlier check.

Performance Impact

Full index validation takes 2-5 steps and at least as many branches. As is often the case, ensuring that we do things correctly (and never access out-of-bounds memory) tends to slow things down a bit. I'm pretty sure the performance regressions can be mostly/entirely mitigated. This PR already implements some mitigations, with more likely coming later:

The new character-alignment bit in string indices avoids having to do a full random-access search for a grapheme break in the >99% case where the index simply comes from the same string view.
replaceSubrange needed to be rewritten in every range-replaceable string view to fix its behavior. The new implementations contain algorithmic improvements that dramatically boost performance compared to the previous versions, despite the additional index validation. (Especially for Substring and the two UnicodeScalarViews.)
The grapheme breaking algorithm was refactored to support exotic Substrings. As part of this work, some duplicated work was eliminated and the forward & backward directions now have their own standalone code paths, simplifying logic.
Higher-level indexing operations (distance(from:to:), index(_:offsetBy:), index(_:offsetBy:limitedBy:)) no longer delegate to generic implementations in BidirectionalCollection; rather they use hand-written implementations that are tailored to strings. This fixes some edge cases that are unique to strings, as well as allowing us to avoid index validation overhead while iterating over indices; this should make the new implementations just as fast or faster as the versions we had in <=5.6.
Some care was taken not to introduce unnecessary ARC traffic, with releasenone annotations added as needed.
To shortcut the number of steps needed to validate an index, I added direct fast paths that test for multiple index flags (encoding, scalar/character alignment) at once, and only going through the full validation procedure if the flags indicate that we aren't on the happy path.

lorentey · 2022-02-16T23:41:17Z

One big insight here is that if a Substring's boundaries aren't character aligned, then the contents of the substring (and the positions of its grapheme breaks) can be wildly different than its base string. This is fine with Collection (it doesn't/can't say anything about slicing with indices that aren't in self.indices), but the current implementation is ill-prepared for such cases, and it gets dangerously confused.

Rounding indices down to the nearest character boundary (like we do with sub-scalar indices) doesn't currently feel to me like the best approach to deal with this. It seems better to allow such Substrings to behave like standalone strings that happen to contain the same Unicode scalars, so this is what this PR implements.

This does mean that non-aligned substrings cannot use or set grapheme caches in String.Index values, as the cluster length indicated wouldn't necessarily be valid.

Testing still keeps popping up more cases to fix. I think this PR will ultimately require String.index(after:) to sometimes search back to find the nearest global character boundary -- without that, advancing forward will always be inconsistent with advancing back, making it very difficult to assign precise semantics for these operations (or to write tests for them).

lorentey · 2022-02-17T04:14:45Z

Just out of idle curiosity:

@swift-ci benchmark

(The new branches will definitely add some overhead, but how much? And will simplifying scalar decoding make up for it?)

lorentey · 2022-02-17T04:28:02Z

@swift-ci test

lorentey · 2022-02-17T04:35:18Z

(Ah benchmarking is currently offline; it will come back soon but meanwhile I'll just satisfy my curiosity at home)

karwa · 2022-02-18T15:09:01Z

One big insight here is that if a Substring's boundaries aren't character aligned, then the contents of the substring (and the positions of its grapheme breaks) can be wildly different than its base string.

Does this mean the Substring will produce indices which its base collection does not consider to be valid indexes?

Rounding indices down to the nearest character boundary (like we do with sub-scalar indices) doesn't currently feel to me like the best approach to deal with this. It seems better to allow such Substrings to behave like standalone strings that happen to contain the same Unicode scalars, so this is what this PR implements.

Hmm, I wonder. Strings (and Substrings) are semantically collections of Characters, not Unicode scalars, and the requirement from Collection that they produce the same indices, and that those indices are compatible with each other, seems to suggest that we may indeed need character alignment, not just scalar alignment.

If I understand what you're saying (and I'm not 100% sure that I do), character alignment would make things more predictable and help solve some of the implementation difficulties, would it not? It would be a slight change to SE-0180, but given your example of how broken the current behaviour is, personally I think an amendment would not be unreasonable.

lorentey · 2022-02-18T19:42:01Z

Does this mean the Substring will produce indices which its base collection does not consider to be valid indexes?

I don't think so -- if a Substring's bounds are valid indices in the original string, then all its indices will be valid, too. (Once the full implementation of this PR lands. Currently Substring doesn't always have a set of valid indices, as its Collection conformance is sometimes broken.)

The details of course depend on what exactly we mean by a "valid index", but (for the two mutually incompatible definitions below) Substring always shares its indices with its base.

If we define valid indices as those that are in the String.Indices view, then Substrings whose bounds are "valid" in this sense also share their valid indices with their base String.
If we define valid indices as those that subscript accepts (which I'd argue is an unreasonable definition, but let's suppose), then Substring indices still come out as always valid in the base string, even though they aren't necessarily reachable from the base string's start index.

From my view, String's Collection conformance isn't in any danger here -- Collection only cares about indices that are in the indices view (i.e., indices that are reachable from startIndex by index(after:)), and doesn't (can't!) require any specific behavior for indices that are outside that. Collection doesn't (cannot) dictate how the result of the slicing subscript should behave if the given range starts/ends on an index that isn't "valid" in the first, narrower, sense.

SE-0180 widened String's set of acceptable indices, adding a menagerie of index values that aren't valid as far Collection is concerned. The issue we need to solve is that the current implementation falls short of ensuring some very reasonable expectations about these indices, such as that substrings whose bounds fall on one of these properly conform to Collection, or that the magnitude of String.distance is symmetric. (The first one is clearly a bug; the second is technically not one, but it makes the semantics of these indices unhelpfully complicated.)

Hmm, I wonder. Strings (and Substrings) are semantically collections of Characters, not Unicode scalars, and the requirement from Collection that they produce the same indices, and that those indices are compatible with each other, seems to suggest that we may indeed need character alignment, not just scalar alignment.

I think I agree this would have been a much simpler option than what SE-0180 went for.

However, that proposal clearly intended to allow sub-character slicing -- it (very loosely) defined what the intended semantics are, and even called out explicit rounding-to-character bounds operations as desirable future work. There are many reasons to dislike this proposal, but it did get accepted, and I think we should make an honest effort to fix its implementation before going back to Swift Evolution.

For now, this PR tries to work out if we can fix the implementation of SE-0180 to work as as closely as possible to what was originally accepted. If at the end of this work, the resulting behavior still feels clearly wrong, or if the performance implications are unbearable, or we run into a fundamental problem, then we can start pushing for overhauling it.

(Things don't seem too bad to me so far; although the situation seems terrible on main, the bugs seem fixable. I remember surprisingly few bug reports about this mess, which I'm taking to mean people aren't exercising this part of the stdlib that much in production. Perhaps folks try and give up when they encounter one of these bogus traps, and mentally file these under "Unicode is complicated", rather than reporting them. This mindset will be very unhealthy/harmful in the long term, but it could mean that we have some leeway in fixing this behavior without breaking too many existing apps.)

(I know @milseman would be very much on board with entirely rolling SE-0180 back, restoring the original distinct index types. FWIW, I'm a bit hesitant/scared about that, because it did make the string APIs a lot less fussy when going from higher-level views down, and because we'd have to maintain ABI/source compatibility with current releases and language versions. That would take a lot of effort we could spend on something more interesting, as well as potentially canceling out most of the performance benefits of undoing SE-0180.)

lorentey · 2022-02-18T20:07:21Z

One argument that could potentially become decisive is that String in SE-0180 can't satisfy the expectation that subscripting returns the same characters for each index in both a substring and its base.

(For all strings s and all indices i, j, and all k in i..<j, it must hold true that s[i..<j][k] == s[k])

let s = "Hello,\r\nworld!"

let i = s.startIndex
let j = s.unicodeScalars.index(s.startIndex, offsetBy: 7) // on \n
let k = s.unicodeScalars.index(s.startIndex, offsetBy: 6) // on \r

let ss = s[i ..< j]
print(s[k]) // "\r\n"
print(ss[k]) // "\r"

(~~This can also happen in the middle a substring, when e.g. regional indicators are involved.~~ Edit: Nah, I think this may only occur on the edges, and perhaps only at the end.)

This behavior is explicitly required by SE-0180, but if it turns out to interfere with some generic algorithms, then that could be enough to make the rounding behavior more desirable.

(This isn't as clear cut as it may appear on first glance -- as a rule, generic algorithms don't take indices from one collection and apply them to another, so the indices involved must have been passed in as arguments. This considerably limits the set of algorithms that may be impacted; and ideally the results should be so bad that no one would file them as just an "emergent property of applying the usual Unicode rules for decoding those code units". SE-0180 is infuriating.)

milseman

Quick pass by of first 1/3rd

milseman · 2022-02-25T17:40:20Z

stdlib/public/core/StringCharacterView.swift

@@ -57,17 +57,34 @@ extension String: BidirectionalCollection {
  ///   `endIndex`.
  /// - Returns: The index value immediately after `i`.
  public func index(after i: Index) -> Index {
+    let i = _guts.ensureMatchingEncoding(i)


Oh cool, this isn't already inlinable :-). Would we want to consider inlining the matching encoding check so that a series of API calls can coalesce them?

Yep, that would probably be the logical choice if this ends up being a significant performance regression.

I’m hoping we won’t need to do that, though, as we’d have to figure out how to add an inlinable wrapper around this. Playing tricks with SILGen names is the route we’ve chosen recently, but I wonder if we’d better come up with a less invasive alternative.

milseman · 2022-02-25T17:41:26Z

stdlib/public/core/StringCharacterView.swift

-      encodedOffset: priorOffset, characterStride: stride)._scalarAligned
+
+    let r = Index(encodedOffset: priorOffset, characterStride: stride)
+    return r._scalarAligned


r._graphemeAligned, which would set both

Would _characterAligned be a better name?

"character" is wildly ambiguous, I feel like. Swift.Character is a name for grapheme clusters, but for internal code/names, I think it's better to be specific

In the context of the Swift stdlib, I think the definition of a character cannot possibly be clearer. String is a Collection of Character values; these routines help identify the boundaries between Characters. The words “string”, “collection” and “character” all have ambiguous meanings outside of this context, but inside of this codebase, we narrowed things down to extremely precise technical definitions.

If we must ignore the context this code is in (why?), then “grapheme” is just as ambiguous as character is — e.g., the “ly” in my name is two Characters, but it is a single letter, and it is considered one grapheme according to at least some linguists. Unicode’s glossary has a non-technical definition of grapheme that doesn’t clear things up at all. The mere fact that it is a different name than Character invites confusion — if I came across it, I would suspect that the different names convey a difference in meaning.

Unicode only has unambiguous definitions for “grapheme cluster” and “extended grapheme cluster”, and iiuc Character implements the latter. So if we want to avoid confusion for people working on stdlib code who aren’t familiar with Swift’s Character type (who are these people?), then we should probably go with _isExtendedGraphemeClusterAligned, or _isEGCAligned for short.

I think you're missing the point. Stdlib API is formally layered on top of an understanding of Unicode semantics, and "character" in Unicode has many definitions. If you want to cite Union's Glossary, then read character. "Extended" is meaningless except to contrast with legacy clusters. "Grapheme cluster" is the more precise name, I only mention "grapheme" because it clearly contrasts with scalars and code units.

Character is precisely defined in the Standard Library. If we must internally pretend this isn't so (why?), then I think we should stick to the corresponding technical term in Unicode, which is "extended grapheme cluster".

Unless I'm completely mistaken (please correct me if so!) the phrase "grapheme cluster" in Unicode refers to something other than what Character implements. Therefore, in my view using that term would increase confusion, not decrease it.

C.f.

/// A single extended grapheme cluster that approximates a user-perceived /// character. [...] @frozen public struct Character: Sendable {

stdlib/public/core/StringCharacterView.swift

`String.UnicodeScalarView` currently lacks proper index validation in its `index(after:)` and `index(before:)` methods, leading to out-of-bounds memory accesses when index navigation methods in this view are given invalid indices. (Also see swiftlang#41598 and swiftlang#41417) rdar://89498541

lorentey · 2022-03-02T07:37:44Z

Hm, Substring.replaceSubrange currently delegates to Slice, which can have very undesirable effects if the replacement changes EGC boundaries in the surrounding string, or if the substring boundaries aren't themselves character-aligned. Slice isn't prepared to correctly adjust startIndex/endIndex in these edge cases, and it also does some extra work in the common case that we should be able to avoid.

(I'm pretty sure the undesirable effects include moving the substring's bounds in its base string by side effect, which should never happen.)

lorentey · 2022-03-17T03:25:52Z

+1,875 −259

🤨 This will not improve my carefully curated net negative code contribution record.

lorentey · 2022-03-24T23:53:30Z

@swift-ci test

lorentey · 2022-03-24T23:53:41Z

Strictly for fun:

@swift-ci benchmark

lorentey · 2022-03-25T02:23:01Z

The last commit made String.subscript non-inlinable.

@swift-ci benchmark

lorentey · 2022-03-25T03:56:17Z

(Significant code size improvements, as expected. -O performance seemed to somewhat improve(?!), -Osize somewhat regressed, -Onone stayed as is.)

…sequences of SE-0180

… indices Assign some previously reserved bits in String.Index and _StringObject to keep track of their associated storage encoding (either UTF-8 or UTF-16). None of these bits will be reliably set in processes that load binaries compiled with older stdlib releases, but when they do end up getting set, we can use them opportunistically to more reliably detect cases where an index is applied on a string with a mismatching encoding. As more and more code gets recompiled with 5.7+, the stdlib will gradually become able to detect such issues with complete accuracy. Code that misuses indices this way was always considered broken; however, String wasn’t able to reliably detect these runtime errors before. Therefore, I expect there is a large amount of broken code out there that keeps using bridged Cocoa String indices (UTF-16) after a mutation turns them into native UTF-8 strings. Therefore, instead of trapping, this commit silently corrects the issue, transcoding the offsets into the correct encoding. It would probably be a good idea to also emit a runtime warning in addition to recovering from the error. This would generate some noise that would gently nudge folks to fix their code. rdar://89369680

lorentey · 2022-04-10T07:19:46Z

@swift-ci benchmark

lorentey · 2022-04-14T00:23:31Z

@swift-ci test

lorentey · 2022-04-14T02:11:42Z

+Var Substring._bounds is a new API without @available attribute

Oops, this is real.

+Accessor String.unicodeScalars.Modify() has been removed

So is this -- I did not realize we have implicit _modify accessors. Neat!

lorentey · 2022-04-14T02:13:56Z

Gah, are we really not able to run grapheme break tests on Linux?

/home/build-user/swift/test/stdlib/StringIndex.swift:608:17: error: cannot find 'graphemeBreakTests' in scope
  for string in graphemeBreakTests.map { $0.0 } {
                ^~~~~~~~~~~~~~~~~~

lorentey · 2022-04-14T02:15:53Z

@swift-ci test

Azoy · 2022-04-14T04:56:41Z

Gah, are we really not able to run grapheme break tests on Linux?

/home/build-user/swift/test/stdlib/StringIndex.swift:608:17: error: cannot find 'graphemeBreakTests' in scope
  for string in graphemeBreakTests.map { $0.0 } {
                ^~~~~~~~~~~~~~~~~~

Yeah, the StdlibUnicodeUnittest uses Foundation to read the test files and for platforms besides Darwin ones we haven't built corelibs-foundation yet when we run Swift tests.

lorentey · 2022-04-14T06:48:15Z

@swift-ci test linux platform

This fixes a Windows regression triggered by swiftlang#41417.

Cherry-picked from swiftlang#41417

This fixes a Windows regression triggered by swiftlang#41417. (cherry picked from commit 66a8ae0)

In Swift 5.6 and below, (broken) code that acquired indices from a UTF-16-encoded string bridged from Cocoa and kept using them after a `makeContiguousUTF8` call (or other mutation) may have appeared to be working correctly as long as the string was ASCII. Since swiftlang#41417, the `String(_:within:)` initializers recognize miscoded indices and reject them by returning nil. This is technically correct, but it unfortunately may be a binary compatibility issue, as these used to return non-nil in previous versions. Mitigate this issue by accepting UTF-16 indices on a UTF-8 string, transcoding their offset as needed. (Attempting to use an UTF-8 index on a UTF-16 string is still rejected — we do not implicitly convert strings in that direction.) rdar://89369680

In Swift 5.6 and below, (broken) code that acquired indices from a UTF-16-encoded string bridged from Cocoa and kept using them after a `makeContiguousUTF8` call (or other mutation) may have appeared to be working correctly as long as the string was ASCII. Since swiftlang#41417, the `String(_:within:)` initializers recognize miscoded indices and reject them by returning nil. This is technically correct, but it unfortunately may be a binary compatibility issue, as these used to return non-nil in previous versions. Mitigate this issue by accepting UTF-16 indices on a UTF-8 string, transcoding their offset as needed. (Attempting to use an UTF-8 index on a UTF-16 string is still rejected — we do not implicitly convert strings in that direction.) rdar://89369680 (cherry picked from commit 4d557b0)

lorentey force-pushed the the-horror-of-se-0180 branch 2 times, most recently from 75e4dfe to cf6ee03 Compare February 25, 2022 05:29

milseman reviewed Feb 25, 2022

View reviewed changes

lorentey mentioned this pull request Mar 1, 2022

[stdlib] String.UnicodeScalarView: Fix some index validation issues #41599

Merged

lorentey force-pushed the the-horror-of-se-0180 branch from 6d856ea to 859c1aa Compare March 2, 2022 03:47

lorentey force-pushed the the-horror-of-se-0180 branch 2 times, most recently from 499ec6b to e4aad9e Compare March 3, 2022 05:15

lorentey mentioned this pull request Mar 12, 2022

Vectorize UTF16 offset calculations #41684

Closed

lorentey force-pushed the the-horror-of-se-0180 branch from e4aad9e to 66137c8 Compare March 17, 2022 03:21

lorentey force-pushed the the-horror-of-se-0180 branch from 66137c8 to 8b39356 Compare March 24, 2022 23:49

This comment was marked as outdated.

Sign in to view

lorentey force-pushed the the-horror-of-se-0180 branch from 8b39356 to dc15dff Compare March 25, 2022 02:22

This comment was marked as outdated.

Sign in to view

lorentey added 2 commits March 24, 2022 20:59

[stdlib] Adjust/fix String’s indexing operations to deal with the con…

683b9fa

…sequences of SE-0180

lorentey force-pushed the the-horror-of-se-0180 branch from 9be58fc to 67adcab Compare April 10, 2022 07:15

lorentey added 4 commits April 10, 2022 16:49

[stdlib] Minor enhancements

bbb0048

[stdlib] String.Index: Remove custom printing

d3df05c

[stdlib][NFC] Doc adjustments

dcfc26c

[stdlib] Remove unused fn

ed7d60c

lorentey added 2 commits April 13, 2022 18:38

[stdlib] String: be more consistent about when markEncoding is called

b33fefb

[stdlib] Revert noop change

89d69a9

lorentey added 2 commits April 13, 2022 19:15

[stdlib] Fix ABI and portability issues

cb2194c

[stdlib][NFC] Remove obsolete note

c67f08b

lorentey merged commit 57f0e67 into swiftlang:main Apr 14, 2022

lorentey mentioned this pull request Apr 15, 2022

[test] Fix Windows regression triggered by #41417 #42385

Merged

lorentey added a commit to lorentey/swift that referenced this pull request Apr 15, 2022

[test] Move string test helper methods to StdlibUnittest

66a8ae0

This fixes a Windows regression triggered by swiftlang#41417.

lorentey added a commit to lorentey/swift that referenced this pull request Apr 15, 2022

[stdlib] Fix String indexing edge cases, anomalies & validation bugs

09afcac

Cherry-picked from swiftlang#41417

lorentey added a commit to lorentey/swift that referenced this pull request Apr 15, 2022

[test] Move string test helper methods to StdlibUnittest

0d03f1f

This fixes a Windows regression triggered by swiftlang#41417. (cherry picked from commit 66a8ae0)

lorentey mentioned this pull request Apr 15, 2022

[5.7][stdlib] Fix String indexing edge cases, anomalies & validation bugs #42402

Merged

lorentey deleted the the-horror-of-se-0180 branch April 17, 2022 21:16

lorentey mentioned this pull request Apr 19, 2022

[stdlib] Make String.Index(_:within:) initializers more permissive #42442

Merged

lorentey mentioned this pull request Apr 21, 2022

[5.7][stdlib] Make String.Index(_:within:) initializers more permissive #42509

Merged

lorentey mentioned this pull request Oct 12, 2022

AttributedString with CRLF sequence crashes while accessing CharacterView #61532

Open

lorentey mentioned this pull request May 2, 2023

rdar://105027055: Improve AttributedString view slicing swiftlang/swift-foundation#92

Merged

[stdlib] Fix String indexing edge cases, anomalies & validation bugs #41417

[stdlib] Fix String indexing edge cases, anomalies & validation bugs #41417

Uh oh!

Conversation

lorentey commented Feb 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

SE-0180 String Index Overhaul

Accidental use of actually invalid indices

Small oversights

Summary of the changes applied

Performance Impact

Uh oh!

lorentey commented Feb 16, 2022

Uh oh!

lorentey commented Feb 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lorentey commented Feb 17, 2022

Uh oh!

lorentey commented Feb 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karwa commented Feb 18, 2022

Uh oh!

lorentey commented Feb 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lorentey commented Feb 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

milseman left a comment

Choose a reason for hiding this comment

Uh oh!

milseman Feb 25, 2022

Choose a reason for hiding this comment

Uh oh!

lorentey Feb 26, 2022

Choose a reason for hiding this comment

Uh oh!

milseman Feb 25, 2022

Choose a reason for hiding this comment

Uh oh!

lorentey Feb 26, 2022

Choose a reason for hiding this comment

Uh oh!

milseman Feb 26, 2022

Choose a reason for hiding this comment

Uh oh!

lorentey Feb 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

milseman Feb 27, 2022

Choose a reason for hiding this comment

Uh oh!

lorentey Mar 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lorentey commented Mar 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lorentey commented Mar 17, 2022

Uh oh!

lorentey commented Mar 24, 2022

Uh oh!

lorentey commented Mar 24, 2022

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

lorentey commented Mar 25, 2022

Uh oh!

This comment was marked as outdated.

lorentey commented Mar 25, 2022

Uh oh!

lorentey commented Apr 10, 2022

Uh oh!

lorentey commented Feb 16, 2022 •

edited

Loading

lorentey commented Feb 17, 2022 •

edited

Loading

lorentey commented Feb 17, 2022 •

edited

Loading

lorentey commented Feb 18, 2022 •

edited

Loading

lorentey commented Feb 18, 2022 •

edited

Loading

lorentey Feb 26, 2022 •

edited

Loading

lorentey Mar 2, 2022 •

edited

Loading

lorentey commented Mar 2, 2022 •

edited

Loading