-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[DO NOT MERGE][stdlib] Collection: switch to using '<' instead of '==' in some loop termination conditions #41331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… termination conditions `IndexingIterator.next()`, `Collection.distance(from:to:)` and `Collection.index(_:offsetBy:limitedBy:)` algorithms currently use `i != end` as their termination condition. This works great when the indices are valid in the collection they run on, but it leads to unterminated loops in cases where a Collection type allows (or even encourages) substricting by index values that aren't members of `indices`. Unfortunately we have a small handful of types in the Standard Library that encourage such sloppy indexing -- the big one is `String`, whose various encoding views all use the same `Index` (since SE-0180), but there isn't necessarily a one-to-one mapping between indices of one view and the next. ``` let s = "💩Cafe\u{301} Banana" // "💩" is encoded in four bytes, "\u{301}" is encoded in two let start = s.utf8.index(s.startIndex, offsetBy: 2) let end = s.utf8.index(s.startIndex, offsetBy: 8) ``` In this case, `start` and `end` aren't strictly valid in the regular `String` value (as they point to the middle of the "💩" emoji and to the second Unicode scalar of the decomposed "é" grapheme cluster, respectively), but `String` still sort-of-kind-of accepts them: ``` print(s[start]) // "💩" print(s[end]) // "\u{301}" a.k.a. "́" print(s[start ..< end]) // "💩La Cafe", note no acute accent ``` The behavior seems to be somewhat sensible -- the indices are rounded to down to the nearest Unicode Scalar boundary, then a substring is constructed out of that. However, this illusion breaks down when we iterate through the substring: ``` for c in s[start ..< end] { print(c) } // 💩 // C // a // f // é // Swift/Collection.swift:711: Fatal error: Out of bounds: index >= endIndex ``` ``` for i in s[start ..< end].indices { print("Character: \(s[i]) Index: \(i)") } // Character: 💩 Index: Index(_rawBits: 1) // Character: C Index: Index(_rawBits: 262401) // Character: a Index: Index(_rawBits: 327937) // Character: f Index: Index(_rawBits: 393473) // Character: é Index: Index(_rawBits: 459521) // Swift/Substring.swift:165: Fatal error: Cannot increment beyond endIndex ``` This is rather surprising. The issue is that `Substring` and `DefaultIndices` both use `IndexingIterator` as their iterator type, and `IndexingIterator.next()` is defined to stop when `_position == _elements.endIndex`. Because `_elements.endIndex` falls in between two reachable indices, the condition never triggers and we run off the end of the substring, skipping over our endIndex. Worse, the entire final grapheme cluster "é" is returned as if it was part of the substring, which it isn't, accoring to how the substring is printed. Similar issues can be found when calling `distance(from:to:)` or `index(_:offsetBy:limitedBy:)`. Besides `String`, `LazyFilterCollection` also suffers from similar index sharing issues with the base collection, although the expectations aren't that well documented there. Now in String's case, we could choose to catch this and trap when the substring is created with these iffy indices; however, SE-0180 rejected this approach in favor of [this passage](https://github.com/apple/swift-evolution/blob/main/proposals/0180-string-index-overhaul.md#comparison-and-subscript-semantics): > With respect to subscripts, an index that does not fall on an exact boundary in a given String or Substring view will be treated as falling at its encodedOffset in the underlying code units, with the actual contents of the result being an emergent property of applying the usual Unicode rules for decoding those code units. For example, when slicing a string with an index i that falls between two Character boundaries, i.encodedOffset is treated as a position in the string's underlying code units, and the Characters of the result are determined by performing standard Unicode grapheme breaking on the resulting sequence of code units. However, iterating through a substring with such irregular bounds therefore ought to behave as if we were iterating over a string composed of the same code units -- the current behavior is clearly violating this by overrunning the end index. (A strict reading of this passage implies that `s[start]` should return the replacement character `"\u{FFFD}"`, as it attempts to read an invalid (partial) sequence of UTF-8 code units. This does not seem desirable to me; the current rounding-to-nearest-scalar behavior is very likely to be more useful in practice. We can't really split scalars without violating `String`'s fundamental invariant that its underlying encoding must always be valid, so rounding to the nearest scalar boundary is the best we can do.) This PR updates `IndexingIterator.next()` and the default implementations of `Collection.distance(from:to:)` and `Collection.index(_:offsetBy:limitedBy:)` to compare indices using `<` instead of `==` in their termination conditions, thereby resolving this subtle issue. (Another potential benefit is an (anecdotal) code gen improvement in some integer-indexed collection types -- `<` seems to enable better code analysis in some cases.) One huge potential drawback is that 'Index.<` might be much slower than `Index.==` -- for example, a hypothetical collection type that implements a classic linked list (like `std::list` in C++) would likely have an O(n) `Index.<` implementation while `==` would take constant time. This does not seem like a strong enough objection to me -- linked lists aren't that practical (or popular) these days, and it doesn't seem too much of a burden for such collection types to provide custom iterator / index navigation implementations. (After all, they will need to deal with far trickier issues like ARC stack exhaustion while destroying long lists.) (There are no such performance issues with more practical Collection implementations, such as tree-based containers.) Tweaking the implementation of such basic functions can also turn out to be source-breaking (or even ABI-breaking) enough that we won't be able to ship this change. In that case, it might be worth special casing `String.Index` in the implementation of these methods. rdar://88705758
@swift-ci test |
@swift-ci benchmark |
(Sadly I don't expect this would uncover anything that this might break) @swift-ci test source compatibility |
There is a very unfortunate thing about let p = s[start ..< end]
print(p.startIndex == start) // false -- there is rounding!
print(p.endIndex == end) // true, unfortunately This means that such substrings aren't valid collections in the current stdlib. One way to fix this would be to add a special case to |
Build failed |
Performance (x86_64): -O
Code size: -O
Performance (x86_64): -Osize
Code size: -Osize
Performance (x86_64): -Onone
Code size: -swiftlibsHow to read the dataThe tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the Noise: Sometimes the performance results (not code size!) contain false Hardware Overview
|
Build failed |
It is relatively simple to demonstrate. In this example, compare the code generated for I'd also recommend building a toolchain and benchmarking this with generic collections that don't use Array as their storage. If this helps solve issues with String, that's great, but I think there is value to be unlocked via this change even if it turns out that String needs some other delicate workarounds. |
The benchmark results are a wash, which is expected -- except for These are testing a Sequence algorithm over The correctness issue uncovered by the unit tests is more pressing, though. |
(Note: Fixing |
As far as integer-indexed collections go, I think the indices of most correct Collections ultimately boil down to some integer or another. Even if they're opaque indices, since they represent a position of some sort, they usually wrap one or more integers. Yes, there are linked-lists, but Collection already requires that its indices be I've been mentioning this particular thing for some time, as I've seen really great improvements in my libraries by using it (especially together with other tweaks, such as unsigned integer indexes). But I also appreciate that the standard library maintainers have limited time and must prioritise correctness over performance. I just hope it stays on the radar for whenever there's time to work on it :) |
This is definitely on my radar -- I'll keep looking for an excuse to land it. It's very difficult to ship potentially ABI breaking behavioral changes like this purely for performance reasons, especially if the performance benefits do not show up as a clear win in benchmarks. (The risk can be quite large -- debugging / dealing with binary compatibility issues tends to take a long time. Every minute spent on it takes resources away from more productive work, so we need to always weigh such risk against expected benefits.) |
Note: This is just an experiment; we may well end up solving this problem in another way, or leaving it as is for now.
IndexingIterator.next()
,Collection.distance(from:to:)
andCollection.index(_:offsetBy:limitedBy:)
algorithms currently usei != end
as their termination condition. This works great when the indices are valid in the collection they run on, but it leads to unterminated loops in cases where a Collection type allows (or even encourages) substricting by index values that aren't members ofindices
.Unfortunately we have a small handful of types in the Standard Library that encourage such sloppy indexing -- the big one is
String
, whose various encoding views all use the sameIndex
(since SE-0180), but there isn't necessarily a one-to-one mapping between indices of one view and the next.In this case,
start
andend
aren't strictly valid in the regularString
value (as they point to the middle of the "💩" emoji and to the second Unicode scalar of the decomposed "é" grapheme cluster, respectively), butString
still sort-of-kind-of accepts them:The behavior seems to be somewhat sensible -- the indices are rounded to down to the nearest Unicode Scalar boundary, then a substring is constructed out of that.
However, this illusion breaks down when we iterate through the substring:
This is rather surprising. The issue is that
Substring
andDefaultIndices
both useIndexingIterator
as their iterator type, andIndexingIterator.next()
is defined to stop when_position == _elements.endIndex
. Because_elements.endIndex
falls in between two reachable indices, the condition never triggers and we run off the end of the substring, skipping over our endIndex. Worse, the entire final grapheme cluster "é" is returned as if it was part of the substring, which it isn't, accoring to how the substring is printed.Similar issues can be found when calling
distance(from:to:)
orindex(_:offsetBy:limitedBy:)
. BesidesString
,LazyFilterCollection
also suffers from similar index sharing issues with the base collection, although the expectations aren't that well documented there.Now in String's case, we could choose to catch this and trap when the substring is created with these iffy indices; however, SE-0180 rejected this approach in favor of this passage:
However, iterating through a substring with such irregular bounds therefore ought to behave as if we were iterating over a string composed of the same code units -- the current behavior is clearly violating this by overrunning the end index.
(A strict reading of this passage implies that
s[start]
should return the replacement character"\u{FFFD}"
, as it attempts to read an invalid (partial) sequence of UTF-8 code units. This does not seem desirable to me; the current rounding-to-nearest-scalar behavior is very likely to be more useful in practice. We can't really split scalars without violatingString
's fundamental invariant that its underlying encoding must always be valid, so rounding to the nearest scalar boundary is the best we can do.)This PR updates
IndexingIterator.next()
and the default implementations ofCollection.distance(from:to:)
andCollection.index(_:offsetBy:limitedBy:)
to compare indices using<
instead of==
in their termination conditions, thereby resolving this subtle issue. (Another potential benefit is an (anecdotal) code gen improvement in some integer-indexed collection types --<
seems to enable better code analysis in some cases.)One huge potential drawback is that
Index.<
might be much slower thanIndex.==
-- for example, a hypothetical collection type that implements a classic linked list (likestd::list
in C++) would likely have an O(n)Index.<
implementation while==
would take constant time. This does not seem like a strong enough objection to me -- linked lists aren't that practical (or popular) these days, and it doesn't seem too much of a burden for such collection types to provide custom iterator / index navigation implementations. (After all, they will need to deal with far trickier issues like ARC stack exhaustion while destroying long lists.)(There are no such performance issues with more practical Collection implementations, such as tree-based containers.)
Tweaking the implementation of such basic functions can also turn out to be source-breaking (or even ABI-breaking) enough that we won't be able to ship this change. In that case, it might be worth special casing
String.Index
in the implementation of these methods.rdar://88705758