Skip to content

Commit 86bac55

Browse files
committed
[stdlib] Collection: switch to using '<' instead of '==' in some loop termination conditions
`IndexingIterator.next()`, `Collection.distance(from:to:)` and `Collection.index(_:offsetBy:limitedBy:)` algorithms currently use `i != end` as their termination condition. This works great when the indices are valid in the collection they run on, but it leads to unterminated loops in cases where a Collection type allows (or even encourages) substricting by index values that aren't members of `indices`. Unfortunately we have a small handful of types in the Standard Library that encourage such sloppy indexing -- the big one is `String`, whose various encoding views all use the same `Index` (since SE-0180), but there isn't necessarily a one-to-one mapping between indices of one view and the next. ``` let s = "💩Cafe\u{301} Banana" // "💩" is encoded in four bytes, "\u{301}" is encoded in two let start = s.utf8.index(s.startIndex, offsetBy: 2) let end = s.utf8.index(s.startIndex, offsetBy: 8) ``` In this case, `start` and `end` aren't strictly valid in the regular `String` value (as they point to the middle of the "💩" emoji and to the second Unicode scalar of the decomposed "é" grapheme cluster, respectively), but `String` still sort-of-kind-of accepts them: ``` print(s[start]) // "💩" print(s[end]) // "\u{301}" a.k.a. "́" print(s[start ..< end]) // "💩La Cafe", note no acute accent ``` The behavior seems to be somewhat sensible -- the indices are rounded to down to the nearest Unicode Scalar boundary, then a substring is constructed out of that. However, this illusion breaks down when we iterate through the substring: ``` for c in s[start ..< end] { print(c) } // 💩 // C // a // f // é // Swift/Collection.swift:711: Fatal error: Out of bounds: index >= endIndex ``` ``` for i in s[start ..< end].indices { print("Character: \(s[i]) Index: \(i)") } // Character: 💩 Index: Index(_rawBits: 1) // Character: C Index: Index(_rawBits: 262401) // Character: a Index: Index(_rawBits: 327937) // Character: f Index: Index(_rawBits: 393473) // Character: é Index: Index(_rawBits: 459521) // Swift/Substring.swift:165: Fatal error: Cannot increment beyond endIndex ``` This is rather surprising. The issue is that `Substring` and `DefaultIndices` both use `IndexingIterator` as their iterator type, and `IndexingIterator.next()` is defined to stop when `_position == _elements.endIndex`. Because `_elements.endIndex` falls in between two reachable indices, the condition never triggers and we run off the end of the substring, skipping over our endIndex. Worse, the entire final grapheme cluster "é" is returned as if it was part of the substring, which it isn't, accoring to how the substring is printed. Similar issues can be found when calling `distance(from:to:)` or `index(_:offsetBy:limitedBy:)`. Besides `String`, `LazyFilterCollection` also suffers from similar index sharing issues with the base collection, although the expectations aren't that well documented there. Now in String's case, we could choose to catch this and trap when the substring is created with these iffy indices; however, SE-0180 rejected this approach in favor of [this passage](https://github.com/apple/swift-evolution/blob/main/proposals/0180-string-index-overhaul.md#comparison-and-subscript-semantics): > With respect to subscripts, an index that does not fall on an exact boundary in a given String or Substring view will be treated as falling at its encodedOffset in the underlying code units, with the actual contents of the result being an emergent property of applying the usual Unicode rules for decoding those code units. For example, when slicing a string with an index i that falls between two Character boundaries, i.encodedOffset is treated as a position in the string's underlying code units, and the Characters of the result are determined by performing standard Unicode grapheme breaking on the resulting sequence of code units. However, iterating through a substring with such irregular bounds therefore ought to behave as if we were iterating over a string composed of the same code units -- the current behavior is clearly violating this by overrunning the end index. (A strict reading of this passage implies that `s[start]` should return the replacement character `"\u{FFFD}"`, as it attempts to read an invalid (partial) sequence of UTF-8 code units. This does not seem desirable to me; the current rounding-to-nearest-scalar behavior is very likely to be more useful in practice. We can't really split scalars without violating `String`'s fundamental invariant that its underlying encoding must always be valid, so rounding to the nearest scalar boundary is the best we can do.) This PR updates `IndexingIterator.next()` and the default implementations of `Collection.distance(from:to:)` and `Collection.index(_:offsetBy:limitedBy:)` to compare indices using `<` instead of `==` in their termination conditions, thereby resolving this subtle issue. (Another potential benefit is an (anecdotal) code gen improvement in some integer-indexed collection types -- `<` seems to enable better code analysis in some cases.) One huge potential drawback is that 'Index.<` might be much slower than `Index.==` -- for example, a hypothetical collection type that implements a classic linked list (like `std::list` in C++) would likely have an O(n) `Index.<` implementation while `==` would take constant time. This does not seem like a strong enough objection to me -- linked lists aren't that practical (or popular) these days, and it doesn't seem too much of a burden for such collection types to provide custom iterator / index navigation implementations. (After all, they will need to deal with far trickier issues like ARC stack exhaustion while destroying long lists.) (There are no such performance issues with more practical Collection implementations, such as tree-based containers.) Tweaking the implementation of such basic functions can also turn out to be source-breaking (or even ABI-breaking) enough that we won't be able to ship this change. In that case, it might be worth special casing `String.Index` in the implementation of these methods. rdar://88705758
1 parent 5e1e0c1 commit 86bac55

File tree

2 files changed

+6
-10
lines changed

2 files changed

+6
-10
lines changed

stdlib/public/core/BidirectionalCollection.swift

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -259,9 +259,7 @@ extension BidirectionalCollection {
259259
}
260260
var i = i
261261
for _ in stride(from: 0, to: distance, by: -1) {
262-
if i == limit {
263-
return nil
264-
}
262+
guard i > limit else { return nil }
265263
formIndex(before: &i)
266264
}
267265
return i
@@ -278,13 +276,13 @@ extension BidirectionalCollection {
278276
var count = 0
279277

280278
if start < end {
281-
while start != end {
279+
while start < end {
282280
count += 1
283281
formIndex(after: &start)
284282
}
285283
}
286284
else if start > end {
287-
while start != end {
285+
while start > end {
288286
count -= 1
289287
formIndex(before: &start)
290288
}

stdlib/public/core/Collection.swift

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -123,7 +123,7 @@ extension IndexingIterator: IteratorProtocol, Sequence {
123123
@inlinable
124124
@inline(__always)
125125
public mutating func next() -> Elements.Element? {
126-
if _position == _elements.endIndex { return nil }
126+
guard _position < _elements.endIndex else { return nil }
127127
let element = _elements[_position]
128128
_elements.formIndex(after: &_position)
129129
return element
@@ -899,7 +899,7 @@ extension Collection {
899899

900900
var start = start
901901
var count = 0
902-
while start != end {
902+
while start < end {
903903
count = count + 1
904904
formIndex(after: &start)
905905
}
@@ -989,9 +989,7 @@ extension Collection {
989989

990990
var i = i
991991
for _ in stride(from: 0, to: n, by: 1) {
992-
if i == limit {
993-
return nil
994-
}
992+
guard i < limit else { return nil }
995993
formIndex(after: &i)
996994
}
997995
return i

0 commit comments

Comments
 (0)