|
| 1 | +# Feature name |
| 2 | + |
| 3 | +* Proposal: [SE-NNNN](NNNN-string-index-printing.md) |
| 4 | +* Authors: [Karoy Lorentey](https://github.com/lorentey) |
| 5 | +* Review Manager: TBD |
| 6 | +* Status: **Awaiting review** |
| 7 | +* Implementation: [apple/swift#75433](https://github.com/swiftlang/swift/pull/75433) |
| 8 | +* Review: ([pitch](https://forums.swift.org/t/improving-string-index-s-printed-descriptions/57027)) |
| 9 | + |
| 10 | +## Introduction |
| 11 | + |
| 12 | +This proposal conforms `String.Index` to `CustomStringConvertible`. |
| 13 | + |
| 14 | +## Motivation |
| 15 | + |
| 16 | +String indices represent offsets from the start of the string's underlying storage representation, referencing a particular UTF-8 or UTF-16 code unit, depending on the string's encoding. (Most Swift strings are UTF-8 encoded, but strings bridged over from Objective-C may remain in their original UTF-16 encoded form.) |
| 17 | + |
| 18 | +If you ever tried printing a string index, you probably noticed that the output is gobbledygook: |
| 19 | + |
| 20 | +```swift |
| 21 | +let string = "👋🏼 Helló" |
| 22 | + |
| 23 | +print(string.startIndex) // ⟹ Index(_rawBits: 15) |
| 24 | +print(string.endIndex) // ⟹ Index(_rawBits: 983047) |
| 25 | +print(string.utf16.index(after: string.startIndex)) // ⟹ Index(_rawBits: 16388) |
| 26 | +print(string.firstRange(of: "ell")!) // ⟹ Index(_rawBits: 655623)..<Index(_rawBits: 852487) |
| 27 | +``` |
| 28 | + |
| 29 | +These displays are generated via the default reflection-based string conversion code path, which fails to produce a comprehensible result. Not being able to print string indices in a sensible way is needlessly complicating their use: it obscures what these things are, and it is an endless source of frustration while working with strings in Swift. |
| 30 | + |
| 31 | +## Proposed solution |
| 32 | + |
| 33 | +This proposal supplies the missing `CustomStringConvertible` conformance on `String.Index`, resolving this long-standing issue. |
| 34 | + |
| 35 | +```swift |
| 36 | +let string = "👋🏼 Helló" |
| 37 | + |
| 38 | +print(string.startIndex) // ⟹ 0[any] |
| 39 | +print(string.endIndex) // ⟹ 15[utf8] |
| 40 | +print(string.utf16.index(after: string.startIndex)) // ⟹ 0[utf8]+1 |
| 41 | +print(string.firstRange(of: "ell")!) // ⟹ 10[utf8]..<13[utf8] |
| 42 | +``` |
| 43 | + |
| 44 | +The sample description strings shown above are illustrative, not normative. This proposal does not specify the exact format and information content of the string returned by the `description` implementation on `String.Index`. As is the case with most conformances to `CustomStringConvertible`, the purpose of these descriptions is to expose internal implementation details for debugging purposes. As those implementation details evolve, the descriptions may need to be changed to match them. Such changes are not generally expected to be part of the Swift Evolution process; so we need to keep the content of these descriptions unspecified. |
| 45 | + |
| 46 | +(With that said, the example displays shown above are not newly invented -- they have already proved their usefulness in actual use. They were developed while working on subtle string processing problems in Swift 5.7, and [LLDB has been shipping them as built-in data formatters][lldb] since the Swift 5.8 release. |
| 47 | + |
| 48 | +In the displays shown, string indices succinctly display their storage offset, their expected encoding, and an (optional) transcoded offset value. For example, the output `15[utf8]` indicates that the index is addressing the code unit at offset 15 in a UTF-8 encoded `String` value. The `startIndex` is at offset zero, which works the same with _any_ encoding, so it is displayed as `0[any]`. As of Swift 6.0, on some platforms string instances may store their text in UTF-16, and so indices within such strings use `[utf16]` to specify that their offsets are measured in UTF-16 code units. |
| 49 | + |
| 50 | +The `+1` in `0[utf8]+1` is an offset into a _transcoded_ Unicode scalar; this index addresses the trailing surrogate in the UTF-16 transcoding of the first scalar within the string, which has to be outside the Basic Multilingual Plane (or it wouldn't require surrogates). In our particular case, the code point is U+1F44B WAVING HAND SIGN, encoded in UTF-8 as `F0 9F 91 8B`, and in UTF-16 as `D83D DC4B`. The index is addressing the UTF-16 code unit `DC4B`, which does not actually exist anywhere in the string's storage -- it needs to be computed on every access, by transcoding the UTF-8 data for this scalar, and offsetting into the result.) |
| 51 | + |
| 52 | +[lldb]: https://github.com/swiftlang/llvm-project/pull/5515 |
| 53 | + |
| 54 | +All of this is really useful information to see while developing or debugging string algorithms, but it is also deeply specific to the particular implementation of `String` that ships in Swift 6.0; therefore it is inherently unstable, and it may change in any Swift release.) |
| 55 | + |
| 56 | +<!-- |
| 57 | +``` |
| 58 | +Characters: | 👋🏼 | " " | H | e | l | l | ó | |
| 59 | +Scalars: | 👋 | "\u{1F3FC}" | " " | H | e | l | l | ó | |
| 60 | +UTF-8: | f0 9f 91 8b | f0 9f 8f bc | 20 | 48 | 65 | 6c | 6c | c3 b3 | |
| 61 | +UTF-16: | d83d dc4b | d83c dffc | 20 | 48 | 65 | 6c | 6c | f3 | |
| 62 | +``` |
| 63 | +--> |
| 64 | + |
| 65 | +## Detailed design |
| 66 | + |
| 67 | +``` |
| 68 | +@available(SwiftStdlib 6.1, *) |
| 69 | +extension String.Index: CustomStringConvertible {} |
| 70 | +
|
| 71 | +extension String.Index { |
| 72 | + @backDeployed(before: SwiftStdlib 6.1) |
| 73 | + public var description: String {...} |
| 74 | +} |
| 75 | +``` |
| 76 | + |
| 77 | +## Source compatibility |
| 78 | + |
| 79 | +The new conformance changes the result of converting a `String.Index` value to a string. This changes observable behavior: code that attempts to parse the result of `String(describing:)` can be mislead by the change of format. |
| 80 | + |
| 81 | +However, `String(describing:)` and `String(reflecting:)` explicitly state that when the input type conforms to none of the standard string conversion protocols, then the result of these operations is unspecified. |
| 82 | + |
| 83 | +Changing the value of an unspecified result is not considered to be a source incompatible change. |
| 84 | + |
| 85 | +## ABI compatibility |
| 86 | + |
| 87 | +The proposal retroactively conforms a previously existing standard type to a previously existing standard protocol. This is technically an ABI breaking change -- on ABI stable platforms, we may have preexisting Swift binaries that assume that `String.Index is CustomStringConvertible` returns `false`, or ones that are implementing this conformance on their own. |
| 88 | + |
| 89 | +We do not expect this to be an issue in practice. |
| 90 | + |
| 91 | +## Implications on adoption |
| 92 | + |
| 93 | +The `String.Index.description` property is defined to be backdeployable, but the conformance itself is not. (It cannot be.) |
| 94 | + |
| 95 | +Code that runs on ABI stable platforms will not get the nicer displays when running on earlier versions of the Swift Standard Library. |
| 96 | + |
| 97 | +```swift |
| 98 | +let str = "🐕 Doggo" |
| 99 | +print(str.firstRange(of: "Dog")!) |
| 100 | +// older stdlib: Index(_rawBits: 327943)..<Index(_rawBits: 524551) |
| 101 | +// newer stdlib: 5[utf8]..<8[utf8] |
| 102 | +``` |
| 103 | + |
| 104 | +This can be somewhat mitigated by explicitly invoking the `description` property, but this isn't recommmended as general practice. |
| 105 | + |
| 106 | +```swift |
| 107 | +print(str.endIndex.description) |
| 108 | +// always: 11[utf8] |
| 109 | +``` |
| 110 | + |
| 111 | +## Future directions |
| 112 | + |
| 113 | +### Additional `CustomStringConvertible` conformances |
| 114 | + |
| 115 | +Other preexisting types in the Standard Library may also usefully gain `CustomStringConvertible` conformances in the future: |
| 116 | + |
| 117 | +- `Set.Index`, `Dictionary.Index` |
| 118 | +- `Slice`, `DefaultIndices` |
| 119 | +- `PartialRangeFrom`, `PartialRangeUpTo`, `PartialRangeThrough` |
| 120 | +- `CollectionDifference`, `CollectionDifference.Index` |
| 121 | +- `FlattenSequence`, `FlattenSequence.Index` |
| 122 | +- `LazyPrefixWhileSequence`, `LazyPrefixWhileSequence.Index` |
| 123 | +- etc. |
| 124 | + |
| 125 | +### New String API to expose the information in these descriptions |
| 126 | + |
| 127 | +The information exposed in the index descriptions shown above is mostly retrievable through public APIs, but not entirely: perhaps most importantly, there is no way to get the expected encoding of a string index through the stdlib's public API surface. The lack of such an API may encourage interested Swift developers to try retrieving this information by parsing the unstable `description` string, or by bitcasting indices to peek at the underlying bit patterns -- neither of which would be healthy for the Swift ecosystem overall. It therefore is desirable to eventually expose this information as well, through API additons like the drafts below: |
| 128 | + |
| 129 | +```swift |
| 130 | +extension String { |
| 131 | + @frozen enum StorageEncoding { |
| 132 | + case utf8 |
| 133 | + case utf16 |
| 134 | + } |
| 135 | + |
| 136 | + /// The storage encoding of this string instance. The encoding view |
| 137 | + /// corresponding to this encoding behaves like a random-access collection. |
| 138 | + /// |
| 139 | + /// - Complexity: O(1) |
| 140 | + var encoding: StorageEncoding { get } |
| 141 | +} |
| 142 | + |
| 143 | +extension String.Index { |
| 144 | + /// The encoding of the string that produced this index, or nil if the |
| 145 | + /// encoding is not known. |
| 146 | + /// |
| 147 | + /// - Complexity: O(1) |
| 148 | + var encoding: String.StorageEncoding? { get } |
| 149 | + |
| 150 | + /// The offset of this position within the UTF-8 storage of the `String` |
| 151 | + /// instance that produced it. `nil` if the offset is not known to be valid |
| 152 | + /// in UTF-8 encoded storage. |
| 153 | + /// |
| 154 | + /// - Complexity: O(1) |
| 155 | + @available(SwiftStdlib 5.7, *) |
| 156 | + var utf8Offset: Int? { get } |
| 157 | + |
| 158 | + /// The offset of this position within the UTF-16 storage of the `String` |
| 159 | + /// instance that produced it. `nil` if the offset is not known to be valid |
| 160 | + /// in UTF-16 encoded storage. |
| 161 | + /// |
| 162 | + /// - Complexity: O(1) |
| 163 | + @available(SwiftStdlib 5.7, *) |
| 164 | + var utf16Offset: Int? { get } |
| 165 | +} |
| 166 | +``` |
| 167 | + |
| 168 | +One major limitation is that string indices don't necessarily know their expected encoding, so the `encoding` property suggested above has to return an optional. (Indices of ASCII strings and the start index of all strings are the same no matter the encoding, and Swift runtimes prior to 5.7 did not track the encoding of string indices at all.) The `utf8Offset` and `utf16Offset` properties would correct and reinstate the functionality that got removed by [SE-0241] with the deprecation of `encodingOffset`. |
| 169 | + |
| 170 | +[SE-0241]: https://github.com/swiftlang/swift-evolution/blob/main/proposals/0241-string-index-explicit-encoding-offset.md |
| 171 | + |
| 172 | +Given that these APIs are quite obscure/subtle, and they pose some interesting design challenges on their own, these additions are deferred to a future proposal. The interface suggested above does not include exposing "transcoded offsets"; I expect the eventual proposal would need to cover those, too. |
| 173 | + |
| 174 | +## Alternatives considered |
| 175 | + |
| 176 | +None. |
0 commit comments