Skip to content

Commit 1c267ad

Browse files
authored
Add proposal for useful String.Index descriptions (#2529)
* Add proposal for useful String.Index descriptions * Edits and amendments - Add note on LLDB already shipping these displays as data formatters. - Expand Future Directions section with potential API additions that expose the underlying information for programmatic use.
1 parent c225716 commit 1c267ad

File tree

1 file changed

+176
-0
lines changed

1 file changed

+176
-0
lines changed
Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
# Feature name
2+
3+
* Proposal: [SE-NNNN](NNNN-string-index-printing.md)
4+
* Authors: [Karoy Lorentey](https://github.com/lorentey)
5+
* Review Manager: TBD
6+
* Status: **Awaiting review**
7+
* Implementation: [apple/swift#75433](https://github.com/swiftlang/swift/pull/75433)
8+
* Review: ([pitch](https://forums.swift.org/t/improving-string-index-s-printed-descriptions/57027))
9+
10+
## Introduction
11+
12+
This proposal conforms `String.Index` to `CustomStringConvertible`.
13+
14+
## Motivation
15+
16+
String indices represent offsets from the start of the string's underlying storage representation, referencing a particular UTF-8 or UTF-16 code unit, depending on the string's encoding. (Most Swift strings are UTF-8 encoded, but strings bridged over from Objective-C may remain in their original UTF-16 encoded form.)
17+
18+
If you ever tried printing a string index, you probably noticed that the output is gobbledygook:
19+
20+
```swift
21+
let string = "👋🏼 Helló"
22+
23+
print(string.startIndex) // ⟹ Index(_rawBits: 15)
24+
print(string.endIndex) // ⟹ Index(_rawBits: 983047)
25+
print(string.utf16.index(after: string.startIndex)) // ⟹ Index(_rawBits: 16388)
26+
print(string.firstRange(of: "ell")!) // ⟹ Index(_rawBits: 655623)..<Index(_rawBits: 852487)
27+
```
28+
29+
These displays are generated via the default reflection-based string conversion code path, which fails to produce a comprehensible result. Not being able to print string indices in a sensible way is needlessly complicating their use: it obscures what these things are, and it is an endless source of frustration while working with strings in Swift.
30+
31+
## Proposed solution
32+
33+
This proposal supplies the missing `CustomStringConvertible` conformance on `String.Index`, resolving this long-standing issue.
34+
35+
```swift
36+
let string = "👋🏼 Helló"
37+
38+
print(string.startIndex) // ⟹ 0[any]
39+
print(string.endIndex) // ⟹ 15[utf8]
40+
print(string.utf16.index(after: string.startIndex)) // ⟹ 0[utf8]+1
41+
print(string.firstRange(of: "ell")!) // ⟹ 10[utf8]..<13[utf8]
42+
```
43+
44+
The sample description strings shown above are illustrative, not normative. This proposal does not specify the exact format and information content of the string returned by the `description` implementation on `String.Index`. As is the case with most conformances to `CustomStringConvertible`, the purpose of these descriptions is to expose internal implementation details for debugging purposes. As those implementation details evolve, the descriptions may need to be changed to match them. Such changes are not generally expected to be part of the Swift Evolution process; so we need to keep the content of these descriptions unspecified.
45+
46+
(With that said, the example displays shown above are not newly invented -- they have already proved their usefulness in actual use. They were developed while working on subtle string processing problems in Swift 5.7, and [LLDB has been shipping them as built-in data formatters][lldb] since the Swift 5.8 release.
47+
48+
In the displays shown, string indices succinctly display their storage offset, their expected encoding, and an (optional) transcoded offset value. For example, the output `15[utf8]` indicates that the index is addressing the code unit at offset 15 in a UTF-8 encoded `String` value. The `startIndex` is at offset zero, which works the same with _any_ encoding, so it is displayed as `0[any]`. As of Swift 6.0, on some platforms string instances may store their text in UTF-16, and so indices within such strings use `[utf16]` to specify that their offsets are measured in UTF-16 code units.
49+
50+
The `+1` in `0[utf8]+1` is an offset into a _transcoded_ Unicode scalar; this index addresses the trailing surrogate in the UTF-16 transcoding of the first scalar within the string, which has to be outside the Basic Multilingual Plane (or it wouldn't require surrogates). In our particular case, the code point is U+1F44B WAVING HAND SIGN, encoded in UTF-8 as `F0 9F 91 8B`, and in UTF-16 as `D83D DC4B`. The index is addressing the UTF-16 code unit `DC4B`, which does not actually exist anywhere in the string's storage -- it needs to be computed on every access, by transcoding the UTF-8 data for this scalar, and offsetting into the result.)
51+
52+
[lldb]: https://github.com/swiftlang/llvm-project/pull/5515
53+
54+
All of this is really useful information to see while developing or debugging string algorithms, but it is also deeply specific to the particular implementation of `String` that ships in Swift 6.0; therefore it is inherently unstable, and it may change in any Swift release.)
55+
56+
<!--
57+
```
58+
Characters: | 👋🏼 | " " | H | e | l | l | ó |
59+
Scalars: | 👋 | "\u{1F3FC}" | " " | H | e | l | l | ó |
60+
UTF-8: | f0 9f 91 8b | f0 9f 8f bc | 20 | 48 | 65 | 6c | 6c | c3 b3 |
61+
UTF-16: | d83d dc4b | d83c dffc | 20 | 48 | 65 | 6c | 6c | f3 |
62+
```
63+
-->
64+
65+
## Detailed design
66+
67+
```
68+
@available(SwiftStdlib 6.1, *)
69+
extension String.Index: CustomStringConvertible {}
70+
71+
extension String.Index {
72+
@backDeployed(before: SwiftStdlib 6.1)
73+
public var description: String {...}
74+
}
75+
```
76+
77+
## Source compatibility
78+
79+
The new conformance changes the result of converting a `String.Index` value to a string. This changes observable behavior: code that attempts to parse the result of `String(describing:)` can be mislead by the change of format.
80+
81+
However, `String(describing:)` and `String(reflecting:)` explicitly state that when the input type conforms to none of the standard string conversion protocols, then the result of these operations is unspecified.
82+
83+
Changing the value of an unspecified result is not considered to be a source incompatible change.
84+
85+
## ABI compatibility
86+
87+
The proposal retroactively conforms a previously existing standard type to a previously existing standard protocol. This is technically an ABI breaking change -- on ABI stable platforms, we may have preexisting Swift binaries that assume that `String.Index is CustomStringConvertible` returns `false`, or ones that are implementing this conformance on their own.
88+
89+
We do not expect this to be an issue in practice.
90+
91+
## Implications on adoption
92+
93+
The `String.Index.description` property is defined to be backdeployable, but the conformance itself is not. (It cannot be.)
94+
95+
Code that runs on ABI stable platforms will not get the nicer displays when running on earlier versions of the Swift Standard Library.
96+
97+
```swift
98+
let str = "🐕 Doggo"
99+
print(str.firstRange(of: "Dog")!)
100+
// older stdlib: Index(_rawBits: 327943)..<Index(_rawBits: 524551)
101+
// newer stdlib: 5[utf8]..<8[utf8]
102+
```
103+
104+
This can be somewhat mitigated by explicitly invoking the `description` property, but this isn't recommmended as general practice.
105+
106+
```swift
107+
print(str.endIndex.description)
108+
// always: 11[utf8]
109+
```
110+
111+
## Future directions
112+
113+
### Additional `CustomStringConvertible` conformances
114+
115+
Other preexisting types in the Standard Library may also usefully gain `CustomStringConvertible` conformances in the future:
116+
117+
- `Set.Index`, `Dictionary.Index`
118+
- `Slice`, `DefaultIndices`
119+
- `PartialRangeFrom`, `PartialRangeUpTo`, `PartialRangeThrough`
120+
- `CollectionDifference`, `CollectionDifference.Index`
121+
- `FlattenSequence`, `FlattenSequence.Index`
122+
- `LazyPrefixWhileSequence`, `LazyPrefixWhileSequence.Index`
123+
- etc.
124+
125+
### New String API to expose the information in these descriptions
126+
127+
The information exposed in the index descriptions shown above is mostly retrievable through public APIs, but not entirely: perhaps most importantly, there is no way to get the expected encoding of a string index through the stdlib's public API surface. The lack of such an API may encourage interested Swift developers to try retrieving this information by parsing the unstable `description` string, or by bitcasting indices to peek at the underlying bit patterns -- neither of which would be healthy for the Swift ecosystem overall. It therefore is desirable to eventually expose this information as well, through API additons like the drafts below:
128+
129+
```swift
130+
extension String {
131+
@frozen enum StorageEncoding {
132+
case utf8
133+
case utf16
134+
}
135+
136+
/// The storage encoding of this string instance. The encoding view
137+
/// corresponding to this encoding behaves like a random-access collection.
138+
///
139+
/// - Complexity: O(1)
140+
var encoding: StorageEncoding { get }
141+
}
142+
143+
extension String.Index {
144+
/// The encoding of the string that produced this index, or nil if the
145+
/// encoding is not known.
146+
///
147+
/// - Complexity: O(1)
148+
var encoding: String.StorageEncoding? { get }
149+
150+
/// The offset of this position within the UTF-8 storage of the `String`
151+
/// instance that produced it. `nil` if the offset is not known to be valid
152+
/// in UTF-8 encoded storage.
153+
///
154+
/// - Complexity: O(1)
155+
@available(SwiftStdlib 5.7, *)
156+
var utf8Offset: Int? { get }
157+
158+
/// The offset of this position within the UTF-16 storage of the `String`
159+
/// instance that produced it. `nil` if the offset is not known to be valid
160+
/// in UTF-16 encoded storage.
161+
///
162+
/// - Complexity: O(1)
163+
@available(SwiftStdlib 5.7, *)
164+
var utf16Offset: Int? { get }
165+
}
166+
```
167+
168+
One major limitation is that string indices don't necessarily know their expected encoding, so the `encoding` property suggested above has to return an optional. (Indices of ASCII strings and the start index of all strings are the same no matter the encoding, and Swift runtimes prior to 5.7 did not track the encoding of string indices at all.) The `utf8Offset` and `utf16Offset` properties would correct and reinstate the functionality that got removed by [SE-0241] with the deprecation of `encodingOffset`.
169+
170+
[SE-0241]: https://github.com/swiftlang/swift-evolution/blob/main/proposals/0241-string-index-explicit-encoding-offset.md
171+
172+
Given that these APIs are quite obscure/subtle, and they pose some interesting design challenges on their own, these additions are deferred to a future proposal. The interface suggested above does not include exposing "transcoded offsets"; I expect the eventual proposal would need to cover those, too.
173+
174+
## Alternatives considered
175+
176+
None.

0 commit comments

Comments
 (0)