Skip to content

Commit c57441c

Browse files
committed
First draft of String proposal 1
1 parent 458478d commit c57441c

File tree

1 file changed

+342
-0
lines changed

1 file changed

+342
-0
lines changed

proposals/0161-StringRevision1.md

Lines changed: 342 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,342 @@
1+
# String Revision: Collection Conformance, C Interop, Transcoding
2+
3+
* Proposal: [SE-NNNN](NNNN-StringCollection.md)
4+
* Authors: [Ben Cohen](https://github.com/airspeedswift), [Dave Abrahams](http://github.com/dabrahams/)
5+
* Review Manager: TBD
6+
* Status: **Awaiting review**
7+
8+
## Introduction
9+
10+
This proposal is to implement a subset of the changes from the [Swift 4
11+
String
12+
Manifesto](https://github.com/apple/swift/blob/master/docs/StringManifesto.md).
13+
14+
Specifically:
15+
16+
* Make `String` conform to `BidirectionalCollection`
17+
* Make `String` conform to `RangeReplaceableCollection`
18+
* Create a `Substring` type for `String.SubSequence`
19+
* Create a `Unicode` protocol to allow for generic operations over both types.
20+
* Consolidate on a concise set of C interop methods.
21+
* Revise the transoding infrastructure.
22+
23+
Other existing aspects of `String` remain unchanged for the purposes of this
24+
proposal.
25+
26+
## Motivation
27+
28+
This proposal follows up on a number of recommendations found in the manifesto:
29+
30+
`Collection` conformance was dropped from `String` in Swift 2. After
31+
reevaluation, the feeling is that the minor semantic discrepancies (mainly with
32+
`RangeReplaceableCollection`) are outweighed by the significant benefits of
33+
restoring these conformances. For more detail on the reasoning, see
34+
[here](https://github.com/apple/swift/blob/master/docs/StringManifesto.md#string-should-be-a-collection-of-characters-again)
35+
36+
While it is not a collection, the Swift 3 string does have slicing operations.
37+
`String` is currently serving as its own subsequence, allowing substrings
38+
to share storage with their "owner". This can lead to memory leaks when small substrings of larger
39+
strings are stored long-term (see [here](https://github.com/apple/swift/blob/master/docs/StringManifesto.md#substrings)
40+
for more detail on this problem). Introducing a separate type of `Substring` to
41+
serve as `String.Subsequence` is recommended to resolve this issue, in a similar
42+
fashion to `ArraySlice`.
43+
44+
As noted in the manifesto, support for interoperation with nul-terminated C
45+
strings in Swift 3 is scattered and incoherent, with 6 ways to transform a C
46+
string into a `String` and four ways to do the inverse. These APIs should be
47+
replaced with a simpler set of methods on `String`.
48+
49+
## Proposed solution
50+
51+
A new type, `Substring`, will be introduced. Similar to `ArraySlice` it will
52+
be documented as only for short- to medium-term storage:
53+
54+
> **Important**
55+
>
56+
> Long-term storage of `Substring` instances is discouraged. A substring holds a
57+
> reference to the entire storage of a larger string, not just to the portion it
58+
> presents, even after the original string’s lifetime ends. Long-term storage of
59+
> a substring may therefore prolong the lifetime of elements that are no longer
60+
> otherwise accessible, which can appear to be memory leakage.
61+
62+
Aside from minor differences, such as having a `SubSequence` of `Self` and a
63+
larger size to describe the range of the subsequence, `Substring`
64+
will be near-identical from a user perspective.
65+
66+
In order to be able to write extensions accross both `String` and `Substring`,
67+
a new `Unicode` protocol to which the two types will conform will be
68+
introduced. For the purposes of this proposal, `Unicode` will be defined as a
69+
protocol to be used whenver you would previously extend `String`. It should be
70+
possible to substitute `extension Unicode { ... }` in Swift 4 wherever
71+
`extension String { ... }` was written in Swift 3, with one exception: any
72+
passing of `self` into an API that takes a concrete `String` will need to be
73+
rewritten as `String(self)`. If `Self` is a `String` then this should
74+
effectively optimize to a no-op, whereas if `Self` is a `Substring` then this
75+
will force a copy, helping to avoid the "memory leak" problems described above.
76+
77+
The exact nature of the protocol – such as which methods should be protocol
78+
requirements vs which can be implemented as protocol extensions, are considered
79+
implementation details and so not covered in this proposal.
80+
81+
`Unicode` will conform to `BidirectionalCollection`.
82+
`RangeReplaceableCollection` conformance will be added directly onto the
83+
`String` and `Substring` types, as it is possible future `Unicode`-conforming
84+
types might not be range-replaceable (e.g. an immutable type that wraps
85+
a `const char *`).
86+
87+
The C string interop methods will be updated to those described
88+
[here](https://github.com/apple/swift/blob/master/docs/StringManifesto.md#c-string-interop):
89+
a single `withCString` operation and two `init(cString:)` constructors, one for
90+
UTF8 and one for arbitrary encodings. The primary change is to remove
91+
"non-repairing" variants of construction from nul-terminated C strings. In both
92+
of the construction APIs, any invalid encoding sequence detected will have its
93+
longest valid prefix replaced by `U+FFFD`, the Unicode replacement character,
94+
per the Unicode specification. This covers the common case. The replacement is
95+
done physically in the underlying storage and the validity of the result is
96+
recorded in the String's encoding such that future accesses need not be slowed
97+
down by possible error repair separately. Construction that is aborted when
98+
encoding errors are detected can be accomplished using APIs on the encoding.
99+
100+
The current transcoding support will be updated to improve usability and
101+
performance. The primary changes will be:
102+
103+
- to allow transcoding directly from one encoding to another without having
104+
to triangulate through an intermediate scalar value
105+
- to add the ability to transcode an input collection in reverse, allowing the
106+
different views on `String` to be made bi-directional
107+
- to have decoding take a collection rather than an iterator, and return an
108+
index of its progress into the source, allowing that method to be static
109+
110+
The standard library currently lacks a `Latin1` codec, so a
111+
`enum Latin1: UnicodeEncoding` type will be added.
112+
113+
## Detailed design
114+
115+
The following additions will be made to the standard library:
116+
117+
```swift
118+
protocol Unicode: BidirectionalCollection {
119+
// Implementation detail as described above
120+
}
121+
122+
extension String: Unicode, RangeReplaceableCollection {
123+
typealias SubSequence = Substring
124+
}
125+
126+
struct Substring: Unicode, RangeReplaceableCollection {
127+
typealias SubSequence = Substring
128+
// near-identical API surface area to String
129+
}
130+
```
131+
132+
The subscript operations on `String` will be amended to return `Substring`:
133+
134+
```swift
135+
struct String {
136+
subscript(bounds: Range<String.Index>) -> Substring { get }
137+
subscript(bounds: ClosedRange<String.Index>) -> Substring { get }
138+
}
139+
```
140+
141+
Note that properties or methods that due to their nature create new `String`
142+
storage (such as `lowercased()`) will _not_ change.
143+
144+
C string interop will be consolidated on the following methods:
145+
146+
```swift
147+
extension String {
148+
/// Constructs a `String` having the same contents as `nulTerminatedUTF8`.
149+
///
150+
/// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded
151+
/// bytes ending just before the first zero byte (NUL character).
152+
init(cString nulTerminatedUTF8: UnsafePointer<CChar>)
153+
154+
/// Constructs a `String` having the same contents as `nulTerminatedCodeUnits`.
155+
///
156+
/// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in
157+
/// the given `encoding`, ending just before the first zero code unit.
158+
/// - Parameter encoding: describes the encoding in which the code units
159+
/// should be interpreted.
160+
init<Encoding: UnicodeEncoding>(
161+
cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
162+
encoding: Encoding)
163+
164+
/// Invokes the given closure on the contents of the string, represented as a
165+
/// pointer to a null-terminated sequence of UTF-8 code units.
166+
func withCString<Result>(
167+
_ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
168+
}
169+
```
170+
171+
Additionally, the current ability to pass a Swift `String` into C methods that
172+
take a C string will remain as-is.
173+
174+
A new protocol, `UnicodeEncoding`, will be added to replace the current
175+
`UnicodeCodec` protocol:
176+
177+
```swift
178+
public enum UnicodeParseResult<T, Index> {
179+
/// Indicates valid input was recognized.
180+
///
181+
/// `resumptionPoint` is the end of the parsed region
182+
case valid(T, resumptionPoint: Index) // FIXME: should these be reordered?
183+
/// Indicates invalid input was recognized.
184+
///
185+
/// `resumptionPoint` is the next position at which to continue parsing after
186+
/// the invalid input is repaired.
187+
case error(resumptionPoint: Index)
188+
189+
/// Indicates that there was no more input to consume.
190+
case emptyInput
191+
192+
/// If any input was consumed, the point from which to continue parsing.
193+
var resumptionPoint: Index? {
194+
switch self {
195+
case .valid(_,let r): return r
196+
case .error(let r): return r
197+
case .emptyInput: return nil
198+
}
199+
}
200+
}
201+
202+
/// An encoding for text with UnicodeScalar as a common currency type
203+
public protocol UnicodeEncoding {
204+
/// The maximum number of code units in an encoded unicode scalar value
205+
static var maxLengthOfEncodedScalar: Int { get }
206+
207+
/// A type that can represent a single UnicodeScalar as it is encoded in this
208+
/// encoding.
209+
associatedtype EncodedScalar : EncodedScalarProtocol
210+
211+
/// Produces a scalar of this encoding if possible; returns `nil` otherwise.
212+
static func encode<Scalar: EncodedScalarProtocol>(
213+
_:Scalar) -> Self.EncodedScalar?
214+
215+
/// Parse a single unicode scalar forward from `input`.
216+
///
217+
/// - Parameter knownCount: a number of code units known to exist in `input`.
218+
/// **Note:** passing a known compile-time constant is strongly advised,
219+
/// even if it's zero.
220+
static func parseScalarForward<C: Collection>(
221+
_ input: C, knownCount: Int /* = 0, via extension */
222+
) -> ParseResult<EncodedScalar, C.Index>
223+
where C.Iterator.Element == EncodedScalar.Iterator.Element
224+
225+
/// Parse a single unicode scalar in reverse from `input`.
226+
///
227+
/// - Parameter knownCount: a number of code units known to exist in `input`.
228+
/// **Note:** passing a known compile-time constant is strongly advised,
229+
/// even if it's zero.
230+
static func parseScalarReverse<C: BidirectionalCollection>(
231+
_ input: C, knownCount: Int /* = 0 , via extension */
232+
) -> ParseResult<EncodedScalar, C.Index>
233+
where C.Iterator.Element == EncodedScalar.Iterator.Element
234+
}
235+
236+
/// Parsing multiple unicode scalar values
237+
extension UnicodeEncoding {
238+
@discardableResult
239+
public static func parseForward<C: Collection>(
240+
_ input: C,
241+
repairingIllFormedSequences makeRepairs: Bool = true,
242+
into output: (EncodedScalar) throws->Void
243+
) rethrows -> (remainder: C.SubSequence, errorCount: Int)
244+
245+
@discardableResult
246+
public static func parseReverse<C: BidirectionalCollection>(
247+
_ input: C,
248+
repairingIllFormedSequences makeRepairs: Bool = true,
249+
into output: (EncodedScalar) throws->Void
250+
) rethrows -> (remainder: C.SubSequence, errorCount: Int)
251+
where C.SubSequence : BidirectionalCollection,
252+
C.SubSequence.SubSequence == C.SubSequence,
253+
C.SubSequence.Iterator.Element == EncodedScalar.Iterator.Element
254+
}
255+
```
256+
257+
258+
`UnicodeCodec` will be updated to refine `UnicodeEncoding`, and all
259+
existing codecs will conform to it.
260+
261+
Note, depending on whether this change lands before or after some of the
262+
generics features, generic `where` clauses may need to be added temporarily.
263+
264+
## Source compatibility
265+
266+
Adding collection conformance to `String` should not materially impact source
267+
stability as it is purely additive: Swift 3's `String` interface currently
268+
fulfills all of the requirements for a bidirectional range replaceable
269+
collection.
270+
271+
Altering `String`'s slicing operations to return a different type is source
272+
breaking. The following mitigating steps are proposed:
273+
274+
- Add a deprecated subscript operator that will run in Swift 3 compatibility
275+
mode and which will return a `String` not a `Substring`.
276+
277+
- Add deprecated versions of all current slicing methods to similarly return a
278+
`String`.
279+
280+
i.e.:
281+
282+
```swift
283+
extension String {
284+
@available(swift, obsoleted: 4)
285+
subscript(bounds: Range<Index>) -> String {
286+
return String(characters[bounds])
287+
}
288+
289+
@available(swift, obsoleted: 4)
290+
subscript(bounds: ClosedRange<Index>) -> String {
291+
return String(characters[bounds])
292+
}
293+
}
294+
```
295+
296+
In a review of 77 popular Swift projects found on GitHub, these changes
297+
resolved any build issues in the 12 projects that assumed an explicit `String`
298+
type returned from slicing operations.
299+
300+
Due to the change in internal implementation, this means that these operations
301+
will be _O(n)_ rather than _O(1)_. This is not expected to be a major concern,
302+
based on experiences from a similar change made to Java, but projects will be
303+
able to work around performance issues without upgrading to Swift 4 by
304+
explicitly typing slices as `Substring`, which will call the Swift 4 variant,
305+
and which will be available but not invoked by default in Swift 3 mode.
306+
307+
The C string interoperability methods outside the ones described in the
308+
detailed design will remain in Swift 3 mode, be deprecated in Swift 4 mode, and
309+
be removed in a subsequent release. `UnicodeCodec` will be similarly deprecated.
310+
311+
## Effect on ABI stability
312+
313+
As a fundamental currency type for Swift, it is essential that the `String`
314+
type (and its associated subsequence) is in a good long-term state before being
315+
locked down when Swift declares ABI stability. Shrinking the size of `String`
316+
to be 64 bits is an important part of this.
317+
318+
## Effect on API resilience
319+
320+
Decisions about the API resilience of the `String` type are still to be
321+
determined, but are not adversely affected by this proposal.
322+
323+
## Alternatives considered
324+
325+
For a more in-depth discussion of some of the trade-offs in string design, see
326+
the manifesto and associated [evolution thread]().
327+
328+
This proposal does not yet introduce an implicit conversion from `Substring` to
329+
`String`. The decision on whether to add this will be deferred pending feedback
330+
on the initial implementation. The intention is to make a preview toolchain
331+
available for feedback, including on whether this implicit conversion is
332+
necessary, prior to the release of Swift 4.
333+
334+
Several of the types related to `String`, such as the encodings, would ideally
335+
reside inside a namespace rather than live at the top level of the standard
336+
library. The best namespace for this is probably `Unicode`, but this is also
337+
the name of the protocol. At some point if we gain the ability to nest enums
338+
and types inside protocols, they should be moved there. Putting them inside
339+
`String` or some other enum namespace is probably not worthwhile in the
340+
mean-time.
341+
342+

0 commit comments

Comments
 (0)