|
| 1 | +# String Revision: Collection Conformance, C Interop, Transcoding |
| 2 | + |
| 3 | +* Proposal: [SE-NNNN](NNNN-StringCollection.md) |
| 4 | +* Authors: [Ben Cohen](https://github.com/airspeedswift), [Dave Abrahams](http://github.com/dabrahams/) |
| 5 | +* Review Manager: TBD |
| 6 | +* Status: **Awaiting review** |
| 7 | + |
| 8 | +## Introduction |
| 9 | + |
| 10 | +This proposal is to implement a subset of the changes from the [Swift 4 |
| 11 | +String |
| 12 | +Manifesto](https://github.com/apple/swift/blob/master/docs/StringManifesto.md). |
| 13 | + |
| 14 | +Specifically: |
| 15 | + |
| 16 | + * Make `String` conform to `BidirectionalCollection` |
| 17 | + * Make `String` conform to `RangeReplaceableCollection` |
| 18 | + * Create a `Substring` type for `String.SubSequence` |
| 19 | + * Create a `Unicode` protocol to allow for generic operations over both types. |
| 20 | + * Consolidate on a concise set of C interop methods. |
| 21 | + * Revise the transoding infrastructure. |
| 22 | + |
| 23 | +Other existing aspects of `String` remain unchanged for the purposes of this |
| 24 | +proposal. |
| 25 | + |
| 26 | +## Motivation |
| 27 | + |
| 28 | +This proposal follows up on a number of recommendations found in the manifesto: |
| 29 | + |
| 30 | +`Collection` conformance was dropped from `String` in Swift 2. After |
| 31 | +reevaluation, the feeling is that the minor semantic discrepancies (mainly with |
| 32 | +`RangeReplaceableCollection`) are outweighed by the significant benefits of |
| 33 | +restoring these conformances. For more detail on the reasoning, see |
| 34 | +[here](https://github.com/apple/swift/blob/master/docs/StringManifesto.md#string-should-be-a-collection-of-characters-again) |
| 35 | + |
| 36 | +While it is not a collection, the Swift 3 string does have slicing operations. |
| 37 | +`String` is currently serving as its own subsequence, allowing substrings |
| 38 | +to share storage with their "owner". This can lead to memory leaks when small substrings of larger |
| 39 | +strings are stored long-term (see [here](https://github.com/apple/swift/blob/master/docs/StringManifesto.md#substrings) |
| 40 | +for more detail on this problem). Introducing a separate type of `Substring` to |
| 41 | +serve as `String.Subsequence` is recommended to resolve this issue, in a similar |
| 42 | +fashion to `ArraySlice`. |
| 43 | + |
| 44 | +As noted in the manifesto, support for interoperation with nul-terminated C |
| 45 | +strings in Swift 3 is scattered and incoherent, with 6 ways to transform a C |
| 46 | +string into a `String` and four ways to do the inverse. These APIs should be |
| 47 | +replaced with a simpler set of methods on `String`. |
| 48 | + |
| 49 | +## Proposed solution |
| 50 | + |
| 51 | +A new type, `Substring`, will be introduced. Similar to `ArraySlice` it will |
| 52 | +be documented as only for short- to medium-term storage: |
| 53 | + |
| 54 | +> **Important** |
| 55 | +> |
| 56 | +> Long-term storage of `Substring` instances is discouraged. A substring holds a |
| 57 | +> reference to the entire storage of a larger string, not just to the portion it |
| 58 | +> presents, even after the original string’s lifetime ends. Long-term storage of |
| 59 | +> a substring may therefore prolong the lifetime of elements that are no longer |
| 60 | +> otherwise accessible, which can appear to be memory leakage. |
| 61 | +
|
| 62 | +Aside from minor differences, such as having a `SubSequence` of `Self` and a |
| 63 | +larger size to describe the range of the subsequence, `Substring` |
| 64 | +will be near-identical from a user perspective. |
| 65 | + |
| 66 | +In order to be able to write extensions accross both `String` and `Substring`, |
| 67 | +a new `Unicode` protocol to which the two types will conform will be |
| 68 | +introduced. For the purposes of this proposal, `Unicode` will be defined as a |
| 69 | +protocol to be used whenver you would previously extend `String`. It should be |
| 70 | +possible to substitute `extension Unicode { ... }` in Swift 4 wherever |
| 71 | +`extension String { ... }` was written in Swift 3, with one exception: any |
| 72 | +passing of `self` into an API that takes a concrete `String` will need to be |
| 73 | +rewritten as `String(self)`. If `Self` is a `String` then this should |
| 74 | +effectively optimize to a no-op, whereas if `Self` is a `Substring` then this |
| 75 | +will force a copy, helping to avoid the "memory leak" problems described above. |
| 76 | + |
| 77 | +The exact nature of the protocol – such as which methods should be protocol |
| 78 | +requirements vs which can be implemented as protocol extensions, are considered |
| 79 | +implementation details and so not covered in this proposal. |
| 80 | + |
| 81 | +`Unicode` will conform to `BidirectionalCollection`. |
| 82 | +`RangeReplaceableCollection` conformance will be added directly onto the |
| 83 | +`String` and `Substring` types, as it is possible future `Unicode`-conforming |
| 84 | +types might not be range-replaceable (e.g. an immutable type that wraps |
| 85 | +a `const char *`). |
| 86 | + |
| 87 | +The C string interop methods will be updated to those described |
| 88 | +[here](https://github.com/apple/swift/blob/master/docs/StringManifesto.md#c-string-interop): |
| 89 | +a single `withCString` operation and two `init(cString:)` constructors, one for |
| 90 | +UTF8 and one for arbitrary encodings. The primary change is to remove |
| 91 | +"non-repairing" variants of construction from nul-terminated C strings. In both |
| 92 | +of the construction APIs, any invalid encoding sequence detected will have its |
| 93 | +longest valid prefix replaced by `U+FFFD`, the Unicode replacement character, |
| 94 | +per the Unicode specification. This covers the common case. The replacement is |
| 95 | +done physically in the underlying storage and the validity of the result is |
| 96 | +recorded in the String's encoding such that future accesses need not be slowed |
| 97 | +down by possible error repair separately. Construction that is aborted when |
| 98 | +encoding errors are detected can be accomplished using APIs on the encoding. |
| 99 | + |
| 100 | +The current transcoding support will be updated to improve usability and |
| 101 | +performance. The primary changes will be: |
| 102 | + |
| 103 | + - to allow transcoding directly from one encoding to another without having |
| 104 | + to triangulate through an intermediate scalar value |
| 105 | + - to add the ability to transcode an input collection in reverse, allowing the |
| 106 | + different views on `String` to be made bi-directional |
| 107 | + - to have decoding take a collection rather than an iterator, and return an |
| 108 | + index of its progress into the source, allowing that method to be static |
| 109 | + |
| 110 | +The standard library currently lacks a `Latin1` codec, so a |
| 111 | +`enum Latin1: UnicodeEncoding` type will be added. |
| 112 | + |
| 113 | +## Detailed design |
| 114 | + |
| 115 | +The following additions will be made to the standard library: |
| 116 | + |
| 117 | +```swift |
| 118 | +protocol Unicode: BidirectionalCollection { |
| 119 | + // Implementation detail as described above |
| 120 | +} |
| 121 | + |
| 122 | +extension String: Unicode, RangeReplaceableCollection { |
| 123 | + typealias SubSequence = Substring |
| 124 | +} |
| 125 | + |
| 126 | +struct Substring: Unicode, RangeReplaceableCollection { |
| 127 | + typealias SubSequence = Substring |
| 128 | + // near-identical API surface area to String |
| 129 | +} |
| 130 | +``` |
| 131 | + |
| 132 | +The subscript operations on `String` will be amended to return `Substring`: |
| 133 | + |
| 134 | +```swift |
| 135 | +struct String { |
| 136 | + subscript(bounds: Range<String.Index>) -> Substring { get } |
| 137 | + subscript(bounds: ClosedRange<String.Index>) -> Substring { get } |
| 138 | +} |
| 139 | +``` |
| 140 | + |
| 141 | +Note that properties or methods that due to their nature create new `String` |
| 142 | +storage (such as `lowercased()`) will _not_ change. |
| 143 | + |
| 144 | +C string interop will be consolidated on the following methods: |
| 145 | + |
| 146 | +```swift |
| 147 | +extension String { |
| 148 | + /// Constructs a `String` having the same contents as `nulTerminatedUTF8`. |
| 149 | + /// |
| 150 | + /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded |
| 151 | + /// bytes ending just before the first zero byte (NUL character). |
| 152 | + init(cString nulTerminatedUTF8: UnsafePointer<CChar>) |
| 153 | + |
| 154 | + /// Constructs a `String` having the same contents as `nulTerminatedCodeUnits`. |
| 155 | + /// |
| 156 | + /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in |
| 157 | + /// the given `encoding`, ending just before the first zero code unit. |
| 158 | + /// - Parameter encoding: describes the encoding in which the code units |
| 159 | + /// should be interpreted. |
| 160 | + init<Encoding: UnicodeEncoding>( |
| 161 | + cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>, |
| 162 | + encoding: Encoding) |
| 163 | + |
| 164 | + /// Invokes the given closure on the contents of the string, represented as a |
| 165 | + /// pointer to a null-terminated sequence of UTF-8 code units. |
| 166 | + func withCString<Result>( |
| 167 | + _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result |
| 168 | +} |
| 169 | +``` |
| 170 | + |
| 171 | +Additionally, the current ability to pass a Swift `String` into C methods that |
| 172 | +take a C string will remain as-is. |
| 173 | + |
| 174 | +A new protocol, `UnicodeEncoding`, will be added to replace the current |
| 175 | +`UnicodeCodec` protocol: |
| 176 | + |
| 177 | +```swift |
| 178 | +public enum UnicodeParseResult<T, Index> { |
| 179 | +/// Indicates valid input was recognized. |
| 180 | +/// |
| 181 | +/// `resumptionPoint` is the end of the parsed region |
| 182 | +case valid(T, resumptionPoint: Index) // FIXME: should these be reordered? |
| 183 | +/// Indicates invalid input was recognized. |
| 184 | +/// |
| 185 | +/// `resumptionPoint` is the next position at which to continue parsing after |
| 186 | +/// the invalid input is repaired. |
| 187 | +case error(resumptionPoint: Index) |
| 188 | + |
| 189 | +/// Indicates that there was no more input to consume. |
| 190 | +case emptyInput |
| 191 | + |
| 192 | + /// If any input was consumed, the point from which to continue parsing. |
| 193 | + var resumptionPoint: Index? { |
| 194 | + switch self { |
| 195 | + case .valid(_,let r): return r |
| 196 | + case .error(let r): return r |
| 197 | + case .emptyInput: return nil |
| 198 | + } |
| 199 | + } |
| 200 | +} |
| 201 | + |
| 202 | +/// An encoding for text with UnicodeScalar as a common currency type |
| 203 | +public protocol UnicodeEncoding { |
| 204 | + /// The maximum number of code units in an encoded unicode scalar value |
| 205 | + static var maxLengthOfEncodedScalar: Int { get } |
| 206 | + |
| 207 | + /// A type that can represent a single UnicodeScalar as it is encoded in this |
| 208 | + /// encoding. |
| 209 | + associatedtype EncodedScalar : EncodedScalarProtocol |
| 210 | + |
| 211 | + /// Produces a scalar of this encoding if possible; returns `nil` otherwise. |
| 212 | + static func encode<Scalar: EncodedScalarProtocol>( |
| 213 | + _:Scalar) -> Self.EncodedScalar? |
| 214 | + |
| 215 | + /// Parse a single unicode scalar forward from `input`. |
| 216 | + /// |
| 217 | + /// - Parameter knownCount: a number of code units known to exist in `input`. |
| 218 | + /// **Note:** passing a known compile-time constant is strongly advised, |
| 219 | + /// even if it's zero. |
| 220 | + static func parseScalarForward<C: Collection>( |
| 221 | + _ input: C, knownCount: Int /* = 0, via extension */ |
| 222 | + ) -> ParseResult<EncodedScalar, C.Index> |
| 223 | + where C.Iterator.Element == EncodedScalar.Iterator.Element |
| 224 | + |
| 225 | + /// Parse a single unicode scalar in reverse from `input`. |
| 226 | + /// |
| 227 | + /// - Parameter knownCount: a number of code units known to exist in `input`. |
| 228 | + /// **Note:** passing a known compile-time constant is strongly advised, |
| 229 | + /// even if it's zero. |
| 230 | + static func parseScalarReverse<C: BidirectionalCollection>( |
| 231 | + _ input: C, knownCount: Int /* = 0 , via extension */ |
| 232 | + ) -> ParseResult<EncodedScalar, C.Index> |
| 233 | + where C.Iterator.Element == EncodedScalar.Iterator.Element |
| 234 | +} |
| 235 | + |
| 236 | +/// Parsing multiple unicode scalar values |
| 237 | +extension UnicodeEncoding { |
| 238 | + @discardableResult |
| 239 | + public static func parseForward<C: Collection>( |
| 240 | + _ input: C, |
| 241 | + repairingIllFormedSequences makeRepairs: Bool = true, |
| 242 | + into output: (EncodedScalar) throws->Void |
| 243 | + ) rethrows -> (remainder: C.SubSequence, errorCount: Int) |
| 244 | + |
| 245 | + @discardableResult |
| 246 | + public static func parseReverse<C: BidirectionalCollection>( |
| 247 | + _ input: C, |
| 248 | + repairingIllFormedSequences makeRepairs: Bool = true, |
| 249 | + into output: (EncodedScalar) throws->Void |
| 250 | + ) rethrows -> (remainder: C.SubSequence, errorCount: Int) |
| 251 | + where C.SubSequence : BidirectionalCollection, |
| 252 | + C.SubSequence.SubSequence == C.SubSequence, |
| 253 | + C.SubSequence.Iterator.Element == EncodedScalar.Iterator.Element |
| 254 | +} |
| 255 | +``` |
| 256 | + |
| 257 | + |
| 258 | +`UnicodeCodec` will be updated to refine `UnicodeEncoding`, and all |
| 259 | +existing codecs will conform to it. |
| 260 | + |
| 261 | +Note, depending on whether this change lands before or after some of the |
| 262 | +generics features, generic `where` clauses may need to be added temporarily. |
| 263 | + |
| 264 | +## Source compatibility |
| 265 | + |
| 266 | +Adding collection conformance to `String` should not materially impact source |
| 267 | +stability as it is purely additive: Swift 3's `String` interface currently |
| 268 | +fulfills all of the requirements for a bidirectional range replaceable |
| 269 | +collection. |
| 270 | + |
| 271 | +Altering `String`'s slicing operations to return a different type is source |
| 272 | +breaking. The following mitigating steps are proposed: |
| 273 | + |
| 274 | + - Add a deprecated subscript operator that will run in Swift 3 compatibility |
| 275 | + mode and which will return a `String` not a `Substring`. |
| 276 | + |
| 277 | + - Add deprecated versions of all current slicing methods to similarly return a |
| 278 | + `String`. |
| 279 | + |
| 280 | +i.e.: |
| 281 | + |
| 282 | +```swift |
| 283 | +extension String { |
| 284 | + @available(swift, obsoleted: 4) |
| 285 | + subscript(bounds: Range<Index>) -> String { |
| 286 | + return String(characters[bounds]) |
| 287 | + } |
| 288 | + |
| 289 | + @available(swift, obsoleted: 4) |
| 290 | + subscript(bounds: ClosedRange<Index>) -> String { |
| 291 | + return String(characters[bounds]) |
| 292 | + } |
| 293 | +} |
| 294 | +``` |
| 295 | + |
| 296 | +In a review of 77 popular Swift projects found on GitHub, these changes |
| 297 | +resolved any build issues in the 12 projects that assumed an explicit `String` |
| 298 | +type returned from slicing operations. |
| 299 | + |
| 300 | +Due to the change in internal implementation, this means that these operations |
| 301 | +will be _O(n)_ rather than _O(1)_. This is not expected to be a major concern, |
| 302 | +based on experiences from a similar change made to Java, but projects will be |
| 303 | +able to work around performance issues without upgrading to Swift 4 by |
| 304 | +explicitly typing slices as `Substring`, which will call the Swift 4 variant, |
| 305 | +and which will be available but not invoked by default in Swift 3 mode. |
| 306 | + |
| 307 | +The C string interoperability methods outside the ones described in the |
| 308 | +detailed design will remain in Swift 3 mode, be deprecated in Swift 4 mode, and |
| 309 | +be removed in a subsequent release. `UnicodeCodec` will be similarly deprecated. |
| 310 | + |
| 311 | +## Effect on ABI stability |
| 312 | + |
| 313 | +As a fundamental currency type for Swift, it is essential that the `String` |
| 314 | +type (and its associated subsequence) is in a good long-term state before being |
| 315 | +locked down when Swift declares ABI stability. Shrinking the size of `String` |
| 316 | +to be 64 bits is an important part of this. |
| 317 | + |
| 318 | +## Effect on API resilience |
| 319 | + |
| 320 | +Decisions about the API resilience of the `String` type are still to be |
| 321 | +determined, but are not adversely affected by this proposal. |
| 322 | + |
| 323 | +## Alternatives considered |
| 324 | + |
| 325 | +For a more in-depth discussion of some of the trade-offs in string design, see |
| 326 | +the manifesto and associated [evolution thread](). |
| 327 | + |
| 328 | +This proposal does not yet introduce an implicit conversion from `Substring` to |
| 329 | +`String`. The decision on whether to add this will be deferred pending feedback |
| 330 | +on the initial implementation. The intention is to make a preview toolchain |
| 331 | +available for feedback, including on whether this implicit conversion is |
| 332 | +necessary, prior to the release of Swift 4. |
| 333 | + |
| 334 | +Several of the types related to `String`, such as the encodings, would ideally |
| 335 | +reside inside a namespace rather than live at the top level of the standard |
| 336 | +library. The best namespace for this is probably `Unicode`, but this is also |
| 337 | +the name of the protocol. At some point if we gain the ability to nest enums |
| 338 | +and types inside protocols, they should be moved there. Putting them inside |
| 339 | +`String` or some other enum namespace is probably not worthwhile in the |
| 340 | +mean-time. |
| 341 | + |
| 342 | + |
0 commit comments