You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Similarly, `String`s can be created from `UTF8Span`s without re-validating their contents.
214
+
215
+
```swift
216
+
extensionString {
217
+
/// Create's a String containing a copy of the UTF-8 content in `codeUnits`.
218
+
/// Skips
219
+
/// validation.
220
+
publicinit(copyingcodeUnits: UTF8Span)
212
221
}
213
222
```
214
223
@@ -218,7 +227,7 @@ We propose a `UTF8Span.UnicodeScalarIterator` type that can do scalar processing
218
227
219
228
```swift
220
229
extensionUTF8Span {
221
-
/// Returns an iterator that will decode the code units into
230
+
/// Returns an iterator that will decode the code units into
222
231
/// `Unicode.Scalar`s.
223
232
///
224
233
/// The resulting iterator has the same lifetime constraints as `self`.
@@ -316,7 +325,7 @@ extension UTF8Span {
316
325
317
326
We similarly propose a `UTF8Span.CharacterIterator` type that can do grapheme-breaking forwards and backwards.
318
327
319
-
The `CharacterIterator` assumes that the start and end of the `UTF8Span` is the start and end of content.
328
+
The `CharacterIterator` assumes that the start and end of the `UTF8Span` is the start and end of content.
320
329
321
330
Any scalar-aligned position is a valid place to start or reset the grapheme-breaking algorithm to, though you could get different `Character` output if resetting to a position that isn't `Character`-aligned relative to the start of the `UTF8Span` (e.g. in the middle of a series of regional indicators).
322
331
@@ -343,15 +352,15 @@ extension UTF8Span {
343
352
/// Return the `Character` starting at `currentCodeUnitOffset`. After the
344
353
/// function returns, `currentCodeUnitOffset` holds the position at the
345
354
/// end of the `Character`, which is also the start of the next
346
-
/// `Character`.
355
+
/// `Character`.
347
356
///
348
357
/// Returns `nil` if at the end of the `UTF8Span`.
349
358
publicmutatingfuncnext() ->Character?
350
359
351
360
/// Return the `Character` ending at `currentCodeUnitOffset`. After the
352
361
/// function returns, `currentCodeUnitOffset` holds the position at the
353
362
/// start of the returned `Character`, which is also the end of the
354
-
/// previous `Character`.
363
+
/// previous `Character`.
355
364
///
356
365
/// Returns `nil` if at the start of the `UTF8Span`.
357
366
publicmutatingfuncprevious() ->Character?
@@ -395,7 +404,7 @@ extension UTF8Span {
395
404
///
396
405
/// Note: This is only for very specific, low-level use cases. If
397
406
/// `codeUnitOffset` is not properly scalar-aligned, this function can
398
-
/// result in undefined behavior when, e.g., `next()` is called.
407
+
/// result in undefined behavior when, e.g., `next()` is called.
399
408
///
400
409
/// If `i` is scalar-aligned, but not `Character`-aligned, you may get
401
410
/// different results from running `Character` iteration.
@@ -445,13 +454,6 @@ extension UTF8Span {
445
454
}
446
455
```
447
456
448
-
We also support literal (i.e. non-canonical) pattern matching against `StaticString`.
/// Whether `self` orders less than `other` under Unicode Canonical
469
471
/// Equivalence using normalized code-unit order (in NFC).
470
-
publicfuncisCanonicallyLessThan(
472
+
publicfunccanonicallyPrecedes(
471
473
_other: UTF8Span
472
474
) ->Bool
473
475
}
@@ -483,17 +485,17 @@ Slicing a `UTF8Span` is nuanced and depends on the caller's desired use. They ca
483
485
484
486
```swift
485
487
extensionUTF8Span {
486
-
/// Returns whether contents are known to be all-ASCII. A return value of
487
-
/// `true` means that all code units are ASCII. A return value of `false`
488
+
/// Returns whether contents are known to be all-ASCII. A return value of
489
+
/// `true` means that all code units are ASCII. A return value of `false`
488
490
/// means there _may_ be non-ASCII content.
489
491
///
490
492
/// ASCII-ness is checked and remembered during UTF-8 validation, so this
491
-
/// is often equivalent to is-ASCII, but there are some situations where
493
+
/// is often equivalent to is-ASCII, but there are some situations where
492
494
/// we might return `false` even when the content happens to be all-ASCII.
493
495
///
494
-
/// For example, a UTF-8 span generated from a `String` that at some point
495
-
/// contained non-ASCII content would report false for `isKnownASCII`, even
496
-
/// if that String had subsequent mutation operations that removed any
496
+
/// For example, a UTF-8 span generated from a `String` that at some point
497
+
/// contained non-ASCII content would report false for `isKnownASCII`, even
498
+
/// if that String had subsequent mutation operations that removed any
497
499
/// non-ASCII content.
498
500
publicvar isKnownASCII: Bool { get }
499
501
@@ -621,16 +623,24 @@ extension UTF8Span {
621
623
```
622
624
623
625
624
-
625
626
### More alignments and alignment queries
626
627
627
628
Future API could include word iterators (either [simple](https://www.unicode.org/reports/tr18/#Simple_Word_Boundaries) or [default](https://www.unicode.org/reports/tr18/#Default_Word_Boundaries)), line iterators, etc.
628
629
629
630
Similarly, we could add API directly to `UTF8Span` for testing whether a given code unit offset is suitably aligned (including scalar or grapheme-cluster alignment checks).
630
631
632
+
### `~=` and other operators
633
+
634
+
`UTF8Span` supports both binary equivalence and Unicode canonical equivalence. For example, a textual format parser using `UTF8Span` might operate in terms of binary equivalence for processing the textual format itself and then in terms of Unicode canonical equivalnce when interpreting the content of the fields.
635
+
636
+
We are deferring making any decision on what a "default" comparison semantics should be as future work, which would include defining a `~=` operator (which would allow one to switch over a `UTF8Span` and match against literals).
637
+
638
+
It may also be the case that it makes more sense for a library or application to define wrapper types around `UTF8Span` which can define `~=` with their preferred comparison semantics.
639
+
640
+
631
641
### Creating `String` copies
632
642
633
-
We could add an initializer to `String` that makes an owned copy of a `UTF8Span`'s contents. Such an initializer can skip UTF-8 validation.
643
+
We could add an initializer to `String` that makes an owned copy of a `UTF8Span`'s contents. Such an initializer can skip UTF-8 validation.
634
644
635
645
Alternatively, we could defer adding anything until more of the `Container` protocol story is clear.
636
646
@@ -640,7 +650,7 @@ Future API could include checks for whether the content is in a particular norma
640
650
641
651
### UnicodeScalarView and CharacterView
642
652
643
-
Like `Span`, we are deferring adding any collection-like types to non-escapable `UTF8Span`. Future work could include adding view types that conform to a new `Container`-like protocol.
653
+
Like `Span`, we are deferring adding any collection-like types to non-escapable `UTF8Span`. Future work could include adding view types that conform to a new `Container`-like protocol.
644
654
645
655
See "Alternatives Considered" below for more rationale on not adding `Collection`-like API in this proposal.
646
656
@@ -695,6 +705,26 @@ Many printing and logging protocols and facilities operate in terms of `String`.
695
705
696
706
## Alternatives considered
697
707
708
+
### Problems arising from the unsafe init
709
+
710
+
The combination of the unsafe init on `UTF8Span` and the copying init on `String` creates a new kind of easily-accesible backdoor to `String`'s security and safety, namely the invariant that it holds validly encoded UTF-8 when in native form.
711
+
712
+
Currently, String is 100% safe outside of crazy custom subclass shenanigans (only on ObjC platforms) or arbitrarily scribbling over memory (which is true of all of Swift). Both are highly visible and require writing many lines of advanced-knowledge code.
713
+
714
+
Without these two API, it is in theory possible to skip validation and produce a String instance of the [indirect contiguous UTF-8](https://forums.swift.org/t/piercing-the-string-veil/21700) flavor through a custom subclass of NSString. But, it is only available on Obj-C platforms and involves creating a custom subclass of `NSString`, having knowledge of lazy bridging internals (which can and sometimes do change from release to release of Swift), and writing very specialized code. The product would be an unsafe lazily bridged instance of `String`, which could more than offset any performance gains from the workaround itself.
715
+
716
+
With these two API, you can get to UB via a:
717
+
718
+
```swift
719
+
let codeUnits = unsafe UTF8Span(unsafeAssumingValidUTF8: bytes)
720
+
...
721
+
String(copying: codeUnits)
722
+
```
723
+
724
+
We are (very) weakly in favor of keeping the unsafe init, because there are many low-level situations in which the valid-UTF8 invariant is held by the system itself (such as a data structure using a custom allocator).
725
+
726
+
727
+
698
728
### Invalid start / end of input UTF-8 encoding errors
699
729
700
730
Earlier prototypes had `.invalidStartOfInput` and `.invalidEndOfInput` UTF8 validation errors to communicate that the input was perhaps incomplete or not slices along scalar boundaries. In this scenario, `.invalidStartOfInput` is equivalent to `.unexpectedContinuation` with the range's lower bound equal to 0 and `.invalidEndOfInput` is equivalent to `.truncatedScalar` with the range's upper bound equal to `count`.
@@ -765,7 +795,7 @@ Scalar-alignment can still be checked and managed by the caller through the `res
765
795
766
796
#### View Collections
767
797
768
-
Another forumulation of these operations could be to provide a collection-like API phrased in terms of indices. Because `Collection`s are `Escapable`, we cannot conform nested `View` types to `Collection` so these would not benefit from any `Collection`-generic code, algorithms, etc.
798
+
Another forumulation of these operations could be to provide a collection-like API phrased in terms of indices. Because `Collection`s are `Escapable`, we cannot conform nested `View` types to `Collection` so these would not benefit from any `Collection`-generic code, algorithms, etc.
769
799
770
800
A benefit of such `Collection`-like views is that it could help serve as adapter code for migration. Existing `Collection`-generic algorithms and methods could be converted to support `UTF8Span` via copy-paste-edit. That is, a developer could interact with `UTF8Span` ala:
0 commit comments