Skip to content

[String] Add UTF-8 fast-paths for Foundation initializers #21959

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 23, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 25 additions & 3 deletions stdlib/public/Darwin/Foundation/NSStringAPI.swift
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,10 @@ extension String {
/// Creates a string by copying the data from a given
/// C array of UTF8-encoded bytes.
public init?(utf8String bytes: UnsafePointer<CChar>) {
if let str = String(validatingUTF8: bytes) {
self = str
return
}
if let ns = NSString(utf8String: bytes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not remove the NSString-and-bridge part of this entirely?
Then we could also make it inlineable since it trivially forwards to the stdlib function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to preserve behavior if the bytes are invalidly encoded. We should probably deprecate this initializer.

self = String._unconditionallyBridgeFromObjectiveC(ns)
} else {
Expand All @@ -202,12 +206,18 @@ extension String {
/// - Parameters:
/// - bytes: A sequence of bytes to interpret using `encoding`.
/// - encoding: The ecoding to use to interpret `bytes`.
public init? <S: Sequence>(bytes: __shared S, encoding: Encoding)
where S.Iterator.Element == UInt8 {
public init?<S: Sequence>(bytes: __shared S, encoding: Encoding)
where S.Iterator.Element == UInt8 {
let byteArray = Array(bytes)
if encoding == .utf8,
let str = byteArray.withUnsafeBufferPointer({ String._tryFromUTF8($0) })
{
self = str
return
}

if let ns = NSString(
bytes: byteArray, length: byteArray.count, encoding: encoding.rawValue) {

self = String._unconditionallyBridgeFromObjectiveC(ns)
} else {
return nil
Expand Down Expand Up @@ -365,6 +375,10 @@ extension String {
cString: UnsafePointer<CChar>,
encoding enc: Encoding
) {
if enc == .utf8, let str = String(validatingUTF8: cString) {
self = str
return
}
if let ns = NSString(cString: cString, encoding: enc.rawValue) {
self = String._unconditionallyBridgeFromObjectiveC(ns)
} else {
Expand All @@ -381,6 +395,14 @@ extension String {
/// Returns a `String` initialized by converting given `data` into
/// Unicode characters using a given `encoding`.
public init?(data: __shared Data, encoding: Encoding) {
if encoding == .utf8,
let str = data.withUnsafeBytes({
String._tryFromUTF8($0.bindMemory(to: UInt8.self))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mm, in previous discussions, we've agreed that the safety of binding Data is dubious given the fact that the underlying buffer may have been previously bound. Are we making the assertion that binding to a trivial type here is safe? /CC @atrick

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to express the UTF-8 validation in terms of raw buffer pointers instead of typed? (Or overload to make this possible safely?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this is something to be concerned with, but if it is:

The fix would be for Data to provide a way to access its contents as a UBP<UInt8>, since Data is the owner (or knows who the owner is) and is the only one allowed to do so. If String exposed something taking a URBP, then treated it as a collection of UInt8s, that seems like it's sweeping the potential time bomb under the living room rug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data might know who the owner is, but it doesn't necessarily make it safe to bind the contents to UInt8, since the previous owner may have bound the buffer to a nontrivial type, and Data cannot unbind it — URBPs specifically do allow reading as UInt8s via a raw interface, though, making it always safe to read from.

I'm not terribly concerned (as we've previously discussed — we can't prevent someone from doing something like reading arbitrary data bound to some non-trivial type U and trying to pass it along to be interpreted as UTF-8), but we did do a lot in Data to make sure we don't invoke undefined behavior like this silently on others' behalf. @atrick might have stronger opinions here, but I agree with him in that we shouldn't be repeating that here.

Copy link
Contributor

@karwa karwa Jan 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI withContiguousStorageIfAvailable provides a typed pointer, so Data could use that. Unfortunately, URBP doesn't support it (I guess because it doesn't want to bind random memory). EDIT: Oops, I thought Data supported that, but it's doing what URBP does.

I encountered that issue with #22028
Currently, if you write:

class MyClass {}
var s = [MyClass(), MyClass(), MyClass()]
let sbytes = withUnsafeBytes(of: &s) { String(decoding: $0, as: UTF8.self) }

you will invoke a fast-path which calls assumingMemoryBound(to: UInt8.self) on the raw pointer, which IIUC is technically undefined behaviour. But it was marked "unsafe"...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any String utilities that we want to accept a raw byte stream, like Data, including validateUTF8 and _uncheckedFromUTF8 should either take URBP or some ContiguouslyStored protocol. Either will be source compatible with the existing APIs, so they can probably be fixed later.

}) {
self = str
return
}

guard let s = NSString(data: data, encoding: encoding.rawValue) else { return nil }
self = String._unconditionallyBridgeFromObjectiveC(s)
}
Expand Down
36 changes: 27 additions & 9 deletions stdlib/public/core/StringCreate.swift
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,33 @@
internal func _allASCII(_ input: UnsafeBufferPointer<UInt8>) -> Bool {
// NOTE: Avoiding for-in syntax to avoid bounds checks
//
// TODO(String performance): Vectorize and/or incorporate into validity
// checking, perhaps both.
// TODO(String performance): SIMD-ize
//
let ptr = input.baseAddress._unsafelyUnwrappedUnchecked
var i = 0
while i < input.count {
guard ptr[i] <= 0x7F else { return false }

let count = input.count
let stride = MemoryLayout<UInt>.stride
let address = Int(bitPattern: ptr)

let wordASCIIMask = UInt(truncatingIfNeeded: 0x8080_8080_8080_8080 as UInt64)
let byteASCIIMask = UInt8(truncatingIfNeeded: wordASCIIMask)

while (address &+ i) % stride != 0 && i < count {
guard ptr[i] & byteASCIIMask == 0 else { return false }
i &+= 1
}

while (i &+ stride) <= count {
let word: UInt = UnsafePointer(
bitPattern: address &+ i
)._unsafelyUnwrappedUnchecked.pointee
guard word & wordASCIIMask == 0 else { return false }
i &+= stride
}

while i < count {
guard ptr[i] & byteASCIIMask == 0 else { return false }
i &+= 1
}
return true
Expand All @@ -42,12 +62,10 @@ extension String {
return storage.asString
}

@usableFromInline
internal static func _tryFromUTF8(
_ input: UnsafeBufferPointer<UInt8>
) -> String? {
public // SPI(Foundation)
static func _tryFromUTF8(_ input: UnsafeBufferPointer<UInt8>) -> String? {
guard case .success(let extraInfo) = validateUTF8(input) else {
return nil
return nil
}

return String._uncheckedFromUTF8(input, isASCII: extraInfo.isASCII)
Expand Down
4 changes: 4 additions & 0 deletions stdlib/public/core/StringUTF8Validation.swift
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,10 @@ extension UTF8ValidationResult: Equatable {}
private struct UTF8ValidationError: Error {}

internal func validateUTF8(_ buf: UnsafeBufferPointer<UInt8>) -> UTF8ValidationResult {
if _allASCII(buf) {
return .success(UTF8ExtraInfo(isASCII: true))
}

var iter = buf.makeIterator()
var lastValidIndex = buf.startIndex

Expand Down