Skip to content

[stdlib] Speed up Character.init significantly for small characters. #6850

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 26, 2017
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 40 additions & 13 deletions stdlib/public/core/Character.swift
Original file line number Diff line number Diff line change
Expand Up @@ -125,11 +125,32 @@ public struct Character :
utf8CodeUnitCount: Builtin.Word,
isASCII: Builtin.Int1
) {
self = Character(
String(
_builtinExtendedGraphemeClusterLiteral: start,
utf8CodeUnitCount: utf8CodeUnitCount,
isASCII: isASCII))
// Most character literals are going to be fewer than eight UTF-8 code
// units; for those, build the small character representation directly.
let maxCodeUnitCount = MemoryLayout<UInt64>.size
if _fastPath(Int(utf8CodeUnitCount) <= maxCodeUnitCount) {
var buffer: UInt64 = ~0
_memcpy(
dest: UnsafeMutableRawPointer(Builtin.addressof(&buffer)),
src: UnsafeMutableRawPointer(start),
size: UInt(utf8CodeUnitCount))
// Copying the bytes directly from the literal into an integer assumes
// little endianness, so convert the copied data into host endianness.
let utf8Chunk = UInt64(littleEndian: buffer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: is this kosher for big-endian systems?

Copy link
Member Author

@allevato allevato Jan 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .small representation orders the packed code units from low to high bits. Since we're doing a memcpy of code units directly into the memory occupied by a UInt64, it's assuming little-endianness everywhere, and the UInt64.init(littleEndian:) call is necessary to ensure the correct integer representation on big-endian systems. The same logic—which I adapted—already exists in the function called by Character.init(String:) here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Maybe some of the comments from the other place would be illuminating here too.

let bits = maxCodeUnitCount &* 8 &- 1
// Verify that the highest bit isn't set so that we can truncate it to
// 63 bits.
if _fastPath(utf8Chunk & (1 << numericCast(bits)) != 0) {
_representation = .small(Builtin.trunc_Int64_Int63(utf8Chunk._value))
return
}
}
// For anything that doesn't fit in 63 bits, build the large
// representation.
self = Character(_largeRepresentationString: String(
_builtinExtendedGraphemeClusterLiteral: start,
utf8CodeUnitCount: utf8CodeUnitCount,
isASCII: isASCII))
}

/// Creates a character with the specified value.
Expand Down Expand Up @@ -183,15 +204,21 @@ public struct Character :
_representation = .small(Builtin.trunc_Int64_Int63(initialUTF8._value))
}
else {
if let native = s._core.nativeBuffer,
native.start == s._core._baseAddress! {
_representation = .large(native._storage)
return
}
var nativeString = ""
nativeString.append(s)
_representation = .large(nativeString._core.nativeBuffer!._storage)
self = Character(_largeRepresentationString: s)
}
}

/// Creates a Character from a String that is already known to require the
/// large representation.
internal init(_largeRepresentationString s: String) {
if let native = s._core.nativeBuffer,
native.start == s._core._baseAddress! {
_representation = .large(native._storage)
return
}
var nativeString = ""
nativeString.append(s)
_representation = .large(nativeString._core.nativeBuffer!._storage)
}

/// Returns the index of the lowest byte that is 0xFF, or 8 if
Expand Down