Skip to content

[String] Use a UTF-8 representation for native strings #20315

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 45 commits into from
Nov 5, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
4ab45df
[String] Drop in initial UTF-8 String prototype
milseman Nov 3, 2018
de893b6
[String] Fix overlays to request index interchange for native strings
milseman Oct 15, 2018
89d18e1
[String] Refactor helper code into UnicodeHelpers.swift.
milseman Oct 15, 2018
9bf2c4d
[String] Use small string at string creation
milseman Oct 15, 2018
7c00552
[String] In-place append and other RRC improvements
milseman Oct 15, 2018
f23a3c1
[String] Bounds checking and Index cleanup
milseman Oct 15, 2018
95ef4bc
[String] Emit literals as UTF-8 rather than UTF-16
milseman Oct 15, 2018
fe7c3ce
[String] Refactorings and cleanup
milseman Oct 28, 2018
2e368a3
[String] Introduce StringBreadcrumbs
milseman Oct 28, 2018
f27d3db
[tests] Fix, cleanup, and clarify some SIL character tests
milseman Sep 27, 2018
a0e639e
[String] Grapheme breaking fast-paths
milseman Oct 28, 2018
f1a35bd
String comparison iterator for UTF8 strings
Sep 22, 2018
c51aa59
[String] Cleanup normalization code.
milseman Sep 24, 2018
fa4c8a6
Give StringProtocol.SubSequence a default of Substring to supress war…
airspeedswift Sep 26, 2018
f56b098
[stdlib] Add consuming/owned annotations to Collection implementation…
airspeedswift Sep 21, 2018
752423b
[String] Remove dead code and decls
milseman Oct 28, 2018
9d9f900
[String] Define performance flags and plumb them throughout
milseman Oct 28, 2018
8851bac
[String] Inlining, NFC fast paths, and more.
milseman Oct 29, 2018
79e9f26
integrating utf8 validation
weissi Sep 28, 2018
bee9374
use legacy replacement ranges to fix tests
weissi Oct 17, 2018
7376009
Add benchmarks and tests for the normalized iterator (#32)
Oct 10, 2018
7aea406
[String] NFC iterator fast-paths
milseman Oct 28, 2018
bacc7ee
fix the normalization unit tests
Oct 12, 2018
9b5eb23
[test] Fix test: add in explicit mkdir
milseman Oct 15, 2018
70b8de1
[test] Replace hard-code SIL test with Swift-based one
milseman Oct 16, 2018
d92098b
[String] Performance improvements to comparison
milseman Oct 12, 2018
a37d110
[String] Constant-fold small strings from literals.
milseman Oct 28, 2018
9135c07
[String] Perform small string append in-register
milseman Oct 28, 2018
d5da6fd
[String] More comparison speedups and cleanup
milseman Oct 14, 2018
75728eb
[String] Implement in-place generic RRC
milseman Oct 21, 2018
b87bff4
[test] Test the unique-native String RRC optimization path
milseman Oct 21, 2018
e2c2e47
[test] Test the breadcrumbing String<->Cocoa interface
milseman Oct 24, 2018
cb0fbc6
[String] 5X Faster getCharacters implementation
milseman Oct 25, 2018
e6582c3
[test] Adjust String tests for UTF-8 representation.
milseman Oct 29, 2018
40aae6b
[String] 32-bit platform support
lorentey Oct 25, 2018
921bb99
[test] Adjust test: we do more constant folding than before!
milseman Oct 26, 2018
948655e
[String] Cleanups, comments, documentation
milseman Oct 25, 2018
53ccd9e
[string] Less inlining for code size.
milseman Oct 27, 2018
53ee971
[test] Update test to reflect DCE
milseman Oct 29, 2018
d112e5f
[test] Dump in new API and ABI reference point
milseman Oct 29, 2018
3820393
[stdlib] Ensure that reserved capacity survives CoW copies (#46)
lorentey Oct 29, 2018
c04dcf3
[String] More efficient breadcrumb-scanning code.
milseman Oct 29, 2018
1939d16
Make corelibs-foundation build
milseman Oct 29, 2018
ec6729a
[String] Assertion logic and isASCII bug fix.
milseman Nov 3, 2018
fee2787
[String] Invalidate breadcrumbs on mutation.
milseman Nov 5, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
225 changes: 124 additions & 101 deletions benchmark/single-source/StringComparison.swift

Large diffs are not rendered by default.

70 changes: 58 additions & 12 deletions benchmark/single-source/StringComparison.swift.gyb
Original file line number Diff line number Diff line change
Expand Up @@ -30,32 +30,58 @@ extension String {
}
}

% Names = ["ascii", "latin1", "fastPrenormal", "slowerPrenormal", "nonBMPSlowestPrenormal", "emoji", "abnormal", "zalgo", "longSharedPrefix"]
% AllWorkloads = ["ascii", "latin1", "fastPrenormal", "slowerPrenormal", "nonBMPSlowestPrenormal", "emoji", "abnormal", "zalgo", "longSharedPrefix"]
% ComparisonWorkloads = AllWorkloads
% HashingWorkloads = ["ascii", "latin1", "fastPrenormal", "slowerPrenormal", "nonBMPSlowestPrenormal", "emoji", "abnormal", "zalgo", "longSharedPrefix"]

public let StringComparison = [
% for Name in Names:
// TODO(UTF8 post-merge): Disable longSharedPrefix hashing benchmark, which is
// enabled here for 1-to-1 comparison vs master

// TODO(UTF8 post-merge): Enable NormalizedIteratorWorkloads for ["ascii",
// "latin1", "fastPrenormal", "slowerPrenormal", "nonBMPSlowestPrenormal",
// "emoji", "abnormal", "zalgo"]

% NormalizedIteratorWorkloads = []

public let StringComparison: [BenchmarkInfo] = [
% for Name in ComparisonWorkloads:
BenchmarkInfo(
name: "StringComparison_${Name}",
runFunction: run_StringComparison_${Name},
tags: [.validation, .api, .String],
setUpFunction: { blackHole(Workload_${Name}) }),
% end # Names
setUpFunction: { blackHole(Workload_${Name}) }
),
% end # ComparisonWorkloads
]

public let StringHashing = [
% for Name in Names:
public let StringHashing: [BenchmarkInfo] = [
% for Name in HashingWorkloads:
BenchmarkInfo(
name: "StringHashing_${Name}",
runFunction: run_StringHashing_${Name},
tags: [.validation, .api, .String],
setUpFunction: { blackHole(Workload_${Name}) }),
% end # Names
setUpFunction: { blackHole(Workload_${Name}) }
),
% end # HashingWorkloads
]

% for Name in Names:
public let NormalizedIterator: [BenchmarkInfo] = [
% for Name in NormalizedIteratorWorkloads:
BenchmarkInfo(
name: "NormalizedIterator_${Name}",
runFunction: run_NormalizedIterator_${Name},
tags: [.validation, .String],
setUpFunction: { blackHole(Workload_${Name}) }
),
% end # NormalizedIteratorWorkloads
]

% for Name in AllWorkloads:
var Workload_${Name}: Workload! = Workload.${Name}

% end # AllWorkloads

%for Name in ComparisonWorkloads:
@inline(never)
public func run_StringComparison_${Name}(_ N: Int) {
let workload: Workload = Workload_${Name}
Expand All @@ -70,6 +96,9 @@ public func run_StringComparison_${Name}(_ N: Int) {
}
}

% end # ComparisonWorkloads

%for Name in HashingWorkloads:
@inline(never)
public func run_StringHashing_${Name}(_ N: Int) {
let workload: Workload = Workload.${Name}
Expand All @@ -81,8 +110,25 @@ public func run_StringHashing_${Name}(_ N: Int) {
}
}
}

% end # Names

% end # HashingWorkloads

%for Name in NormalizedIteratorWorkloads:
@inline(never)
public func run_NormalizedIterator_${Name}(_ N: Int) {
let workload: Workload = Workload.${Name}
let tripCount = workload.tripCount
let payload = workload.payload
for _ in 1...tripCount*N {
for str in payload {
str._withNFCCodeUnits { cu in
blackHole(cu)
}
}
}
}

% end # NormalizedIteratorWorkloads

struct Workload {
static let N = 100
Expand Down
1 change: 1 addition & 0 deletions benchmark/utils/main.swift
Original file line number Diff line number Diff line change
Expand Up @@ -266,6 +266,7 @@ registerBenchmark(NSErrorTest)
registerBenchmark(NSStringConversion)
registerBenchmark(NibbleSort)
registerBenchmark(NopDeinit)
registerBenchmark(NormalizedIterator)
registerBenchmark(ObjectAllocation)
#if os(macOS) || os(iOS) || os(watchOS) || os(tvOS)
registerBenchmark(ObjectiveCBridging)
Expand Down
2 changes: 0 additions & 2 deletions include/swift/AST/DiagnosticsSema.def
Original file line number Diff line number Diff line change
Expand Up @@ -2825,8 +2825,6 @@ ERROR(builtin_unicode_scalar_literal_broken_proto,none,
ERROR(unicode_scalar_literal_broken_proto,none,
"protocol 'ExpressibleByUnicodeScalarLiteral' is broken", ())

ERROR(builtin_utf16_extended_grapheme_cluster_literal_broken_proto,none,
"protocol '_ExpressibleByBuiltinUTF16ExtendedGraphemeClusterLiteral' is broken", ())
ERROR(builtin_extended_grapheme_cluster_literal_broken_proto,none,
"protocol '_ExpressibleByBuiltinExtendedGraphemeClusterLiteral' is broken", ())
ERROR(extended_grapheme_cluster_literal_broken_proto,none,
Expand Down
1 change: 0 additions & 1 deletion include/swift/AST/KnownIdentifiers.def
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,6 @@ IDENTIFIER_(builtinUnicodeScalarLiteral)
IDENTIFIER(unicodeScalarLiteral)

IDENTIFIER(stringLiteral)
IDENTIFIER_(builtinUTF16StringLiteral)
IDENTIFIER_(builtinStringLiteral)
IDENTIFIER(StringLiteralType)
IDENTIFIER(stringInterpolation)
Expand Down
2 changes: 0 additions & 2 deletions include/swift/AST/KnownProtocols.def
Original file line number Diff line number Diff line change
Expand Up @@ -88,12 +88,10 @@ EXPRESSIBLE_BY_LITERAL_PROTOCOL_(ExpressibleByImageLiteral)
EXPRESSIBLE_BY_LITERAL_PROTOCOL_(ExpressibleByFileReferenceLiteral)

BUILTIN_EXPRESSIBLE_BY_LITERAL_PROTOCOL_(ExpressibleByBuiltinBooleanLiteral)
BUILTIN_EXPRESSIBLE_BY_LITERAL_PROTOCOL_(ExpressibleByBuiltinUTF16ExtendedGraphemeClusterLiteral)
BUILTIN_EXPRESSIBLE_BY_LITERAL_PROTOCOL_(ExpressibleByBuiltinExtendedGraphemeClusterLiteral)
BUILTIN_EXPRESSIBLE_BY_LITERAL_PROTOCOL_(ExpressibleByBuiltinFloatLiteral)
BUILTIN_EXPRESSIBLE_BY_LITERAL_PROTOCOL_(ExpressibleByBuiltinIntegerLiteral)
BUILTIN_EXPRESSIBLE_BY_LITERAL_PROTOCOL_(ExpressibleByBuiltinStringLiteral)
BUILTIN_EXPRESSIBLE_BY_LITERAL_PROTOCOL_(ExpressibleByBuiltinUTF16StringLiteral)
BUILTIN_EXPRESSIBLE_BY_LITERAL_PROTOCOL_(ExpressibleByBuiltinUnicodeScalarLiteral)

#undef EXPRESSIBLE_BY_LITERAL_PROTOCOL
Expand Down
2 changes: 0 additions & 2 deletions lib/IRGen/GenMeta.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4040,12 +4040,10 @@ SpecialProtocol irgen::getSpecialProtocolID(ProtocolDecl *P) {
case KnownProtocolKind::ExpressibleByImageLiteral:
case KnownProtocolKind::ExpressibleByFileReferenceLiteral:
case KnownProtocolKind::ExpressibleByBuiltinBooleanLiteral:
case KnownProtocolKind::ExpressibleByBuiltinUTF16ExtendedGraphemeClusterLiteral:
case KnownProtocolKind::ExpressibleByBuiltinExtendedGraphemeClusterLiteral:
case KnownProtocolKind::ExpressibleByBuiltinFloatLiteral:
case KnownProtocolKind::ExpressibleByBuiltinIntegerLiteral:
case KnownProtocolKind::ExpressibleByBuiltinStringLiteral:
case KnownProtocolKind::ExpressibleByBuiltinUTF16StringLiteral:
case KnownProtocolKind::ExpressibleByBuiltinUnicodeScalarLiteral:
case KnownProtocolKind::OptionSet:
case KnownProtocolKind::BridgedNSError:
Expand Down
8 changes: 1 addition & 7 deletions lib/SILOptimizer/Utils/Local.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -732,16 +732,10 @@ bool StringConcatenationOptimizer::extractStringConcatOperands() {
auto AILeftOperandsNum = AILeft->getNumOperands();
auto AIRightOperandsNum = AIRight->getNumOperands();

// makeUTF16 should have following parameters:
// (start: RawPointer, utf16CodeUnitCount: Word)
// makeUTF8 should have following parameters:
// (start: RawPointer, utf8CodeUnitCount: Word, isASCII: Int1)
if (!((FRILeftFun->hasSemanticsAttr("string.makeUTF16") &&
AILeftOperandsNum == 4) ||
(FRILeftFun->hasSemanticsAttr("string.makeUTF8") &&
if (!((FRILeftFun->hasSemanticsAttr("string.makeUTF8") &&
AILeftOperandsNum == 5) ||
(FRIRightFun->hasSemanticsAttr("string.makeUTF16") &&
AIRightOperandsNum == 4) ||
(FRIRightFun->hasSemanticsAttr("string.makeUTF8") &&
AIRightOperandsNum == 5)))
return false;
Expand Down
73 changes: 11 additions & 62 deletions lib/Sema/CSApply.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2094,55 +2094,24 @@ namespace {
Diag<> brokenBuiltinProtocolDiag;

if (isStringLiteral) {
// If the string contains only ASCII, force a UTF8 representation
bool forceASCII = stringLiteral != nullptr;
if (forceASCII) {
for (auto c: stringLiteral->getValue()) {
if (c & (1 << 7)) {
forceASCII = false;
break;
}
}
}

literalType = tc.Context.Id_StringLiteralType;

literalFuncName = DeclName(tc.Context, DeclBaseName::createConstructor(),
{ tc.Context.Id_stringLiteral });

// If the string contains non-ASCII and the type can handle
// UTF-16 string literals, prefer them.
builtinProtocol = tc.getProtocol(
expr->getLoc(),
KnownProtocolKind::ExpressibleByBuiltinUTF16StringLiteral);

if (!forceASCII && (tc.conformsToProtocol(
type, builtinProtocol, cs.DC,
ConformanceCheckFlags::InExpression))) {
builtinLiteralFuncName =
DeclName(tc.Context, DeclBaseName::createConstructor(),
{tc.Context.Id_builtinUTF16StringLiteral,
tc.Context.getIdentifier("utf16CodeUnitCount")});

if (stringLiteral)
stringLiteral->setEncoding(StringLiteralExpr::UTF16);
else
magicLiteral->setStringEncoding(StringLiteralExpr::UTF16);
} else {
// Otherwise, fall back to UTF-8.
builtinProtocol = tc.getProtocol(
expr->getLoc(),
KnownProtocolKind::ExpressibleByBuiltinStringLiteral);
builtinLiteralFuncName
= DeclName(tc.Context, DeclBaseName::createConstructor(),
{ tc.Context.Id_builtinStringLiteral,
tc.Context.getIdentifier("utf8CodeUnitCount"),
tc.Context.getIdentifier("isASCII") });
if (stringLiteral)
stringLiteral->setEncoding(StringLiteralExpr::UTF8);
else
magicLiteral->setStringEncoding(StringLiteralExpr::UTF8);
}
KnownProtocolKind::ExpressibleByBuiltinStringLiteral);
builtinLiteralFuncName
= DeclName(tc.Context, DeclBaseName::createConstructor(),
{ tc.Context.Id_builtinStringLiteral,
tc.Context.getIdentifier("utf8CodeUnitCount"),
tc.Context.getIdentifier("isASCII") });
if (stringLiteral)
stringLiteral->setEncoding(StringLiteralExpr::UTF8);
else
magicLiteral->setStringEncoding(StringLiteralExpr::UTF8);

brokenProtocolDiag = diag::string_literal_broken_proto;
brokenBuiltinProtocolDiag = diag::builtin_string_literal_broken_proto;
} else if (isGraphemeClusterLiteral) {
Expand All @@ -2163,26 +2132,6 @@ namespace {
diag::extended_grapheme_cluster_literal_broken_proto;
brokenBuiltinProtocolDiag =
diag::builtin_extended_grapheme_cluster_literal_broken_proto;

auto *builtinUTF16ExtendedGraphemeClusterProtocol = tc.getProtocol(
expr->getLoc(),
KnownProtocolKind::ExpressibleByBuiltinUTF16ExtendedGraphemeClusterLiteral);
if (tc.conformsToProtocol(type,
builtinUTF16ExtendedGraphemeClusterProtocol,
cs.DC, ConformanceCheckFlags::InExpression)) {
builtinLiteralFuncName
= DeclName(tc.Context, DeclBaseName::createConstructor(),
{ tc.Context.Id_builtinExtendedGraphemeClusterLiteral,
tc.Context.getIdentifier("utf16CodeUnitCount") });

builtinProtocol = builtinUTF16ExtendedGraphemeClusterProtocol;
brokenBuiltinProtocolDiag =
diag::builtin_utf16_extended_grapheme_cluster_literal_broken_proto;
if (stringLiteral)
stringLiteral->setEncoding(StringLiteralExpr::UTF16);
else
magicLiteral->setStringEncoding(StringLiteralExpr::UTF16);
}
} else {
// Otherwise, we should have just one Unicode scalar.
literalType = tc.Context.Id_UnicodeScalarLiteralType;
Expand Down
102 changes: 102 additions & 0 deletions stdlib/private/StdlibUnicodeUnittest/StdlibUnicodeUnittest.swift
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,108 @@

import StdlibUnittest

extension String {
func parseUTF8CodeUnits() -> [UInt8] {
var utf8 = [UInt8]()
let units = self.split(separator: " ")
let scalars = units.compactMap { string -> Unicode.Scalar? in
let i = Int(string, radix: 16)!
return Unicode.Scalar(i)

}

for scalar in scalars {
utf8 += String(scalar).utf8
}
return utf8
}

func parseUTF16CodeUnits() -> [UInt16] {
var utf16 = [UInt16]()
let units = self.split(separator: " ")
let scalars = units.compactMap { string -> Unicode.Scalar? in
let i = Int(string, radix: 16)!
return Unicode.Scalar(i)
}

for scalar in scalars {
utf16 += scalar.utf16
}
return utf16
}
}

public struct NormalizationTest {
public let loc: SourceLoc
public let sourceUTF16: [UInt16]
public let source: [UInt8]
public let NFC: [UInt8]
public let NFD: [UInt8]
public let NFKC: [UInt8]
public let NFKD: [UInt8]

init(
loc: SourceLoc,
source: String,
NFC: String,
NFD: String,
NFKC: String,
NFKD: String
) {
self.loc = loc
self.sourceUTF16 = source.parseUTF16CodeUnits()
self.source = source.parseUTF8CodeUnits()
self.NFC = NFC.parseUTF8CodeUnits()
self.NFD = NFD.parseUTF8CodeUnits()
self.NFKC = NFKC.parseUTF8CodeUnits()
self.NFKD = NFKD.parseUTF8CodeUnits()
}
}

// Normalization tests are currently only avaible on Darwin, awaiting a sensible
// file API...
#if _runtime(_ObjC)
import Foundation
public let normalizationTests: [NormalizationTest] = {
var tests = [NormalizationTest]()

let file = CommandLine.arguments[2]
let fileURL = URL(fileURLWithPath: file)

let fileContents = try! String(contentsOf: fileURL) + "" // go faster

var lineNumber: UInt = 0
for line in fileContents.split(separator: "\n") {
lineNumber += 1
guard line.hasPrefix("#") == false else {
continue
}

let content = line.split(separator: "#").first!

guard !content.isEmpty else {
continue
}
guard !content.hasPrefix("@") else {
continue
}

let columns = content.split(separator: ";").filter { $0 != " " }.map(String.init)
let test = NormalizationTest(
loc: SourceLoc(file, lineNumber),
source: columns[0],
NFC: columns[1],
NFD: columns[2],
NFKC: columns[3],
NFKD: columns[4])

tests.append(test)
}

return tests
}()
#endif

public struct UTFTest {
public struct Flags : OptionSet {
public let rawValue: Int
Expand Down
2 changes: 1 addition & 1 deletion stdlib/private/StdlibUnittest/StdlibCoreExtras.swift
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ import Foundation
//

func findSubstring(_ haystack: Substring, _ needle: String) -> String.Index? {
return findSubstring(String(haystack._ephemeralContent), needle)
return findSubstring(haystack._ephemeralString, needle)
}

func findSubstring(_ string: String, _ substring: String) -> String.Index? {
Expand Down
Loading