Skip to content

Optimize matching to match on scalar values when possible #525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Jul 12, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
e518bdf
Add unicode and dna benchmarks
rctcwyvrn Jun 23, 2022
eb64036
Fixup naming, turn off firstmatch, add benchmark filtering by regex
rctcwyvrn Jun 23, 2022
1237f9a
Make --exclude-ns actually work correctly
rctcwyvrn Jun 23, 2022
fdf2c23
Add usage comment for generateFasta
rctcwyvrn Jun 23, 2022
eeb38e9
First ver
rctcwyvrn Jun 28, 2022
6f76f36
Merge branch 'main' into match-scalar
rctcwyvrn Jun 29, 2022
3a2b324
Remove matchseq entirely
rctcwyvrn Jun 29, 2022
df8919e
Merge branch 'just-one-more-benchmark-suite' into temp
rctcwyvrn Jun 29, 2022
c2ee8cc
Finish up matchScalar
rctcwyvrn Jun 29, 2022
809b085
Factor out nextScalarIndex for matchBitsetScalar
rctcwyvrn Jun 29, 2022
06c77c7
Add scalar mode support for matching bitsets + fix bug
rctcwyvrn Jun 29, 2022
e3d7ad7
Emit matchScalar in quotedLiteral when in unicode scalar mode
rctcwyvrn Jun 29, 2022
3e1e088
Add tests
rctcwyvrn Jun 29, 2022
b4c7c8c
Cleanup
rctcwyvrn Jun 30, 2022
5359e31
Revert "Merge branch 'just-one-more-benchmark-suite' into temp"
rctcwyvrn Jun 30, 2022
6e4c2bd
Cleanup
rctcwyvrn Jul 4, 2022
1a359b4
Add case-insensitive match instructions
rctcwyvrn Jul 4, 2022
097ffeb
Remove extra instructions and use payload bits instead
rctcwyvrn Jul 4, 2022
76667a2
Comment out compiletests for now
rctcwyvrn Jul 4, 2022
6a1f6e9
Fix compile tests
rctcwyvrn Jul 5, 2022
fce8f9a
Merge branch 'main' into scalar-optimizations-clean
rctcwyvrn Jul 7, 2022
0860368
Fix scalar matching in grapheme semantic mode
hamishknight Jul 8, 2022
c6bc811
Preserve scalar syntax in DSL conversion
hamishknight Jul 8, 2022
bca1d2b
Change scalar semantics to match #565
rctcwyvrn Jul 11, 2022
79e60ac
Merge branch 'not-to-scale' into scalar-optimizations-clean
rctcwyvrn Jul 11, 2022
22cc9d5
Add edge case test
rctcwyvrn Jul 11, 2022
7a9923d
Always match .scalar under grapheme semantics
rctcwyvrn Jul 11, 2022
47f8e66
Merge branch 'main' into scalar-optimizations-clean
rctcwyvrn Jul 11, 2022
7aa98d1
Add new instructions to compile tests
rctcwyvrn Jul 11, 2022
113cfe3
Add an XFAIL test for scalar coalescing
rctcwyvrn Jul 12, 2022
9949b8e
Fix XCTExpectFailure for linux
rctcwyvrn Jul 12, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 11 additions & 9 deletions Sources/_StringProcessing/ByteCodeGen.swift
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,11 @@ fileprivate extension Compiler.ByteCodeGen {
emitCharacter(c)

case let .scalar(s):
emitScalar(s)
if options.semanticLevel == .graphemeCluster {
emitCharacter(Character(s))
} else {
emitMatchScalar(s)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should fail tests, and if it's not we may need to write the tests and XFAIL them. E.g. /a\u{301}/, IIUC, as written would try to make a proper Character from the combining scalar.


case let .assertion(kind):
try emitAssertion(kind.ast)
Expand Down Expand Up @@ -92,7 +96,7 @@ fileprivate extension Compiler.ByteCodeGen {
guard options.semanticLevel == .graphemeCluster else {
for char in s {
for scalar in char.unicodeScalars {
emitScalar(scalar)
emitMatchScalar(scalar)
}
}
return
Expand Down Expand Up @@ -273,22 +277,20 @@ fileprivate extension Compiler.ByteCodeGen {
}
}

mutating func emitScalar(_ s: UnicodeScalar) {
// A scalar in grapheme semantic mode must match a full charcter, so
// perform a boundary check
let boundaryCheck = options.semanticLevel == .graphemeCluster
mutating func emitMatchScalar(_ s: UnicodeScalar) {
assert(options.semanticLevel == .unicodeScalar)
if options.isCaseInsensitive && s.properties.isCased {
builder.buildMatchScalarCaseInsensitive(s, boundaryCheck: boundaryCheck)
builder.buildMatchScalarCaseInsensitive(s, boundaryCheck: false)
} else {
builder.buildMatchScalar(s, boundaryCheck: boundaryCheck)
builder.buildMatchScalar(s, boundaryCheck: false)
}
}

mutating func emitCharacter(_ c: Character) {
// Unicode scalar mode matches the specific scalars that comprise a character
if options.semanticLevel == .unicodeScalar {
for scalar in c.unicodeScalars {
emitScalar(scalar)
emitMatchScalar(scalar)
}
return
}
Expand Down
2 changes: 1 addition & 1 deletion Tests/RegexBuilderTests/RegexDSLTests.swift
Original file line number Diff line number Diff line change
Expand Up @@ -1159,7 +1159,7 @@ class RegexDSLTests: XCTestCase {
XCTAssertNotNil(try r3.matchingSemantics(.unicodeScalar).wholeMatch(in: "👨‍👨‍👧‍👦"))

let r4 = Regex { "é" as UnicodeScalar }
XCTAssertNil(
XCTAssertNotNil(
try r4.firstMatch(in: "e\u{301}")
)
XCTAssertNotNil(
Expand Down