Skip to content

Commit 61965c3

Browse files
committed
Restrict character property fuzzy matching to "pattern whitespace"
I wasn't aware of this Unicode property when initially implementing this. It's a more restricted set of whitespace that Unicode reccommends for parsing patterns. It's the same set of whitespace used for extended syntax. UAX44-LM3 itself doesn't appear to specify the exact set of whitespace to match against, but this is no more restrictive than the engines I'm aware of.
1 parent c13980f commit 61965c3

File tree

2 files changed

+15
-1
lines changed

2 files changed

+15
-1
lines changed

Sources/_RegexParser/Regex/Parse/CharacterPropertyClassification.swift

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ extension Source {
1818
// This follows the rules provided by UAX44-LM3, including trying to drop an
1919
// "is" prefix, which isn't required by UTS#18 RL1.2, but is nice for
2020
// consistency with other engines and the Unicode.Scalar.Properties names.
21-
let str = str.filter { !$0.isWhitespace && $0 != "_" && $0 != "-" }
21+
let str = str.filter { !$0.isPatternWhitespace && $0 != "_" && $0 != "-" }
2222
.lowercased()
2323
if let m = match(str) {
2424
return m

Tests/RegexTests/ParseTests.swift

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2061,6 +2061,16 @@ extension RegexTests {
20612061
""", changeMatchingOptions(matchingOptions(adding: .extended))
20622062
)
20632063

2064+
parseWithDelimitersTest(#"""
2065+
#/
2066+
\p{
2067+
gc
2068+
=
2069+
digit
2070+
}
2071+
/#
2072+
"""#, prop(.generalCategory(.decimalNumber)))
2073+
20642074
// MARK: Delimiter skipping: Make sure we can skip over the ending delimiter
20652075
// if it's clear that it's part of the regex syntax.
20662076

@@ -2486,6 +2496,10 @@ extension RegexTests {
24862496
diagnosticTest(#"\p{aaa\p{b}}"#, .unknownProperty(key: nil, value: "aaa"))
24872497
diagnosticTest(#"[[:{:]]"#, .unknownProperty(key: nil, value: "{"))
24882498

2499+
// We only filter pattern whitespace, which doesn't include things like
2500+
// non-breaking spaces.
2501+
diagnosticTest(#"\p{L\#u{A0}l}"#, .unknownProperty(key: nil, value: "L\u{A0}l"))
2502+
24892503
// MARK: Matching options
24902504

24912505
diagnosticTest("(?-y{g})", .cannotRemoveTextSegmentOptions)

0 commit comments

Comments
 (0)