Skip to content

Allow unbounded quoted sequences \Q... #432

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 15, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions Sources/_RegexParser/Regex/Parse/Diagnostics.swift
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ enum ParseError: Error, Hashable {
case invalidEscape(Character)
case confusableCharacter(Character)

case quoteMayNotSpanMultipleLines

case cannotReferToWholePattern

case quantifierRequiresOperand(String)
Expand Down Expand Up @@ -138,6 +140,8 @@ extension ParseError: CustomStringConvertible {
return "invalid escape sequence '\\\(c)'"
case .confusableCharacter(let c):
return "'\(c)' is confusable for a metacharacter; use '\\u{...}' instead"
case .quoteMayNotSpanMultipleLines:
return "quoted sequence may not span multiple lines in multi-line literal"
case .cannotReferToWholePattern:
return "cannot refer to whole pattern here"
case .quantifierRequiresOperand(let q):
Expand Down
19 changes: 17 additions & 2 deletions Sources/_RegexParser/Regex/Parse/LexicalAnalysis.swift
Original file line number Diff line number Diff line change
Expand Up @@ -579,7 +579,7 @@ extension Source {

/// Try to consume quoted content
///
/// Quote -> '\Q' (!'\E' .)* '\E'
/// Quote -> '\Q' (!'\E' .)* '\E'?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the quote get terminated by a group or does it have to be terminated by the end of the regex? In that case, it seems more like:

    Quote -> '\Q' (!'\E' .)* ('\E' | <end-of-input>)

(or whatever our notational convention was for end of input)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has to be terminated by the end of the regex yeah, I can update the grammar to be explicit about that. I don't believe we ever added a convention for it, <end-of-input> seems reasonable to me tho.

///
/// With `SyntaxOptions.experimentalQuotes`, also accepts
///
Expand All @@ -592,9 +592,24 @@ extension Source {
mutating func lexQuote(context: ParsingContext) throws -> AST.Quote? {
let str = try recordLoc { src -> String? in
if src.tryEat(sequence: #"\Q"#) {
return try src.expectQuoted(endingWith: #"\E"#).value
let contents = src.lexUntil { src in
src.isEmpty || src.tryEat(sequence: #"\E"#)
}.value

// In multi-line literals, the quote may not span multiple lines.
if context.syntax.contains(.multilineExtendedSyntax),
contents.spansMultipleLinesInRegexLiteral {
throw ParseError.quoteMayNotSpanMultipleLines
}

// The sequence must not be empty in a custom character class.
if context.isInCustomCharacterClass && contents.isEmpty {
throw ParseError.expectedNonEmptyContents
}
return contents
}
if context.experimentalQuotes, src.tryEat("\"") {
// TODO: Can experimental quotes be empty?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we'll be dropping experimental mode in favor of (?"quoted")

return try src.expectQuoted(endingWith: "\"", ignoreEscaped: true).value
}
return nil
Expand Down
10 changes: 8 additions & 2 deletions Sources/_RegexParser/Regex/Parse/Parse.swift
Original file line number Diff line number Diff line change
Expand Up @@ -592,6 +592,13 @@ public func parse<S: StringProtocol>(
return ast
}

extension String {
/// Whether the given string is considered multi-line for a regex literal.
var spansMultipleLinesInRegexLiteral: Bool {
unicodeScalars.contains(where: { $0 == "\n" || $0 == "\r" })
}
}

/// Retrieve the default set of syntax options that a delimiter and literal
/// contents indicates.
fileprivate func defaultSyntaxOptions(
Expand All @@ -601,8 +608,7 @@ fileprivate func defaultSyntaxOptions(
case .forwardSlash:
// For an extended syntax forward slash e.g #/.../#, extended syntax is
// permitted if it spans multiple lines.
if delim.poundCount > 0 &&
contents.unicodeScalars.contains(where: { $0 == "\n" || $0 == "\r" }) {
if delim.poundCount > 0 && contents.spansMultipleLinesInRegexLiteral {
return .multilineExtendedSyntax
}
return .traditional
Expand Down
33 changes: 31 additions & 2 deletions Tests/RegexTests/ParseTests.swift
Original file line number Diff line number Diff line change
Expand Up @@ -754,6 +754,14 @@ extension RegexTests {
// This follows the PCRE behavior.
parseTest(#"\Q\\E"#, quote("\\"))

// ICU allows quotes to be empty outside of custom character classes.
parseTest(#"\Q\E"#, quote(""))

// Quotes may be unterminated.
parseTest(#"\Qab"#, quote("ab"))
parseTest(#"\Q"#, quote(""))
parseTest("\\Qab\\", quote("ab\\"))

parseTest(#"a" ."b"#, concat("a", quote(" ."), "b"),
syntax: .experimental)
parseTest(#"a" .""b""#, concat("a", quote(" ."), quote("b")),
Expand Down Expand Up @@ -2539,8 +2547,6 @@ extension RegexTests {
diagnosticTest(#"(?P"#, .expected(")"))
diagnosticTest(#"(?R"#, .expected(")"))

diagnosticTest(#"\Qab"#, .expected("\\E"))
diagnosticTest("\\Qab\\", .expected("\\E"))
diagnosticTest(#""ab"#, .expected("\""), syntax: .experimental)
diagnosticTest(#""ab\""#, .expected("\""), syntax: .experimental)
diagnosticTest("\"ab\\", .expectedEscape, syntax: .experimental)
Expand Down Expand Up @@ -2619,6 +2625,9 @@ extension RegexTests {
// TODO: Custom diagnostic for missing '\Q'
diagnosticTest(#"\E"#, .invalidEscape("E"))

diagnosticTest(#"[\Q\E]"#, .expectedNonEmptyContents)
diagnosticTest(#"[\Q]"#, .expected("]"))

// PCRE treats these as octal, but we require a `0` prefix.
diagnosticTest(#"[\1]"#, .invalidEscape("1"))
diagnosticTest(#"[\123]"#, .invalidEscape("1"))
Expand Down Expand Up @@ -2711,6 +2720,26 @@ extension RegexTests {
""", .cannotRemoveExtendedSyntaxInMultilineMode
)

diagnosticWithDelimitersTest(#"""
#/
\Q
\E
/#
"""#, .quoteMayNotSpanMultipleLines)

diagnosticWithDelimitersTest(#"""
#/
\Qabc
\E
/#
"""#, .quoteMayNotSpanMultipleLines)

diagnosticWithDelimitersTest(#"""
#/
\Q
/#
"""#, .quoteMayNotSpanMultipleLines)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this just be the regex /\Q/, i.e. it's not spanning multiple lines because there's no additional line? I'm ok either way, and perhaps reporting an error is helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Currently we don't do any whitespace stripping, so if we just allowed it as-is, it would parse as \Q\n \E. That being said, it seems reasonable enough to strip everything after-and-including the last newline given we enforce that the closing delimiter must appear on a newline. Does that sound reasonable to you, or should be leave it as an error for now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... for now, if someone wants verbatim multi-line content, they have to use a string inside builder for that. We'll eventually get at least interpolation if not quoting as well. Otherwise the multi-line literal ignores the lines and whitespaces. It might be the case these are all the same regex:

Regex { "re gex" }

#/
  re\Q gex
/#

#/
  re\Q gex\E
/#

#/
  \Qre gex
/#

#/
  re\ gex
/#

/re\Q gex/

/re\Q gex\E/

For errors:

#/
  re\Q  # error: it spans lines (same logic as isolated option setting?)
  gex
/#

What about:

#/
  re\Q gex
  

/#

Newlines are a consideration of the host language. So perhaps \Q in a literal cannot quote a newline and a matching \E has to appear on the same line or it's the last semantic line in the literal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By semantic line, do you mean any line that doesn't consist entirely of whitespace? So:

#/
 re\Q gex
  

/#

would be legal as we'd strip all trailing whitespace after the x character, forming the quote \Q gex\E?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A related question here is whether (?-x) should be allowed if it's used on the last semantic line, e.g:

// Okay
#/
(?-x) abc


/#

// Not okay
#/
(?-x)abc
d
/#

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could go either way, what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably have a consistent behavior between (?-x) and unterminated \Q whichever way we go. For now at least, I'm tempted to say we should keep everything using the rule where semantic whitespace cannot span multiple lines anywhere in the literal, so (?-x) and unterminated \Q are always invalid. So this test case would remain as-is. Given this rule only affects literals, we'd be free to lift in the future if we wanted. Does that sound reasonable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM


// MARK: Group specifiers

diagnosticTest(#"(*"#, .unknownGroupKind("*"))
Expand Down