Parse escaped backreferences and subpatterns #88

hamishknight · 2021-12-17T13:03:34Z

Parse the escaped syntaxes for backreferences and subpatterns (the latter are so syntactically similar, it made sense to also parse them).

This doesn't yet handle the non-escaped syntax for either, in particular Python-style backreferences (?P=...). These are syntactically similar to groups (despite being atoms), so will require some more thought on how to parse.

milseman

Overall LGTM, though I suspect the parser will be the one to make the call whether \n is a backreference or an octal reference. Otherwise the AST needs API to help clients decide.

Sources/_MatchingEngine/Regex/AST/Atom.swift

Sources/_MatchingEngine/Regex/Parse/LexicalAnalysis.swift

milseman · 2021-12-17T16:07:50Z

@swift-ci please test linux platform

hamishknight · 2021-12-17T21:05:26Z

Updated to disambiguate octal vs backreferences in the parser, and also extended the \nnn syntax to work more generally in custom character classes, how does this look @milseman?

hamishknight · 2021-12-17T22:09:49Z

@swift-ci please test Linux

milseman

LGTM

milseman · 2021-12-17T23:23:32Z

Sources/_MatchingEngine/Regex/Parse/LexicalAnalysis.swift

+            num <= numberOfPriorGroups {
+          src.advance(digits.count)
+          return .backreference(.absolute(num))
+        }


Should we just return the octal literal?

We can do, though we'd need to call into expectUnicodeScalar as we can only read up to 3 digits of octal, and it wouldn't cover the \0nn case unless we also add a condition for that here

Where is that logic when we fall through?

In expectUnicodeScalar, called from expectEscaped

It seems more straight-forward to make the explicit call, especially if we have context-local constraints on the number of digits or other aspects of interpretation. Otherwise, would the general case have to reconstruct this logic?

Fall-through also means that any intermediary code may have to be concerned with or reason about this possibility. In general I'm a fan of local reasoning. But this can be done after this PR, and it's not a huge deal.

Otherwise, would the general case have to reconstruct this logic?

I might not be following what you mean by this, but I don't think so. Both the backreference and unicode scalar cases have differences in what they accept, but there is a syntactic ambiguity between them. In general, we just need to make sure to try lex as a backreference before a unicode scalar. IMO if we change this logic to produce a scalar, we should also change the scalar logic to be able to produce a backreference so we no longer need to reason about which is called first. I'm happy to do that in a follow-up, but going to merge this for now to cleanup the option parsing PR

Sources/_MatchingEngine/Regex/Parse/LexicalAnalysis.swift

Missed this when implementing the rest of the group kinds. Because we don't backtrack, we should throw an error here after consuming a `*` in a group.

Throw an error if we reach the end of input before we encounter the closing delimiter we expect. Also add an overload of `lexUntil(eating:)` that takes a character.

Parse the escaped syntaxes for backreferences and subpatterns (the latter are so syntactically similar, it made sense to also parse them). This doesn't yet handle the non-escaped syntax for either, in particular Python-style backreferences `(?P=...)`. These are syntactically similar to groups (despite being atoms), so will require some more thought on how to parse.

Implement octal disambiguation for the `\nnn` syntax where a backreference is only formed if there have been that many prior groups, or it begins with 8 or 9, or is less than 10. In addition, generalize the \0nn syntax to support arbitrary \nnn octal sequences inside and outside character classes.

hamishknight · 2021-12-20T19:42:51Z

@swift-ci please test Linux

hamishknight requested a review from milseman December 17, 2021 13:03

hamishknight mentioned this pull request Dec 17, 2021

Syntax Status and Roadmap #63

Closed

milseman approved these changes Dec 17, 2021

View reviewed changes

Sources/_MatchingEngine/Regex/AST/Atom.swift Outdated Show resolved Hide resolved

Sources/_MatchingEngine/Regex/Parse/LexicalAnalysis.swift Outdated Show resolved Hide resolved

hamishknight force-pushed the backrefs branch from 2f3208a to 9865167 Compare December 17, 2021 21:03

hamishknight force-pushed the backrefs branch from 9865167 to 555b0c4 Compare December 17, 2021 22:09

milseman approved these changes Dec 17, 2021

View reviewed changes

rxwei reviewed Dec 18, 2021

View reviewed changes

Sources/_MatchingEngine/Regex/Parse/LexicalAnalysis.swift Outdated Show resolved Hide resolved

hamishknight added 4 commits December 18, 2021 10:49

Add error case for unknown '(*...'

32fbefd

Missed this when implementing the rest of the group kinds. Because we don't backtrack, we should throw an error here after consuming a `*` in a group.

Handle end-of-input in lexUntil(eating:)

6cfceaf

Throw an error if we reach the end of input before we encounter the closing delimiter we expect. Also add an overload of `lexUntil(eating:)` that takes a character.

hamishknight force-pushed the backrefs branch from 555b0c4 to 3b5533f Compare December 18, 2021 10:49

hamishknight mentioned this pull request Dec 18, 2021

Parse matching options #91

Merged

hamishknight merged commit 0a7d4bb into swiftlang:main Dec 20, 2021

hamishknight deleted the backrefs branch December 20, 2021 20:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parse escaped backreferences and subpatterns #88

Parse escaped backreferences and subpatterns #88

Uh oh!

hamishknight commented Dec 17, 2021

Uh oh!

milseman left a comment

Uh oh!

Uh oh!

Uh oh!

milseman commented Dec 17, 2021

Uh oh!

hamishknight commented Dec 17, 2021

Uh oh!

hamishknight commented Dec 17, 2021

Uh oh!

milseman left a comment

Uh oh!

milseman Dec 17, 2021

Uh oh!

hamishknight Dec 18, 2021

Uh oh!

milseman Dec 18, 2021

Uh oh!

hamishknight Dec 18, 2021

Uh oh!

milseman Dec 20, 2021

Uh oh!

hamishknight Dec 20, 2021

Uh oh!

Uh oh!

hamishknight commented Dec 20, 2021

Uh oh!

Uh oh!

Parse escaped backreferences and subpatterns #88

Parse escaped backreferences and subpatterns #88

Uh oh!

Conversation

hamishknight commented Dec 17, 2021

Uh oh!

milseman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

milseman commented Dec 17, 2021

Uh oh!

hamishknight commented Dec 17, 2021

Uh oh!

hamishknight commented Dec 17, 2021

Uh oh!

milseman left a comment

Choose a reason for hiding this comment

Uh oh!

milseman Dec 17, 2021

Choose a reason for hiding this comment

Uh oh!

hamishknight Dec 18, 2021

Choose a reason for hiding this comment

Uh oh!

milseman Dec 18, 2021

Choose a reason for hiding this comment

Uh oh!

hamishknight Dec 18, 2021

Choose a reason for hiding this comment

Uh oh!

milseman Dec 20, 2021

Choose a reason for hiding this comment

Uh oh!

hamishknight Dec 20, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hamishknight commented Dec 20, 2021

Uh oh!

Uh oh!