-
Notifications
You must be signed in to change notification settings - Fork 50
Parse escaped backreferences and subpatterns #88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, though I suspect the parser will be the one to make the call whether \n
is a backreference or an octal reference. Otherwise the AST needs API to help clients decide.
@swift-ci please test linux platform |
2f3208a
to
9865167
Compare
Updated to disambiguate octal vs backreferences in the parser, and also extended the |
9865167
to
555b0c4
Compare
@swift-ci please test Linux |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
num <= numberOfPriorGroups { | ||
src.advance(digits.count) | ||
return .backreference(.absolute(num)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just return the octal literal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do, though we'd need to call into expectUnicodeScalar
as we can only read up to 3 digits of octal, and it wouldn't cover the \0nn
case unless we also add a condition for that here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is that logic when we fall through?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In expectUnicodeScalar
, called from expectEscaped
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems more straight-forward to make the explicit call, especially if we have context-local constraints on the number of digits or other aspects of interpretation. Otherwise, would the general case have to reconstruct this logic?
Fall-through also means that any intermediary code may have to be concerned with or reason about this possibility. In general I'm a fan of local reasoning. But this can be done after this PR, and it's not a huge deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, would the general case have to reconstruct this logic?
I might not be following what you mean by this, but I don't think so. Both the backreference and unicode scalar cases have differences in what they accept, but there is a syntactic ambiguity between them. In general, we just need to make sure to try lex as a backreference before a unicode scalar. IMO if we change this logic to produce a scalar, we should also change the scalar logic to be able to produce a backreference so we no longer need to reason about which is called first. I'm happy to do that in a follow-up, but going to merge this for now to cleanup the option parsing PR
Missed this when implementing the rest of the group kinds. Because we don't backtrack, we should throw an error here after consuming a `*` in a group.
Throw an error if we reach the end of input before we encounter the closing delimiter we expect. Also add an overload of `lexUntil(eating:)` that takes a character.
Parse the escaped syntaxes for backreferences and subpatterns (the latter are so syntactically similar, it made sense to also parse them). This doesn't yet handle the non-escaped syntax for either, in particular Python-style backreferences `(?P=...)`. These are syntactically similar to groups (despite being atoms), so will require some more thought on how to parse.
Implement octal disambiguation for the `\nnn` syntax where a backreference is only formed if there have been that many prior groups, or it begins with 8 or 9, or is less than 10. In addition, generalize the \0nn syntax to support arbitrary \nnn octal sequences inside and outside character classes.
555b0c4
to
3b5533f
Compare
@swift-ci please test Linux |
Parse the escaped syntaxes for backreferences and subpatterns (the latter are so syntactically similar, it made sense to also parse them).
This doesn't yet handle the non-escaped syntax for either, in particular Python-style backreferences
(?P=...)
. These are syntactically similar to groups (despite being atoms), so will require some more thought on how to parse.