Skip to content

Parse escaped backreferences and subpatterns #88

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 20, 2021

Conversation

hamishknight
Copy link
Contributor

Parse the escaped syntaxes for backreferences and subpatterns (the latter are so syntactically similar, it made sense to also parse them).

This doesn't yet handle the non-escaped syntax for either, in particular Python-style backreferences (?P=...). These are syntactically similar to groups (despite being atoms), so will require some more thought on how to parse.

Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, though I suspect the parser will be the one to make the call whether \n is a backreference or an octal reference. Otherwise the AST needs API to help clients decide.

@milseman
Copy link
Member

@swift-ci please test linux platform

@hamishknight
Copy link
Contributor Author

Updated to disambiguate octal vs backreferences in the parser, and also extended the \nnn syntax to work more generally in custom character classes, how does this look @milseman?

@hamishknight
Copy link
Contributor Author

@swift-ci please test Linux

Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

num <= numberOfPriorGroups {
src.advance(digits.count)
return .backreference(.absolute(num))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just return the octal literal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do, though we'd need to call into expectUnicodeScalar as we can only read up to 3 digits of octal, and it wouldn't cover the \0nn case unless we also add a condition for that here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is that logic when we fall through?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In expectUnicodeScalar, called from expectEscaped

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems more straight-forward to make the explicit call, especially if we have context-local constraints on the number of digits or other aspects of interpretation. Otherwise, would the general case have to reconstruct this logic?

Fall-through also means that any intermediary code may have to be concerned with or reason about this possibility. In general I'm a fan of local reasoning. But this can be done after this PR, and it's not a huge deal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, would the general case have to reconstruct this logic?

I might not be following what you mean by this, but I don't think so. Both the backreference and unicode scalar cases have differences in what they accept, but there is a syntactic ambiguity between them. In general, we just need to make sure to try lex as a backreference before a unicode scalar. IMO if we change this logic to produce a scalar, we should also change the scalar logic to be able to produce a backreference so we no longer need to reason about which is called first. I'm happy to do that in a follow-up, but going to merge this for now to cleanup the option parsing PR

Missed this when implementing the rest of the
group kinds. Because we don't backtrack, we should
throw an error here after consuming a `*` in a
group.
Throw an error if we reach the end of input before
we encounter the closing delimiter we expect. Also
add an overload of `lexUntil(eating:)` that takes
a character.
Parse the escaped syntaxes for backreferences and
subpatterns (the latter are so syntactically
similar, it made sense to also parse them).

This doesn't yet handle the non-escaped syntax for
either, in particular Python-style backreferences
`(?P=...)`. These are syntactically similar to
groups (despite being atoms), so will require some
more thought on how to parse.
Implement octal disambiguation for the `\nnn`
syntax where a backreference is only formed if
there have been that many prior groups, or it
begins with 8 or 9, or is less than 10.

In addition, generalize the \0nn syntax to
support arbitrary \nnn octal sequences inside and
outside character classes.
@hamishknight
Copy link
Contributor Author

@swift-ci please test Linux

@hamishknight hamishknight merged commit 0a7d4bb into swiftlang:main Dec 20, 2021
@hamishknight hamishknight deleted the backrefs branch December 20, 2021 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants