-
-
Notifications
You must be signed in to change notification settings - Fork 32.3k
bpo-30688: support \N{name} escapes in re patterns #2261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jonathaneunice, thanks for your PR! By analyzing the history of the files in this pull request, we identified @birkenfeld, @serhiy-storchaka and @tiran to be potential reviewers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a Misc/NEWS
and "What's New" entries and your name in Misc/ACKS
.
Doc/library/re.rst
Outdated
@@ -443,7 +443,7 @@ character ``'$'``. | |||
Most of the standard escapes supported by Python string literals are also | |||
accepted by the regular expression parser:: | |||
|
|||
\a \b \f \n | |||
\a \b \f \n \N{name} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just \N
.
Add '\N'
to the note below:
``'\u'`` and ``'\U'`` escape sequences are only recognized ...
escape += source.getwhile(100, UNICODE_NAME) | ||
escape += source.getwhile(1, CLOSING_BRACE) | ||
try: | ||
c = ord(literal_eval('"%s"' % escape)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer using unicodedata
. If there are issues with importing, import it not at the module startup, but only if it is needed. Since this feature is relatively complex, the implementation can be moved in a separate function for avoiding the code duplication.
elif c == "N" and source.istext: | ||
# named unicode escape e.g. \N{EM DASH} | ||
escape += source.getwhile(1, OPENING_BRACE) | ||
escape += source.getwhile(100, UNICODE_NAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use modified getuntil()
. Just add yet one parameter for specifying what is missing in error messages.
badly_formed = [r'\N{BUBBA DASH}', r'\N{EM DASH', | ||
r'\NEM DASH}', r'\NOGGIN'] | ||
for bad in badly_formed: | ||
with self.assertRaises(re.error): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use checkPatternError()
and test error messages.
for bad in badly_formed: | ||
with self.assertRaises(re.error): | ||
re.compile(bad) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add also tests for \N
in bytes patterns.
suites = [ | ||
[ # basic matches | ||
['\u2014', r'\u2014', '\N{EM DASH}', | ||
r'\N{EM DASH}'], # pattern |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the last case enough?
for patterns, match_yes, match_no in suites: | ||
for pat in patterns: | ||
for target in match_yes: | ||
self.assertTrue(re.match(pat, target)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use subTest()
when check in a loop.
Actually I think that loops are not needed. It is enough to test just one case for a pattern.
[ # basic matches | ||
['\u2014', r'\u2014', '\N{EM DASH}', | ||
r'\N{EM DASH}'], # pattern | ||
['\u2014', '\N{EM DASH}', '—', '—and more'], # matches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is hard to see differences between different dashes on terminal. Use just \u2014
or \N{EM DASH}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use unicodedata
. It is faster, uses less memory and is safer.
To try and help move older pull requests forward, we are going through and backfilling 'awaiting' labels on pull requests that are lacking the label. Based on the current reviews, the best we can tell in an automated fashion is that a core developer requested changes to be made to this pull request. If/when the requested changes have been made, please leave a comment that says, |
Since the original author didn't respond for long time, but the proposed feature looks worthy, I have created a new #5588 for addressing my requests. |
re
specially handles Unicode escapes (\uXXXX
and\UXXXXXXXX
) so that even raw strings (r'...'
) have symbolic Unicode characters. But it has not supported named Unicode escapes such asr'\N{EM DASH}'
, making the escapes for string literals diverge from those for regular expressions.This PR brings them back into alignment by supporting named Unicode escapes in
re
, along with accompanying updates to tests and docs.The implementation is straightforward, but several notes:
ast.literal_eval
rather thanunicodedata.lookup
to evaluate the names. While adequate and safe, it wasn't my first choice. But I had some issues importingunicodedata
which seemed to be related to the sequencing of the build. Tinkering with CPython's build automation is above my pay grade.Tokenizer.getuntil()
seems more apposite for tracking to the closing brace of a named escape, but I used the less-obvious.getwhile()
as.getuntil()
seems to have some baked-in assumptions about its use case, especially under error conditions. Wanting to tread lightly, I fell back to.getwhile()
.ARABIC LIGATURE UIGHUR KIRGHIZ YEH WITH HAMZA ABOVE WITH ALEF MAKSURA ISOLATED FORM
, which seems unlikely to be excelled any time soon.https://bugs.python.org/issue30688