Skip to content

bpo-30688: support \N{name} escapes in re patterns #2261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

jonathaneunice
Copy link
Contributor

@jonathaneunice jonathaneunice commented Jun 17, 2017

re specially handles Unicode escapes (\uXXXX and \UXXXXXXXX) so that even raw strings (r'...') have symbolic Unicode characters. But it has not supported named Unicode escapes such as r'\N{EM DASH}', making the escapes for string literals diverge from those for regular expressions.

This PR brings them back into alignment by supporting named Unicode escapes in re, along with accompanying updates to tests and docs.

The implementation is straightforward, but several notes:

  1. Uses ast.literal_eval rather than unicodedata.lookup to evaluate the names. While adequate and safe, it wasn't my first choice. But I had some issues importing unicodedata which seemed to be related to the sequencing of the build. Tinkering with CPython's build automation is above my pay grade.
  2. Tokenizer.getuntil() seems more apposite for tracking to the closing brace of a named escape, but I used the less-obvious .getwhile() as .getuntil() seems to have some baked-in assumptions about its use case, especially under error conditions. Wanting to tread lightly, I fell back to .getwhile().
  3. The 100-character maximum search width is sufficient; the longest current Unicode character name is 83 characters, for the delightful ARABIC LIGATURE UIGHUR KIRGHIZ YEH WITH HAMZA ABOVE WITH ALEF MAKSURA ISOLATED FORM, which seems unlikely to be excelled any time soon.

name length distribution

https://bugs.python.org/issue30688

@mention-bot
Copy link

@jonathaneunice, thanks for your PR! By analyzing the history of the files in this pull request, we identified @birkenfeld, @serhiy-storchaka and @tiran to be potential reviewers.

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a Misc/NEWS and "What's New" entries and your name in Misc/ACKS.

@@ -443,7 +443,7 @@ character ``'$'``.
Most of the standard escapes supported by Python string literals are also
accepted by the regular expression parser::

\a \b \f \n
\a \b \f \n \N{name}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just \N.

Add '\N' to the note below:

``'\u'`` and ``'\U'`` escape sequences are only recognized ...

escape += source.getwhile(100, UNICODE_NAME)
escape += source.getwhile(1, CLOSING_BRACE)
try:
c = ord(literal_eval('"%s"' % escape))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer using unicodedata. If there are issues with importing, import it not at the module startup, but only if it is needed. Since this feature is relatively complex, the implementation can be moved in a separate function for avoiding the code duplication.

elif c == "N" and source.istext:
# named unicode escape e.g. \N{EM DASH}
escape += source.getwhile(1, OPENING_BRACE)
escape += source.getwhile(100, UNICODE_NAME)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use modified getuntil(). Just add yet one parameter for specifying what is missing in error messages.

badly_formed = [r'\N{BUBBA DASH}', r'\N{EM DASH',
r'\NEM DASH}', r'\NOGGIN']
for bad in badly_formed:
with self.assertRaises(re.error):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use checkPatternError() and test error messages.

for bad in badly_formed:
with self.assertRaises(re.error):
re.compile(bad)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add also tests for \N in bytes patterns.

suites = [
[ # basic matches
['\u2014', r'\u2014', '\N{EM DASH}',
r'\N{EM DASH}'], # pattern
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the last case enough?

for patterns, match_yes, match_no in suites:
for pat in patterns:
for target in match_yes:
self.assertTrue(re.match(pat, target))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use subTest() when check in a loop.

Actually I think that loops are not needed. It is enough to test just one case for a pattern.

[ # basic matches
['\u2014', r'\u2014', '\N{EM DASH}',
r'\N{EM DASH}'], # pattern
['\u2014', '\N{EM DASH}', '—', '—and more'], # matches
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is hard to see differences between different dashes on terminal. Use just \u2014 or \N{EM DASH}.

@serhiy-storchaka serhiy-storchaka self-assigned this Jun 17, 2017
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use unicodedata. It is faster, uses less memory and is safer.

@brettcannon
Copy link
Member

To try and help move older pull requests forward, we are going through and backfilling 'awaiting' labels on pull requests that are lacking the label. Based on the current reviews, the best we can tell in an automated fashion is that a core developer requested changes to be made to this pull request.

If/when the requested changes have been made, please leave a comment that says, I have made the requested changes; please review again. That will trigger a bot to flag this pull request as ready for a follow-up review.

@serhiy-storchaka
Copy link
Member

Since the original author didn't respond for long time, but the proposed feature looks worthy, I have created a new #5588 for addressing my requests.

@serhiy-storchaka serhiy-storchaka removed their assignment Dec 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants