bpo-30688: support \N{name} escapes in re patterns #2261

jonathaneunice · 2017-06-17T08:44:50Z

re specially handles Unicode escapes (\uXXXX and \UXXXXXXXX) so that even raw strings (r'...') have symbolic Unicode characters. But it has not supported named Unicode escapes such as r'\N{EM DASH}', making the escapes for string literals diverge from those for regular expressions.

This PR brings them back into alignment by supporting named Unicode escapes in re, along with accompanying updates to tests and docs.

The implementation is straightforward, but several notes:

Uses ast.literal_eval rather than unicodedata.lookup to evaluate the names. While adequate and safe, it wasn't my first choice. But I had some issues importing unicodedata which seemed to be related to the sequencing of the build. Tinkering with CPython's build automation is above my pay grade.
Tokenizer.getuntil() seems more apposite for tracking to the closing brace of a named escape, but I used the less-obvious .getwhile() as .getuntil() seems to have some baked-in assumptions about its use case, especially under error conditions. Wanting to tread lightly, I fell back to .getwhile().
The 100-character maximum search width is sufficient; the longest current Unicode character name is 83 characters, for the delightful ARABIC LIGATURE UIGHUR KIRGHIZ YEH WITH HAMZA ABOVE WITH ALEF MAKSURA ISOLATED FORM, which seems unlikely to be excelled any time soon.

https://bugs.python.org/issue30688

mention-bot · 2017-06-17T08:44:55Z

@jonathaneunice, thanks for your PR! By analyzing the history of the files in this pull request, we identified @birkenfeld, @serhiy-storchaka and @tiran to be potential reviewers.

serhiy-storchaka

Add a Misc/NEWS and "What's New" entries and your name in Misc/ACKS.

serhiy-storchaka · 2017-06-17T10:18:29Z

Doc/library/re.rst

@@ -443,7 +443,7 @@ character ``'$'``.
 Most of the standard escapes supported by Python string literals are also
 accepted by the regular expression parser::

-   \a      \b      \f      \n
+   \a      \b      \f      \n      \N{name}


Just \N.

Add '\N' to the note below:

``'\u'`` and ``'\U'`` escape sequences are only recognized ...

serhiy-storchaka · 2017-06-17T10:26:31Z

Lib/sre_parse.py

+            escape += source.getwhile(100, UNICODE_NAME)
+            escape += source.getwhile(1, CLOSING_BRACE)
+            try:
+                c = ord(literal_eval('"%s"' % escape))


I prefer using unicodedata. If there are issues with importing, import it not at the module startup, but only if it is needed. Since this feature is relatively complex, the implementation can be moved in a separate function for avoiding the code duplication.

serhiy-storchaka · 2017-06-17T10:29:38Z

Lib/sre_parse.py

+        elif c == "N" and source.istext:
+            # named unicode escape e.g. \N{EM DASH}
+            escape += source.getwhile(1, OPENING_BRACE)
+            escape += source.getwhile(100, UNICODE_NAME)


You could use modified getuntil(). Just add yet one parameter for specifying what is missing in error messages.

serhiy-storchaka · 2017-06-17T10:35:06Z

Lib/test/test_re.py

+        badly_formed = [r'\N{BUBBA DASH}', r'\N{EM DASH',
+                        r'\NEM DASH}', r'\NOGGIN']
+        for bad in badly_formed:
+            with self.assertRaises(re.error):


Use checkPatternError() and test error messages.

serhiy-storchaka · 2017-06-17T10:36:42Z

Lib/test/test_re.py

+        for bad in badly_formed:
+            with self.assertRaises(re.error):
+                re.compile(bad)
+


Add also tests for \N in bytes patterns.

serhiy-storchaka · 2017-06-17T10:37:37Z

Lib/test/test_re.py

+        suites = [
+            [   # basic matches
+                ['\u2014', r'\u2014', '\N{EM DASH}',
+                 r'\N{EM DASH}'],                               # pattern


Isn't the last case enough?

serhiy-storchaka · 2017-06-17T10:40:38Z

Lib/test/test_re.py

+        for patterns, match_yes, match_no in suites:
+            for pat in patterns:
+                for target in match_yes:
+                    self.assertTrue(re.match(pat, target))


Use subTest() when check in a loop.

Actually I think that loops are not needed. It is enough to test just one case for a pattern.

serhiy-storchaka · 2017-06-17T10:42:17Z

Lib/test/test_re.py

+            [   # basic matches
+                ['\u2014', r'\u2014', '\N{EM DASH}',
+                 r'\N{EM DASH}'],                               # pattern
+                ['\u2014', '\N{EM DASH}', '—', '—and more'],    # matches


It is hard to see differences between different dashes on terminal. Use just \u2014 or \N{EM DASH}.

serhiy-storchaka

Please use unicodedata. It is faster, uses less memory and is safer.

brettcannon · 2018-02-02T22:04:52Z

To try and help move older pull requests forward, we are going through and backfilling 'awaiting' labels on pull requests that are lacking the label. Based on the current reviews, the best we can tell in an automated fashion is that a core developer requested changes to be made to this pull request.

If/when the requested changes have been made, please leave a comment that says, I have made the requested changes; please review again. That will trigger a bot to flag this pull request as ready for a follow-up review.

serhiy-storchaka · 2018-02-08T17:13:32Z

Since the original author didn't respond for long time, but the proposed feature looks worthy, I have created a new #5588 for addressing my requests.

added \N{name} escapes to re patterns

5f72f7a

the-knights-who-say-ni added the CLA signed label Jun 17, 2017

serhiy-storchaka reviewed Jun 17, 2017

View reviewed changes

serhiy-storchaka self-assigned this Jun 17, 2017

jonathaneunice added 2 commits June 17, 2017 12:08

mention limitation of \N sequences to Unicode patterns

7fb2983

mention \N escapes more tersely

4db797b

serhiy-storchaka requested changes Jul 16, 2017

View reviewed changes

Merge branch 'master' into fix-issue-30688

1113472

brettcannon added the awaiting changes label Feb 2, 2018

serhiy-storchaka mentioned this pull request Feb 8, 2018

bpo-30688: Support \N{name} escapes in re patterns. #5588

Merged

serhiy-storchaka closed this Feb 8, 2018

serhiy-storchaka removed their assignment Dec 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

bpo-30688: support \N{name} escapes in re patterns #2261

bpo-30688: support \N{name} escapes in re patterns #2261

Uh oh!

jonathaneunice commented Jun 17, 2017 •

edited by bedevere-bot

Loading

Uh oh!

mention-bot commented Jun 17, 2017

Uh oh!

serhiy-storchaka left a comment

Uh oh!

serhiy-storchaka Jun 17, 2017

Uh oh!

serhiy-storchaka Jun 17, 2017

Uh oh!

serhiy-storchaka Jun 17, 2017

Uh oh!

serhiy-storchaka Jun 17, 2017

Uh oh!

serhiy-storchaka Jun 17, 2017

Uh oh!

serhiy-storchaka Jun 17, 2017

Uh oh!

serhiy-storchaka Jun 17, 2017

Uh oh!

serhiy-storchaka Jun 17, 2017

Uh oh!

serhiy-storchaka left a comment

Uh oh!

brettcannon commented Feb 2, 2018

Uh oh!

serhiy-storchaka commented Feb 8, 2018

Uh oh!

Uh oh!

Uh oh!

bpo-30688: support \N{name} escapes in re patterns #2261

bpo-30688: support \N{name} escapes in re patterns #2261

Uh oh!

Conversation

jonathaneunice commented Jun 17, 2017 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mention-bot commented Jun 17, 2017

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

brettcannon commented Feb 2, 2018

Uh oh!

serhiy-storchaka commented Feb 8, 2018

Uh oh!

Uh oh!

jonathaneunice commented Jun 17, 2017 •

edited by bedevere-bot

Loading