Skip to content

bpo-30688: support \N{name} escapes in re patterns #2261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions Doc/library/re.rst
Original file line number Diff line number Diff line change
Expand Up @@ -468,13 +468,13 @@ Most of the standard escapes supported by Python string literals are also
accepted by the regular expression parser::

\a \b \f \n
\r \t \u \U
\v \x \\
\N \r \t \u
\U \v \x \\

(Note that ``\b`` is used to represent word boundaries, and means "backspace"
only inside character classes.)

``'\u'`` and ``'\U'`` escape sequences are only recognized in Unicode
``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are only recognized in Unicode
patterns. In bytes patterns they are errors.

Octal escapes are included in a limited form. If the first digit is a 0, or if
Expand All @@ -488,6 +488,9 @@ three digits in length.
.. versionchanged:: 3.6
Unknown escapes consisting of ``'\'`` and an ASCII letter now are errors.

.. versionchanged:: 3.7
The ``'\N{name}'`` escape sequence has been added. As in string literals,
it expands to the named Unicode character (e.g. ``'\N{EM DASH}'``).

.. seealso::

Expand Down
28 changes: 28 additions & 0 deletions Lib/sre_parse.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
# XXX: show string offset and offending character for all errors

from sre_constants import *
from ast import literal_eval

SPECIAL_CHARS = ".\\[{()*+?^$|"
REPEAT_CHARS = "*+?{"
Expand All @@ -25,6 +26,11 @@

WHITESPACE = frozenset(" \t\n\r\v\f")

UNICODE_NAME = ASCIILETTERS | DIGITS | frozenset(' -')
CLOSING_BRACE = frozenset("}")
OPENING_BRACE = frozenset("{")


_REPEATCODES = frozenset({MIN_REPEAT, MAX_REPEAT})
_UNITCODES = frozenset({ANY, RANGE, IN, LITERAL, NOT_LITERAL, CATEGORY})

Expand Down Expand Up @@ -322,6 +328,17 @@ def _class_escape(source, escape):
c = int(escape[2:], 16)
chr(c) # raise ValueError for invalid code
return LITERAL, c
elif c == "N" and source.istext:
# named unicode escape e.g. \N{EM DASH}
escape += source.getwhile(1, OPENING_BRACE)
escape += source.getwhile(100, UNICODE_NAME)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use modified getuntil(). Just add yet one parameter for specifying what is missing in error messages.

escape += source.getwhile(1, CLOSING_BRACE)
try:
c = ord(literal_eval('"%s"' % escape))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer using unicodedata. If there are issues with importing, import it not at the module startup, but only if it is needed. Since this feature is relatively complex, the implementation can be moved in a separate function for avoiding the code duplication.

except SyntaxError:
charname = escape[2:].strip('{}')
raise source.error("unknown Unicode character name %s" % charname, len(escape))
return LITERAL, c
elif c in OCTDIGITS:
# octal escape (up to three digits)
escape += source.getwhile(2, OCTDIGITS)
Expand Down Expand Up @@ -370,6 +387,17 @@ def _escape(source, escape, state):
c = int(escape[2:], 16)
chr(c) # raise ValueError for invalid code
return LITERAL, c
elif c == "N" and source.istext:
# named unicode escape e.g. \N{EM DASH}
escape += source.getwhile(1, OPENING_BRACE)
escape += source.getwhile(100, UNICODE_NAME)
escape += source.getwhile(1, CLOSING_BRACE)
try:
c = ord(literal_eval('"%s"' % escape))
except SyntaxError:
charname = escape[2:].strip('{}')
raise source.error("unknown Unicode character name %s" % charname, len(escape))
return LITERAL, c
elif c == "0":
# octal escape
escape += source.getwhile(2, OCTDIGITS)
Expand Down
33 changes: 33 additions & 0 deletions Lib/test/test_re.py
Original file line number Diff line number Diff line change
Expand Up @@ -700,6 +700,39 @@ def test_other_escapes(self):
with self.subTest(c):
self.assertRaises(re.error, re.compile, '[\\%c]' % c)

def test_named_unicode_escapes(self):
# test individual Unicode named escapes
suites = [
[ # basic matches
['\u2014', r'\u2014', '\N{EM DASH}',
r'\N{EM DASH}'], # pattern
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the last case enough?

['\u2014', '\N{EM DASH}', '—', '—and more'], # matches
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is hard to see differences between different dashes on terminal. Use just \u2014 or \N{EM DASH}.

['\u2015', '\N{EN DASH}'] # no match
],
[ # character set matches
['[\u2014-\u2020]', r'[\u2014-\u2020]',
'[\N{EM DASH}-\N{DAGGER}]', r'[\N{EM DASH}-\N{DAGGER}]',
'[\u2014-\N{DAGGER}]', '[\N{EM DASH}-\u2020]',], # pattern
['\u2014', '\N{EM DASH}', '—', '—and more', '\u2020',
'\N{DAGGER}', '†', '\u2017', '\N{DOUBLE LOW LINE}'],
['\u2011', '\N{EN DASH}', '\u2013', 'xyz', '\u2021']
],
]

for patterns, match_yes, match_no in suites:
for pat in patterns:
for target in match_yes:
self.assertTrue(re.match(pat, target))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use subTest() when check in a loop.

Actually I think that loops are not needed. It is enough to test just one case for a pattern.

for target in match_no:
self.assertIsNone(re.match(pat, target))

# test errors in \N{name} handling - only valid names should pass
badly_formed = [r'\N{BUBBA DASH}', r'\N{EM DASH',
r'\NEM DASH}', r'\NOGGIN']
for bad in badly_formed:
with self.assertRaises(re.error):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use checkPatternError() and test error messages.

re.compile(bad)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add also tests for \N in bytes patterns.

def test_string_boundaries(self):
# See http://bugs.python.org/issue10713
self.assertEqual(re.search(r"\b(abc)\b", "abc").group(1),
Expand Down