Skip to content

Commit 07985ef

Browse files
Issue #22286: The "backslashreplace" error handlers now works with
decoding and translating.
1 parent 58f0201 commit 07985ef

File tree

10 files changed

+196
-83
lines changed

10 files changed

+196
-83
lines changed

Doc/howto/unicode.rst

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -280,8 +280,9 @@ and optionally an *errors* argument.
280280
The *errors* argument specifies the response when the input string can't be
281281
converted according to the encoding's rules. Legal values for this argument are
282282
``'strict'`` (raise a :exc:`UnicodeDecodeError` exception), ``'replace'`` (use
283-
``U+FFFD``, ``REPLACEMENT CHARACTER``), or ``'ignore'`` (just leave the
284-
character out of the Unicode result).
283+
``U+FFFD``, ``REPLACEMENT CHARACTER``), ``'ignore'`` (just leave the
284+
character out of the Unicode result), or ``'backslashreplace'`` (inserts a
285+
``\xNN`` escape sequence).
285286
The following examples show the differences::
286287

287288
>>> b'\x80abc'.decode("utf-8", "strict") #doctest: +NORMALIZE_WHITESPACE
@@ -291,6 +292,8 @@ The following examples show the differences::
291292
invalid start byte
292293
>>> b'\x80abc'.decode("utf-8", "replace")
293294
'\ufffdabc'
295+
>>> b'\x80abc'.decode("utf-8", "backslashreplace")
296+
'\\x80abc'
294297
>>> b'\x80abc'.decode("utf-8", "ignore")
295298
'abc'
296299

Doc/library/codecs.rst

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -314,8 +314,8 @@ The following error handlers are only applicable to
314314
| | reference (only for encoding). Implemented |
315315
| | in :func:`xmlcharrefreplace_errors`. |
316316
+-------------------------+-----------------------------------------------+
317-
| ``'backslashreplace'`` | Replace with backslashed escape sequences |
318-
| | (only for encoding). Implemented in |
317+
| ``'backslashreplace'`` | Replace with backslashed escape sequences. |
318+
| | Implemented in |
319319
| | :func:`backslashreplace_errors`. |
320320
+-------------------------+-----------------------------------------------+
321321
| ``'namereplace'`` | Replace with ``\N{...}`` escape sequences |
@@ -350,6 +350,10 @@ In addition, the following error handler is specific to the given codecs:
350350
.. versionadded:: 3.5
351351
The ``'namereplace'`` error handler.
352352

353+
.. versionchanged:: 3.5
354+
The ``'backslashreplace'`` error handlers now works with decoding and
355+
translating.
356+
353357
The set of allowed values can be extended by registering a new named error
354358
handler:
355359

@@ -417,9 +421,9 @@ functions:
417421

418422
.. function:: backslashreplace_errors(exception)
419423

420-
Implements the ``'backslashreplace'`` error handling (for encoding with
421-
:term:`text encodings <text encoding>` only): the
422-
unencodable character is replaced by a backslashed escape sequence.
424+
Implements the ``'backslashreplace'`` error handling (for
425+
:term:`text encodings <text encoding>` only): malformed data is
426+
replaced by a backslashed escape sequence.
423427

424428
.. function:: namereplace_errors(exception)
425429

Doc/library/functions.rst

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -973,9 +973,8 @@ are always available. They are listed here in alphabetical order.
973973
Characters not supported by the encoding are replaced with the
974974
appropriate XML character reference ``&#nnn;``.
975975

976-
* ``'backslashreplace'`` (also only supported when writing)
977-
replaces unsupported characters with Python's backslashed escape
978-
sequences.
976+
* ``'backslashreplace'`` replaces malformed data by Python's backslashed
977+
escape sequences.
979978

980979
* ``'namereplace'`` (also only supported when writing)
981980
replaces unsupported characters with ``\N{...}`` escape sequences.

Doc/library/io.rst

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -825,11 +825,12 @@ Text I/O
825825
exception if there is an encoding error (the default of ``None`` has the same
826826
effect), or pass ``'ignore'`` to ignore errors. (Note that ignoring encoding
827827
errors can lead to data loss.) ``'replace'`` causes a replacement marker
828-
(such as ``'?'``) to be inserted where there is malformed data. When
829-
writing, ``'xmlcharrefreplace'`` (replace with the appropriate XML character
830-
reference), ``'backslashreplace'`` (replace with backslashed escape
831-
sequences) or ``'namereplace'`` (replace with ``\N{...}`` escape sequences)
832-
can be used. Any other error handling name that has been registered with
828+
(such as ``'?'``) to be inserted where there is malformed data.
829+
``'backslashreplace'`` causes malformed data to be replaced by a
830+
backslashed escape sequence. When writing, ``'xmlcharrefreplace'``
831+
(replace with the appropriate XML character reference) or ``'namereplace'``
832+
(replace with ``\N{...}`` escape sequences) can be used. Any other error
833+
handling name that has been registered with
833834
:func:`codecs.register_error` is also valid.
834835

835836
.. index::

Doc/whatsnew/3.5.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,9 @@ Other Language Changes
118118

119119
Some smaller changes made to the core Python language are:
120120

121-
* None yet.
121+
* Added the ``'namereplace'`` error handlers. The ``'backslashreplace'``
122+
error handlers now works with decoding and translating.
123+
(Contributed by Serhiy Storchaka in :issue:`19676` and :issue:`22286`.)
122124

123125

124126

Lib/codecs.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -127,7 +127,8 @@ class Codec:
127127
'surrogateescape' - replace with private code points U+DCnn.
128128
'xmlcharrefreplace' - Replace with the appropriate XML
129129
character reference (only for encoding).
130-
'backslashreplace' - Replace with backslashed escape sequences
130+
'backslashreplace' - Replace with backslashed escape sequences.
131+
'namereplace' - Replace with \\N{...} escape sequences
131132
(only for encoding).
132133
133134
The set of allowed values can be extended via register_error.
@@ -359,7 +360,8 @@ def __init__(self, stream, errors='strict'):
359360
'xmlcharrefreplace' - Replace with the appropriate XML
360361
character reference.
361362
'backslashreplace' - Replace with backslashed escape
362-
sequences (only for encoding).
363+
sequences.
364+
'namereplace' - Replace with \\N{...} escape sequences.
363365
364366
The set of allowed parameter values can be extended via
365367
register_error.
@@ -429,7 +431,8 @@ def __init__(self, stream, errors='strict'):
429431
430432
'strict' - raise a ValueError (or a subclass)
431433
'ignore' - ignore the character and continue with the next
432-
'replace'- replace with a suitable replacement character;
434+
'replace'- replace with a suitable replacement character
435+
'backslashreplace' - Replace with backslashed escape sequences;
433436
434437
The set of allowed parameter values can be extended via
435438
register_error.

Lib/test/test_codeccallbacks.py

Lines changed: 15 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -246,6 +246,11 @@ def handler_unicodeinternal(exc):
246246
"\u0000\ufffd"
247247
)
248248

249+
self.assertEqual(
250+
b"\x00\x00\x00\x00\x00".decode("unicode-internal", "backslashreplace"),
251+
"\u0000\\x00"
252+
)
253+
249254
codecs.register_error("test.hui", handler_unicodeinternal)
250255

251256
self.assertEqual(
@@ -565,17 +570,6 @@ def test_badandgoodbackslashreplaceexceptions(self):
565570
codecs.backslashreplace_errors,
566571
UnicodeError("ouch")
567572
)
568-
# "backslashreplace" can only be used for encoding
569-
self.assertRaises(
570-
TypeError,
571-
codecs.backslashreplace_errors,
572-
UnicodeDecodeError("ascii", bytearray(b"\xff"), 0, 1, "ouch")
573-
)
574-
self.assertRaises(
575-
TypeError,
576-
codecs.backslashreplace_errors,
577-
UnicodeTranslateError("\u3042", 0, 1, "ouch")
578-
)
579573
# Use the correct exception
580574
self.assertEqual(
581575
codecs.backslashreplace_errors(
@@ -701,6 +695,16 @@ def test_badandgoodnamereplaceexceptions(self):
701695
UnicodeEncodeError("ascii", "\udfff", 0, 1, "ouch")),
702696
("\\udfff", 1)
703697
)
698+
self.assertEqual(
699+
codecs.backslashreplace_errors(
700+
UnicodeDecodeError("ascii", bytearray(b"\xff"), 0, 1, "ouch")),
701+
("\\xff", 1)
702+
)
703+
self.assertEqual(
704+
codecs.backslashreplace_errors(
705+
UnicodeTranslateError("\u3042", 0, 1, "ouch")),
706+
("\\u3042", 1)
707+
)
704708

705709
def test_badhandlerresults(self):
706710
results = ( 42, "foo", (1,2,3), ("foo", 1, 3), ("foo", None), ("foo",), ("foo", 1, 3), ("foo", None), ("foo",) )

Lib/test/test_codecs.py

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -378,6 +378,10 @@ def test_lone_surrogates(self):
378378
before + after)
379379
self.assertEqual(test_sequence.decode(self.encoding, "replace"),
380380
before + self.ill_formed_sequence_replace + after)
381+
backslashreplace = ''.join('\\x%02x' % b
382+
for b in self.ill_formed_sequence)
383+
self.assertEqual(test_sequence.decode(self.encoding, "backslashreplace"),
384+
before + backslashreplace + after)
381385

382386
class UTF32Test(ReadTest, unittest.TestCase):
383387
encoding = "utf-32"
@@ -1300,14 +1304,19 @@ def test_bug1251300(self):
13001304
"unicode_internal")
13011305
if sys.byteorder == "little":
13021306
invalid = b"\x00\x00\x11\x00"
1307+
invalid_backslashreplace = r"\x00\x00\x11\x00"
13031308
else:
13041309
invalid = b"\x00\x11\x00\x00"
1310+
invalid_backslashreplace = r"\x00\x11\x00\x00"
13051311
with support.check_warnings():
13061312
self.assertRaises(UnicodeDecodeError,
13071313
invalid.decode, "unicode_internal")
13081314
with support.check_warnings():
13091315
self.assertEqual(invalid.decode("unicode_internal", "replace"),
13101316
'\ufffd')
1317+
with support.check_warnings():
1318+
self.assertEqual(invalid.decode("unicode_internal", "backslashreplace"),
1319+
invalid_backslashreplace)
13111320

13121321
@unittest.skipUnless(SIZEOF_WCHAR_T == 4, 'specific to 32-bit wchar_t')
13131322
def test_decode_error_attributes(self):
@@ -2042,6 +2051,16 @@ def test_decode_with_string_map(self):
20422051
("ab\ufffd", 3)
20432052
)
20442053

2054+
self.assertEqual(
2055+
codecs.charmap_decode(b"\x00\x01\x02", "backslashreplace", "ab"),
2056+
("ab\\x02", 3)
2057+
)
2058+
2059+
self.assertEqual(
2060+
codecs.charmap_decode(b"\x00\x01\x02", "backslashreplace", "ab\ufffe"),
2061+
("ab\\x02", 3)
2062+
)
2063+
20452064
self.assertEqual(
20462065
codecs.charmap_decode(b"\x00\x01\x02", "ignore", "ab"),
20472066
("ab", 3)
@@ -2118,6 +2137,25 @@ def test_decode_with_int2str_map(self):
21182137
("ab\ufffd", 3)
21192138
)
21202139

2140+
self.assertEqual(
2141+
codecs.charmap_decode(b"\x00\x01\x02", "backslashreplace",
2142+
{0: 'a', 1: 'b'}),
2143+
("ab\\x02", 3)
2144+
)
2145+
2146+
self.assertEqual(
2147+
codecs.charmap_decode(b"\x00\x01\x02", "backslashreplace",
2148+
{0: 'a', 1: 'b', 2: None}),
2149+
("ab\\x02", 3)
2150+
)
2151+
2152+
# Issue #14850
2153+
self.assertEqual(
2154+
codecs.charmap_decode(b"\x00\x01\x02", "backslashreplace",
2155+
{0: 'a', 1: 'b', 2: '\ufffe'}),
2156+
("ab\\x02", 3)
2157+
)
2158+
21212159
self.assertEqual(
21222160
codecs.charmap_decode(b"\x00\x01\x02", "ignore",
21232161
{0: 'a', 1: 'b'}),
@@ -2194,6 +2232,18 @@ def test_decode_with_int2int_map(self):
21942232
("ab\ufffd", 3)
21952233
)
21962234

2235+
self.assertEqual(
2236+
codecs.charmap_decode(b"\x00\x01\x02", "backslashreplace",
2237+
{0: a, 1: b}),
2238+
("ab\\x02", 3)
2239+
)
2240+
2241+
self.assertEqual(
2242+
codecs.charmap_decode(b"\x00\x01\x02", "backslashreplace",
2243+
{0: a, 1: b, 2: 0xFFFE}),
2244+
("ab\\x02", 3)
2245+
)
2246+
21972247
self.assertEqual(
21982248
codecs.charmap_decode(b"\x00\x01\x02", "ignore",
21992249
{0: a, 1: b}),
@@ -2253,9 +2303,13 @@ def test_unicode_escape(self):
22532303

22542304
self.assertRaises(UnicodeDecodeError, codecs.unicode_escape_decode, br"\U00110000")
22552305
self.assertEqual(codecs.unicode_escape_decode(r"\U00110000", "replace"), ("\ufffd", 10))
2306+
self.assertEqual(codecs.unicode_escape_decode(r"\U00110000", "backslashreplace"),
2307+
(r"\x5c\x55\x30\x30\x31\x31\x30\x30\x30\x30", 10))
22562308

22572309
self.assertRaises(UnicodeDecodeError, codecs.raw_unicode_escape_decode, br"\U00110000")
22582310
self.assertEqual(codecs.raw_unicode_escape_decode(r"\U00110000", "replace"), ("\ufffd", 10))
2311+
self.assertEqual(codecs.raw_unicode_escape_decode(r"\U00110000", "backslashreplace"),
2312+
(r"\x5c\x55\x30\x30\x31\x31\x30\x30\x30\x30", 10))
22592313

22602314

22612315
class UnicodeEscapeTest(unittest.TestCase):
@@ -2894,11 +2948,13 @@ def test_cp932(self):
28942948
(b'[\xff]', 'strict', None),
28952949
(b'[\xff]', 'ignore', '[]'),
28962950
(b'[\xff]', 'replace', '[\ufffd]'),
2951+
(b'[\xff]', 'backslashreplace', '[\\xff]'),
28972952
(b'[\xff]', 'surrogateescape', '[\udcff]'),
28982953
(b'[\xff]', 'surrogatepass', None),
28992954
(b'\x81\x00abc', 'strict', None),
29002955
(b'\x81\x00abc', 'ignore', '\x00abc'),
29012956
(b'\x81\x00abc', 'replace', '\ufffd\x00abc'),
2957+
(b'\x81\x00abc', 'backslashreplace', '\\xff\x00abc'),
29022958
))
29032959

29042960
def test_cp1252(self):

Misc/NEWS

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,9 @@ Release date: TBA
1010
Core and Builtins
1111
-----------------
1212

13+
- Issue #22286: The "backslashreplace" error handlers now works with
14+
decoding and translating.
15+
1316
- Issue #23253: Delay-load ShellExecute[AW] in os.startfile for reduced
1417
startup overhead on Windows.
1518

0 commit comments

Comments
 (0)