Skip to content

Commit 2f46cf6

Browse files
miss-islingtonMa Lin
andauthored
bpo-38056: overhaul Error Handlers section in codecs documentation (GH-15732)
* Some handlers were wrongly described as text-encoding only, but actually they can also be used in text-decoding. * Add more description to each handler. * Add two REPL examples. * Add indexes for Error Handler's name. Co-authored-by: Kyle Stanley <[email protected]> Co-authored-by: Victor Stinner <[email protected]> Co-authored-by: Jelle Zijlstra <[email protected]> (cherry picked from commit 5bc2390) Co-authored-by: Ma Lin <[email protected]>
1 parent cffa76d commit 2f46cf6

File tree

3 files changed

+127
-74
lines changed

3 files changed

+127
-74
lines changed

Doc/glossary.rst

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1125,7 +1125,16 @@ Glossary
11251125
See also :term:`borrowed reference`.
11261126

11271127
text encoding
1128-
A codec which encodes Unicode strings to bytes.
1128+
A string in Python is a sequence of Unicode code points (in range
1129+
``U+0000``--``U+10FFFF``). To store or transfer a string, it needs to be
1130+
serialized as a sequence of bytes.
1131+
1132+
Serializing a string into a sequence of bytes is known as "encoding", and
1133+
recreating the string from the sequence of bytes is known as "decoding".
1134+
1135+
There are a variety of different text serialization
1136+
:ref:`codecs <standard-encodings>`, which are collectively referred to as
1137+
"text encodings".
11291138

11301139
text file
11311140
A :term:`file object` able to read and write :class:`str` objects.

Doc/library/codecs.rst

Lines changed: 116 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -23,11 +23,11 @@
2323
This module defines base classes for standard Python codecs (encoders and
2424
decoders) and provides access to the internal Python codec registry, which
2525
manages the codec and error handling lookup process. Most standard codecs
26-
are :term:`text encodings <text encoding>`, which encode text to bytes,
27-
but there are also codecs provided that encode text to text, and bytes to
28-
bytes. Custom codecs may encode and decode between arbitrary types, but some
29-
module features are restricted to use specifically with
30-
:term:`text encodings <text encoding>`, or with codecs that encode to
26+
are :term:`text encodings <text encoding>`, which encode text to bytes (and
27+
decode bytes to text), but there are also codecs provided that encode text to
28+
text, and bytes to bytes. Custom codecs may encode and decode between arbitrary
29+
types, but some module features are restricted to be used specifically with
30+
:term:`text encodings <text encoding>` or with codecs that encode to
3131
:class:`bytes`.
3232

3333
The module defines the following functions for encoding and decoding with
@@ -300,58 +300,56 @@ codec will handle encoding and decoding errors.
300300
Error Handlers
301301
^^^^^^^^^^^^^^
302302

303-
To simplify and standardize error handling,
304-
codecs may implement different error handling schemes by
305-
accepting the *errors* string argument. The following string values are
306-
defined and implemented by all standard Python codecs:
303+
To simplify and standardize error handling, codecs may implement different
304+
error handling schemes by accepting the *errors* string argument:
307305

308-
.. tabularcolumns:: |l|L|
309-
310-
+-------------------------+-----------------------------------------------+
311-
| Value | Meaning |
312-
+=========================+===============================================+
313-
| ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass); |
314-
| | this is the default. Implemented in |
315-
| | :func:`strict_errors`. |
316-
+-------------------------+-----------------------------------------------+
317-
| ``'ignore'`` | Ignore the malformed data and continue |
318-
| | without further notice. Implemented in |
319-
| | :func:`ignore_errors`. |
320-
+-------------------------+-----------------------------------------------+
321-
322-
The following error handlers are only applicable to
323-
:term:`text encodings <text encoding>`:
306+
>>> 'German ß, ♬'.encode(encoding='ascii', errors='backslashreplace')
307+
b'German \\xdf, \\u266c'
308+
>>> 'German ß, ♬'.encode(encoding='ascii', errors='xmlcharrefreplace')
309+
b'German &#223;, &#9836;'
324310

325311
.. index::
312+
pair: strict; error handler's name
313+
pair: ignore; error handler's name
314+
pair: replace; error handler's name
315+
pair: backslashreplace; error handler's name
316+
pair: surrogateescape; error handler's name
326317
single: ? (question mark); replacement character
327318
single: \ (backslash); escape sequence
328319
single: \x; escape sequence
329320
single: \u; escape sequence
330321
single: \U; escape sequence
331-
single: \N; escape sequence
322+
323+
The following error handlers can be used with all Python
324+
:ref:`standard-encodings` codecs:
325+
326+
.. tabularcolumns:: |l|L|
332327

333328
+-------------------------+-----------------------------------------------+
334329
| Value | Meaning |
335330
+=========================+===============================================+
336-
| ``'replace'`` | Replace with a suitable replacement |
337-
| | marker; Python will use the official |
338-
| | ``U+FFFD`` REPLACEMENT CHARACTER for the |
339-
| | built-in codecs on decoding, and '?' on |
340-
| | encoding. Implemented in |
341-
| | :func:`replace_errors`. |
331+
| ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass), |
332+
| | this is the default. Implemented in |
333+
| | :func:`strict_errors`. |
342334
+-------------------------+-----------------------------------------------+
343-
| ``'xmlcharrefreplace'`` | Replace with the appropriate XML character |
344-
| | reference (only for encoding). Implemented |
345-
| | in :func:`xmlcharrefreplace_errors`. |
335+
| ``'ignore'`` | Ignore the malformed data and continue without|
336+
| | further notice. Implemented in |
337+
| | :func:`ignore_errors`. |
338+
+-------------------------+-----------------------------------------------+
339+
| ``'replace'`` | Replace with a replacement marker. On |
340+
| | encoding, use ``?`` (ASCII character). On |
341+
| | decoding, use ```` (U+FFFD, the official |
342+
| | REPLACEMENT CHARACTER). Implemented in |
343+
| | :func:`replace_errors`. |
346344
+-------------------------+-----------------------------------------------+
347345
| ``'backslashreplace'`` | Replace with backslashed escape sequences. |
346+
| | On encoding, use hexadecimal form of Unicode |
347+
| | code point with formats ``\xhh`` ``\uxxxx`` |
348+
| | ``\Uxxxxxxxx``. On decoding, use hexadecimal |
349+
| | form of byte value with format ``\xhh``. |
348350
| | Implemented in |
349351
| | :func:`backslashreplace_errors`. |
350352
+-------------------------+-----------------------------------------------+
351-
| ``'namereplace'`` | Replace with ``\N{...}`` escape sequences |
352-
| | (only for encoding). Implemented in |
353-
| | :func:`namereplace_errors`. |
354-
+-------------------------+-----------------------------------------------+
355353
| ``'surrogateescape'`` | On decoding, replace byte with individual |
356354
| | surrogate code ranging from ``U+DC80`` to |
357355
| | ``U+DCFF``. This code will then be turned |
@@ -361,27 +359,55 @@ The following error handlers are only applicable to
361359
| | more.) |
362360
+-------------------------+-----------------------------------------------+
363361

362+
.. index::
363+
pair: xmlcharrefreplace; error handler's name
364+
pair: namereplace; error handler's name
365+
single: \N; escape sequence
366+
367+
The following error handlers are only applicable to encoding (within
368+
:term:`text encodings <text encoding>`):
369+
370+
+-------------------------+-----------------------------------------------+
371+
| Value | Meaning |
372+
+=========================+===============================================+
373+
| ``'xmlcharrefreplace'`` | Replace with XML/HTML numeric character |
374+
| | reference, which is a decimal form of Unicode |
375+
| | code point with format ``&#num;`` Implemented |
376+
| | in :func:`xmlcharrefreplace_errors`. |
377+
+-------------------------+-----------------------------------------------+
378+
| ``'namereplace'`` | Replace with ``\N{...}`` escape sequences, |
379+
| | what appears in the braces is the Name |
380+
| | property from Unicode Character Database. |
381+
| | Implemented in :func:`namereplace_errors`. |
382+
+-------------------------+-----------------------------------------------+
383+
384+
.. index::
385+
pair: surrogatepass; error handler's name
386+
364387
In addition, the following error handler is specific to the given codecs:
365388

366389
+-------------------+------------------------+-------------------------------------------+
367390
| Value | Codecs | Meaning |
368391
+===================+========================+===========================================+
369-
|``'surrogatepass'``| utf-8, utf-16, utf-32, | Allow encoding and decoding of surrogate |
370-
| | utf-16-be, utf-16-le, | codes. These codecs normally treat the |
371-
| | utf-32-be, utf-32-le | presence of surrogates as an error. |
392+
|``'surrogatepass'``| utf-8, utf-16, utf-32, | Allow encoding and decoding surrogate code|
393+
| | utf-16-be, utf-16-le, | point (``U+D800`` - ``U+DFFF``) as normal |
394+
| | utf-32-be, utf-32-le | code point. Otherwise these codecs treat |
395+
| | | the presence of surrogate code point in |
396+
| | | :class:`str` as an error. |
372397
+-------------------+------------------------+-------------------------------------------+
373398

374399
.. versionadded:: 3.1
375400
The ``'surrogateescape'`` and ``'surrogatepass'`` error handlers.
376401

377402
.. versionchanged:: 3.4
378-
The ``'surrogatepass'`` error handlers now works with utf-16\* and utf-32\* codecs.
403+
The ``'surrogatepass'`` error handler now works with utf-16\* and utf-32\*
404+
codecs.
379405

380406
.. versionadded:: 3.5
381407
The ``'namereplace'`` error handler.
382408

383409
.. versionchanged:: 3.5
384-
The ``'backslashreplace'`` error handlers now works with decoding and
410+
The ``'backslashreplace'`` error handler now works with decoding and
385411
translating.
386412

387413
The set of allowed values can be extended by registering a new named error
@@ -424,42 +450,59 @@ functions:
424450

425451
.. function:: strict_errors(exception)
426452

427-
Implements the ``'strict'`` error handling: each encoding or
428-
decoding error raises a :exc:`UnicodeError`.
453+
Implements the ``'strict'`` error handling.
429454

455+
Each encoding or decoding error raises a :exc:`UnicodeError`.
430456

431-
.. function:: replace_errors(exception)
432457

433-
Implements the ``'replace'`` error handling (for :term:`text encodings
434-
<text encoding>` only): substitutes ``'?'`` for encoding errors
435-
(to be encoded by the codec), and ``'\ufffd'`` (the Unicode replacement
436-
character) for decoding errors.
458+
.. function:: ignore_errors(exception)
437459

460+
Implements the ``'ignore'`` error handling.
438461

439-
.. function:: ignore_errors(exception)
462+
Malformed data is ignored; encoding or decoding is continued without
463+
further notice.
440464

441-
Implements the ``'ignore'`` error handling: malformed data is ignored and
442-
encoding or decoding is continued without further notice.
443465

466+
.. function:: replace_errors(exception)
444467

445-
.. function:: xmlcharrefreplace_errors(exception)
468+
Implements the ``'replace'`` error handling.
446469

447-
Implements the ``'xmlcharrefreplace'`` error handling (for encoding with
448-
:term:`text encodings <text encoding>` only): the
449-
unencodable character is replaced by an appropriate XML character reference.
470+
Substitutes ``?`` (ASCII character) for encoding errors or ```` (U+FFFD,
471+
the official REPLACEMENT CHARACTER) for decoding errors.
450472

451473

452474
.. function:: backslashreplace_errors(exception)
453475

454-
Implements the ``'backslashreplace'`` error handling (for
455-
:term:`text encodings <text encoding>` only): malformed data is
456-
replaced by a backslashed escape sequence.
476+
Implements the ``'backslashreplace'`` error handling.
477+
478+
Malformed data is replaced by a backslashed escape sequence.
479+
On encoding, use the hexadecimal form of Unicode code point with formats
480+
``\xhh`` ``\uxxxx`` ``\Uxxxxxxxx``. On decoding, use the hexadecimal form of
481+
byte value with format ``\xhh``.
482+
483+
.. versionchanged:: 3.5
484+
Works with decoding and translating.
485+
486+
487+
.. function:: xmlcharrefreplace_errors(exception)
488+
489+
Implements the ``'xmlcharrefreplace'`` error handling (for encoding within
490+
:term:`text encoding` only).
491+
492+
The unencodable character is replaced by an appropriate XML/HTML numeric
493+
character reference, which is a decimal form of Unicode code point with
494+
format ``&#num;`` .
495+
457496

458497
.. function:: namereplace_errors(exception)
459498

460-
Implements the ``'namereplace'`` error handling (for encoding with
461-
:term:`text encodings <text encoding>` only): the
462-
unencodable character is replaced by a ``\N{...}`` escape sequence.
499+
Implements the ``'namereplace'`` error handling (for encoding within
500+
:term:`text encoding` only).
501+
502+
The unencodable character is replaced by a ``\N{...}`` escape sequence. The
503+
set of characters that appear in the braces is the Name property from
504+
Unicode Character Database. For example, the German lowercase letter ``'ß'``
505+
will be converted to byte sequence ``\N{LATIN SMALL LETTER SHARP S}`` .
463506

464507
.. versionadded:: 3.5
465508

@@ -473,7 +516,7 @@ The base :class:`Codec` class defines these methods which also define the
473516
function interfaces of the stateless encoder and decoder:
474517

475518

476-
.. method:: Codec.encode(input[, errors])
519+
.. method:: Codec.encode(input, errors='strict')
477520

478521
Encodes the object *input* and returns a tuple (output object, length consumed).
479522
For instance, :term:`text encoding` converts
@@ -491,7 +534,7 @@ function interfaces of the stateless encoder and decoder:
491534
of the output object type in this situation.
492535

493536

494-
.. method:: Codec.decode(input[, errors])
537+
.. method:: Codec.decode(input, errors='strict')
495538

496539
Decodes the object *input* and returns a tuple (output object, length
497540
consumed). For instance, for a :term:`text encoding`, decoding converts
@@ -558,7 +601,7 @@ define in order to be compatible with the Python codec registry.
558601
object.
559602

560603

561-
.. method:: encode(object[, final])
604+
.. method:: encode(object, final=False)
562605

563606
Encodes *object* (taking the current state of the encoder into account)
564607
and returns the resulting encoded object. If this is the last call to
@@ -615,7 +658,7 @@ define in order to be compatible with the Python codec registry.
615658
object.
616659

617660

618-
.. method:: decode(object[, final])
661+
.. method:: decode(object, final=False)
619662

620663
Decodes *object* (taking the current state of the decoder into account)
621664
and returns the resulting decoded object. If this is the last call to
@@ -749,7 +792,7 @@ compatible with the Python codec registry.
749792
:func:`register_error`.
750793

751794

752-
.. method:: read([size[, chars, [firstline]]])
795+
.. method:: read(size=-1, chars=-1, firstline=False)
753796

754797
Decodes data from the stream and returns the resulting object.
755798

@@ -775,7 +818,7 @@ compatible with the Python codec registry.
775818
available on the stream, these should be read too.
776819

777820

778-
.. method:: readline([size[, keepends]])
821+
.. method:: readline(size=None, keepends=True)
779822

780823
Read one line from the input stream and return the decoded data.
781824

@@ -786,7 +829,7 @@ compatible with the Python codec registry.
786829
returned.
787830

788831

789-
.. method:: readlines([sizehint[, keepends]])
832+
.. method:: readlines(sizehint=None, keepends=True)
790833

791834
Read all lines available on the input stream and return them as a list of
792835
lines.
@@ -877,7 +920,7 @@ Encodings and Unicode
877920
---------------------
878921

879922
Strings are stored internally as sequences of code points in
880-
range ``0x0``--``0x10FFFF``. (See :pep:`393` for
923+
range ``U+0000``--``U+10FFFF``. (See :pep:`393` for
881924
more details about the implementation.)
882925
Once a string object is used outside of CPU and memory, endianness
883926
and how these arrays are stored as bytes become an issue. As with other
@@ -958,7 +1001,7 @@ encoding was used for encoding a string. Each charmap encoding can
9581001
decode any random byte sequence. However that's not possible with UTF-8, as
9591002
UTF-8 byte sequences have a structure that doesn't allow arbitrary byte
9601003
sequences. To increase the reliability with which a UTF-8 encoding can be
961-
detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls
1004+
detected, Microsoft invented a variant of UTF-8 (that Python calls
9621005
``"utf-8-sig"``) for its Notepad program: Before any of the Unicode characters
9631006
is written to the file, a UTF-8 encoded BOM (which looks like this as a byte
9641007
sequence: ``0xef``, ``0xbb``, ``0xbf``) is written. As it's rather improbable
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Overhaul the :ref:`error-handlers` documentation in :mod:`codecs`.

0 commit comments

Comments
 (0)