Skip to content

Commit 582acb7

Browse files
committed
Merge issue 19548 changes from 3.4
2 parents 5d57539 + b9fdb7a commit 582acb7

File tree

9 files changed

+424
-377
lines changed

9 files changed

+424
-377
lines changed

Doc/glossary.rst

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -834,10 +834,13 @@ Glossary
834834
:meth:`~collections.somenamedtuple._asdict`. Examples of struct sequences
835835
include :data:`sys.float_info` and the return value of :func:`os.stat`.
836836

837+
text encoding
838+
A codec which encodes Unicode strings to bytes.
839+
837840
text file
838841
A :term:`file object` able to read and write :class:`str` objects.
839842
Often, a text file actually accesses a byte-oriented datastream
840-
and handles the text encoding automatically.
843+
and handles the :term:`text encoding` automatically.
841844

842845
.. seealso::
843846
A :term:`binary file` reads and write :class:`bytes` objects.

Doc/library/codecs.rst

Lines changed: 334 additions & 317 deletions
Large diffs are not rendered by default.

Doc/library/functions.rst

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -940,15 +940,17 @@ are always available. They are listed here in alphabetical order.
940940
*encoding* is the name of the encoding used to decode or encode the file.
941941
This should only be used in text mode. The default encoding is platform
942942
dependent (whatever :func:`locale.getpreferredencoding` returns), but any
943-
encoding supported by Python can be used. See the :mod:`codecs` module for
943+
:term:`text encoding` supported by Python
944+
can be used. See the :mod:`codecs` module for
944945
the list of supported encodings.
945946

946947
*errors* is an optional string that specifies how encoding and decoding
947948
errors are to be handled--this cannot be used in binary mode.
948-
A variety of standard error handlers are available, though any
949+
A variety of standard error handlers are available
950+
(listed under :ref:`error-handlers`), though any
949951
error handling name that has been registered with
950952
:func:`codecs.register_error` is also valid. The standard names
951-
are:
953+
include:
952954

953955
* ``'strict'`` to raise a :exc:`ValueError` exception if there is
954956
an encoding error. The default value of ``None`` has the same

Doc/library/stdtypes.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1512,7 +1512,7 @@ expression support in the :mod:`re` module).
15121512
a :exc:`UnicodeError`. Other possible
15131513
values are ``'ignore'``, ``'replace'``, ``'xmlcharrefreplace'``,
15141514
``'backslashreplace'`` and any other name registered via
1515-
:func:`codecs.register_error`, see section :ref:`codec-base-classes`. For a
1515+
:func:`codecs.register_error`, see section :ref:`error-handlers`. For a
15161516
list of possible encodings, see section :ref:`standard-encodings`.
15171517

15181518
.. versionchanged:: 3.1
@@ -2384,7 +2384,7 @@ arbitrary binary data.
23842384
error handling scheme. The default for *errors* is ``'strict'``, meaning
23852385
that encoding errors raise a :exc:`UnicodeError`. Other possible values are
23862386
``'ignore'``, ``'replace'`` and any other name registered via
2387-
:func:`codecs.register_error`, see section :ref:`codec-base-classes`. For a
2387+
:func:`codecs.register_error`, see section :ref:`error-handlers`. For a
23882388
list of possible encodings, see section :ref:`standard-encodings`.
23892389

23902390
.. note::

Doc/library/tarfile.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -798,7 +798,7 @@ metadata must be either decoded or encoded. If *encoding* is not set
798798
appropriately, this conversion may fail.
799799

800800
The *errors* argument defines how characters are treated that cannot be
801-
converted. Possible values are listed in section :ref:`codec-base-classes`.
801+
converted. Possible values are listed in section :ref:`error-handlers`.
802802
The default scheme is ``'surrogateescape'`` which Python also uses for its
803803
file system calls, see :ref:`os-filenames`.
804804

Lib/codecs.py

Lines changed: 30 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -347,8 +347,7 @@ def __init__(self, stream, errors='strict'):
347347

348348
""" Creates a StreamWriter instance.
349349
350-
stream must be a file-like object open for writing
351-
(binary) data.
350+
stream must be a file-like object open for writing.
352351
353352
The StreamWriter may use different error handling
354353
schemes by providing the errors keyword argument. These
@@ -422,8 +421,7 @@ def __init__(self, stream, errors='strict'):
422421

423422
""" Creates a StreamReader instance.
424423
425-
stream must be a file-like object open for reading
426-
(binary) data.
424+
stream must be a file-like object open for reading.
427425
428426
The StreamReader may use different error handling
429427
schemes by providing the errors keyword argument. These
@@ -451,13 +449,12 @@ def read(self, size=-1, chars=-1, firstline=False):
451449
""" Decodes data from the stream self.stream and returns the
452450
resulting object.
453451
454-
chars indicates the number of characters to read from the
455-
stream. read() will never return more than chars
456-
characters, but it might return less, if there are not enough
457-
characters available.
452+
chars indicates the number of decoded code points or bytes to
453+
return. read() will never return more data than requested,
454+
but it might return less, if there is not enough available.
458455
459-
size indicates the approximate maximum number of bytes to
460-
read from the stream for decoding purposes. The decoder
456+
size indicates the approximate maximum number of decoded
457+
bytes or code points to read for decoding. The decoder
461458
can modify this setting as appropriate. The default value
462459
-1 indicates to read and decode as much as possible. size
463460
is intended to prevent having to decode huge files in one
@@ -468,7 +465,7 @@ def read(self, size=-1, chars=-1, firstline=False):
468465
will be returned, the rest of the input will be kept until the
469466
next call to read().
470467
471-
The method should use a greedy read strategy meaning that
468+
The method should use a greedy read strategy, meaning that
472469
it should read as much data as is allowed within the
473470
definition of the encoding and the given size, e.g. if
474471
optional encoding endings or state markers are available
@@ -603,7 +600,7 @@ def readline(self, size=None, keepends=True):
603600
def readlines(self, sizehint=None, keepends=True):
604601

605602
""" Read all lines available on the input stream
606-
and return them as list of lines.
603+
and return them as a list.
607604
608605
Line breaks are implemented using the codec's decoder
609606
method and are included in the list entries.
@@ -751,19 +748,18 @@ def __exit__(self, type, value, tb):
751748

752749
class StreamRecoder:
753750

754-
""" StreamRecoder instances provide a frontend - backend
755-
view of encoding data.
751+
""" StreamRecoder instances translate data from one encoding to another.
756752
757753
They use the complete set of APIs returned by the
758754
codecs.lookup() function to implement their task.
759755
760-
Data written to the stream is first decoded into an
761-
intermediate format (which is dependent on the given codec
762-
combination) and then written to the stream using an instance
763-
of the provided Writer class.
756+
Data written to the StreamRecoder is first decoded into an
757+
intermediate format (depending on the "decode" codec) and then
758+
written to the underlying stream using an instance of the provided
759+
Writer class.
764760
765-
In the other direction, data is read from the stream using a
766-
Reader instance and then return encoded data to the caller.
761+
In the other direction, data is read from the underlying stream using
762+
a Reader instance and then encoded and returned to the caller.
767763
768764
"""
769765
# Optional attributes set by the file wrappers below
@@ -775,22 +771,17 @@ def __init__(self, stream, encode, decode, Reader, Writer,
775771

776772
""" Creates a StreamRecoder instance which implements a two-way
777773
conversion: encode and decode work on the frontend (the
778-
input to .read() and output of .write()) while
779-
Reader and Writer work on the backend (reading and
780-
writing to the stream).
774+
data visible to .read() and .write()) while Reader and Writer
775+
work on the backend (the data in stream).
781776
782-
You can use these objects to do transparent direct
783-
recodings from e.g. latin-1 to utf-8 and back.
777+
You can use these objects to do transparent
778+
transcodings from e.g. latin-1 to utf-8 and back.
784779
785780
stream must be a file-like object.
786781
787-
encode, decode must adhere to the Codec interface, Reader,
782+
encode and decode must adhere to the Codec interface; Reader and
788783
Writer must be factory functions or classes providing the
789-
StreamReader, StreamWriter interface resp.
790-
791-
encode and decode are needed for the frontend translation,
792-
Reader and Writer for the backend translation. Unicode is
793-
used as intermediate encoding.
784+
StreamReader and StreamWriter interfaces resp.
794785
795786
Error handling is done in the same way as defined for the
796787
StreamWriter/Readers.
@@ -865,7 +856,7 @@ def __exit__(self, type, value, tb):
865856

866857
### Shortcuts
867858

868-
def open(filename, mode='rb', encoding=None, errors='strict', buffering=1):
859+
def open(filename, mode='r', encoding=None, errors='strict', buffering=1):
869860

870861
""" Open an encoded file using the given mode and return
871862
a wrapped version providing transparent encoding/decoding.
@@ -875,10 +866,8 @@ def open(filename, mode='rb', encoding=None, errors='strict', buffering=1):
875866
codecs. Output is also codec dependent and will usually be
876867
Unicode as well.
877868
878-
Files are always opened in binary mode, even if no binary mode
879-
was specified. This is done to avoid data loss due to encodings
880-
using 8-bit values. The default file mode is 'rb' meaning to
881-
open the file in binary read mode.
869+
Underlying encoded files are always opened in binary mode.
870+
The default file mode is 'r', meaning to open the file in read mode.
882871
883872
encoding specifies the encoding which is to be used for the
884873
file.
@@ -914,13 +903,13 @@ def EncodedFile(file, data_encoding, file_encoding=None, errors='strict'):
914903
""" Return a wrapped version of file which provides transparent
915904
encoding translation.
916905
917-
Strings written to the wrapped file are interpreted according
918-
to the given data_encoding and then written to the original
919-
file as string using file_encoding. The intermediate encoding
906+
Data written to the wrapped file is decoded according
907+
to the given data_encoding and then encoded to the underlying
908+
file using file_encoding. The intermediate data type
920909
will usually be Unicode but depends on the specified codecs.
921910
922-
Strings are read from the file using file_encoding and then
923-
passed back to the caller as string using data_encoding.
911+
Bytes read from the file are decoded using file_encoding and then
912+
passed back to the caller encoded using data_encoding.
924913
925914
If file_encoding is not given, it defaults to data_encoding.
926915

Lib/test/test_codecs.py

Lines changed: 37 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1140,6 +1140,8 @@ def test_recoding(self):
11401140
# Python used to crash on this at exit because of a refcount
11411141
# bug in _codecsmodule.c
11421142

1143+
self.assertTrue(f.closed)
1144+
11431145
# From RFC 3492
11441146
punycode_testcases = [
11451147
# A Arabic (Egyptian):
@@ -1592,6 +1594,16 @@ def test_incremental_encode(self):
15921594
self.assertEqual(encoder.encode("ample.org."), b"xn--xample-9ta.org.")
15931595
self.assertEqual(encoder.encode("", True), b"")
15941596

1597+
def test_errors(self):
1598+
"""Only supports "strict" error handler"""
1599+
"python.org".encode("idna", "strict")
1600+
b"python.org".decode("idna", "strict")
1601+
for errors in ("ignore", "replace", "backslashreplace",
1602+
"surrogateescape"):
1603+
self.assertRaises(Exception, "python.org".encode, "idna", errors)
1604+
self.assertRaises(Exception,
1605+
b"python.org".decode, "idna", errors)
1606+
15951607
class CodecsModuleTest(unittest.TestCase):
15961608

15971609
def test_decode(self):
@@ -1682,6 +1694,24 @@ def test_all(self):
16821694
for api in codecs.__all__:
16831695
getattr(codecs, api)
16841696

1697+
def test_open(self):
1698+
self.addCleanup(support.unlink, support.TESTFN)
1699+
for mode in ('w', 'r', 'r+', 'w+', 'a', 'a+'):
1700+
with self.subTest(mode), \
1701+
codecs.open(support.TESTFN, mode, 'ascii') as file:
1702+
self.assertIsInstance(file, codecs.StreamReaderWriter)
1703+
1704+
def test_undefined(self):
1705+
self.assertRaises(UnicodeError, codecs.encode, 'abc', 'undefined')
1706+
self.assertRaises(UnicodeError, codecs.decode, b'abc', 'undefined')
1707+
self.assertRaises(UnicodeError, codecs.encode, '', 'undefined')
1708+
self.assertRaises(UnicodeError, codecs.decode, b'', 'undefined')
1709+
for errors in ('strict', 'ignore', 'replace', 'backslashreplace'):
1710+
self.assertRaises(UnicodeError,
1711+
codecs.encode, 'abc', 'undefined', errors)
1712+
self.assertRaises(UnicodeError,
1713+
codecs.decode, b'abc', 'undefined', errors)
1714+
16851715
class StreamReaderTest(unittest.TestCase):
16861716

16871717
def setUp(self):
@@ -1815,13 +1845,10 @@ def test_basic(self):
18151845
# "undefined"
18161846

18171847
# The following encodings don't work in stateful mode
1818-
broken_unicode_with_streams = [
1848+
broken_unicode_with_stateful = [
18191849
"punycode",
18201850
"unicode_internal"
18211851
]
1822-
broken_incremental_coders = broken_unicode_with_streams + [
1823-
"idna",
1824-
]
18251852

18261853
class BasicUnicodeTest(unittest.TestCase, MixInCheckStateHandling):
18271854
def test_basics(self):
@@ -1841,7 +1868,7 @@ def test_basics(self):
18411868
(chars, size) = codecs.getdecoder(encoding)(b)
18421869
self.assertEqual(chars, s, "encoding=%r" % encoding)
18431870

1844-
if encoding not in broken_unicode_with_streams:
1871+
if encoding not in broken_unicode_with_stateful:
18451872
# check stream reader/writer
18461873
q = Queue(b"")
18471874
writer = codecs.getwriter(encoding)(q)
@@ -1859,7 +1886,7 @@ def test_basics(self):
18591886
decodedresult += reader.read()
18601887
self.assertEqual(decodedresult, s, "encoding=%r" % encoding)
18611888

1862-
if encoding not in broken_incremental_coders:
1889+
if encoding not in broken_unicode_with_stateful:
18631890
# check incremental decoder/encoder and iterencode()/iterdecode()
18641891
try:
18651892
encoder = codecs.getincrementalencoder(encoding)()
@@ -1908,7 +1935,7 @@ def test_basics_capi(self):
19081935
from _testcapi import codec_incrementalencoder, codec_incrementaldecoder
19091936
s = "abc123" # all codecs should be able to encode these
19101937
for encoding in all_unicode_encodings:
1911-
if encoding not in broken_incremental_coders:
1938+
if encoding not in broken_unicode_with_stateful:
19121939
# check incremental decoder/encoder (fetched via the C API)
19131940
try:
19141941
cencoder = codec_incrementalencoder(encoding)
@@ -1948,7 +1975,7 @@ def test_seek(self):
19481975
for encoding in all_unicode_encodings:
19491976
if encoding == "idna": # FIXME: See SF bug #1163178
19501977
continue
1951-
if encoding in broken_unicode_with_streams:
1978+
if encoding in broken_unicode_with_stateful:
19521979
continue
19531980
reader = codecs.getreader(encoding)(io.BytesIO(s.encode(encoding)))
19541981
for t in range(5):
@@ -1981,7 +2008,7 @@ def test_decoder_state(self):
19812008
# Check that getstate() and setstate() handle the state properly
19822009
u = "abc123"
19832010
for encoding in all_unicode_encodings:
1984-
if encoding not in broken_incremental_coders:
2011+
if encoding not in broken_unicode_with_stateful:
19852012
self.check_state_handling_decode(encoding, u, u.encode(encoding))
19862013
self.check_state_handling_encode(encoding, u, u.encode(encoding))
19872014

@@ -2185,6 +2212,7 @@ def test_encodedfile(self):
21852212
f = io.BytesIO(b"\xc3\xbc")
21862213
with codecs.EncodedFile(f, "latin-1", "utf-8") as ef:
21872214
self.assertEqual(ef.read(), b"\xfc")
2215+
self.assertTrue(f.closed)
21882216

21892217
def test_streamreaderwriter(self):
21902218
f = io.BytesIO(b"\xc3\xbc")

Misc/NEWS

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1441,6 +1441,10 @@ C API
14411441
Documentation
14421442
-------------
14431443

1444+
- Issue #19548: Update the codecs module documentation to better cover the
1445+
distinction between text encodings and other codecs, together with other
1446+
clarifications. Patch by Martin Panter.
1447+
14441448
- Issue #22394: Doc/Makefile now supports ``make venv PYTHON=../python`` to
14451449
create a venv for generating the documentation, e.g.,
14461450
``make html PYTHON=venv/bin/python3``.
@@ -1477,6 +1481,10 @@ Documentation
14771481
Tests
14781482
-----
14791483

1484+
- Issue #19548: Added some additional checks to test_codecs to ensure that
1485+
statements in the updated documentation remain accurate. Patch by Martin
1486+
Panter.
1487+
14801488
- Issue #22838: All test_re tests now work with unittest test discovery.
14811489

14821490
- Issue #22173: Update lib2to3 tests to use unittest test discovery.

Modules/_codecsmodule.c

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -54,9 +54,9 @@ PyDoc_STRVAR(register__doc__,
5454
"register(search_function)\n\
5555
\n\
5656
Register a codec search function. Search functions are expected to take\n\
57-
one argument, the encoding name in all lower case letters, and return\n\
58-
a tuple of functions (encoder, decoder, stream_reader, stream_writer)\n\
59-
(or a CodecInfo object).");
57+
one argument, the encoding name in all lower case letters, and either\n\
58+
return None, or a tuple of functions (encoder, decoder, stream_reader,\n\
59+
stream_writer) (or a CodecInfo object).");
6060

6161
static
6262
PyObject *codec_register(PyObject *self, PyObject *search_function)

0 commit comments

Comments
 (0)