Skip to content

Commit 1bcb8a6

Browse files
authored
bpo-33409: Clarify PEP 538/540 relationship (GH-7534)
While locale coercion and UTF-8 mode turned out to be complementary ideas rather than competing ones, it isn't immediately obvious why it's useful to have both, or how they interact at runtime. This updates both the Python 3.7 What's New doc and the PYTHONCOERCECLOCALE and PYTHONUTF8 documentation in an attempt to clarify that relationship: - in the respective What's New sections, add a closing paragraph explaining which problem each one solves, and pointing to the other PEP's section for the specific aspects it relies on the other PEP to solve - use "locale-aware mode" as a more descriptive term for the default non-UTF-8 mode - improve wording conistenccy between the PYTHONCOERCECLOCALE and PYTHONUTF8 docs when they cover the same thing (mostly related to legacy locale detection and setting the standard stream error handler) - improve the description of the locale coercion trigger conditions (including pointing out that setting LC_ALL turns off locale coercion) - port the full description of the UTF-8 mode behaviour changes from PEP 540 into the PYTHONUTF8 documentation - be explicit that PYTHONIOENCODING still overrides the settings for the standard streams - mention concrete examples of things that do and don't get their text encoding assumptions adjusted by the two text encoding assumption override techniques
1 parent 4acc140 commit 1bcb8a6

File tree

3 files changed

+106
-24
lines changed

3 files changed

+106
-24
lines changed

Doc/using/cmdline.rst

Lines changed: 79 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -438,8 +438,10 @@ Miscellaneous options
438438
* Set the :attr:`~sys.flags.dev_mode` attribute of :attr:`sys.flags` to
439439
``True``
440440

441-
* ``-X utf8`` enables the UTF-8 mode, whereas ``-X utf8=0`` disables the
442-
UTF-8 mode.
441+
* ``-X utf8`` enables UTF-8 mode for operating system interfaces, overriding
442+
the default locale-aware mode. ``-X utf8=0`` explicitly disables UTF-8
443+
mode (even when it would otherwise activate automatically).
444+
See :envvar:`PYTHONUTF8` for more details.
443445

444446
It also allows passing arbitrary values and retrieving them through the
445447
:data:`sys._xoptions` dictionary.
@@ -789,36 +791,49 @@ conflict.
789791
.. envvar:: PYTHONCOERCECLOCALE
790792

791793
If set to the value ``0``, causes the main Python command line application
792-
to skip coercing the legacy ASCII-based C locale to a more capable UTF-8
793-
based alternative.
794+
to skip coercing the legacy ASCII-based C and POSIX locales to a more
795+
capable UTF-8 based alternative.
794796

795-
If this variable is *not* set, or is set to a value other than ``0``, and
796-
the current locale reported for the ``LC_CTYPE`` category is the default
797-
``C`` locale, then the Python CLI will attempt to configure the following
798-
locales for the ``LC_CTYPE`` category in the order listed before loading the
799-
interpreter runtime:
797+
If this variable is *not* set (or is set to a value other than ``0``), the
798+
``LC_ALL`` locale override environment variable is also not set, and the
799+
current locale reported for the ``LC_CTYPE`` category is either the default
800+
``C`` locale, or else the explicitly ASCII-based ``POSIX`` locale, then the
801+
Python CLI will attempt to configure the following locales for the
802+
``LC_CTYPE`` category in the order listed before loading the interpreter
803+
runtime:
800804

801805
* ``C.UTF-8``
802806
* ``C.utf8``
803807
* ``UTF-8``
804808

805809
If setting one of these locale categories succeeds, then the ``LC_CTYPE``
806810
environment variable will also be set accordingly in the current process
807-
environment before the Python runtime is initialized. This ensures the
808-
updated setting is seen in subprocesses, as well as in operations that
809-
query the environment rather than the current C locale (such as Python's
810-
own :func:`locale.getdefaultlocale`).
811+
environment before the Python runtime is initialized. This ensures that in
812+
addition to being seen by both the interpreter itself and other locale-aware
813+
components running in the same process (such as the GNU ``readline``
814+
library), the updated setting is also seen in subprocesses (regardless of
815+
whether or not those processes are running a Python interpreter), as well as
816+
in operations that query the environment rather than the current C locale
817+
(such as Python's own :func:`locale.getdefaultlocale`).
811818

812819
Configuring one of these locales (either explicitly or via the above
813-
implicit locale coercion) will automatically set the error handler for
814-
:data:`sys.stdin` and :data:`sys.stdout` to ``surrogateescape``. This
815-
behavior can be overridden using :envvar:`PYTHONIOENCODING` as usual.
820+
implicit locale coercion) automatically enables the ``surrogateescape``
821+
:ref:`error handler <error-handlers>` for :data:`sys.stdin` and
822+
:data:`sys.stdout` (:data:`sys.stderr` continues to use ``backslashreplace``
823+
as it does in any other locale). This stream handling behavior can be
824+
overridden using :envvar:`PYTHONIOENCODING` as usual.
816825

817826
For debugging purposes, setting ``PYTHONCOERCECLOCALE=warn`` will cause
818827
Python to emit warning messages on ``stderr`` if either the locale coercion
819828
activates, or else if a locale that *would* have triggered coercion is
820829
still active when the Python runtime is initialized.
821830

831+
Also note that even when locale coercion is disabled, or when it fails to
832+
find a suitable target locale, :envvar:`PYTHONUTF8` will still activate by
833+
default in legacy ASCII-based locales. Both features must be disabled in
834+
order to force the interpreter to use ``ASCII`` instead of ``UTF-8`` for
835+
system interfaces.
836+
822837
Availability: \*nix
823838

824839
.. versionadded:: 3.7
@@ -834,10 +849,56 @@ conflict.
834849

835850
.. envvar:: PYTHONUTF8
836851

837-
If set to ``1``, enable the UTF-8 mode. If set to ``0``, disable the UTF-8
838-
mode. Any other non-empty string cause an error.
852+
If set to ``1``, enables the interpreter's UTF-8 mode, where ``UTF-8`` is
853+
used as the text encoding for system interfaces, regardless of the
854+
current locale setting.
855+
856+
This means that:
857+
858+
* :func:`sys.getfilesystemencoding()` returns ``'UTF-8'`` (the locale
859+
encoding is ignored).
860+
* :func:`locale.getpreferredencoding()` returns ``'UTF-8'`` (the locale
861+
encoding is ignored, and the function's ``do_setlocale`` parameter has no
862+
effect).
863+
* :data:`sys.stdin`, :data:`sys.stdout`, and :data:`sys.stderr` all use
864+
UTF-8 as their text encoding, with the ``surrogateescape``
865+
:ref:`error handler <error-handlers>` being enabled for :data:`sys.stdin`
866+
and :data:`sys.stdout` (:data:`sys.stderr` continues to use
867+
``backslashreplace`` as it does in the default locale-aware mode)
868+
869+
As a consequence of the changes in those lower level APIs, other higher
870+
level APIs also exhibit different default behaviours:
871+
872+
* Command line arguments, environment variables and filenames are decoded
873+
to text using the UTF-8 encoding.
874+
* :func:`os.fsdecode()` and :func:`os.fsencode()` use the UTF-8 encoding.
875+
* :func:`open()`, :func:`io.open()`, and :func:`codecs.open()` use the UTF-8
876+
encoding by default. However, they still use the strict error handler by
877+
default so that attempting to open a binary file in text mode is likely
878+
to raise an exception rather than producing nonsense data.
879+
880+
Note that the standard stream settings in UTF-8 mode can be overridden by
881+
:envvar:`PYTHONIOENCODING` (just as they can be in the default locale-aware
882+
mode).
883+
884+
If set to ``0``, the interpreter runs in its default locale-aware mode.
885+
886+
Setting any other non-empty string causes an error during interpreter
887+
initialisation.
888+
889+
If this environment variable is not set at all, then the interpreter defaults
890+
to using the current locale settings, *unless* the current locale is
891+
identified as a legacy ASCII-based locale
892+
(as descibed for :envvar:`PYTHONCOERCECLOCALE`), and locale coercion is
893+
either disabled or fails. In such legacy locales, the interpreter will
894+
default to enabling UTF-8 mode unless explicitly instructed not to do so.
895+
896+
Also available as the :option:`-X` ``utf8`` option.
897+
898+
Availability: \*nix
839899

840900
.. versionadded:: 3.7
901+
See :pep:`540` for more details.
841902

842903

843904
Debug-mode variables

Doc/whatsnew/3.7.rst

Lines changed: 25 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -97,9 +97,10 @@ Significant improvements in the standard library:
9797

9898
CPython implementation improvements:
9999

100+
* Avoiding the use of ASCII as a default text encoding:
101+
* :ref:`PEP 538 <whatsnew37-pep538>`, legacy C locale coercion
102+
* :ref:`PEP 540 <whatsnew37-pep540>`, forced UTF-8 runtime mode
100103
* :ref:`PEP 552 <whatsnew37-pep552>`, deterministic .pycs
101-
* :ref:`PEP 538 <whatsnew37-pep538>`, legacy C locale coercion
102-
* :ref:`PEP 540 <whatsnew37-pep540>`, forced UTF-8 runtime mode
103104
* :ref:`the new development runtime mode <whatsnew37-devmode>`
104105
* :ref:`PEP 565 <whatsnew37-pep565>`, improved :exc:`DeprecationWarning`
105106
handling
@@ -184,7 +185,8 @@ PEP 538: Legacy C Locale Coercion
184185

185186
An ongoing challenge within the Python 3 series has been determining a sensible
186187
default strategy for handling the "7-bit ASCII" text encoding assumption
187-
currently implied by the use of the default C locale on non-Windows platforms.
188+
currently implied by the use of the default C or POSIX locale on non-Windows
189+
platforms.
188190

189191
:pep:`538` updates the default interpreter command line interface to
190192
automatically coerce that locale to an available UTF-8 based locale as
@@ -205,10 +207,18 @@ continues to be ``backslashreplace``, regardless of locale.
205207

206208
Locale coercion is silent by default, but to assist in debugging potentially
207209
locale related integration problems, explicit warnings (emitted directly on
208-
:data:`~sys.stderr` can be requested by setting ``PYTHONCOERCECLOCALE=warn``.
210+
:data:`~sys.stderr`) can be requested by setting ``PYTHONCOERCECLOCALE=warn``.
209211
This setting will also cause the Python runtime to emit a warning if the
210212
legacy C locale remains active when the core interpreter is initialized.
211213

214+
While :pep:`538`'s locale coercion has the benefit of also affecting extension
215+
modules (such as GNU ``readline``), as well as child processes (including those
216+
running non-Python applications and older versions of Python), it has the
217+
downside of requiring that a suitable target locale be present on the running
218+
system. To better handle the case where no suitable target locale is available
219+
(as occurs on RHEL/CentOS 7, for example), Python 3.7 also implements
220+
:ref:`whatsnew37-pep540`.
221+
212222
.. seealso::
213223

214224
:pep:`538` -- Coercing the legacy C locale to a UTF-8 based locale
@@ -231,8 +241,17 @@ The forced UTF-8 mode can be used to change the text handling behavior in
231241
an embedded Python interpreter without changing the locale settings of
232242
an embedding application.
233243

234-
The UTF-8 mode is enabled by default when the locale is "C". See
235-
:ref:`whatsnew37-pep538` for details.
244+
While :pep:`540`'s UTF-8 mode has the benefit of working regardless of which
245+
locales are available on the running system, it has the downside of having no
246+
effect on extension modules (such as GNU ``readline``), child processes running
247+
non-Python applications, and child processes running older versions of Python.
248+
To reduce the risk of corrupting text data when communicating with such
249+
components, Python 3.7 also implements :ref:`whatsnew37-pep540`).
250+
251+
The UTF-8 mode is enabled by default when the locale is ``C`` or ``POSIX``, and
252+
the :pep:`538` locale coercion feature fails to change it to a UTF-8 based
253+
alternative (whether that failure is due to ``PYTHONCOERCECLOCALE=0`` being set,
254+
``LC_ALL`` being set, or the lack of a suitable target locale).
236255

237256
.. seealso::
238257

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
Clarified the relationship between PEP 538's PYTHONCOERCECLOCALE and PEP
2+
540's PYTHONUTF8 mode.

0 commit comments

Comments
 (0)