PEP: 597
Title: Add optional EncodingWarning
Last-Modified: 07-Aug-2021
Author: Inada Naoki <songofacandy@gmail.com>
Status: Final
Type: Standards Track
Content-Type: text/x-rst
Created: 05-Jun-2019
Python-Version: 3.10


Abstract
========

Add a new warning category ``EncodingWarning``. It is emitted when the
``encoding`` argument to ``open()`` is omitted and the default
locale-specific encoding is used.

The warning is disabled by default. A new ``-X warn_default_encoding``
command-line option and a new ``PYTHONWARNDEFAULTENCODING`` environment
variable can be used to enable it.

A ``"locale"`` argument value for ``encoding`` is added too. It
explicitly specifies that the locale encoding should be used, silencing
the warning.


Motivation
==========

Using the default encoding is a common mistake
----------------------------------------------

Developers using macOS or Linux may forget that the default encoding
is not always UTF-8.

For example, using ``long_description = open("README.md").read()`` in
``setup.py`` is a common mistake. Many Windows users cannot install
such packages if there is at least one non-ASCII character
(e.g. emoji, author names, copyright symbols, and the like)
in their UTF-8-encoded ``README.md`` file.

Of the 4000 most downloaded packages from PyPI, 489 use non-ASCII
characters in their README, and 82 fail to install from source on
non-UTF-8 locales due to not specifying an encoding for a non-ASCII
file. [1]_

Another example is ``logging.basicConfig(filename="log.txt")``.
Some users might expect it to use UTF-8 by default, but the locale
encoding is actually what is used. [2]_

Even Python experts may assume that the default encoding is UTF-8.
This creates bugs that only happen on Windows; see [3]_, [4]_, [5]_,
and [6]_ for example.

Emitting a warning when the ``encoding`` argument is omitted will help
find such mistakes.


Explicit way to use locale-specific encoding
--------------------------------------------

``open(filename)`` isn't explicit about which encoding is expected:

* If ASCII is assumed, this isn't a bug, but may result in decreased
  performance on Windows, particularly with non-Latin-1 locale encodings
* If UTF-8 is assumed, this may be a bug or a platform-specific script
* If the locale encoding is assumed, the behavior is as expected
  (but could change if future versions of Python modify the default)

From this point of view, ``open(filename)`` is not readable code.

``encoding=locale.getpreferredencoding(False)`` can be used to
specify the locale encoding explicitly, but it is too long and easy
to misuse (e.g. one can forget to pass ``False`` as its argument).

This PEP provides an explicit way to specify the locale encoding.


Prepare to change the default encoding to UTF-8
-----------------------------------------------

Since UTF-8 has become the de-facto standard text encoding,
we might default to it for opening files in the future.

However, such a change will affect many applications and libraries.
If we start emitting ``DeprecationWarning`` everywhere the ``encoding``
argument is omitted, it will be too noisy and painful.

Although this PEP doesn't propose changing the default encoding,
it will help enable that change by:

* Reducing the number of omitted ``encoding`` arguments in libraries
  before we start emitting a ``DeprecationWarning`` by default.

* Allowing users to pass ``encoding="locale"`` to suppress
  the current warning and any ``DeprecationWarning`` added in the future,
  as well as retaining consistent behavior if later Python versions
  change the default, ensuring support for any Python version >=3.10.


Specification
=============

``EncodingWarning``
-------------------

Add a new ``EncodingWarning`` warning class as a subclass of
``Warning``. It is emitted when the ``encoding`` argument is omitted and
the default locale-specific encoding is used.


Options to enable the warning
-----------------------------

The ``-X warn_default_encoding`` option and the
``PYTHONWARNDEFAULTENCODING`` environment variable are added. They
are used to enable ``EncodingWarning``.

``sys.flags.warn_default_encoding`` is also added. The flag is true when
``EncodingWarning`` is enabled.

When the flag is set, ``io.TextIOWrapper()``, ``open()`` and other
modules using them will emit ``EncodingWarning`` when the ``encoding``
argument is omitted.

Since ``EncodingWarning`` is a subclass of ``Warning``, they are
shown by default (if the ``warn_default_encoding`` flag is set), unlike
``DeprecationWarning``.


``encoding="locale"``
---------------------

``io.TextIOWrapper`` will accept ``"locale"`` as a valid argument to
``encoding``. It has the same meaning as the current ``encoding=None``,
except that ``io.TextIOWrapper`` doesn't emit ``EncodingWarning`` when
``encoding="locale"`` is specified.


``io.text_encoding()``
----------------------

``io.text_encoding()`` is a helper for functions with an
``encoding=None`` parameter that pass it to ``io.TextIOWrapper()`` or
``open()``.

A pure Python implementation will look like this::

   def text_encoding(encoding, stacklevel=1):
       """A helper function to choose the text encoding.

       When *encoding* is not None, just return it.
       Otherwise, return the default text encoding (i.e. "locale").

       This function emits an EncodingWarning if *encoding* is None and
       sys.flags.warn_default_encoding is true.

       This function can be used in APIs with an encoding=None parameter
       that pass it to TextIOWrapper or open.
       However, please consider using encoding="utf-8" for new APIs.
       """
       if encoding is None:
           if sys.flags.warn_default_encoding:
               import warnings
               warnings.warn(
                   "'encoding' argument not specified.",
                   EncodingWarning, stacklevel + 2)
           encoding = "locale"
       return encoding

For example, ``pathlib.Path.read_text()`` can use it like this:

.. code-block::

   def read_text(self, encoding=None, errors=None):
       encoding = io.text_encoding(encoding)
       with self.open(mode='r', encoding=encoding, errors=errors) as f:
           return f.read()

By using ``io.text_encoding()``, ``EncodingWarning`` is emitted for
the caller of ``read_text()`` instead of ``read_text()`` itself.


Affected standard library modules
---------------------------------

Many standard library modules will be affected by this change.

Most APIs accepting ``encoding=None`` will use ``io.text_encoding()``
as written in the previous section.

Where using the locale encoding as the default encoding is reasonable,
``encoding="locale"`` will be used instead. For example,
the ``subprocess`` module will use the locale encoding as the default
for pipes.

Many tests use ``open()`` without ``encoding`` specified to read
ASCII text files. They should be rewritten with ``encoding="ascii"``.


Rationale
=========

Opt-in warning
--------------

Although ``DeprecationWarning`` is suppressed by default, always
emitting ``DeprecationWarning`` when the ``encoding`` argument is
omitted would be too noisy.

Noisy warnings may lead developers to dismiss the
``DeprecationWarning``.


"locale" is not a codec alias
-----------------------------

We don't add "locale" as a codec alias because the locale can be
changed at runtime.

Additionally, ``TextIOWrapper`` checks ``os.device_encoding()``
when ``encoding=None``. This behavior cannot be implemented in
a codec.


Backward Compatibility
======================

The new warning is not emitted by default, so this PEP is 100%
backwards-compatible.


Forward Compatibility
=====================

Passing ``"locale"`` as the argument to ``encoding`` is not
forward-compatible. Code using it will not work on Python older than
3.10, and will instead raise ``LookupError: unknown encoding: locale``.

Until developers can drop Python 3.9 support, ``EncodingWarning``
can only be used for finding missing ``encoding="utf-8"`` arguments.


How to Teach This
=================

For new users
-------------

Since ``EncodingWarning`` is used to write cross-platform code,
there is no need to teach it to new users.

We can just recommend using UTF-8 for text files and using
``encoding="utf-8"`` when opening them.


For experienced users
---------------------

Using ``open(filename)`` to read text files encoded in UTF-8 is a
common mistake. It may not work on Windows because UTF-8 is not the
default encoding.

You can use ``-X warn_default_encoding`` or
``PYTHONWARNDEFAULTENCODING=1`` to find this type of mistake.

Omitting the ``encoding`` argument is not a bug when opening text files
encoded in the locale encoding, but ``encoding="locale"`` is recommended
in Python 3.10 and later because it is more explicit.


Reference Implementation
========================

https://github.com/python/cpython/pull/19481


Discussions
===========

The latest discussion thread is:
https://mail.python.org/archives/list/python-dev@python.org/thread/SFYUP2TWD5JZ5KDLVSTZ44GWKVY4YNCV/


* Why not implement this in linters?

  * ``encoding="locale"`` and ``io.text_encoding()`` must be implemented
    in Python.

  * It is difficult to find all callers of functions wrapping
    ``open()`` or ``TextIOWrapper()`` (see the ``io.text_encoding()``
    section).

* Many developers will not use the option.

  * Some will, and report the warnings to libraries they use,
    so the option is worth it even if many developers don't enable it.

  * For example, I found [7]_ and [8]_ by running
    ``pip install -U pip``, and [9]_ by running ``tox``
    with the reference implementation. This demonstrates how this
    option can be used to find potential issues.


References
==========

.. [1] "Packages can't be installed when encoding is not UTF-8"
       (https://github.com/methane/pep597-pypi-ascii)

.. [2] "Logging - Inconsistent behaviour when handling unicode"
       (https://bugs.python.org/issue37111)

.. [3] Packaging tutorial in packaging.python.org didn't specify
       encoding to read a ``README.md``
       (https://github.com/pypa/packaging.python.org/pull/682)

.. [4] ``json.tool`` had used locale encoding to read JSON files.
       (https://bugs.python.org/issue33684)

.. [5] site: Potential UnicodeDecodeError when handling pth file
       (https://bugs.python.org/issue33684)

.. [6] pypa/pip: "Installing packages fails if Python 3 installed
       into path with non-ASCII characters"
       (https://github.com/pypa/pip/issues/9054)

.. [7] "site: Potential UnicodeDecodeError when handling pth file"
       (https://bugs.python.org/issue43214)

.. [8] "[pypa/pip] Use ``encoding`` option or binary mode for open()"
       (https://github.com/pypa/pip/pull/9608)

.. [9] "Possible UnicodeError caused by missing encoding="utf-8""
       (https://github.com/tox-dev/tox/issues/1908)


Copyright
=========

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.

..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   fill-column: 70
   coding: utf-8
   End: