Skip to content

Commit cc16be8

Browse files
committed
Issue #27781: Change file system encoding on Windows to UTF-8 (PEP 529)
1 parent cfbd48b commit cc16be8

File tree

18 files changed

+614
-832
lines changed

18 files changed

+614
-832
lines changed

Doc/c-api/unicode.rst

Lines changed: 17 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -802,10 +802,11 @@ File System Encoding
802802
""""""""""""""""""""
803803
804804
To encode and decode file names and other environment strings,
805-
:c:data:`Py_FileSystemEncoding` should be used as the encoding, and
806-
``"surrogateescape"`` should be used as the error handler (:pep:`383`). To
807-
encode file names during argument parsing, the ``"O&"`` converter should be
808-
used, passing :c:func:`PyUnicode_FSConverter` as the conversion function:
805+
:c:data:`Py_FileSystemDefaultEncoding` should be used as the encoding, and
806+
:c:data:`Py_FileSystemDefaultEncodeErrors` should be used as the error handler
807+
(:pep:`383` and :pep:`529`). To encode file names to :class:`bytes` during
808+
argument parsing, the ``"O&"`` converter should be used, passing
809+
:c:func:`PyUnicode_FSConverter` as the conversion function:
809810
810811
.. c:function:: int PyUnicode_FSConverter(PyObject* obj, void* result)
811812
@@ -820,8 +821,9 @@ used, passing :c:func:`PyUnicode_FSConverter` as the conversion function:
820821
.. versionchanged:: 3.6
821822
Accepts a :term:`path-like object`.
822823
823-
To decode file names during argument parsing, the ``"O&"`` converter should be
824-
used, passing :c:func:`PyUnicode_FSDecoder` as the conversion function:
824+
To decode file names to :class:`str` during argument parsing, the ``"O&"``
825+
converter should be used, passing :c:func:`PyUnicode_FSDecoder` as the
826+
conversion function:
825827
826828
.. c:function:: int PyUnicode_FSDecoder(PyObject* obj, void* result)
827829
@@ -840,7 +842,7 @@ used, passing :c:func:`PyUnicode_FSDecoder` as the conversion function:
840842
.. c:function:: PyObject* PyUnicode_DecodeFSDefaultAndSize(const char *s, Py_ssize_t size)
841843
842844
Decode a string using :c:data:`Py_FileSystemDefaultEncoding` and the
843-
``"surrogateescape"`` error handler, or ``"strict"`` on Windows.
845+
:c:data:`Py_FileSystemDefaultEncodeErrors` error handler.
844846
845847
If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the
846848
locale encoding.
@@ -854,28 +856,28 @@ used, passing :c:func:`PyUnicode_FSDecoder` as the conversion function:
854856
855857
The :c:func:`Py_DecodeLocale` function.
856858
857-
.. versionchanged:: 3.2
858-
Use ``"strict"`` error handler on Windows.
859+
.. versionchanged:: 3.6
860+
Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler.
859861
860862
861863
.. c:function:: PyObject* PyUnicode_DecodeFSDefault(const char *s)
862864
863865
Decode a null-terminated string using :c:data:`Py_FileSystemDefaultEncoding`
864-
and the ``"surrogateescape"`` error handler, or ``"strict"`` on Windows.
866+
and the :c:data:`Py_FileSystemDefaultEncodeErrors` error handler.
865867
866868
If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the
867869
locale encoding.
868870
869871
Use :c:func:`PyUnicode_DecodeFSDefaultAndSize` if you know the string length.
870872
871-
.. versionchanged:: 3.2
872-
Use ``"strict"`` error handler on Windows.
873+
.. versionchanged:: 3.6
874+
Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler.
873875
874876
875877
.. c:function:: PyObject* PyUnicode_EncodeFSDefault(PyObject *unicode)
876878
877879
Encode a Unicode object to :c:data:`Py_FileSystemDefaultEncoding` with the
878-
``"surrogateescape"`` error handler, or ``"strict"`` on Windows, and return
880+
:c:data:`Py_FileSystemDefaultEncodeErrors` error handler, and return
879881
:class:`bytes`. Note that the resulting :class:`bytes` object may contain
880882
null bytes.
881883
@@ -892,6 +894,8 @@ used, passing :c:func:`PyUnicode_FSDecoder` as the conversion function:
892894
893895
.. versionadded:: 3.2
894896
897+
.. versionchanged:: 3.6
898+
Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler.
895899
896900
wchar_t Support
897901
"""""""""""""""

Doc/library/sys.rst

Lines changed: 40 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -428,25 +428,42 @@ always available.
428428

429429
.. function:: getfilesystemencoding()
430430

431-
Return the name of the encoding used to convert Unicode filenames into
432-
system file names. The result value depends on the operating system:
431+
Return the name of the encoding used to convert between Unicode
432+
filenames and bytes filenames. For best compatibility, str should be
433+
used for filenames in all cases, although representing filenames as bytes
434+
is also supported. Functions accepting or returning filenames should support
435+
either str or bytes and internally convert to the system's preferred
436+
representation.
433437

434-
* On Mac OS X, the encoding is ``'utf-8'``.
438+
This encoding is always ASCII-compatible.
439+
440+
:func:`os.fsencode` and :func:`os.fsdecode` should be used to ensure that
441+
the correct encoding and errors mode are used.
435442

436-
* On Unix, the encoding is the user's preference according to the result of
437-
nl_langinfo(CODESET).
443+
* On Mac OS X, the encoding is ``'utf-8'``.
438444

439-
* On Windows NT+, file names are Unicode natively, so no conversion is
440-
performed. :func:`getfilesystemencoding` still returns ``'mbcs'``, as
441-
this is the encoding that applications should use when they explicitly
442-
want to convert Unicode strings to byte strings that are equivalent when
443-
used as file names.
445+
* On Unix, the encoding is the locale encoding.
444446

445-
* On Windows 9x, the encoding is ``'mbcs'``.
447+
* On Windows, the encoding may be ``'utf-8'`` or ``'mbcs'``, depending
448+
on user configuration.
446449

447450
.. versionchanged:: 3.2
448451
:func:`getfilesystemencoding` result cannot be ``None`` anymore.
449452

453+
.. versionchanged:: 3.6
454+
Windows is no longer guaranteed to return ``'mbcs'``. See :pep:`529`
455+
and :func:`_enablelegacywindowsfsencoding` for more information.
456+
457+
.. function:: getfilesystemencodeerrors()
458+
459+
Return the name of the error mode used to convert between Unicode filenames
460+
and bytes filenames. The encoding name is returned from
461+
:func:`getfilesystemencoding`.
462+
463+
:func:`os.fsencode` and :func:`os.fsdecode` should be used to ensure that
464+
the correct encoding and errors mode are used.
465+
466+
.. versionadded:: 3.6
450467

451468
.. function:: getrefcount(object)
452469

@@ -1138,6 +1155,18 @@ always available.
11381155
This function has been added on a provisional basis (see :pep:`411`
11391156
for details.) Use it only for debugging purposes.
11401157

1158+
.. function:: _enablelegacywindowsfsencoding()
1159+
1160+
Changes the default filesystem encoding and errors mode to 'mbcs' and
1161+
'replace' respectively, for consistency with versions of Python prior to 3.6.
1162+
1163+
This is equivalent to defining the :envvar:`PYTHONLEGACYWINDOWSFSENCODING`
1164+
environment variable before launching Python.
1165+
1166+
Availability: Windows
1167+
1168+
.. versionadded:: 3.6
1169+
See :pep:`529` for more details.
11411170

11421171
.. data:: stdin
11431172
stdout

Doc/using/cmdline.rst

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -672,6 +672,20 @@ conflict.
672672
It now has no effect if set to an empty string.
673673

674674

675+
.. envvar:: PYTHONLEGACYWINDOWSFSENCODING
676+
677+
If set to a non-empty string, the default filesystem encoding and errors mode
678+
will revert to their pre-3.6 values of 'mbcs' and 'replace', respectively.
679+
Otherwise, the new defaults 'utf-8' and 'surrogatepass' are used.
680+
681+
This may also be enabled at runtime with
682+
:func:`sys._enablelegacywindowsfsencoding()`.
683+
684+
Availability: Windows
685+
686+
.. versionadded:: 3.6
687+
See :pep:`529` for more details.
688+
675689
Debug-mode variables
676690
~~~~~~~~~~~~~~~~~~~~
677691

Doc/whatsnew/3.6.rst

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,8 @@ Security improvements:
7676

7777
Windows improvements:
7878

79+
* PEP 529: :ref:`Change Windows filesystem encoding to UTF-8 <pep-529>`
80+
7981
* The ``py.exe`` launcher, when used interactively, no longer prefers
8082
Python 2 over Python 3 when the user doesn't specify a version (via
8183
command line arguments or a config file). Handling of shebang lines
@@ -218,6 +220,33 @@ evaluated at run time, and then formatted using the :func:`format` protocol.
218220

219221
See :pep:`498` and the main documentation at :ref:`f-strings`.
220222

223+
.. _pep-529:
224+
225+
PEP 529: Change Windows filesystem encoding to UTF-8
226+
----------------------------------------------------
227+
228+
Representing filesystem paths is best performed with str (Unicode) rather than
229+
bytes. However, there are some situations where using bytes is sufficient and
230+
correct.
231+
232+
Prior to Python 3.6, data loss could result when using bytes paths on Windows.
233+
With this change, using bytes to represent paths is now supported on Windows,
234+
provided those bytes are encoded with the encoding returned by
235+
:func:`sys.getfilesystemencoding()`, which now defaults to ``'utf-8'``.
236+
237+
Applications that do not use str to represent paths should use
238+
:func:`os.fsencode()` and :func:`os.fsdecode()` to ensure their bytes are
239+
correctly encoded. To revert to the previous behaviour, set
240+
:envvar:`PYTHONLEGACYWINDOWSFSENCODING` or call
241+
:func:`sys._enablelegacywindowsfsencoding`.
242+
243+
See :pep:`529` for more information and discussion of code modifications that
244+
may be required.
245+
246+
.. note::
247+
248+
This change is considered experimental for 3.6.0 beta releases. The default
249+
encoding may change before the final release.
221250

222251
PEP 487: Simpler customization of class creation
223252
------------------------------------------------

Include/fileobject.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ PyAPI_FUNC(char *) Py_UniversalNewlineFgets(char *, int, FILE*, PyObject *);
2323
If non-NULL, this is different than the default encoding for strings
2424
*/
2525
PyAPI_DATA(const char *) Py_FileSystemDefaultEncoding;
26+
PyAPI_DATA(const char *) Py_FileSystemDefaultEncodeErrors;
2627
PyAPI_DATA(int) Py_HasFileSystemDefaultEncoding;
2728

2829
/* Internal API

Include/unicodeobject.h

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -103,10 +103,6 @@ typedef wchar_t Py_UNICODE;
103103
# endif
104104
#endif
105105

106-
#if defined(MS_WINDOWS)
107-
# define HAVE_MBCS
108-
#endif
109-
110106
#ifdef HAVE_WCHAR_H
111107
/* Work around a cosmetic bug in BSDI 4.x wchar.h; thanks to Thomas Wouters */
112108
# ifdef _HAVE_BSDI
@@ -1657,7 +1653,7 @@ PyAPI_FUNC(PyObject *) PyUnicode_TranslateCharmap(
16571653
);
16581654
#endif
16591655

1660-
#ifdef HAVE_MBCS
1656+
#ifdef MS_WINDOWS
16611657

16621658
/* --- MBCS codecs for Windows -------------------------------------------- */
16631659

@@ -1700,7 +1696,7 @@ PyAPI_FUNC(PyObject*) PyUnicode_EncodeCodePage(
17001696
const char *errors /* error handling */
17011697
);
17021698

1703-
#endif /* HAVE_MBCS */
1699+
#endif /* MS_WINDOWS */
17041700

17051701
/* --- Decimal Encoder ---------------------------------------------------- */
17061702

Lib/os.py

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -851,10 +851,7 @@ def getenvb(key, default=None):
851851

852852
def _fscodec():
853853
encoding = sys.getfilesystemencoding()
854-
if encoding == 'mbcs':
855-
errors = 'strict'
856-
else:
857-
errors = 'surrogateescape'
854+
errors = sys.getfilesystemencodeerrors()
858855

859856
def fsencode(filename):
860857
"""Encode filename (an os.PathLike, bytes, or str) to the filesystem

0 commit comments

Comments
 (0)