Skip to content

Commit 9a5ee89

Browse files
committed
Removed the note and improved existing description based on the discussion in the issue
1 parent 96f31ef commit 9a5ee89

File tree

1 file changed

+14
-26
lines changed

1 file changed

+14
-26
lines changed

Doc/library/codecs.rst

Lines changed: 14 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -990,32 +990,20 @@ code point, is to store each code point as four consecutive bytes. There are two
990990
possibilities: store the bytes in big endian or in little endian order. These
991991
two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
992992
disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
993-
will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this
994-
problem: bytes will always be in natural endianness. When these bytes are read
995-
by a CPU with a different endianness, then bytes have to be swapped though. To
996-
be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence,
997-
there's the so called BOM ("Byte Order Mark"). This is the Unicode character
998-
``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32``
999-
byte sequence. The byte swapped version of this character (``0xFFFE``) is an
1000-
illegal character that may not appear in a Unicode text. So when the
1001-
first character in a ``UTF-16`` or ``UTF-32`` byte sequence
1002-
appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
1003-
1004-
.. note::
1005-
1006-
**Python UTF-16 and UTF-32 Codec Behavior**
1007-
1008-
Python's ``UTF-16`` and ``UTF-32`` codecs (when used without an explicit
1009-
byte order suffix like ``-BE`` or ``-LE``) follow the platform's native
1010-
byte order when no BOM is present. This differs from the Unicode Standard
1011-
specification, which states that UTF-16 and UTF-32 encoding schemes should
1012-
default to big-endian byte order when no BOM is present and no higher-level
1013-
protocol specifies the byte order.
1014-
1015-
This behavior was chosen for practical compatibility reasons, as it avoids
1016-
byte swapping on the most common platforms, but developers should be aware
1017-
of this difference when exchanging data with systems that strictly follow
1018-
the Unicode specification.
993+
will always have to swap bytes on encoding and decoding.
994+
Python's ``UTF-32`` codec avoids this problem by using the platform's native byte
995+
order when no BOM is present. The plain ``UTF-16`` codec (without a ``-BE`` or
996+
``-LE`` suffix) behaves the same way. Python follows prevailing platform
997+
practice so native-endian data round-trips without redundant byte swapping,
998+
even though the Unicode Standard defaults to big-endian when the byte order is
999+
unspecified.When these bytes are read by a CPU with a different endianness,
1000+
then bytes have to be swapped though. To be able to detect the endianness of a
1001+
``UTF-16`` or ``UTF-32`` byte sequence, there's the so called BOM ("Byte Order Mark").
1002+
This is the Unicode character ``U+FEFF``. This character can be prepended to every
1003+
``UTF-16`` or ``UTF-32`` byte sequence. The byte swapped version of this character
1004+
(``0xFFFE``) is an illegal character that may not appear in a Unicode text.
1005+
So when the first character in a ``UTF-16`` or ``UTF-32`` byte sequence appears to be
1006+
a ``U+FFFE`` the bytes have to be swapped on decoding.
10191007

10201008
Unfortunately the character ``U+FEFF`` had a second purpose as
10211009
a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow

0 commit comments

Comments
 (0)