@@ -990,32 +990,20 @@ code point, is to store each code point as four consecutive bytes. There are two
990990possibilities: store the bytes in big endian or in little endian order. These
991991two encodings are called ``UTF-32-BE `` and ``UTF-32-LE `` respectively. Their
992992disadvantage is that if e.g. you use ``UTF-32-BE `` on a little endian machine you
993- will always have to swap bytes on encoding and decoding. ``UTF-32 `` avoids this
994- problem: bytes will always be in natural endianness. When these bytes are read
995- by a CPU with a different endianness, then bytes have to be swapped though. To
996- be able to detect the endianness of a ``UTF-16 `` or ``UTF-32 `` byte sequence,
997- there's the so called BOM ("Byte Order Mark"). This is the Unicode character
998- ``U+FEFF ``. This character can be prepended to every ``UTF-16 `` or ``UTF-32 ``
999- byte sequence. The byte swapped version of this character (``0xFFFE ``) is an
1000- illegal character that may not appear in a Unicode text. So when the
1001- first character in a ``UTF-16 `` or ``UTF-32 `` byte sequence
1002- appears to be a ``U+FFFE `` the bytes have to be swapped on decoding.
1003-
1004- .. note ::
1005-
1006- **Python UTF-16 and UTF-32 Codec Behavior **
1007-
1008- Python's ``UTF-16 `` and ``UTF-32 `` codecs (when used without an explicit
1009- byte order suffix like ``-BE `` or ``-LE ``) follow the platform's native
1010- byte order when no BOM is present. This differs from the Unicode Standard
1011- specification, which states that UTF-16 and UTF-32 encoding schemes should
1012- default to big-endian byte order when no BOM is present and no higher-level
1013- protocol specifies the byte order.
1014-
1015- This behavior was chosen for practical compatibility reasons, as it avoids
1016- byte swapping on the most common platforms, but developers should be aware
1017- of this difference when exchanging data with systems that strictly follow
1018- the Unicode specification.
993+ will always have to swap bytes on encoding and decoding.
994+ Python's ``UTF-32 `` codec avoids this problem by using the platform's native byte
995+ order when no BOM is present. The plain ``UTF-16 `` codec (without a ``-BE `` or
996+ ``-LE `` suffix) behaves the same way. Python follows prevailing platform
997+ practice so native-endian data round-trips without redundant byte swapping,
998+ even though the Unicode Standard defaults to big-endian when the byte order is
999+ unspecified.When these bytes are read by a CPU with a different endianness,
1000+ then bytes have to be swapped though. To be able to detect the endianness of a
1001+ ``UTF-16 `` or ``UTF-32 `` byte sequence, there's the so called BOM ("Byte Order Mark").
1002+ This is the Unicode character ``U+FEFF ``. This character can be prepended to every
1003+ ``UTF-16 `` or ``UTF-32 `` byte sequence. The byte swapped version of this character
1004+ (``0xFFFE ``) is an illegal character that may not appear in a Unicode text.
1005+ So when the first character in a ``UTF-16 `` or ``UTF-32 `` byte sequence appears to be
1006+ a ``U+FFFE `` the bytes have to be swapped on decoding.
10191007
10201008Unfortunately the character ``U+FEFF `` had a second purpose as
10211009a ``ZERO WIDTH NO-BREAK SPACE ``: a character that has no width and doesn't allow
0 commit comments