@@ -982,17 +982,22 @@ defined in Unicode. A simple and straightforward way that can store each Unicode
982982code point, is to store each code point as four consecutive bytes. There are two
983983possibilities: store the bytes in big endian or in little endian order. These
984984two encodings are called ``UTF-32-BE `` and ``UTF-32-LE `` respectively. Their
985- disadvantage is that if e.g. you use ``UTF-32-BE `` on a little endian machine you
986- will always have to swap bytes on encoding and decoding. ``UTF-32 `` avoids this
987- problem: bytes will always be in natural endianness. When these bytes are read
988- by a CPU with a different endianness, then bytes have to be swapped though. To
989- be able to detect the endianness of a ``UTF-16 `` or ``UTF-32 `` byte sequence,
990- there's the so called BOM ("Byte Order Mark"). This is the Unicode character
991- ``U+FEFF ``. This character can be prepended to every ``UTF-16 `` or ``UTF-32 ``
992- byte sequence. The byte swapped version of this character (``0xFFFE ``) is an
993- illegal character that may not appear in a Unicode text. So when the
994- first character in a ``UTF-16 `` or ``UTF-32 `` byte sequence
995- appears to be a ``U+FFFE `` the bytes have to be swapped on decoding.
985+ disadvantage is that if, for example, you use ``UTF-32-BE `` on a little endian
986+ machine you will always have to swap bytes on encoding and decoding.
987+ Python's ``UTF-16 `` and ``UTF-32 `` codecs avoid this problem by using the
988+ platform's native byte order when no BOM is present.
989+ Python follows prevailing platform
990+ practice, so native-endian data round-trips without redundant byte swapping,
991+ even though the Unicode Standard defaults to big-endian when the byte order is
992+ unspecified. When these bytes are read by a CPU with a different endianness,
993+ the bytes have to be swapped. To be able to detect the endianness of a
994+ ``UTF-16 `` or ``UTF-32 `` byte sequence, a BOM ("Byte Order Mark") is used.
995+ This is the Unicode character ``U+FEFF ``. This character can be prepended to every
996+ ``UTF-16 `` or ``UTF-32 `` byte sequence. The byte swapped version of this character
997+ (``0xFFFE ``) is an illegal character that may not appear in a Unicode text.
998+ When the first character of a ``UTF-16 `` or ``UTF-32 `` byte sequence is
999+ ``U+FFFE ``, the bytes have to be swapped on decoding.
1000+
9961001Unfortunately the character ``U+FEFF `` had a second purpose as
9971002a ``ZERO WIDTH NO-BREAK SPACE ``: a character that has no width and doesn't allow
9981003a word to be split. It can e.g. be used to give hints to a ligature algorithm.
0 commit comments