Skip to content

Commit 96f31ef

Browse files
committed
gh-128571: Document UTF-16/32 native byte order
1 parent d6dd64a commit 96f31ef

File tree

1 file changed

+17
-0
lines changed

1 file changed

+17
-0
lines changed

Doc/library/codecs.rst

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1000,6 +1000,23 @@ byte sequence. The byte swapped version of this character (``0xFFFE``) is an
10001000
illegal character that may not appear in a Unicode text. So when the
10011001
first character in a ``UTF-16`` or ``UTF-32`` byte sequence
10021002
appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
1003+
1004+
.. note::
1005+
1006+
**Python UTF-16 and UTF-32 Codec Behavior**
1007+
1008+
Python's ``UTF-16`` and ``UTF-32`` codecs (when used without an explicit
1009+
byte order suffix like ``-BE`` or ``-LE``) follow the platform's native
1010+
byte order when no BOM is present. This differs from the Unicode Standard
1011+
specification, which states that UTF-16 and UTF-32 encoding schemes should
1012+
default to big-endian byte order when no BOM is present and no higher-level
1013+
protocol specifies the byte order.
1014+
1015+
This behavior was chosen for practical compatibility reasons, as it avoids
1016+
byte swapping on the most common platforms, but developers should be aware
1017+
of this difference when exchanging data with systems that strictly follow
1018+
the Unicode specification.
1019+
10031020
Unfortunately the character ``U+FEFF`` had a second purpose as
10041021
a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow
10051022
a word to be split. It can e.g. be used to give hints to a ligature algorithm.

0 commit comments

Comments
 (0)