File tree Expand file tree Collapse file tree 1 file changed +17
-0
lines changed Expand file tree Collapse file tree 1 file changed +17
-0
lines changed Original file line number Diff line number Diff line change @@ -1000,6 +1000,23 @@ byte sequence. The byte swapped version of this character (``0xFFFE``) is an
10001000illegal character that may not appear in a Unicode text. So when the
10011001first character in a ``UTF-16 `` or ``UTF-32 `` byte sequence
10021002appears to be a ``U+FFFE `` the bytes have to be swapped on decoding.
1003+ 
1004+ .. note ::
1005+ 
1006+    **Python UTF-16 and UTF-32 Codec Behavior **
1007+ 
1008+    Python's ``UTF-16 `` and ``UTF-32 `` codecs (when used without an explicit
1009+    byte order suffix like ``-BE `` or ``-LE ``) follow the platform's native
1010+    byte order when no BOM is present. This differs from the Unicode Standard
1011+    specification, which states that UTF-16 and UTF-32 encoding schemes should
1012+    default to big-endian byte order when no BOM is present and no higher-level
1013+    protocol specifies the byte order.
1014+ 
1015+    This behavior was chosen for practical compatibility reasons, as it avoids
1016+    byte swapping on the most common platforms, but developers should be aware
1017+    of this difference when exchanging data with systems that strictly follow
1018+    the Unicode specification.
1019+ 
10031020Unfortunately the character ``U+FEFF `` had a second purpose as
10041021a ``ZERO WIDTH NO-BREAK SPACE ``: a character that has no width and doesn't allow
10051022a word to be split. It can e.g. be used to give hints to a ligature algorithm.
 
 
   
 
     
   
   
          
    
    
     
    
      
     
     
    You can’t perform that action at this time.
  
 
    
  
    
      
        
     
       
      
     
   
 
    
    
  
 
  
 
     
    
0 commit comments