Non-streaming decode() appears to remove the BOM? #88

dralley · 2022-09-05T22:29:13Z

Lines 2974 to 2980 in d4d7d2a

    
           pub fn decode<'a>(&'static self, bytes: &'a [u8]) -> (Cow<'a, str>, &'static Encoding, bool) { 
        
               let (encoding, without_bom) = match Encoding::for_bom(bytes) { 
        
                   Some((encoding, bom_length)) => (encoding, &bytes[bom_length..]), 
        
                   None => (self, bytes), 
        
               }; 
        
               let (cow, had_errors) = encoding.decode_without_bom_handling(without_bom); 
        
               (cow, encoding, had_errors)

Functionally, decode() and decode_with_bom_removal() seem pretty much the same? That doesn't seem correct? If there's a variant called "decode_with_bom_removal" then I would expect the standard variant not to remove the BOM.

Compare to:

encoding_rs/src/lib.rs

Lines 3019 to 3030 in d4d7d2a

    
           pub fn decode_with_bom_removal<'a>(&'static self, bytes: &'a [u8]) -> (Cow<'a, str>, bool) { 
        
               let without_bom = if self == UTF_8 && bytes.starts_with(b"\xEF\xBB\xBF") { 
        
                   &bytes[3..] 
        
               } else if (self == UTF_16LE && bytes.starts_with(b"\xFF\xFE")) 
        
                   || (self == UTF_16BE && bytes.starts_with(b"\xFE\xFF")) 
        
               { 
        
                   &bytes[2..] 
        
               } else { 
        
                   bytes 
        
               }; 
        
               self.decode_without_bom_handling(without_bom) 
        
           }

It's totally valid to decode the BOM, the BOM is a unicode character like any other character. Decoding a UTF-16 document with a BOM should yield a UTF-8 document with a BOM. Otherwise, you would just use the BOM-removing version...

use encoding_rs::*;

fn main() {
    // Two characters, '1' and then BOM character
    println!("{:?}", UTF_16LE.decode(&[0x31, 0x00, 0xFF, 0xFE]).0.as_bytes()); 
    // Nothing - BOM removed
    println!("{:?}", UTF_16LE.decode(&[0xFF, 0xFE]).0.as_bytes());
}

[49, 239, 187, 191]
[]

The text was updated successfully, but these errors were encountered:

dralley · 2022-09-05T22:51:18Z

The same thing happens with the streaming API. Am I missing something? Why does the decoder produced by new_decoder() strip the BOM when there is a separate method named new_decoder_with_bom_removal() that I am explicitly not using?

use encoding_rs::*; 

fn main() {
  
    let mut buf = [0; 100];
    
    dbg!(UTF_16LE.new_decoder().decode_to_utf8(
        &[0x31, 0x00, 0xFF, 0xFE],
        &mut buf,
        true,
    ));
    
    println!("{:?}", &buf);
    
    let mut buf = [0; 100];
    
    dbg!(UTF_16LE.new_decoder().decode_to_utf8(
        &[0xFF, 0xFE],
        &mut buf,
        true,
    ));
    
    println!("{:?}", &buf);
}

[src/main.rs:11] UTF_16LE.new_decoder().decode_to_utf8(&[0x31, 0x00, 0xFF, 0xFE], &mut buf,
    true) = (
    InputEmpty,
    4,
    4,
    false,
)
[src/main.rs:21] UTF_16LE.new_decoder().decode_to_utf8(&[0xFF, 0xFE], &mut buf, true) = (
    InputEmpty,
    2,
    0,
    false,
)

[49, 239, 187, 191, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

dralley · 2022-09-09T15:28:56Z

@hsivonen Is there a detail that I'm missing?

dralley · 2022-09-14T19:00:11Z

I could try to put up a PR, I just want to verify that there's not a reason behind this

hsivonen · 2022-09-15T13:20:07Z

decode changes the encoding that's used according to the BOM and removes the BOM.
decode_with_bom_removal removes the (leading) BOM if it matches the BOM for the encoding but doesn't change the encoding.
decode_without_bom_handling gives no special treatment to the BOM.

AFAICT, these are working as documented.

dralley · 2022-09-15T13:22:29Z

@hsivonen If what you say is true then the name of decode_with_bom_removal() doesn't indicate the actual difference between decode_with_bom_removal() and decode(). At minimum that is highly confusing.

It sounds like there is no way to respect the BOM (use the BOM-indicated encoding) without removing it, which is problematic - it's valid unicode, and sometimes you do want to keep it. For instance, if the data came from UTF-16 LE, and you want to perform some processing on it as UTF-8, before re-encoding it as UTF-16 LE. You could write the BOM separately, but that is sometimes less convenient than simply leaving it there to begin with.

dralley · 2022-09-15T13:24:28Z

Furthermore I don't see any documentation indicating that decode() removes the BOM, could you point out where it is documented?

dralley · 2022-09-16T16:07:37Z

So ignoring the documentation issue, I can understand why this is the case due to the way the WHATWG spec defines how decode should be done (though the name "_with_bom_removal()" is still misleading and doesn't quite match what the spec does).

Would you be open to merging function which did not do this?

hsivonen · 2022-09-20T05:51:25Z

Furthermore I don't see any documentation indicating that decode() removes the BOM, could you point out where it is documented?

Good point. It says "BOM sniffing" without saying explicitly that BOM sniffing means BOM removal.

What's the use case for deciding the encoding from BOM but still letting the BOM show through to output?

dralley · 2022-09-20T12:33:17Z

I'm working on improving the encoding support of quick-xml. One idea I had is that a user may want to get either:

a normalized series of XML events, which strips any non-meaningful whitespace including anything prior to the XML delcaration (including the BOM), or
a series of XML events, which would include whitespace between elements (and the BOM), which if passed back into a writer would exactly reproduce the document

The latter would potentially simplify making edits to existing documents, given that some software wants to see a UTF-8 BOM even though the spec recommends against it.

In order to pull that off, the BOM would need to not be stripped automatically. Whether it gets removed or not (alongside other things) would be handled by the Reader.

I'm open to a counterargument that it's not a good idea, though. One potential issue with the use case I just described is that there would be no way to automatically handle indentation for new additions into the document, because automatic indentation would need to be turned off to "exactly reproduce" everything else. But I think we would still need to know whether the BOM was present, at least.

hsivonen · 2022-09-20T12:42:59Z

The BOM isn't part of the XML information set, so an XML use case hasn't arisen before. Does you exact reproduction mode also retain whitespace on either side of the equals sign for attributes, etc.?

Since this issue is filed against the non-streaming API, you can easily find out if there was a BOM by inspecting the buffer with Encoding::for_bom().

dralley · 2022-09-20T12:56:11Z

Well, the streaming API does the same thing as mentioned in the first comment after the OP, I just first encountered this while doing some experimentation. The streaming API is what we would be using in practice.

The BOM isn't part of the XML information set

It's not, but neither is inter-element whitespace or any data prior to the XML declaration, hence why I figured it might make sense to treat them similarly. The BOM is relevant to some (mostly Windows) software even though, for UTF-8, it shouldn't be.

Does you exact reproduction mode also retain whitespace on either side of the equals sign for attributes, etc.?

At the moment this is just a concept, the actual encoding support is more important and is the main focus. You're right that it's non-trivial and would have to encompass a lot more than just whitespace between events. But I think that would be possible. Whether it's actually worthwhile, I'm not sure yet.

hsivonen · 2022-09-20T13:06:38Z

Well, the streaming API does the same thing as mentioned in the first comment after the OP, I just first encountered this while doing some experimentation. The streaming API is what we would be using in practice.

I see. Still, I think at least for now, I'm going to continue to treat this as out of scope, since the BOM with BOM semantics isn't supposed to be part of the logical content of the stream.

dralley · 2022-09-20T14:55:09Z

the BOM with BOM semantics isn't supposed to be part of the logical content of the stream.

This is the part I'm unsure on. Nothing about the BOM is particularly special, it's just a unicode character, etc. Isn't it just part of the document, then? Unicode says it should present as an invisible zero-width non-breaking space, which implies that it is expected to make it through to the presentation layer sometimes, and not just be stripped off in all cases.

Various documents I've read don't spell it out explicitly but they make it seem like part of the data stream itself, more akin to the #! shebang line at the beginning of a Python/Perl/Bash script than magic bytes at the beginning of a zip file. In one sense, the shebang line isn't "part of the script", but in another sense, it's just as much a part of the script as any other comment in the script.

https://www.w3.org/International/questions/qa-byte-order-mark
https://unicode.org/faq/utf_bom.html#bom1

hsivonen · 2022-09-21T06:31:22Z

See The Unicode Standard 15.0 D98 second bullet last sentence on page 131 (PDF page 157). See also the note in The Encoding Standard that explains that the naming doesn't match the naming in The Unicode Standard.

Note that the XML spec says of the BOM: "This is an encoding signature, not part of either the markup or the character data of the XML document." Also, the BNF in the XML spec doesn't consider the BOM part of the textual syntax.

dralley · 2022-09-23T02:05:33Z

That appears to be correct for UTF-16. Regarding UTF-8, the bullet points under D95 on the previous page are frustratingly unclear - it has no such language and actually seems to imply the opposite.

While there is obviously no need for a byte order signature when using UTF-8,
there are occasions when processes convert UTF-16 or UTF-32 data contain-
ing a byte order mark into UTF-8. When represented in UTF-8, the byte order
mark turns into the byte sequence . Its usage at the beginning of a
UTF-8 data stream is neither required nor recommended by the Unicode Stan-
dard, but its presence does not affect conformance to the UTF-8 encoding
scheme. Identification of the byte sequence at the beginning of a
data stream can, however, be taken as a near-certain indication that the data
stream is using the UTF-8 encoding scheme.

But considering I really don't feel like lawyering the spec to that degree, point taken, we should probably just accept that the BOM will be stripped and force the user to write it themselves on the output side, if they want it.

I will file a new issue about documentation.

dralley mentioned this issue Sep 14, 2022

[work-in-progress] Decoding BufReader implementation tafia/quick-xml#441

Draft

hsivonen closed this as completed Sep 15, 2022

dralley mentioned this issue Sep 23, 2022

Better document the behavior of decoding functions with respect to the BOM #89

Closed

BurntSushi mentioned this issue Feb 27, 2023

Added a method for returning the Decoder's current input encoding, and fix bugs BurntSushi/encoding_rs_io#11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-streaming decode() appears to remove the BOM? #88

Non-streaming decode() appears to remove the BOM? #88

dralley commented Sep 5, 2022 •

edited

Loading

dralley commented Sep 5, 2022 •

edited

Loading

dralley commented Sep 9, 2022

dralley commented Sep 14, 2022

hsivonen commented Sep 15, 2022

dralley commented Sep 15, 2022 •

edited

Loading

dralley commented Sep 15, 2022

dralley commented Sep 16, 2022 •

edited

Loading

hsivonen commented Sep 20, 2022

dralley commented Sep 20, 2022 •

edited

Loading

hsivonen commented Sep 20, 2022

dralley commented Sep 20, 2022 •

edited

Loading

hsivonen commented Sep 20, 2022

dralley commented Sep 20, 2022 •

edited

Loading

hsivonen commented Sep 21, 2022

dralley commented Sep 23, 2022 •

edited

Loading

Non-streaming decode() appears to remove the BOM? #88

Non-streaming decode() appears to remove the BOM? #88

Comments

dralley commented Sep 5, 2022 • edited Loading

dralley commented Sep 5, 2022 • edited Loading

dralley commented Sep 9, 2022

dralley commented Sep 14, 2022

hsivonen commented Sep 15, 2022

dralley commented Sep 15, 2022 • edited Loading

dralley commented Sep 15, 2022

dralley commented Sep 16, 2022 • edited Loading

hsivonen commented Sep 20, 2022

dralley commented Sep 20, 2022 • edited Loading

hsivonen commented Sep 20, 2022

dralley commented Sep 20, 2022 • edited Loading

hsivonen commented Sep 20, 2022

dralley commented Sep 20, 2022 • edited Loading

hsivonen commented Sep 21, 2022

dralley commented Sep 23, 2022 • edited Loading

dralley commented Sep 5, 2022 •

edited

Loading

dralley commented Sep 5, 2022 •

edited

Loading

dralley commented Sep 15, 2022 •

edited

Loading

dralley commented Sep 16, 2022 •

edited

Loading

dralley commented Sep 20, 2022 •

edited

Loading

dralley commented Sep 20, 2022 •

edited

Loading

dralley commented Sep 20, 2022 •

edited

Loading

dralley commented Sep 23, 2022 •

edited

Loading