-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-streaming decode() appears to remove the BOM? #88
Comments
The same thing happens with the streaming API. Am I missing something? Why does the decoder produced by use encoding_rs::*;
fn main() {
let mut buf = [0; 100];
dbg!(UTF_16LE.new_decoder().decode_to_utf8(
&[0x31, 0x00, 0xFF, 0xFE],
&mut buf,
true,
));
println!("{:?}", &buf);
let mut buf = [0; 100];
dbg!(UTF_16LE.new_decoder().decode_to_utf8(
&[0xFF, 0xFE],
&mut buf,
true,
));
println!("{:?}", &buf);
}
|
@hsivonen Is there a detail that I'm missing? |
I could try to put up a PR, I just want to verify that there's not a reason behind this |
AFAICT, these are working as documented. |
@hsivonen If what you say is true then the name of It sounds like there is no way to respect the BOM (use the BOM-indicated encoding) without removing it, which is problematic - it's valid unicode, and sometimes you do want to keep it. For instance, if the data came from UTF-16 LE, and you want to perform some processing on it as UTF-8, before re-encoding it as UTF-16 LE. You could write the BOM separately, but that is sometimes less convenient than simply leaving it there to begin with. |
Furthermore I don't see any documentation indicating that |
So ignoring the documentation issue, I can understand why this is the case due to the way the WHATWG spec defines how Would you be open to merging function which did not do this? |
Good point. It says "BOM sniffing" without saying explicitly that BOM sniffing means BOM removal. What's the use case for deciding the encoding from BOM but still letting the BOM show through to output? |
I'm working on improving the encoding support of
The latter would potentially simplify making edits to existing documents, given that some software wants to see a UTF-8 BOM even though the spec recommends against it. In order to pull that off, the BOM would need to not be stripped automatically. Whether it gets removed or not (alongside other things) would be handled by the Reader. I'm open to a counterargument that it's not a good idea, though. One potential issue with the use case I just described is that there would be no way to automatically handle indentation for new additions into the document, because automatic indentation would need to be turned off to "exactly reproduce" everything else. But I think we would still need to know whether the BOM was present, at least. |
The BOM isn't part of the XML information set, so an XML use case hasn't arisen before. Does you exact reproduction mode also retain whitespace on either side of the equals sign for attributes, etc.? Since this issue is filed against the non-streaming API, you can easily find out if there was a BOM by inspecting the buffer with |
Well, the streaming API does the same thing as mentioned in the first comment after the OP, I just first encountered this while doing some experimentation. The streaming API is what we would be using in practice.
It's not, but neither is inter-element whitespace or any data prior to the XML declaration, hence why I figured it might make sense to treat them similarly. The BOM is relevant to some (mostly Windows) software even though, for UTF-8, it shouldn't be.
At the moment this is just a concept, the actual encoding support is more important and is the main focus. You're right that it's non-trivial and would have to encompass a lot more than just whitespace between events. But I think that would be possible. Whether it's actually worthwhile, I'm not sure yet. |
I see. Still, I think at least for now, I'm going to continue to treat this as out of scope, since the BOM with BOM semantics isn't supposed to be part of the logical content of the stream. |
This is the part I'm unsure on. Nothing about the BOM is particularly special, it's just a unicode character, etc. Isn't it just part of the document, then? Unicode says it should present as an invisible zero-width non-breaking space, which implies that it is expected to make it through to the presentation layer sometimes, and not just be stripped off in all cases. Various documents I've read don't spell it out explicitly but they make it seem like part of the data stream itself, more akin to the https://www.w3.org/International/questions/qa-byte-order-mark |
See The Unicode Standard 15.0 D98 second bullet last sentence on page 131 (PDF page 157). See also the note in The Encoding Standard that explains that the naming doesn't match the naming in The Unicode Standard. Note that the XML spec says of the BOM: "This is an encoding signature, not part of either the markup or the character data of the XML document." Also, the BNF in the XML spec doesn't consider the BOM part of the textual syntax. |
That appears to be correct for UTF-16. Regarding UTF-8, the bullet points under D95 on the previous page are frustratingly unclear - it has no such language and actually seems to imply the opposite.
But considering I really don't feel like lawyering the spec to that degree, point taken, we should probably just accept that the BOM will be stripped and force the user to write it themselves on the output side, if they want it. I will file a new issue about documentation. |
encoding_rs/src/lib.rs
Lines 2974 to 2980 in d4d7d2a
Functionally,
decode()
anddecode_with_bom_removal()
seem pretty much the same? That doesn't seem correct? If there's a variant called "decode_with_bom_removal" then I would expect the standard variant not to remove the BOM.Compare to:
encoding_rs/src/lib.rs
Lines 3019 to 3030 in d4d7d2a
It's totally valid to decode the BOM, the BOM is a unicode character like any other character. Decoding a UTF-16 document with a BOM should yield a UTF-8 document with a BOM. Otherwise, you would just use the BOM-removing version...
The text was updated successfully, but these errors were encountered: