Strip BOM from the event stream, add a method to Writer for writing BOM #459

dralley · 2022-08-17T03:14:24Z

No description provided.

dralley · 2022-08-17T03:27:33Z

What I actually want to do: provide a configuration option for either trim_before_first_element which would skip all data prior to the first XML element (including the BOM), or go further and provide a trim_interelement_text option which would do this while also generally trimming all text events in places where they aren't supposed to be (which would address #285 (comment)). If you want to fully reproduce the original event stream you could just disable that trimming and you would get it all (including the BOM in a Text event)

But after attempting to implement it I realized it requires more substantial parsing changes then I'm prepared to make right now. This approach isn't ideal but it covers most needs fairly well, addresses the original issue, and still moves in the right direction.

codecov-commenter · 2022-08-17T03:33:46Z

Codecov Report

Merging #459 (340ea04) into master (11e483a) will increase coverage by 0.81%.
The diff coverage is 63.63%.

@@            Coverage Diff             @@
##           master     #459      +/-   ##
==========================================
+ Coverage   52.45%   53.27%   +0.81%     
==========================================
  Files          29       29              
  Lines       13555    13527      -28     
==========================================
+ Hits         7110     7206      +96     
+ Misses       6445     6321     -124

Flag	Coverage Δ
unittests	`53.27% <63.63%> (+0.81%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
benches/microbenches.rs	`0.00% <ø> (ø)`
src/de/mod.rs	`82.24% <ø> (-0.32%)`	⬇️
src/events/mod.rs	`69.90% <ø> (+1.38%)`	⬆️
src/writer.rs	`49.34% <5.00%> (+0.42%)`	⬆️
src/encoding.rs	`83.87% <87.50%> (-12.91%)`	⬇️
src/reader/mod.rs	`90.43% <100.00%> (+0.10%)`	⬆️
src/reader/parser.rs	`98.67% <100.00%> (+0.05%)`	⬆️
... and 4 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Mingun

I like the idea of the trim_interelement_text option (the name, probably, could be better), that's what I was thinking.

I'm not sure that we should remove StartText if we want

return text only as an str
return the BOM, which is not an str (UTF BOMs seems to be representable as special characters, but I'm not sure that this is a general case. Should we bother about other possible BOMs?)

StartText solves this problem by splitting out the ordinary text from the BOM bytes and other unexpected chars before the declaration, which are not according to the standard, but I think, could be used in the wild (the examples is our tests where it is often convenient to have leading spaces before a declaration). Actually, it should be named StartBytes, because this is not a text

I agree that using StartText event not the best way to write BOM, the writer should write it by himself. That is why I was thinking about splitting event into events for reading and events for writing, for which reason public reader and writer modules was introduced and Event::borrow method was implemented.

Having different events we could fix #332 (because reader event no longer will own the data and we could return right lifetime with existing Attributes struct).

Different events for reading and writing also has an advantage, that reader event can contain bytes internally and decoded only on demand (although we have a goal to step away from that, but is can have non-obvious implications) and a span, but writer event could

own their content
store their content in UTF-8 only (and in fact right now you should use only UTF-8 -- event generation in other encodings seems to be broken. serde serializer, for example, always writes in UTF-8)
does not have a span which is anyway is useless for it

What exactly prevents StartText?

tests/encodings.rs

src/reader/parser.rs

dralley · 2022-08-17T17:51:08Z

I'm not sure that we should remove StartText if we want

return text only as an str

return the BOM, which is not an str (UTF BOMs seems to be representable as special characters, but I'm not sure that this is a general case. Should we bother about other possible BOMs?)

StartText solves this problem by splitting out the ordinary text from the BOM bytes and other unexpected chars before the declaration, which are not according to the standard, but I think, could be used in the wild (the examples is our tests where it is often convenient to have leading spaces before a declaration). Actually, it should be named StartBytes, because this is not a text

This is a misunderstanding of the BOM that I also shared until a day or two ago.

It's not that it's able to be represented as a special character but that it is literally just a character, with an assigned Unicode character number, which can be encoded in different encodings just like any other Unicode character - and if you put this character at the beginning of the document, and you know how different encodings represent that character, then you can just look at the first few bytes and see if they match how a particular encoding would encode the character U+FEFF.

In retrospect this makes total sense but it never really clicked into place for me until very recently.

The implications are that:

str can store it just fine. It's just a unicode character, UTF-8 encodes unicode characters, str stores UTF-8, it all works.
Encoders and decoders know how to handle it. You don't have to treat it differently, you can just pass it through the decoder. If you decode a UTF-16 LE BOM as UTF-8, you get a UTF-8 BOM. Because it's just two representations of the same character. The same applies to any combination of UTF-8, UTF-16, and UTF-32 - you can pass it back and forth and it will always be correct for the document encoding.

Therefore, since it is truly just Text, we don't need a separate StartText.

The problem with the BOM is not that it can't be represented with str but that it's annoying and undesirable to output as part of the XML event stream - which is approximately the same problem we have with un-trimmed whitespace between elements. So I think it makes sense to deal with it in the same way. If you want to parse a normalized XML representation (which most people do), you should trim all that stuff away. If you want an exact reproduction, don't trim it.

In terms of Writer, although we probably won't support writing different encodings any time soon (since encoding_rs doesn't support any other encodings), we don't need to worry about multiple different kinds of BOMs. Just write the UTF-8 BOM into the text stream the same as everything else and let the encoder handle it.

edit: I was also wrong about encoding_rs_io not letting you pass the BOM through. You can do that and it's safe for the reason I just described.

Mingun

It's not that it's able to be represented as a special character but that it is literally just a character, with an assigned Unicode character number, which can be encoded in different encodings just like any other Unicode character - and if you put this character at the beginning of the document, and you know how different encodings represent that character, then you can just look at the first few bytes and see if they match how a particular encoding would encode the character U+FEFF.

OK, the official FAQ about that: https://unicode.org/faq/utf_bom.html#bom1

Also I've made some experiments:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=4f7773f8164bd228c4f35329b915f38b

Generally approved, but please made the small improvements in doc & tests

Changelog.md

src/encoding.rs

src/reader/mod.rs

src/writer.rs

StartText would be out of place once all events are expected to contain UTF-8. Additionally the decoder implementation strips BOM bytes out of the bytestream so there's no good way to access them.

dralley · 2022-08-18T14:03:38Z

@Mingun re: test, I've marked it as an expected failure until we can address the same issue you pointed out. We can strip out the BOM here but it's too late to avoid emitting the text event entirely.

But we had the same issue before with *_with_bom_removal(), so it's not a regression.

After removing StartText event in tafia#459 text events can be generated at the beginning of the stream

After removing StartText event in #459 text events can be generated at the beginning of the stream

dralley force-pushed the remove-bytesstart branch from dab8065 to 6daeced Compare August 17, 2022 03:18

dralley force-pushed the remove-bytesstart branch from 6daeced to f0acc46 Compare August 17, 2022 03:27

dralley requested a review from Mingun August 17, 2022 04:05

Mingun reviewed Aug 17, 2022

View reviewed changes

tests/encodings.rs Show resolved Hide resolved

src/reader/parser.rs Show resolved Hide resolved

Mingun mentioned this pull request Aug 17, 2022

[work-in-progress] Decoding BufReader implementation #441

Draft

Mingun approved these changes Aug 18, 2022

View reviewed changes

dralley added 2 commits August 18, 2022 09:25

Remove StartText

b1a9670

StartText would be out of place once all events are expected to contain UTF-8. Additionally the decoder implementation strips BOM bytes out of the bytestream so there's no good way to access them.

Add a write_bom() method to the Writer

08d4a3a

dralley force-pushed the remove-bytesstart branch from f0acc46 to b27d52f Compare August 18, 2022 13:31

Remove BOM from first-emitted text event

340ea04

dralley force-pushed the remove-bytesstart branch from b27d52f to 340ea04 Compare August 18, 2022 13:53

dralley merged commit bbe490b into tafia:master Aug 18, 2022

dralley deleted the remove-bytesstart branch August 18, 2022 14:14

Mingun added a commit to Mingun/quick-xml that referenced this pull request Aug 26, 2022

Remove excess test

fca7178

After removing StartText event in tafia#459 text events can be generated at the beginning of the stream

Mingun added a commit to Mingun/quick-xml that referenced this pull request Aug 26, 2022

Remove excess test

3692228

After removing StartText event in tafia#459 text events can be generated at the beginning of the stream

Mingun added a commit to Mingun/quick-xml that referenced this pull request Aug 27, 2022

Remove excess test

bf2a360

After removing StartText event in tafia#459 text events can be generated at the beginning of the stream

dralley pushed a commit that referenced this pull request Aug 27, 2022

Remove excess test

1b0259d

After removing StartText event in #459 text events can be generated at the beginning of the stream

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strip BOM from the event stream, add a method to Writer for writing BOM #459

Strip BOM from the event stream, add a method to Writer for writing BOM #459

dralley commented Aug 17, 2022

dralley commented Aug 17, 2022 •

edited

Loading

codecov-commenter commented Aug 17, 2022 •

edited

Loading

Mingun left a comment

dralley commented Aug 17, 2022 •

edited

Loading

Mingun left a comment

dralley commented Aug 18, 2022

Strip BOM from the event stream, add a method to Writer for writing BOM #459

Strip BOM from the event stream, add a method to Writer for writing BOM #459

Conversation

dralley commented Aug 17, 2022

dralley commented Aug 17, 2022 • edited Loading

codecov-commenter commented Aug 17, 2022 • edited Loading

Codecov Report

Mingun left a comment

Choose a reason for hiding this comment

dralley commented Aug 17, 2022 • edited Loading

Mingun left a comment

Choose a reason for hiding this comment

dralley commented Aug 18, 2022

dralley commented Aug 17, 2022 •

edited

Loading

codecov-commenter commented Aug 17, 2022 •

edited

Loading

dralley commented Aug 17, 2022 •

edited

Loading