-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strip BOM from the event stream, add a method to Writer for writing BOM #459
Conversation
dab8065
to
6daeced
Compare
What I actually want to do: provide a configuration option for either But after attempting to implement it I realized it requires more substantial parsing changes then I'm prepared to make right now. This approach isn't ideal but it covers most needs fairly well, addresses the original issue, and still moves in the right direction. |
6daeced
to
f0acc46
Compare
Codecov Report
@@ Coverage Diff @@
## master #459 +/- ##
==========================================
+ Coverage 52.45% 53.27% +0.81%
==========================================
Files 29 29
Lines 13555 13527 -28
==========================================
+ Hits 7110 7206 +96
+ Misses 6445 6321 -124
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of the trim_interelement_text
option (the name, probably, could be better), that's what I was thinking.
I'm not sure that we should remove StartText
if we want
- return text only as an
str
- return the BOM, which is not an
str
(UTF BOMs seems to be representable as special characters, but I'm not sure that this is a general case. Should we bother about other possible BOMs?)
StartText
solves this problem by splitting out the ordinary text from the BOM bytes and other unexpected chars before the declaration, which are not according to the standard, but I think, could be used in the wild (the examples is our tests where it is often convenient to have leading spaces before a declaration). Actually, it should be named StartBytes
, because this is not a text
I agree that using StartText
event not the best way to write BOM, the writer should write it by himself. That is why I was thinking about splitting event into events for reading and events for writing, for which reason public reader
and writer
modules was introduced and Event::borrow
method was implemented.
Having different events we could fix #332 (because reader event no longer will own the data and we could return right lifetime with existing Attributes
struct).
Different events for reading and writing also has an advantage, that reader event can contain bytes internally and decoded only on demand (although we have a goal to step away from that, but is can have non-obvious implications) and a span, but writer event could
- own their content
- store their content in UTF-8 only (and in fact right now you should use only UTF-8 -- event generation in other encodings seems to be broken. serde serializer, for example, always writes in UTF-8)
- does not have a span which is anyway is useless for it
What exactly prevents StartText
?
This is a misunderstanding of the BOM that I also shared until a day or two ago. It's not that it's able to be represented as a special character but that it is literally just a character, with an assigned Unicode character number, which can be encoded in different encodings just like any other Unicode character - and if you put this character at the beginning of the document, and you know how different encodings represent that character, then you can just look at the first few bytes and see if they match how a particular encoding would encode the character U+FEFF. In retrospect this makes total sense but it never really clicked into place for me until very recently. The implications are that:
Therefore, since it is truly just The problem with the BOM is not that it can't be represented with In terms of edit: I was also wrong about |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not that it's able to be represented as a special character but that it is literally just a character, with an assigned Unicode character number, which can be encoded in different encodings just like any other Unicode character - and if you put this character at the beginning of the document, and you know how different encodings represent that character, then you can just look at the first few bytes and see if they match how a particular encoding would encode the character U+FEFF.
OK, the official FAQ about that: https://unicode.org/faq/utf_bom.html#bom1
Also I've made some experiments:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=4f7773f8164bd228c4f35329b915f38b
Generally approved, but please made the small improvements in doc & tests
StartText would be out of place once all events are expected to contain UTF-8. Additionally the decoder implementation strips BOM bytes out of the bytestream so there's no good way to access them.
f0acc46
to
b27d52f
Compare
b27d52f
to
340ea04
Compare
@Mingun re: test, I've marked it as an expected failure until we can address the same issue you pointed out. We can strip out the BOM here but it's too late to avoid emitting the text event entirely. But we had the same issue before with |
After removing StartText event in tafia#459 text events can be generated at the beginning of the stream
After removing StartText event in tafia#459 text events can be generated at the beginning of the stream
After removing StartText event in tafia#459 text events can be generated at the beginning of the stream
After removing StartText event in #459 text events can be generated at the beginning of the stream
No description provided.