Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly detect encoding even without BOM #465

Merged
merged 11 commits into from
Aug 27, 2022

Conversation

Mingun
Copy link
Collaborator

@Mingun Mingun commented Aug 26, 2022

This PR correctly processes input and allow to detect UTF-16 encoding even without BOM, using the algorithm, recommended by W3C. Previously it is not worked because we ran it too late, when the < (0x3C) byte was already recognized as start of a tag and the "start text" does not contained it.

There is one thin thing there: when we should strip BOM? Currently I chosen the consistent behavior across encoding feature enabled / disabled and parsing from bytes / from &str, but probably it is not the best choice.

The available variants can be summarized in that table:

from_reader from_str
--features encoding auto-detected, BOM stripped UTF-8, is it needed to strip possible BOM?
no --features encoding UTF-8, is it needed to strip possible BOM? UTF-8, is it needed to strip possible BOM?

I doubt about stripping BOM from &str arguments, because it is highly likely that BOM already was stripped when you've got the string, and it it the other BOM in the beginning, it is probably the content part (because BOM is ordinary character, U+FEFF, and can be used in any place like any other character).

However, when encoding feature is not enabled, the from_reader and from_str cases are the same -- input should be in UTF-8. Should we strip BOM in from_reader case then?


This PR also adds a set of files in all supported encodings most of which contains all characters that corresponding encoding can handle. The utility project for generation is in the test-gen folder

@Mingun Mingun added bug encoding Issues related to support of various encodings of the XML documents labels Aug 26, 2022
@Mingun Mingun requested a review from dralley August 26, 2022 16:32
src/encoding.rs Outdated Show resolved Hide resolved
Cargo.toml Show resolved Hide resolved
@@ -173,25 +180,26 @@ pub(crate) fn remove_bom<'b>(bytes: &'b [u8], encoding: &'static Encoding) -> &'
///
/// | Bytes |Detected encoding
/// |-------------|------------------------------------------
/// |`FE FF ## ##`|UTF-16, big-endian
/// | **BOM**
/// |`FE_FF_##_##`|UTF-16, big-endian
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Underscores added because otherwise long text in lower cells shrink the first column and leads to text wrapping, which looks not good

@Mingun
Copy link
Collaborator Author

Mingun commented Aug 26, 2022

What about BOM stripping? Should we strip BOM when we parse from &str?

@dralley
Copy link
Collaborator

dralley commented Aug 26, 2022

I don't think so personally - #461 should handle it so that the input itself never needs to be transformed. Unless you think we need it would be useful in the interim before that issue is dealt with.

@Mingun
Copy link
Collaborator Author

Mingun commented Aug 26, 2022

Ok, then for now I leave it as implemented:

from_reader from_str
--features encoding auto-detected, BOM stripped if present UTF-8, BOM stripped if present
no --features encoding UTF-8, BOM stripped if present UTF-8, BOM stripped if present

PR is ready for review

@dralley
Copy link
Collaborator

dralley commented Aug 26, 2022

Thanks. I've been waiting for this to avoid overlapping work as I needed to make a few similar changes - will try to proceed with the decoding work this weekend.

## - [UTF-16LE]
## - [ISO-2022-JP]
##
## You should stop to process document when one of that encoding will be detected,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggestion makes sense in isolation, but I don't know that we want to contribute to even more churn than necessary? It's likely to either break or be unnecessary in the very next release.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I just want to explicitly warn users about that UTF-16 and ISO-2022-JP are not supported for now. For example, parsing documents with some Chinese characters that represented as [ASCII byte, some byte] in UTF-16BE or [some byte, ASCII byte] in UTF16LE can confuse the parser. That can be avoided if you stop to processing such documents in the very beginning.

I want make this change because I want to cut release this weekend, and it is still unclear when the correct solution will be ready, so it is best to avoid using problematic encodings for now.

I hope that in the next release this note will be removed. It should not break anything -- once correct support will be implemented, users will be updated and remove their guard code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a remark, that restriction is temporary and will be eliminated once #158 is fixed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I do think finishing #158 this weekend is plausible but I won't promise it.

Copy link
Collaborator

@dralley dralley Aug 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, an intermediate "good enough to release" stage of it, in any case.

Copy link
Collaborator

@dralley dralley Aug 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or it would be, except for async. Keep forgetting about that...

Cargo.toml Outdated
@@ -54,9 +54,54 @@ async-tokio = ["tokio"]
## UTF-16 will not work (therefore, `quick-xml` is not [standard compliant]).
##
## List of supported encodings includes all encodings supported by [`encoding_rs`]
## crate, that satisfied the restriction above.
## crate, that satisfied the restriction above. So, the following encodings are
## **not supported**:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of redundancy between this, the lines above, and the lines below that can probably be cleaned up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to remove redundancy, if you have something else in mind, please make a suggestion.

Copy link
Collaborator

@dralley dralley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing it to approved so you can merge once the minor comments are addressed.

@Mingun Mingun mentioned this pull request Aug 27, 2022
18 tasks
@dralley dralley merged commit 9598c37 into tafia:master Aug 27, 2022
@Mingun Mingun deleted the detect-encoding branch August 27, 2022 17:13
Mingun added a commit to Mingun/quick-xml that referenced this pull request Sep 29, 2023
CDATA section was formed incorrectly and instead was recognized as a Start tag.

File introduced in PR tafia#465 and was made manually and not using generator, because
WHATWG does not have definition of this encoding as a separate entry in index.json.
Actually, this encoding the same as ISO-8859-8, but influences layout direction
when render text.

Wikipedia:
The WHATWG Encoding Standard used by HTML5 treats ISO-8859-8 and ISO-8859-8-I
as distinct encodings with the same mapping due to influence on the layout direction

So generator was fixed and file regenerated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug encoding Issues related to support of various encodings of the XML documents
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants