Cannot parse UTF-16 xml #322
Labels
bug
encoding
Issues related to support of various encodings of the XML documents
enhancement
help wanted
UTF-16 xml parsing does not seem to work at all.
Current Behaviour
To test this, try changing this function to print tag names.
quick-xml/tests/unit_tests.rs
Lines 904 to 921 in 118c07b
You would expect to see something like
["", "project"]
.But instead, this comes out:
["", "㼀砀洀氀�", "\u{a00}�", "瀀爀漀樀攀挀琀�", "\u{a00}�", "⼀瀀爀漀樀攀挀琀�", "\u{a00}�"]
Likely Reason
It looks like the parser compares a single byte to check if that character is
b'<'
(0x3C
) or other special characters, then increments the position by 1 byte. But in UTF-16, the<
character is actually two bytes0x3C 0x00
.So when the parser tries to parse
<?...
, which in UTF-8 is0xFF 0xFE 0x3C 0x00 0x3F 0x00 ...
, It consumes0x3C
(<
) and thinks the element's raw name is0x00 0x3F 0x00 ...
. (㼀
is0x00 0x3F
)Changing the above code's line to
txt.push(reader.decode(&e.local_name()[1..]).to_string())
makes the element names come out correct. (Even thoughEvent::End
andEvent::Decl
is lumped withEvent::Start
.):["", "?xml", "\u{a00}�", "project", "\u{a00}�", "/project", "\u{a00}�"]
.The text was updated successfully, but these errors were encountered: