Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot parse UTF-16 xml #322

Closed
BlueGreenMagick opened this issue Sep 27, 2021 · 1 comment
Closed

Cannot parse UTF-16 xml #322

BlueGreenMagick opened this issue Sep 27, 2021 · 1 comment
Labels
bug encoding Issues related to support of various encodings of the XML documents enhancement help wanted

Comments

@BlueGreenMagick
Copy link
Contributor

UTF-16 xml parsing does not seem to work at all.

Current Behaviour

To test this, try changing this function to print tag names.

#[test]
#[cfg(feature = "encoding")]
fn test_unescape_and_decode_without_bom_removes_utf16le_bom() {
let mut reader = Reader::from_file("./tests/documents/utf16le.xml").unwrap();
reader.trim_text(true);
let mut txt = Vec::new();
let mut buf = Vec::new();
loop {
match reader.read_event(&mut buf) {
Ok(Event::Text(e)) => txt.push(e.unescape_and_decode_without_bom(&mut reader).unwrap()),
Ok(Event::Eof) => break,
_ => (),
}
}
assert_eq!(txt[0], "");
}

#[test]
#[cfg(feature = "encoding")]
fn test_unescape_and_decode_without_bom_removes_utf16le_bom() {
    let mut reader = Reader::from_file("./tests/documents/utf16le.xml").unwrap();
    reader.trim_text(true);

    let mut txt = Vec::new();
    let mut buf = Vec::new();

    loop {
        match reader.read_event(&mut buf) {
            Ok(Event::Text(e)) => txt.push(e.unescape_and_decode_without_bom(&mut reader).unwrap()),
            Ok(Event::Eof) => break,
            Ok(Event::Start(e)) => txt.push(reader.decode(&e.local_name()).to_string()), // add this line
            _ => (),
        }
    }
   // print the content of txt
    println!("{:?}", txt); 
    panic!();
    assert_eq!(txt[0], "");
}

You would expect to see something like ["", "project"].
But instead, this comes out: ["", "㼀砀洀氀�", "\u{a00}�", "瀀爀漀樀攀挀琀�", "\u{a00}�", "⼀瀀爀漀樀攀挀琀�", "\u{a00}�"]

Likely Reason

It looks like the parser compares a single byte to check if that character is b'<'(0x3C) or other special characters, then increments the position by 1 byte. But in UTF-16, the < character is actually two bytes 0x3C 0x00.

So when the parser tries to parse <?..., which in UTF-8 is 0xFF 0xFE 0x3C 0x00 0x3F 0x00 ..., It consumes 0x3C(<) and thinks the element's raw name is 0x00 0x3F 0x00 .... ( is 0x00 0x3F)

Changing the above code's line to txt.push(reader.decode(&e.local_name()[1..]).to_string()) makes the element names come out correct. (Even though Event::End and Event::Decl is lumped with Event::Start.): ["", "?xml", "\u{a00}�", "project", "\u{a00}�", "/project", "\u{a00}�"].

@Mingun Mingun added bug enhancement help wanted encoding Issues related to support of various encodings of the XML documents labels May 21, 2022
@dralley
Copy link
Collaborator

dralley commented Sep 26, 2022

Considering this a duplicate of #158

@dralley dralley closed this as completed Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug encoding Issues related to support of various encodings of the XML documents enhancement help wanted
Projects
None yet
Development

No branches or pull requests

3 participants