-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with reading XML from buffering reader #469
Comments
The problem in that first
The problem does not related to non-latin1 characters, but probabilistic. With Russian characters the first CDATA end sequence Lines 761 to 765 in 6bedf6c
|
let me guess the problem is in buffering reader. When XML with russian or other two or more bytes chars fit into buffer everything is fine but if such XML are spread between several chunks the problem occurs. Maybe there is an assume that char is always has one byte size but it's not true in case of UTF-8 encoding |
Yes, the problem in buffering reader only, but no, it exists also for one-byte characters. If you replace For all ASCII compatible encodings that supported by quick-xml it is not possible to split string inside multi-byte character, because we only split at boundaries of some ASCII characters, and their byte values never used as second byte in multi-byte encodings (actually, all supported encodings are 1-2-byte encodings). That is all encodings, except UTF-16BE, UTF-16LE, and ISO-2022-JP. |
I meet the same issue, is there any workaround or I need to downgrade the crate to 0.22? |
A possible workaround is to read into buffer and use borrowing version ( Also you can help with working on a patch. If no one will provide one, I'll fix it myself on next weekend and release 0.25 with fix. In the fix I expect tests for all events (most of them affected) for both sync and async readers. The start point for a test: #[test]
fn issue469() {
let xml = "<![CDATA[1]]>";
// ^^^^^^^^^^^^ data that fit into buffer
let size = xml.match_indices("]]").next().unwrap().0 + 2;
let br = BufReader::with_capacity(size, xml.as_bytes());
let mut reader = Reader::from_reader(br);
let mut buf = Vec::new();
assert_eq!(
reader.read_event_into(&mut buf).unwrap(),
Event::CData(BytesCData::new("1"))
);
assert_eq!(
reader.read_event_into(&mut buf).unwrap(),
Event::Eof
);
} The fix should be simple -- need to check the last one or two bytes in the buffer before consume it. If it is |
i would do so but i cant because my xml is about 24Gb :) |
Probably you could use memory-mapped files to pretend that XML is in memory |
this cannot be done too - xml content unzipped on the fly from xz file and passed as stream to xml parser (using stdin) |
Well, I'm sure, that as exercise it could be possible to unzip to memory, map memory and parse from it, but probably that bug will be fixed before that Goldberg machine will work :) |
failures (8): reader::async_tokio::test::small_buffers::cdata1 reader::async_tokio::test::small_buffers::cdata2 reader::async_tokio::test::small_buffers::comment1 reader::async_tokio::test::small_buffers::comment2 reader::buffered_reader::test::small_buffers::cdata1 reader::buffered_reader::test::small_buffers::cdata2 reader::buffered_reader::test::small_buffers::comment1 reader::buffered_reader::test::small_buffers::comment2
Fix #469 - parsing from buffered reader
Here is an example
when you run it you will crash with something like this
if buffer capacity increase to 512 (see let br = BufReader::with_capacity(32, xml.as_bytes());) the problem will gone. Replacing non Latin1 chars to latin1 ones will also solves the problem.
ADDITIONAL INFO
It was not a problem before v0.23, so v0.22 parsed such files correctly
The text was updated successfully, but these errors were encountered: