-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Selectively deserializing XML elements with serde #165
Comments
Awesome! As for your question I usually deserialize things manually. Usually it is not too complicated to do (I tend to use internal Doing automatic serde-like deserialization is very nice too! But getting all the use cases right is complex. If I were you I'd probably open a PR on serde-xml to add a feature to use quick-xml instead of xml-rs. Not sure if they'd want to merge it but you'll probably get some idea about improving your code from there. Anyway, let me know what you finally decide! |
Doesn't look like they'd agree: RReverser/serde-xml-rs#61 I've asked again though: RReverser/serde-xml-rs#108 |
Meanwhile, just flushing some ideas I had, you can ignore me... My first thought was to make a read_to_end()-like method that returns an object that implements Deserializer. That meant putting serde code inside quick-xml though, which is not nice. Or maybe it is? Not sure. My second thought was making read_to_end() return the string it read; looks like inside of it, it keeps clearing the given buffer, but with a flag that could be made optional. Then I could parse that buffer with something else (e.g. serde-xml-rs). Problem is that it's basically being parsed twice, and the buffer might become huge. My third thought was to make use of what #161 is bringing, but that'd still mean parsing the text twice. Also, possibly seeking back in the reader, which might be hard with e.g. streams. My fourth thought was to make read_to_end() return a special object that implements Read, which could be used as an input for anything (including the current xml-rs/serde-xml-rs); problem is, I don't think it would be a valid XML, since it'd be missing namespace information from outer elements. My fifth thought was to make a read_to_end() method that returns a sub-XmlReader, which would borrow the reader, namespace info, etc, and whose Also, I think some of the above suffer from forgetting about namespace information from outer elements, so I can ignore them outright. |
I think the first idea is probably the best one, again if the purpose is to have something as competitive as possible. I would probably have a |
So, speed at any cost? You are a man after my own heart ❤️ I'll wait some more for a reply from serde-xml-rs, I'm not in any hurry, although I'm definitely not dropping this. |
Haha, no, not at any cost. My position is more that we shouldn't take shortcuts when we can avoid it. We should aim for code that is as efficient as possible, while being reasonably simple to reason about. |
Just to let you know, I've just started implement a |
Here's the generic Deserializer I made long ago, compatible with both quick-xml and xml-rs: Looks like it still works, just needed a couple tweaks to work with quick-xml 0.16. I haven't touched it since then so I don't remember much about it, but I think it was pretty complete, and was just missing the serializer part. It's probably a bit behind compared to the main serde-xml-rs repo. Feel free to use parts (or the whole) of it if you want! |
I would love to use this/see this happen. I'm working on a baseball data scraper which parses a LOT of XML files. I've built it all using serde-xml-rs, so would be really happy if I could just swap out xml-rs for quick-xml without losing serde support and get a free performance boost. |
I am working on a port of serde-xml into quick-xml. |
First, thanks for taking this on, it is much appreciated. I'm having trouble getting anything to deserialize, I get an Err (Start,) for all the various XML files I tried replacing For this link: http://gd2.mlb.com/components/game/rok/year_2008/month_07/day_10/gid_2008_07_10_dblrok_ddorok_2/linescore.xml I had the following structs: struct LineScoreData {
game_pk: u32,
game_type: char,
venue: String,
venue_w_chan_loc: String,
venue_id: u32,
time: String,
time_zone: String,
ampm: String,
home_team_id: u32,
home_team_city: String,
home_team_name: String,
home_league_id: u32,
away_team_id: u32,
away_team_city: String,
away_team_name: String,
away_league_id: u32,
#[serde(rename="linescore", skip_serializing)]
innings: Vec<LineScore>,
}
struct LineScore {
#[serde(rename="away_inning_runs")]
away_runs: u32,
#[serde(rename="home_inning_runs", deserialize_with = "empty_string_is_none")]
//needs to be an Option, since home team doesn't always bat.
home_runs: Option<u32>,
// Keeping the inning as a string, since we'll need it to construct URLs later
inning: String,
} |
Thank you for the quicjk response! |
Works! I'll start testing the others later today and feed back any errors I bump into. |
Awesome! |
For this XML file: http://gd2.mlb.com/components/game/mlb/year_2008/month_05/day_10/gid_2008_05_10_cinmlb_nynmlb_1/players.xml I used the following structs: #[derive(Deserialize, Serialize, Debug)]
struct Game {
#[serde(rename="team")]
teams: Vec<Team>,
umpires: Umpires
}
#[derive(Deserialize, Serialize, Debug)]
struct Team {
#[serde(rename="type")]
home_away: HomeAway,
id: String,
name: String,
#[serde(rename="player")]
players: Vec<Player>,
#[serde(rename="coach")]
coaches: Vec<Coach>,
}
#[derive(Deserialize, Serialize, Debug)]
enum HomeAway {
#[serde(rename="home")]
Home,
#[serde(rename="away")]
Away,
}
#[derive(Deserialize, Serialize, Debug, Clone)]
struct Player {
id: u32,
#[serde(rename="first")]
name_first: String,
#[serde(rename="last")]
name_last: String,
game_position: Option<String>,
bat_order: Option<u8>,
position: String,
}
#[derive(Deserialize, Serialize, Debug)]
struct Coach {
position: String,
#[serde(rename="first")]
name_first: String,
#[serde(rename="last")]
name_last: String,
id: u32,
} This gave an error of "invalid type: string "away", expected enum HomeAway". |
Fixed! |
Hi!
I want to parse a 3MB XML file with serde. I used serde-xml-rs, and it is painfully slow; I hacked my way through serde-xml-rs's code to make it work with quick-xml instead (the APIs are very similar after all), and that sped it up tremendously (from 1.5s to 0.1s).
But! I don't want to deserialize an entire 3MB XML into a giant struct (which has loads of small heap-allocated vecs inside it), when all I want is to scan for a specific element inside it, and deserialize just that one.
I thought of using quick-xml's events to reach the element I want, then read_to_end() to get the whole element as a big text block, and then use serde-xml-rs to parse the text block as xml; except this approach loses all namespacing/encoding info.
I also thought of implementing some sort of from_element(xmlreader, start_element), which would give a partial Deserializer object, which is my current favorite option.
Thoughts? Any better ways to do this?
The text was updated successfully, but these errors were encountered: