-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing 20MB file using from_reader is slow #160
Comments
These are some smaller files from Sentry that should show the same behavior. The files I'm working with are about four to six times the size but unfortunately i cannot publicly share them. |
I took the larger of the two files in the zip (vendor.js.map) and extended the "mappings" and "sourcesContent" to be each four copies of themselves. The result is 23 MiB. vendor2.zip Parsing unbuffered directly from the file takes 4710ms. We expect this to be slow. let f = File::open(path).unwrap();
serde_json::from_reader(f).unwrap() Parsing from a buffered reader takes 562ms. I assume this is what @mitsuhiko was running. let br = BufReader::new(File::open(path).unwrap());
serde_json::from_reader(br).unwrap() Parsing from a string (including reading the file to a string!) takes 55ms. This is the case that I optimized a while back. let mut s = String::new();
File::open(path).unwrap().read_to_string(&mut s).unwrap();
serde_json::from_str(&s).unwrap() Parsing from a vec is the same at 55ms. let mut bytes = Vec::new();
File::open(path).unwrap().read_to_end(&mut bytes).unwrap();
serde_json::from_slice(&contents).unwrap() Note that in all of these cases parsing to RawSourceMap vs parsing to serde_json::Value takes exactly the same time because the JSON is dominated by large strings. Parsing in Python takes 248ms in Python 2.7.12 and 186ms in Python 3.5.2. Both Pythons are reading the file into memory as a string first. The read happens here. So Python is doing a slower version of what Rust is doing in 55ms. with open(path) as f:
json.load(f) I also tried the other json crate for good measure which takes 77ms (still impressive compared to Python). let mut s = String::new();
File::open(path).unwrap().read_to_string(&mut s).unwrap();
json::parse(&s).unwrap() And of course I tried RapidJSON which may be the fastest C/C++ JSON parser. Don't mind the nasty but actually really fast reading of the file to a std::string, it only takes 6ms. Using clang++ 3.8.0 with -O3 it takes 110ms and using g++ 5.4.0 with -O3 it takes 67ms std::ifstream in(path, std::ios::in | std::ios::binary);
std::string s;
in.seekg(0, std::ios::end);
s.resize(in.tellg());
in.seekg(0, std::ios::beg);
in.read(&s[0], s.size());
in.close();
rapidjson::Document d;
d.Parse(s.c_str()); Conclusion
|
For those wondering, bincode takes 14ms. let mut br = BufReader::new(File::open(path).unwrap());
let lim = bincode::SizeLimit::Infinite;
bincode::serde::deserialize_from(&mut br, lim).unwrap() |
Comments from @erickt in IRC:
|
Depending on your timeline for improving the speed of I just encountered this problem with a 45MB JSON file that was taking about 25 seconds to load using I don't think the edit: if you agree this is a good idea, I'll be glad to submit a PR. |
The problem with |
We could use specialization for Seek and BufReader. With Seek we can detect the size and then choose between slice, buf or read processing |
I'm taking a stab at implementing the BufRead using specialization, which would make it nightly-only for now., although I guess we could add a from_bufread. I think with I'll create a PR for discussion when I have something to show. |
All right I have an absolutely terrible but working PoC. It required a lot of 'open-heart surgery' on the project to make all the lifetimes and stuff work (you can't return a buffer from a BufRead) but you can look at the result here: https://github.com/bouk/json/tree/buf-read-keys (I think a rewrite of the whole read.rs file would be the most prudent course of action). Again, it's a PoC, the code is 💩. Anyways, for the result: with this script: extern crate serde_json;
use std::fs::File;
use std::io::Read;
use std::io::BufReader;
fn main() {
let br = BufReader::new(File::open("vendor2.json").unwrap());
let _: serde_json::Value = serde_json::from_reader(br).unwrap();
} I get 450ms parse time on the current master, but on my branch it's brought down to ~100ms with the buffer optimizations. So, a 4-5x speed up is what we can expect here. Like I mentioned before, a lot of assumptions need to be rethought, like the Reference enum which I couldn't get working properly and which doesn't even lead to improvements in the default json Value parser, as borrowed strings aren't used (but they could be useful for other types I guess). So, to conclude: definitely possible and worthwhile, look at my untested and broken implementation for inspiration, but there is more work required. EDIT: OK I take it back, it's slightly nicer now. Not much, but some |
`serde_json::from_reader` is considerably slow for some reason. serde-rs/json#160
`serde_json::from_reader` is considerably slow for some reason. serde-rs/json#160
`serde_json::from_reader` is considerably slow for some reason. serde-rs/json#160
`serde_json::from_reader` is considerably slow for some reason. serde-rs/json#160
See serde-rs/json#160 (comment) As the comment by @dtolnay is 2.5 years old, I re-did some measurements. Seems nothing much has changed. PROJECT 1: 692 ms -> str 722 ms -> buffered reader 2,120 s -> bare reader PROJECT 2 (servo): 4.230s -> str 9.885s -> buffered reader (using std::io::BufReader) 5m14.607s-> bare reader
This allows us to switch back to serde_json::from_slice instead of serde_json::from_reader, because the latter is significantly slower. See serde-rs/json#160
I need to parse a 2.4GB JSON file and found the following to be the fastest: let file = File::open(options.input)?;
let mmap = unsafe { MmapOptions::new().map(&file)? };
let deserializer = serde_json::Deserializer::from_slice(&mmap);
for_each(&mut deserializer, |obj: MyObject| todo!() )?; My JSON file is a giant array of huge objects. The AFAIK, this is the only way to parse a huge file in a single pass without copying the content to memory. Implementation of `for_each`It is adapted from the [stream-array example](https://serde.rs/stream-array.html). It is very generic, so feel free to include it somewhere if it is useful.fn for_each<'de, D, T, F>(deserializer: D, f: F) -> Result<(), D::Error>
where
D: Deserializer<'de>,
T: Deserialize<'de>,
F: FnMut(T),
{
struct SeqVisitor<T, F>(F, PhantomData<T>);
impl<'de, T, F> Visitor<'de> for SeqVisitor<T, F>
where
T: Deserialize<'de>,
F: FnMut(T),
{
type Value = ();
fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
formatter.write_str("a nonempty sequence")
}
fn visit_seq<A>(mut self, mut seq: A) -> Result<(), A::Error>
where
A: SeqAccess<'de>,
{
while let Some(value) = seq.next_element::<T>()? {
self.0(value)
}
Ok(())
}
}
let visitor = SeqVisitor(f, PhantomData);
deserializer.deserialize_seq(visitor)
} |
It takes 750 ms to deserialize these types while json.load in python takes 300 ms.
Reported by @mitsuhiko in IRC.
The text was updated successfully, but these errors were encountered: