Parsing 20MB file using from_reader is slow #160

dtolnay · 2016-10-12T15:59:25Z

It takes 750 ms to deserialize these types while json.load in python takes 300 ms.

Reported by @mitsuhiko in IRC.

mitsuhiko · 2016-10-12T16:04:39Z

These are some smaller files from Sentry that should show the same behavior. The files I'm working with are about four to six times the size but unfortunately i cannot publicly share them.

maps.zip

dtolnay · 2016-10-13T08:21:56Z

I took the larger of the two files in the zip (vendor.js.map) and extended the "mappings" and "sourcesContent" to be each four copies of themselves. The result is 23 MiB. vendor2.zip

Parsing unbuffered directly from the file takes 4710ms. We expect this to be slow.

let f = File::open(path).unwrap();
serde_json::from_reader(f).unwrap()

Parsing from a buffered reader takes 562ms. I assume this is what @mitsuhiko was running.

let br = BufReader::new(File::open(path).unwrap());
serde_json::from_reader(br).unwrap()

Parsing from a string (including reading the file to a string!) takes 55ms. This is the case that I optimized a while back.

let mut s = String::new();
File::open(path).unwrap().read_to_string(&mut s).unwrap();
serde_json::from_str(&s).unwrap()

Parsing from a vec is the same at 55ms.

let mut bytes = Vec::new();
File::open(path).unwrap().read_to_end(&mut bytes).unwrap();
serde_json::from_slice(&contents).unwrap()

Note that in all of these cases parsing to RawSourceMap vs parsing to serde_json::Value takes exactly the same time because the JSON is dominated by large strings.

Parsing in Python takes 248ms in Python 2.7.12 and 186ms in Python 3.5.2. Both Pythons are reading the file into memory as a string first. The read happens here. So Python is doing a slower version of what Rust is doing in 55ms.

with open(path) as f:
    json.load(f)

I also tried the other json crate for good measure which takes 77ms (still impressive compared to Python).

let mut s = String::new();
File::open(path).unwrap().read_to_string(&mut s).unwrap();
json::parse(&s).unwrap()

And of course I tried RapidJSON which may be the fastest C/C++ JSON parser. Don't mind the nasty but actually really fast reading of the file to a std::string, it only takes 6ms. Using clang++ 3.8.0 with -O3 it takes 110ms and using g++ 5.4.0 with -O3 it takes 67ms

std::ifstream in(path, std::ios::in | std::ios::binary);
std::string s;
in.seekg(0, std::ios::end);
s.resize(in.tellg());
in.seekg(0, std::ios::beg);
in.read(&s[0], s.size());
in.close();

rapidjson::Document d;
d.Parse(s.c_str());

Conclusion

The core code is fast as evidenced by being 2x faster than RapidJSON+LLVM.
We comfortably beat json.load in Python by 3x when making a fair comparison.
I know we can do better without resorting to reading into memory. My previous optimizations were not focused on this case. The next thing I would like to try is specializing for readers that implement std::io::Seek, which both File and BufReader do.

dtolnay · 2016-10-13T16:05:26Z

For those wondering, bincode takes 14ms.

let mut br = BufReader::new(File::open(path).unwrap());
let lim = bincode::SizeLimit::Infinite;
bincode::serde::deserialize_from(&mut br, lim).unwrap()

dtolnay · 2016-10-13T16:06:39Z

Comments from @erickt in IRC:

One idea I had is we could change serde_json do it's own buffering
I did that for a separate project implementing the google snappy compression algorithm and it turned out to be quite fast
Since we would mainly be reading directly from a slice instead of a reader
The problem is that I had to write a state machine to handle things like lists or strings that were across the slice boundaries

dimfeld · 2016-11-05T04:20:41Z

Depending on your timeline for improving the speed of from_reader, what do you think about mentioning this in the docs as a first step?

I just encountered this problem with a 45MB JSON file that was taking about 25 seconds to load using from_reader. Since I'm new to Rust I didn't think to use a BufReader at first, and that brought the time down to about 1.5 seconds. But as you mentioned here, reading it all into memory in advance and using from_slice was faster still, at 350ms or so.

I don't think the BufReader technique necessarily needs to be documented here since that's not specific to this crate and is more of a newbie thing, but the vast speed difference between from_slice and from_reader seems worth mentioning if it's not going to change soon. Any thoughts?

edit: if you agree this is a good idea, I'll be glad to submit a PR.

mitsuhiko · 2016-11-05T07:49:36Z

The problem with BufRead is that rust does not support a way to tell a Read apart from a BufRead in a generic interface. Ideally serde could auto wrap in a BufReader if only a Read is supplied :(

oli-obk · 2016-11-05T12:27:46Z

We could use specialization for Seek and BufReader. With Seek we can detect the size and then choose between slice, buf or read processing

bouk · 2017-11-29T10:37:36Z

I'm taking a stab at implementing the BufRead using specialization, which would make it nightly-only for now., although I guess we could add a from_bufread. I think with fill_buf we could do without any kind of copying for the (presumably) common case where a string fits in the buffer of the BufReader.

I'll create a PR for discussion when I have something to show.

bouk · 2017-12-07T11:43:43Z

All right I have an absolutely terrible but working PoC. It required a lot of 'open-heart surgery' on the project to make all the lifetimes and stuff work (you can't return a buffer from a BufRead) but you can look at the result here: https://github.com/bouk/json/tree/buf-read-keys (I think a rewrite of the whole read.rs file would be the most prudent course of action). Again, it's a PoC, the code is 💩.

Anyways, for the result: with this script:

extern crate serde_json;

use std::fs::File;
use std::io::Read;
use std::io::BufReader;

fn main() {
  let br = BufReader::new(File::open("vendor2.json").unwrap());
  let _: serde_json::Value = serde_json::from_reader(br).unwrap();
}

I get 450ms parse time on the current master, but on my branch it's brought down to ~100ms with the buffer optimizations. So, a 4-5x speed up is what we can expect here. Like I mentioned before, a lot of assumptions need to be rethought, like the Reference enum which I couldn't get working properly and which doesn't even lead to improvements in the default json Value parser, as borrowed strings aren't used (but they could be useful for other types I guess).

So, to conclude: definitely possible and worthwhile, look at my untested and broken implementation for inspiration, but there is more work required.

EDIT: OK I take it back, it's slightly nicer now. Not much, but some

`serde_json::from_reader` is considerably slow for some reason. serde-rs/json#160

serde-rs/json#160

@dtolnay

See serde-rs/json#160 (comment) As the comment by @dtolnay is 2.5 years old, I re-did some measurements. Seems nothing much has changed. PROJECT 1: 692 ms -> str 722 ms -> buffered reader 2,120 s -> bare reader PROJECT 2 (servo): 4.230s -> str 9.885s -> buffered reader (using std::io::BufReader) 5m14.607s-> bare reader

This allows us to switch back to serde_json::from_slice instead of serde_json::from_reader, because the latter is significantly slower. See serde-rs/json#160

recmo · 2021-05-14T16:14:56Z

I need to parse a 2.4GB JSON file and found the following to be the fastest:

let file = File::open(options.input)?;
let mmap = unsafe { MmapOptions::new().map(&file)? };
let deserializer = serde_json::Deserializer::from_slice(&mmap);
for_each(&mut deserializer, |obj: MyObject| todo!() )?;

My JSON file is a giant array of huge objects. The deserializer is used with for_each, a custom Visitor that handles the array items streaming one at a time. Each item gets deserialized into objects with zero-copy &str references.

AFAIK, this is the only way to parse a huge file in a single pass without copying the content to memory.

Implementation of `for_each`

It is adapted from the [stream-array example](https://serde.rs/stream-array.html). It is very generic, so feel free to include it somewhere if it is useful.

fn for_each<'de, D, T, F>(deserializer: D, f: F) -> Result<(), D::Error>
where
    D: Deserializer<'de>,
    T: Deserialize<'de>,
    F: FnMut(T),
{
    struct SeqVisitor<T, F>(F, PhantomData<T>);

    impl<'de, T, F> Visitor<'de> for SeqVisitor<T, F>
    where
        T: Deserialize<'de>,
        F: FnMut(T),
    {
        type Value = ();

        fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
            formatter.write_str("a nonempty sequence")
        }

        fn visit_seq<A>(mut self, mut seq: A) -> Result<(), A::Error>
        where
            A: SeqAccess<'de>,
        {
            while let Some(value) = seq.next_element::<T>()? {
                self.0(value)
            }
            Ok(())
        }
    }
    let visitor = SeqVisitor(f, PhantomData);
    deserializer.deserialize_seq(visitor)
}

dtolnay added the performance label Oct 12, 2016

dtolnay changed the title ~~Parsing 20 MB RawSourceMap is slow~~ Parsing 20MB file using from_reader is slow Oct 13, 2016

dtolnay self-assigned this Oct 13, 2016

indiv0 mentioned this issue Feb 16, 2017

Simplify code with serde_json::from_reader indiv0/xkcd-rs#2

Closed

KodrAus mentioned this issue Jul 26, 2017

Compatibility with tokio #316

Open

chpio mentioned this issue Dec 2, 2017

Use buffered io maidsafe-archive/config_file_handler#47

Open

bovee mentioned this issue Jan 9, 2018

RefSeq comparisons are slow onecodex/finch-rs#13

Closed

messense added a commit to messense/Rocket that referenced this issue Jan 22, 2018

Use serde_json::from_str instead of from_reader when deserialize Json

89c83ae

`serde_json::from_reader` is considerably slow for some reason. serde-rs/json#160

messense mentioned this issue Jan 22, 2018

Use from_reader_eager instead of from_reader when deserialize Json rwf2/Rocket#547

Closed

messense added a commit to messense/Rocket that referenced this issue Jan 23, 2018

Use serde_json::from_str instead of from_reader when deserialize Json

0e24373

`serde_json::from_reader` is considerably slow for some reason. serde-rs/json#160

messense added a commit to messense/Rocket that referenced this issue Jan 24, 2018

Use serde_json::from_str instead of from_reader when deserialize Json

203a26e

`serde_json::from_reader` is considerably slow for some reason. serde-rs/json#160

messense added a commit to messense/Rocket that referenced this issue Feb 17, 2018

Use from_reader_eager instead of from_reader when deserialize Json

4aed87b

`serde_json::from_reader` is considerably slow for some reason. serde-rs/json#160

estebank added a commit to estebank/rust-postgres that referenced this issue Jan 15, 2019

Use from_slice instead of from_reader

4e70f75

serde-rs/json#160

tesuji mentioned this issue Apr 4, 2019

Implement CString::from_reader rust-lang/rust#59314

Closed

Xanewok mentioned this issue Apr 18, 2019

Use serde instead of rustc_serialize rust-lang/rls#1436

Closed

est31 mentioned this issue May 30, 2019

Add note about issue 160 to from_reader function #544

Merged

Lesiuk mentioned this issue Dec 1, 2020

warp::body::json is more than 100 times slower than fs::read_to_string + serde_json::from_str seanmonstar/warp#757

Closed

jplatte mentioned this issue Apr 13, 2021

api: Replace bytes::Buf bounds by AsRef<[u8]> ruma/ruma#506

Closed

jplatte added a commit to ruma/ruma that referenced this issue Apr 13, 2021

api: Replace bytes::Buf by AsRef<u8> for reading

c169356

This allows us to switch back to serde_json::from_slice instead of serde_json::from_reader, because the latter is significantly slower. See serde-rs/json#160

michaelsproul mentioned this issue May 18, 2021

Speed up JSON load in slashing protection import sigp/lighthouse#2347

Closed

This was referenced May 28, 2021

Fix transform_properties::Crop deserialization lumeohq/lumeo-types-rs#58

Merged

Switch incoming http body bound to BufRead or AsyncBufRead ruma/ruma#603

Open

josojo mentioned this issue Oct 26, 2021

Faster reading gnosis/gp-dune-bridge#35

Merged

icodezjb mentioned this issue Nov 1, 2021

Speed up big chainspec json(~1.5 GB) load paritytech/substrate#10137

Merged

serde-rs locked and limited conversation to collaborators Mar 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing 20MB file using from_reader is slow #160

Parsing 20MB file using from_reader is slow #160

dtolnay commented Oct 12, 2016

mitsuhiko commented Oct 12, 2016

dtolnay commented Oct 13, 2016

dtolnay commented Oct 13, 2016

dtolnay commented Oct 13, 2016

dimfeld commented Nov 5, 2016 •

edited

Loading

mitsuhiko commented Nov 5, 2016

oli-obk commented Nov 5, 2016

bouk commented Nov 29, 2017 •

edited

Loading

bouk commented Dec 7, 2017 •

edited

Loading

recmo commented May 14, 2021 •

edited

Loading

Parsing 20MB file using from_reader is slow #160

Parsing 20MB file using from_reader is slow #160

Comments

dtolnay commented Oct 12, 2016

mitsuhiko commented Oct 12, 2016

dtolnay commented Oct 13, 2016

Conclusion

dtolnay commented Oct 13, 2016

dtolnay commented Oct 13, 2016

dimfeld commented Nov 5, 2016 • edited Loading

mitsuhiko commented Nov 5, 2016

oli-obk commented Nov 5, 2016

bouk commented Nov 29, 2017 • edited Loading

bouk commented Dec 7, 2017 • edited Loading

recmo commented May 14, 2021 • edited Loading

dimfeld commented Nov 5, 2016 •

edited

Loading

bouk commented Nov 29, 2017 •

edited

Loading

bouk commented Dec 7, 2017 •

edited

Loading

recmo commented May 14, 2021 •

edited

Loading