Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reverse_complement spends most of its time reading input #43

Closed
mbrubeck opened this issue Feb 24, 2017 · 5 comments
Closed

reverse_complement spends most of its time reading input #43

mbrubeck opened this issue Feb 24, 2017 · 5 comments

Comments

@mbrubeck
Copy link
Contributor

mbrubeck commented Feb 24, 2017

On my system, reverse_complement < input25000000.txt spends 67% of its time on the read_to_end call here. According to the profile, that time is almost all spent in memmove.

    let mut data = Vec::with_capacity(1024 * 1024);
    stdin.lock().read_to_end(&mut data).unwrap();

Increasing the initial buffer size to fit the entire dataset cuts the read_to_end time in half, and reduces the total execution time by about 30%, but obviously it also wastes memory if the data is small. (Since this program reads from stdin, it doesn't known the total size before it starts reading.)

@mbrubeck
Copy link
Contributor Author

The C++ code uses ftell to get the input size before allocating a buffer:

      long start = ftell( stdin );
      fseek( stdin, 0, SEEK_END );
      size = ftell( stdin ) - start;

mbrubeck added a commit to mbrubeck/benchmarksgame-rs that referenced this issue Feb 24, 2017
mbrubeck added a commit to mbrubeck/benchmarksgame-rs that referenced this issue Feb 24, 2017
mbrubeck added a commit to mbrubeck/benchmarksgame-rs that referenced this issue Feb 24, 2017
@llogiq
Copy link
Contributor

llogiq commented Feb 24, 2017

The C version uses a buffer that is realloc'd to double size on overflow. Perhaps we could have something similar in a crate? Reading arbitrary-sized input is a common use case after all.

@TeXitoi
Copy link
Owner

TeXitoi commented Feb 24, 2017

@llogiq Vec already does that automatically, no?

@mbrubeck
Copy link
Contributor Author

mbrubeck commented Feb 24, 2017

To avoid wasting memory, read_to_end does not double the buffer; instead it grows it by up to DEFAULT_BUF_SIZE (8 KB) at a time. When reading 250 MB of data, that's a lot of reallocations!

https://github.com/rust-lang/rust/blob/08230775a026c955873ba557e624b7f665661f37/src/libstd/io/mod.rs#L351-L354

Also, with a 250 MB buffer, even a single reallocation can be a huge cost. I had to tack an extra byte onto the buffer size to avoid a single reallocation at the end of the file.

@llogiq
Copy link
Contributor

llogiq commented Feb 24, 2017

@TeXitoi also depending on alignment and allocator used, RawVec may create just another allocation and move the contents there (interesting realloc implementation)..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants