Skip to content

simplify shootout-reverse-complement.rs #18357

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 30, 2014

Conversation

TeXitoi
Copy link
Contributor

@TeXitoi TeXitoi commented Oct 26, 2014

Simpler, safer and shorter, in the same spirit of the current version, and the
same performances.

@mahkoh please review, I think I didn't change any performances related thing.

Simpler, safer and shorter, in the same spirit of the current version, and the
same performances.
const CHUNK: uint = 64 * 1024;

let mut vec = Vec::with_capacity(1024 * 1024);
let mut vec = Vec::with_capacity(CHUNK);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to be 1MB for the code below to work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rewritten this function to be as close as possible to Reader::read_to_end(). This value will just add exactly one more allocation, or there is something I didn't understand?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more allocation can be very significant with jemalloc. But in the case 64k -> 1M it looks like it's insignificant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

64k -> 1M isn't a huge reallocation so the mremap issue isn't relevant.

@mahkoh
Copy link
Contributor

mahkoh commented Oct 26, 2014

What do your benchmarks say?

@TeXitoi
Copy link
Contributor Author

TeXitoi commented Oct 27, 2014

I can't see noticiable difference in speed between the 2 (i.e. on several run, the speed interval is the same)

@mahkoh
Copy link
Contributor

mahkoh commented Oct 27, 2014

Performance looks good here too.

@alexcrichton
Copy link
Member

Nice work @TeXitoi, thanks!

@mahkoh
Copy link
Contributor

mahkoh commented Oct 27, 2014

Have you compared the performance with the C and C++ solutions?

@TeXitoi
Copy link
Contributor Author

TeXitoi commented Oct 27, 2014

Rust:

real    0m1.069s
user    0m0.760s
sys     0m0.592s

C#2:

real    0m1.185s
user    0m1.412s
sys     0m0.452s

C++#4:

real    0m1.252s
user    0m1.424s
sys     0m0.520s

@TeXitoi
Copy link
Contributor Author

TeXitoi commented Oct 27, 2014

Rust with Reader::read_to_end():

real    0m1.176s
user    0m0.792s
sys     0m0.676s

@mahkoh
Copy link
Contributor

mahkoh commented Oct 27, 2014

You're not using Linux, are you?

@TeXitoi
Copy link
Contributor Author

TeXitoi commented Oct 27, 2014

I am:

texitoi@vaio:~/dev/rust$ cat /proc/cpuinfo | grep 'model name'
model name  : Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
model name  : Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
model name  : Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
model name  : Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
texitoi@vaio:~/dev/rust$ uname -a
Linux vaio 3.16-2-amd64 #1 SMP Debian 3.16.3-2 (2014-09-20) x86_64 GNU/Linux
texitoi@vaio:~/dev/rust$ 

@TeXitoi
Copy link
Contributor Author

TeXitoi commented Oct 27, 2014

compile with

rustc -C lto -C target-cpu=core2 --opt-level 3 shootout-reverse-complement.rs -o rc

@mahkoh
Copy link
Contributor

mahkoh commented Oct 27, 2014

How are you benchmarking and what input are you using?

@TeXitoi
Copy link
Contributor Author

TeXitoi commented Oct 27, 2014

input is generated with /tmp/fasta 100000000 > /tmp/fasta-out.txt

bench with runing time /tmp/rc < /tmp/fasta-out.txt > /dev/null about 10 times and taking the best result.

@mahkoh
Copy link
Contributor

mahkoh commented Oct 27, 2014

Please use the 25M fasta output used in the official benchmark.

@TeXitoi
Copy link
Contributor Author

TeXitoi commented Oct 27, 2014

Input: /tmp/fasta 25000000 > /tmp/fasta-out.txt

Rust in this PR:

real    0m0.207s
user    0m0.164s
sys     0m0.112s

Rust + Reader::read_to_end():

real    0m0.299s
user    0m0.184s
sys     0m0.184s

C#2:

real    0m0.297s
user    0m0.356s
sys     0m0.116s

C++#4:

real    0m0.323s
user    0m0.384s
sys     0m0.112s

/// Reads all remaining bytes from the stream.
fn read_to_end<R: Reader>(r: &mut R) -> IoResult<Vec<u8>> {
// FIXME: this method is a temporary workaround of a slowness
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a performance bug in jemalloc or Rust and it's not a performance bug in the Linux kernel. I've proposed a new feature for mremap in the Linux kernel (MREMAP_RETAIN) to allow jemalloc to take advantage of it but there is no guarantee of that landing. I don't think you should include a FIXME for something that's not a bug and has no guarantee of ever being fixed.

You're free to use mmap, mremap and munmap directly which would be faster than working around the fact that you're doing massive copies by using a very large growth multiple. It's not possible for jemalloc to do this because mremap because it causes virtual memory fragmentation by unmapping the source.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The glibc allocator doesn't attempt to eliminate virtual memory fragmentation so it's able to use mremap as it exists today. However, it's significantly slower than just using mmap, mremap and munmap directly anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you suggest that I just remove the comment? using mremap directly will be linux only, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance of huge reallocations on Linux is already better than other platforms. If MREMAP_RETAIN does get accepted upstream, then huge reallocations will be blazing on on new Linux kernels with jemalloc but that won't impact the performance on other platforms.

If you want to have code that's special-cased for Linux then you can already do that by calling mmap, mremap and munmap. It will be faster than what you're doing here because it will eliminate the huge copies rather than just making them less frequent.

@mahkoh
Copy link
Contributor

mahkoh commented Oct 27, 2014

It looks like read_to_end is significantly faster on your machine than on mine. I'm seeing 0.36 vs 0.52 here.

@TeXitoi
Copy link
Contributor Author

TeXitoi commented Oct 27, 2014

@mahkoh do you compile with -C lto -C target-cpu=core2 --opt-level 3? I suspect LTO is usefull with Reader.

@mahkoh
Copy link
Contributor

mahkoh commented Oct 27, 2014

lto gives me no significant improvements.

@TeXitoi TeXitoi force-pushed the simplify-reverse-complement branch from bf16d62 to 7017fb0 Compare October 28, 2014 21:14
@TeXitoi
Copy link
Contributor Author

TeXitoi commented Oct 28, 2014

@thestinger rephrased the comment. OK?

bors added a commit that referenced this pull request Oct 29, 2014
…excrichton

Simpler, safer and shorter, in the same spirit of the current version, and the
same performances.

@mahkoh please review, I think I didn't change any performances related thing.
@bors bors closed this Oct 30, 2014
@bors bors merged commit 7017fb0 into rust-lang:master Oct 30, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants