-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider disabling compression for rlibs and bytecode files #37086
Comments
Maybe we should switch to a faster compression algorithm - e.g. lz4? |
Something that could be considered is an algorithm that is seekable, so that we also could decompress on-demand (as we use absolute/relative positioning in the metadata blob). |
How much larger the rlibs become, though? 30%? |
For reference:
from #6902 (comment) To be honest, I don't think that the speed gains listed above warrant turning compression off. How hard would it be to compress things on a background thread? Compression could be done in parallel with codegen, maybe? |
rlibs are already uncompressed except bitcode. Metadata in dylibs is compressed. I think we turned it off in rlibs not because of the one-time compression cost, but because of decompression in each user crate (it also means no mmap is possible). |
As @eddyb mentioned rlib metadata isn't compressed, as it's intended to be mmap'd. So that also makes me curious where this compression is called from? A standard compilation should not take a look at the bytecode (e.g. it shouldn't need to decompress it), it's only there for LTO builds. Or that's at least the state of the world as of a few years ago when I implemented LTO... @nnethercote are you sure that this function is mostly being called from decompression of metadata and/or bytecode? I could imagine that this showing up with compressing bytecode but not during a normal compile... |
|
Oh ok I think I misread. Then yes I think that this is purely being called from compressing bytecode, not the metadata itself (which isn't compressed in rlibs). We can likely stomach larger rlibs (the size rarely comes up) but to truly fix this we in theory want to disable bytecode-in-rlib entirely. It's only used for LTO, which is almost never used, so we should arguably require a flag to opt-in to LTO-able-rlib which Cargo passes by default if need be. |
As the first comment says, compression occurs in two places. More specifically, here:
Both of them are significant, though how significant varies by benchmark. E.g. the former affects
leb128 encoding is also hot. E.g. see #37083 where a tiny improvement to |
Any suggestions on how to move forward here? The speed-ups are large, but the disk space trade-off is such that it will be easy for inertia to win here, and I'd like to avoid that. I can take file measurements if someone suggests good examples to measure. |
I suppose my personal ideal world would look a little like:
That way the Rust installation stays the same size yet everyone would get the benefits of not having to compress bytecode. Cargo build directories in general don't matter so much more for size as the Rust installation itself I've found at least. |
I'd prefer to have a future proof concept before doing anything here. Unless I'm overlooking something, there are only two crates where this makes a difference of more than a second and that's for debug builds of big crates. In a release build, I would suspect that the difference is well under 1% of total build time. So, I don't see an urgent need for action here. (Sorry, @nnethercote, I don't want to put a damper on your enthusiasm. It's great that you are looking into this, I just don't want to needlessly rush things) Some questions that I'd like to have answered before proceeding:
|
A relatively easy change that might help would be to simply reduce the compression level. There are some flags that can be changed, see here, or here. Maybe there is a better tradeoff that could be found without making massive changes. Another thing that might be worth noting is that deflate function as it's written now will make miniz allocate memory (I haven't counted the exact size, but may be closer to 1mb) for the compressor each time it's called, which may not be ideal if it's used a number of times in short succession. |
I don't think we want to make LTO opt-in; Rust should support it by default, without needing to rebuild the world with it. |
Not sure how much it helps, but deflate can be parallelized. |
I disagree that LTO is almost never used. But, also, in the long run it should be used more. So, I'd rather have all libraries support LTO by default.
I personally am OK with rebuilding the world to switch from non-LTO to LTO builds, as long as it doesn't require every library to opt into anything manually. Perhaps, when building an application, Cargo could inspect the configuration and if any configuration enables LTO then it could build every library with LTO support. Otherwise, if the application doesn't enable LTO support, then it wouldn't enable LTO support. I don't know if this would need to inspect just the application's Cargo.toml or if it would need to recursively inspect all the Cargo.tomls, but this seems like just a detail. |
Just to point the obvious, using something different from How hard would be plug any of those? |
zstd is supposedly a faster, equal-compression-rate version of zlib. Someone should check that. |
@eddyb worked out that That still leaves the bytecode compression, which is the larger part of the potential speed-up. I tried disabling bytecode compression and the size of the rlibs for |
…ddyb Optimize `write_metadata`. `write_metadata` currently generates metadata unnecessarily in some cases, and also compresses it unnecessarily in some cases. This commit fixes that. It speeds up three of the rustc-benchmarks by 1--4%. r? @eddyb, who deserves much of the credit because he (a) identified the problem from the profile data I provided in rust-lang#37086, and (b) explained to me how to fix it. Thank you, @eddyb!
I did a small bit of analysis about compression algorithms and such. I extracted all bytecode from the standard distribution rlibs (e.g. everything we're shipping). I decompressed it and then recompressed it with a bunch of algorithms. Kept track of how long everything took as well as the compression ratios. The raw data is here where each section is the statistics for one particular piece of bytecode. The final entry is the summation of everything previous. I tested:
Basically what I think this tells me is that zstd is blazingly fast and gets better compression than deflate (what we're using today) at lower levels. Otherwise xz is super slow (surprise surprise) and brotli also isn't faring too well on this data set. Now that being said, this seems like it's a tiny portion of compiles. The deflate times for any particular bytecode are in the handfuls of milliseconds at most it looks like. If we really want speed though, zstd seems the way to go. |
Thank you for the analysis, @alexcrichton. What parameters defines the "fast", "default", and "best" modes for deflate?
For syntex-incr it's ~10% for a debug build! And that's a particular interesting example given that "incremental compilation is coming soon" is the standard response to any complaints about rustc's speed... |
Based on the tests done by @alexcrichton (maybe you could put up the test in a github repo or something?) it seems that lowering the deflate compression level could provide a nice speedup without a huge loss in compression efficiency. As it would probably only require changing 1-2 lines of code it may be a good idea to do this while deciding on and alternatively implementing a change to a different compression algorithm. |
Is that still true after metadata is not compressed for rlibs anymore? |
I think a good first step towards improvement here would be to allow for LLVM bitcode to be either compressed or not. The way we store bitcode already contains a small header that tells us about the format, so this would be easy to implement in a backwards compatible way. With that implemented we can just forgo compression in some scenarios (like debug builds or incr. comp.) in a transparent way. Adding support for zstd would be nice too. |
The fast/default/best correspond to what's in flate2. It's pretty janky, but the script I used is here. Agreed we should support uncompressed bytecode! I'd also be fine with it behind a flag that we disable by default and only enable for our own releases. |
All said and done, my Visual Studio + Windows SDK setup is ~5GB. I wouldn't blink at the Rust installation doubling in size if it meant compilation was even slightly (a few percent) faster and/or if the toolchain was easier to maintain (i.e. if it didn't need to support any compression at all). |
This is exactly what I've done in #37298. Thank you for the suggestion. |
…richton Use a faster `deflate` setting In rust-lang#37086 we have considered various ideas for reducing the cost of LLVM bytecode compression. This PR implements the simplest of these: use a faster `deflate` setting. It's very simple and reduces the compression time by almost half while increasing the size of the resulting rlibs by only about 2%. I looked at using zstd, which might be able to halve the compression time again. But integrating zstd is beyond my Rust FFI integration abilities at the moment -- it consists of a few dozen C files, has a non-trivial build system, etc. I decided it was worth getting a big chunk of the possible improvement with minimum effort. The following table shows the before and after percentages of instructions executed during compression while doing debug builds of some of the rustc-benchmarks with a stage1 compiler. ``` html5ever-2016-08-25 1.4% -> 0.7% hyper.0.5.0 3.8% -> 2.4% inflate-0.1.0 1.0% -> 0.5% piston-image-0.10.3 2.9% -> 1.8% regex.0.1.30 3.4% -> 2.1% rust-encoding-0.3.0 4.8% -> 2.9% syntex-0.42.2 2.9% -> 1.8% syntex-0.42.2-incr-clean 14.2% -> 8.9% ``` The omitted ones spend 0% of their time in decompression. And here are actual timings: ``` futures-rs-test 4.110s vs 4.102s --> 1.002x faster (variance: 1.017x, 1.004x) helloworld 0.223s vs 0.226s --> 0.986x faster (variance: 1.012x, 1.022x) html5ever-2016- 4.218s vs 4.186s --> 1.008x faster (variance: 1.008x, 1.010x) hyper.0.5.0 4.746s vs 4.661s --> 1.018x faster (variance: 1.002x, 1.016x) inflate-0.1.0 4.194s vs 4.143s --> 1.012x faster (variance: 1.007x, 1.006x) issue-32062-equ 0.317s vs 0.316s --> 1.001x faster (variance: 1.013x, 1.005x) issue-32278-big 1.811s vs 1.825s --> 0.992x faster (variance: 1.014x, 1.006x) jld-day15-parse 1.412s vs 1.412s --> 1.001x faster (variance: 1.019x, 1.008x) piston-image-0. 11.058s vs 10.977s --> 1.007x faster (variance: 1.008x, 1.039x) reddit-stress 2.331s vs 2.342s --> 0.995x faster (variance: 1.019x, 1.006x) regex.0.1.30 2.294s vs 2.276s --> 1.008x faster (variance: 1.007x, 1.007x) rust-encoding-0 1.963s vs 1.924s --> 1.020x faster (variance: 1.009x, 1.006x) syntex-0.42.2 29.667s vs 29.391s --> 1.009x faster (variance: 1.002x, 1.023x) syntex-0.42.2-i 15.257s vs 14.148s --> 1.078x faster (variance: 1.018x, 1.008x) ``` r? @alexcrichton
Use a faster `deflate` setting In #37086 we have considered various ideas for reducing the cost of LLVM bytecode compression. This PR implements the simplest of these: use a faster `deflate` setting. It's very simple and reduces the compression time by almost half while increasing the size of the resulting rlibs by only about 2%. I looked at using zstd, which might be able to halve the compression time again. But integrating zstd is beyond my Rust FFI integration abilities at the moment -- it consists of a few dozen C files, has a non-trivial build system, etc. I decided it was worth getting a big chunk of the possible improvement with minimum effort. The following table shows the before and after percentages of instructions executed during compression while doing debug builds of some of the rustc-benchmarks with a stage1 compiler. ``` html5ever-2016-08-25 1.4% -> 0.7% hyper.0.5.0 3.8% -> 2.4% inflate-0.1.0 1.0% -> 0.5% piston-image-0.10.3 2.9% -> 1.8% regex.0.1.30 3.4% -> 2.1% rust-encoding-0.3.0 4.8% -> 2.9% syntex-0.42.2 2.9% -> 1.8% syntex-0.42.2-incr-clean 14.2% -> 8.9% ``` The omitted ones spend 0% of their time in decompression. And here are actual timings: ``` futures-rs-test 4.110s vs 4.102s --> 1.002x faster (variance: 1.017x, 1.004x) helloworld 0.223s vs 0.226s --> 0.986x faster (variance: 1.012x, 1.022x) html5ever-2016- 4.218s vs 4.186s --> 1.008x faster (variance: 1.008x, 1.010x) hyper.0.5.0 4.746s vs 4.661s --> 1.018x faster (variance: 1.002x, 1.016x) inflate-0.1.0 4.194s vs 4.143s --> 1.012x faster (variance: 1.007x, 1.006x) issue-32062-equ 0.317s vs 0.316s --> 1.001x faster (variance: 1.013x, 1.005x) issue-32278-big 1.811s vs 1.825s --> 0.992x faster (variance: 1.014x, 1.006x) jld-day15-parse 1.412s vs 1.412s --> 1.001x faster (variance: 1.019x, 1.008x) piston-image-0. 11.058s vs 10.977s --> 1.007x faster (variance: 1.008x, 1.039x) reddit-stress 2.331s vs 2.342s --> 0.995x faster (variance: 1.019x, 1.006x) regex.0.1.30 2.294s vs 2.276s --> 1.008x faster (variance: 1.007x, 1.007x) rust-encoding-0 1.963s vs 1.924s --> 1.020x faster (variance: 1.009x, 1.006x) syntex-0.42.2 29.667s vs 29.391s --> 1.009x faster (variance: 1.002x, 1.023x) syntex-0.42.2-i 15.257s vs 14.148s --> 1.078x faster (variance: 1.018x, 1.008x) ``` r? @alexcrichton
One of the hottest functions in rustc is
tdefl_compress
, which is called fromdeflate_bytes
. It's used in two places: crate metadata in rlibs, and LLVM bytecode files.If we simply turned off compression in these two places we would get sizeable speed-ups. The following numbers are for a proof-of-concept patch, doing debug builds with a stage1 compiler.
regex
andsyntex-incr
are the biggest wins.The obvious downside is that the size of the relevant files is larger. So we need to decide if this we are happy with this trade-off, accepting larger files for faster compilation.
The text was updated successfully, but these errors were encountered: