Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop some unsafes - the compiler now optimizes equivalent safe code #43

Merged
merged 18 commits into from
Jun 26, 2018

Conversation

Shnatsel
Copy link
Contributor

This PR drops some unsafe code that was introduced in place of safe code as an optimization. It is no longer needed: on Rust 1.27 (current stable) performance degradation from this change is within measurement noise even when taking 10x the normal number of samples using Criterion.

cargo bench has 2% variance by itself, so it couldn't measure anything below that, so I had to switch to Criterion and then jack up the number of samples to make sure there is no degradation at all. You can find the benchmarking setup here. It's a bit messy, but I can clean it up and open a PR if you're interested.

…surement noise even when taking 10x the normal number of samples using Criterion.
@Shnatsel
Copy link
Contributor Author

Travis check failed because rustup is currently broken for nightly, it fails to download the compiler. I have this on my local machine too.

…ith potentially uninitialized contents. I have tried and failed to actually exploit this to get an information leak.
src/lib.rs Outdated
}

for i in self.pos as usize..pos_end as usize {
self.buffer[i] = self.buffer[i + forward as usize]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation here is still too to the left. Also, the semicolon missing.

Copy link
Contributor Author

@Shnatsel Shnatsel Jun 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, fixed in the very next commit. Didn't want to put unrelated things to the same commit. Nevermind, it's a different line that looked similar. Thanks for the catch, and thanks for the review!

src/lib.rs Outdated
if self.pos < dist && pos_end > self.pos {
return Err("invalid run length in stream".to_owned());
}

if self.buffer.len() < pos_end as usize {
unsafe {
self.buffer.set_len(pos_end as usize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if push is efficient enough, compared to assigning.
The capacity should be fixed, maybe just needs an assert!(pos_end as usize <= self.buffer.capacity()) to hint to LLVM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is, replacing if self.buffer.len() < pos_end as usize {...} (and the original loop) with:

        for i in self.pos as usize..self.buffer.len().min(pos_end as usize) {
            self.buffer[i] = self.buffer[i - dist as usize];
        }
        assert!(pos_end as usize <= self.buffer.capacity());
        while self.buffer.len() < pos_end as usize {
            let x = self.buffer[self.buffer.len() - dist as usize];
            self.buffer.push(x);
        }

One interesting question would be whether we are even validating that dist != 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still iterating on this part, so I didn't include it in this PR. I will try this and report the performance impact. Thanks!

I'm also going to try extend_from_slice just to see how that works. Probably poorly because the slice belongs to the same buffer, and since the vector might be reallocated the slice could be invalidated, so I'd have to clone it. Now if I had a fixed-size vector that would be guaranteed not to reallocate that might have been faster than push().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extend_from_slice is not really relevant, because this is RLE, not just "copy from the past", so if dist > len, you start repeating by reading bytes you've just written.
Also note that length of the vector is only increased once, then the sliding window stays constantly sized, so the push loop is less performance-sensitive.

You'd need a different loop structure with something like memcpy inside, and skipping dist-long runs, if you wanted to actually take advantage of distances larger than a few bytes, but I have my suspicions that it will be significantly faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, the code you've suggested incurs a 9-11% performance penalty on decompressing entire files, depending on the file. I have tweaked it a bit and got it to incur 8-10% penalty, here's the code:

        let upper_bound = self.buffer.len().min(pos_end as usize);
        for i in self.pos as usize..upper_bound {
            self.buffer[i] = self.buffer[i - dist as usize];
        }
        assert!(pos_end as usize <= self.buffer.capacity());
        let initial_buffer_len = self.buffer.len();
        for i in initial_buffer_len..pos_end as usize {
            let x = self.buffer[i - dist as usize];
            self.buffer.push(x);
        }

Presence or absence of assert() has no effect (in this code, I haven't tested the variant with while without assert).

I also got 10% performance overhead simply by replacing unsafe {self.buffer.set_len(pos_end as usize);} with self.buffer.resize(pos_end as usize, 0u8); in the original code, which results in a slightly more concise code than your solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for investigating! This is probably the single best response to me filing a security issue I've ever seen, and I didn't even have a proof of concept this time.

As for the commit message - function run_len_dist() when taken in isolation is vulnerable; the rest of the code just so happens to never call it with dist set to 0, so the crate as a whole is not vulnerable. I would prefer to note that in the commit history, and update only the "It is unclear..." part to reflect that the crate as a whole is not vulnerable. But I will drop the mention of the vulnerability if you insist.

Also, here's my memcpy-based prototype.

        if self.buffer.len() < pos_end as usize {
            self.buffer.resize(pos_end as usize, 0u8);
        }

        fill_slice_with_subslice(&mut self.buffer, (self.pos as usize - dist as usize, self.pos as usize), (self.pos as usize, pos_end as usize));
        fn fill_slice_with_subslice(slice: &mut[u8], (source_from, source_to): (usize, usize), (dest_from, dest_to): (usize, usize)) {
            let (source, destination) = if dest_from >= source_from {slice.split_at_mut(dest_from)} else {slice.split_at_mut(source_from)};
            let source = &mut source[source_from..source_to];
            let destination = &mut destination[..dest_to-dest_from];
            for i in (0..( (destination.len()) / source.len() )).map(|x| x * source.len()) {
                destination[i..source.len()+i].copy_from_slice(&source);
            }
        }

It fails some tests and I have been trying to understand why to no avail, so I'm afraid I won't be able to complete it.

I'm afraid I'm not familiar with assembler or LLVM IR, so I will not be able to inspect it in any meaningful way. Sorry. I will benchmark your suggested changes with panic=abort on nightly and report the results.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I don't mean panic=abort, there's a "nightly" feature which changes what abort() does. Although panic=abort might itself cause further improvements, so there's probably a few combinations to test.

I wouldn't have done the copy_from_slice with a for loop, or at least not with division/multiplication - there's away to step ranges, although I'd also try a while loop that updates state instead.

But anyway, dist is not a multiple of len, you're missing something to handle len % dist (again, it should ideally be done without using %, unless you can find a way to use division that is a performance boost, but I doubt that's possible here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've conjured up a function based on copy_from_slice() that is more efficient than a per-element loop. Using while made it a lot more readable, so thanks for the tip!

        if self.buffer.len() < pos_end as usize {
            self.buffer.resize(pos_end as usize, 0u8);
        }
        fill_slice_with_subslice(&mut self.buffer, (self.pos as usize - dist as usize, self.pos as usize), (self.pos as usize, pos_end as usize));
        fn fill_slice_with_subslice(slice: &mut[u8], (source_from, source_to): (usize, usize), (dest_from, dest_to): (usize, usize)) {
            let (source, destination) = slice.split_at_mut(dest_from); //TODO: allow destination to be lower than source
            let source = &source[source_from..source_to];
            let destination = &mut destination[..(dest_to - dest_from)];
            let mut offset = 0;
            while offset + source.len() < destination.len() {
                destination[offset..source.len()+offset].copy_from_slice(&source);
                offset += source.len();
            }
            let remaining_chunk = destination.len()-offset;
            &mut destination[offset..].copy_from_slice(&source[..remaining_chunk]);
        }

It offsets some of the costs of safe memory initialization so that switching to safe initialization would only create 5% overhead instead of 10%. If we switch the loop above to something like this too, then we'd have an entirely safe crate with the same performance as before.

I have a branch with all changes from this PR plus the optimized loop: https://github.com/Shnatsel/inflate/tree/safe-with-optimized-loop

Sadly, this function still fails one test - the line with its invocation causes an interger overflow on test "issue_30_realworld", i.e. self.pos is actually less than dist. How did it work in a simple loop that did self.buffer[i - dist as usize] with i initialized to self.pos but doesn't work here is beyond me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this code could be much simpler if it was duplicated between forward and backward, just like the old code - in fact, it would be very similar to the old code, just doing more than one element at a time.

Also, is there a benefit to always using the same source subslice, or is it just as efficient / more efficient to always copy the last dist bytes? There may be some cache effects here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always copying the last dist bytes seems to be exactly as efficient as aways copying the same slice.

The code I ended up with looks like this:

        let (source, destination) = (&mut self.buffer).split_at_mut(self.pos as usize);
        let source = &source[source.len() - dist as usize..];
        let mut offset = 0;
        while offset + source.len() < destination.len() {
            destination[offset..source.len()+offset].copy_from_slice(&source);
            offset += source.len();
        }
        let remaining_chunk = destination.len()-offset;
        &mut destination[offset..].copy_from_slice(&source[..remaining_chunk]);

Which is a bit more readable. This nets 3% to 7% performance improvement.

Surprisingly, putting the same code in the other copying loop in this function actually hurts performance by 1% on my samples.

I've also tried an iterator-based version, which is concise but as slow as copying byte-by-byte:

        let (source, destination) = (&mut self.buffer).split_at_mut(self.pos as usize);
        let source = &source[source.len() - dist as usize..];
        for (d,s) in destination.chunks_mut(dist as usize).zip(source.chunks(dist as usize).cycle()) {
            let d_len = d.len(); // last chunk has a size lower than we've specified
            d.copy_from_slice(&s[..d_len]);
        }

However, I've realized that this function can return pretty much any garbage and tests won't fail. Since this creates regression potential, optimizing copying falls out of scope of this PR.

src/lib.rs Outdated
}

for i in self.pos as usize..pos_end as usize {
self.buffer[i] = self.buffer[i + forward as usize]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semicolon still missing.

@eddyb
Copy link
Member

eddyb commented Jun 25, 2018

Btw, there is still the use of the abort intrinsic, could that be removed perhaps?
Or does unwinding still affect benchmarks?

Also, I hope both #35 and this were tested with both the "nightly" feature and without. It's possible that if unwinding prevents optimizations, changes that don't affect benchmarks with unwinding actually do affect benchmarks without. cc @RazrFalcon

…re vulnerability in this function. This is not exploitable in practice, because run_len_dist() is not exposes in the public API and is never called with dist=0 from this crate, even given deliberately malformed inputs.
src/lib.rs Outdated
@@ -65,7 +65,7 @@
//! }
//! ```

#![cfg_attr(feature = "unstable", feature(core))]
#![cfg_attr(feature = "unstable", feature(core_intrinsics))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, for how long has this been broken?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR: dunno.
The only relevant changelog entry I could find was "Declaration of lang items and intrinsics are now feature-gated by default." in 0.11.0 (2014-07-02), but that's probably about feature(core) not feature(core_intrinsics). I could not find anything about core_intrinsics on Rust bug tracker.

@Shnatsel
Copy link
Contributor Author

Current code with unsafe:

  • rustc 1.27 stable: baseline
  • rustc 1.28.0-nightly (60efbdead 2018-06-23): +3% regression
  • "unstable" feature, rustc 1.28.0-nightly (60efbdead 2018-06-23): +3% regression

Safe code with while suggested by @eddyb:

  • rustc 1.27 stable: +11% regression
  • #[inline(never)], rustc 1.27 stable: +12% regression
  • rustc 1.28.0-nightly (60efbdead 2018-06-23): +13% regression
  • #[inline(never)], rustc 1.28.0-nightly (60efbdead 2018-06-23): +13% regression
  • "unstable" feature, assert() rewritten as if .. {abort()} rustc 1.28.0-nightly (60efbdead 2018-06-23): +13% regression

I could not benchmark panic=abort because Criterion dependency pest_derive fails to build with "error: the linked panic runtime panic_unwind is not compiled with this crate's panic strategy abort", and 3% differences are not observable with simpler tools.

Apparently the abort intrinsic is no longer needed. Shall I drop it in this PR?

@eddyb
Copy link
Member

eddyb commented Jun 25, 2018

Apparently the abort intrinsic is no longer needed. Shall I drop it in this PR?

Sounds good!

@Shnatsel
Copy link
Contributor Author

Done. I hope removing a feature from Cargo.toml is not a breaking change?

src/lib.rs Outdated
@@ -404,7 +386,7 @@ impl CodeLengthReader {
self.result.push(0);
}
}
_ => abort(),
_ => panic!(),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to use unreachable!() macro? It would provide more meaningful name.

Cargo.toml Outdated
@@ -9,7 +9,6 @@ keywords = ["deflate", "decompression", "compression", "piston"]

[features]
default = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed at all? I think all it did was disable unstable by default.

@eddyb
Copy link
Member

eddyb commented Jun 25, 2018

@Shnatsel I don't know about breaking change. cc @bvssvni

@bvssvni
Copy link
Contributor

bvssvni commented Jun 25, 2018

I think removing a feature is a breaking change. Bump it to 0.5.0.

@eddyb
Copy link
Member

eddyb commented Jun 25, 2018

@bvssvni IMO we should avoid breaking changes, to ensure that any improvements are used in the ecosystem. But YMMV.

@bvssvni
Copy link
Contributor

bvssvni commented Jun 25, 2018

@eddyb It's always better to release a new version than messing up existing projects.

@Shnatsel
Copy link
Contributor Author

I would like the safety improvements done so far not to warrant a version bump so that they would naturally propagate through the ecosystem. I'm on board with reverting the switch from abort() to assertions if that's what's needed to make it happen.

On the other hand, "unstable" feature was only available on nightly where breakage is kind of expected, and it was already broken to boot, so all people notice would be that it code with that feature enabled has actually started compiling.

IDK. Your call.

@bvssvni
Copy link
Contributor

bvssvni commented Jun 26, 2018

OK, if it was broken then we can bump it to 0.4.3.

@Shnatsel
Copy link
Contributor Author

I haven't managed to get rid of the unsafe block without losing performance, so instead I've covered it in asserts. Asserts actually improve performance by roughly 2%, depending on the workload.

I am actually in the market for an entirely safe inflate, perhaps even one with #![forbid(unsafe_code)], but coercing llvm into optimizing safe code to the same extent has proven to be beyond my abilities.

This PR is now as good as I can get it. Please conduct final review and merge it or request changes.

@bvssvni
Copy link
Contributor

bvssvni commented Jun 26, 2018

Merging.

@bvssvni bvssvni merged commit 728bd49 into image-rs:master Jun 26, 2018
@bvssvni
Copy link
Contributor

bvssvni commented Jun 26, 2018

@eddyb Do you want to test your access to publish? Or do you want me to do it?

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Jun 27, 2018

Thank you!

I'll probably work on png crate next. My fuzzers have found a panic on malformed input in it, I'll fix that to unblock further fuzzing. Plus it's got unsafes, perhaps I can do something about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants