-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop some unsafe
s - the compiler now optimizes equivalent safe code
#43
Conversation
…surement noise even when taking 10x the normal number of samples using Criterion.
Travis check failed because |
…ith potentially uninitialized contents. I have tried and failed to actually exploit this to get an information leak.
src/lib.rs
Outdated
} | ||
|
||
for i in self.pos as usize..pos_end as usize { | ||
self.buffer[i] = self.buffer[i + forward as usize] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation here is still too to the left. Also, the semicolon missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, fixed in the very next commit. Didn't want to put unrelated things to the same commit. Nevermind, it's a different line that looked similar. Thanks for the catch, and thanks for the review!
src/lib.rs
Outdated
if self.pos < dist && pos_end > self.pos { | ||
return Err("invalid run length in stream".to_owned()); | ||
} | ||
|
||
if self.buffer.len() < pos_end as usize { | ||
unsafe { | ||
self.buffer.set_len(pos_end as usize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if push
is efficient enough, compared to assigning.
The capacity should be fixed, maybe just needs an assert!(pos_end as usize <= self.buffer.capacity())
to hint to LLVM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is, replacing if self.buffer.len() < pos_end as usize {...}
(and the original loop) with:
for i in self.pos as usize..self.buffer.len().min(pos_end as usize) {
self.buffer[i] = self.buffer[i - dist as usize];
}
assert!(pos_end as usize <= self.buffer.capacity());
while self.buffer.len() < pos_end as usize {
let x = self.buffer[self.buffer.len() - dist as usize];
self.buffer.push(x);
}
One interesting question would be whether we are even validating that dist != 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still iterating on this part, so I didn't include it in this PR. I will try this and report the performance impact. Thanks!
I'm also going to try extend_from_slice just to see how that works. Probably poorly because the slice belongs to the same buffer, and since the vector might be reallocated the slice could be invalidated, so I'd have to clone it. Now if I had a fixed-size vector that would be guaranteed not to reallocate that might have been faster than push()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extend_from_slice
is not really relevant, because this is RLE, not just "copy from the past", so if dist > len
, you start repeating by reading bytes you've just written.
Also note that length of the vector is only increased once, then the sliding window stays constantly sized, so the push
loop is less performance-sensitive.
You'd need a different loop structure with something like memcpy
inside, and skipping dist
-long runs, if you wanted to actually take advantage of distances larger than a few bytes, but I have my suspicions that it will be significantly faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly, the code you've suggested incurs a 9-11% performance penalty on decompressing entire files, depending on the file. I have tweaked it a bit and got it to incur 8-10% penalty, here's the code:
let upper_bound = self.buffer.len().min(pos_end as usize);
for i in self.pos as usize..upper_bound {
self.buffer[i] = self.buffer[i - dist as usize];
}
assert!(pos_end as usize <= self.buffer.capacity());
let initial_buffer_len = self.buffer.len();
for i in initial_buffer_len..pos_end as usize {
let x = self.buffer[i - dist as usize];
self.buffer.push(x);
}
Presence or absence of assert() has no effect (in this code, I haven't tested the variant with while
without assert).
I also got 10% performance overhead simply by replacing unsafe {self.buffer.set_len(pos_end as usize);}
with self.buffer.resize(pos_end as usize, 0u8);
in the original code, which results in a slightly more concise code than your solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for investigating! This is probably the single best response to me filing a security issue I've ever seen, and I didn't even have a proof of concept this time.
As for the commit message - function run_len_dist()
when taken in isolation is vulnerable; the rest of the code just so happens to never call it with dist
set to 0, so the crate as a whole is not vulnerable. I would prefer to note that in the commit history, and update only the "It is unclear..." part to reflect that the crate as a whole is not vulnerable. But I will drop the mention of the vulnerability if you insist.
Also, here's my memcpy-based prototype.
if self.buffer.len() < pos_end as usize {
self.buffer.resize(pos_end as usize, 0u8);
}
fill_slice_with_subslice(&mut self.buffer, (self.pos as usize - dist as usize, self.pos as usize), (self.pos as usize, pos_end as usize));
fn fill_slice_with_subslice(slice: &mut[u8], (source_from, source_to): (usize, usize), (dest_from, dest_to): (usize, usize)) {
let (source, destination) = if dest_from >= source_from {slice.split_at_mut(dest_from)} else {slice.split_at_mut(source_from)};
let source = &mut source[source_from..source_to];
let destination = &mut destination[..dest_to-dest_from];
for i in (0..( (destination.len()) / source.len() )).map(|x| x * source.len()) {
destination[i..source.len()+i].copy_from_slice(&source);
}
}
It fails some tests and I have been trying to understand why to no avail, so I'm afraid I won't be able to complete it.
I'm afraid I'm not familiar with assembler or LLVM IR, so I will not be able to inspect it in any meaningful way. Sorry. I will benchmark your suggested changes with panic=abort
on nightly and report the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I don't mean panic=abort
, there's a "nightly"
feature which changes what abort()
does. Although panic=abort
might itself cause further improvements, so there's probably a few combinations to test.
I wouldn't have done the copy_from_slice
with a for
loop, or at least not with division/multiplication - there's away to step ranges, although I'd also try a while
loop that updates state instead.
But anyway, dist
is not a multiple of len
, you're missing something to handle len % dist
(again, it should ideally be done without using %
, unless you can find a way to use division that is a performance boost, but I doubt that's possible here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've conjured up a function based on copy_from_slice()
that is more efficient than a per-element loop. Using while
made it a lot more readable, so thanks for the tip!
if self.buffer.len() < pos_end as usize {
self.buffer.resize(pos_end as usize, 0u8);
}
fill_slice_with_subslice(&mut self.buffer, (self.pos as usize - dist as usize, self.pos as usize), (self.pos as usize, pos_end as usize));
fn fill_slice_with_subslice(slice: &mut[u8], (source_from, source_to): (usize, usize), (dest_from, dest_to): (usize, usize)) {
let (source, destination) = slice.split_at_mut(dest_from); //TODO: allow destination to be lower than source
let source = &source[source_from..source_to];
let destination = &mut destination[..(dest_to - dest_from)];
let mut offset = 0;
while offset + source.len() < destination.len() {
destination[offset..source.len()+offset].copy_from_slice(&source);
offset += source.len();
}
let remaining_chunk = destination.len()-offset;
&mut destination[offset..].copy_from_slice(&source[..remaining_chunk]);
}
It offsets some of the costs of safe memory initialization so that switching to safe initialization would only create 5% overhead instead of 10%. If we switch the loop above to something like this too, then we'd have an entirely safe crate with the same performance as before.
I have a branch with all changes from this PR plus the optimized loop: https://github.com/Shnatsel/inflate/tree/safe-with-optimized-loop
Sadly, this function still fails one test - the line with its invocation causes an interger overflow on test "issue_30_realworld", i.e. self.pos
is actually less than dist
. How did it work in a simple loop that did self.buffer[i - dist as usize]
with i
initialized to self.pos
but doesn't work here is beyond me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this code could be much simpler if it was duplicated between forward and backward, just like the old code - in fact, it would be very similar to the old code, just doing more than one element at a time.
Also, is there a benefit to always using the same source subslice, or is it just as efficient / more efficient to always copy the last dist
bytes? There may be some cache effects here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always copying the last dist
bytes seems to be exactly as efficient as aways copying the same slice.
The code I ended up with looks like this:
let (source, destination) = (&mut self.buffer).split_at_mut(self.pos as usize);
let source = &source[source.len() - dist as usize..];
let mut offset = 0;
while offset + source.len() < destination.len() {
destination[offset..source.len()+offset].copy_from_slice(&source);
offset += source.len();
}
let remaining_chunk = destination.len()-offset;
&mut destination[offset..].copy_from_slice(&source[..remaining_chunk]);
Which is a bit more readable. This nets 3% to 7% performance improvement.
Surprisingly, putting the same code in the other copying loop in this function actually hurts performance by 1% on my samples.
I've also tried an iterator-based version, which is concise but as slow as copying byte-by-byte:
let (source, destination) = (&mut self.buffer).split_at_mut(self.pos as usize);
let source = &source[source.len() - dist as usize..];
for (d,s) in destination.chunks_mut(dist as usize).zip(source.chunks(dist as usize).cycle()) {
let d_len = d.len(); // last chunk has a size lower than we've specified
d.copy_from_slice(&s[..d_len]);
}
However, I've realized that this function can return pretty much any garbage and tests won't fail. Since this creates regression potential, optimizing copying falls out of scope of this PR.
src/lib.rs
Outdated
} | ||
|
||
for i in self.pos as usize..pos_end as usize { | ||
self.buffer[i] = self.buffer[i + forward as usize] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semicolon still missing.
Btw, there is still the use of the Also, I hope both #35 and this were tested with both the |
…re vulnerability in this function. This is not exploitable in practice, because run_len_dist() is not exposes in the public API and is never called with dist=0 from this crate, even given deliberately malformed inputs.
src/lib.rs
Outdated
@@ -65,7 +65,7 @@ | |||
//! } | |||
//! ``` | |||
|
|||
#![cfg_attr(feature = "unstable", feature(core))] | |||
#![cfg_attr(feature = "unstable", feature(core_intrinsics))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, for how long has this been broken?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR: dunno.
The only relevant changelog entry I could find was "Declaration of lang items and intrinsics are now feature-gated by default." in 0.11.0 (2014-07-02), but that's probably about feature(core) not feature(core_intrinsics). I could not find anything about core_intrinsics on Rust bug tracker.
Current code with unsafe:
Safe code with
I could not benchmark Apparently the abort intrinsic is no longer needed. Shall I drop it in this PR? |
Sounds good! |
… compiler. Replace its use with more idiomatic assertions.
Done. I hope removing a feature from Cargo.toml is not a breaking change? |
src/lib.rs
Outdated
@@ -404,7 +386,7 @@ impl CodeLengthReader { | |||
self.result.push(0); | |||
} | |||
} | |||
_ => abort(), | |||
_ => panic!(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to use unreachable!()
macro? It would provide more meaningful name.
Cargo.toml
Outdated
@@ -9,7 +9,6 @@ keywords = ["deflate", "decompression", "compression", "piston"] | |||
|
|||
[features] | |||
default = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed at all? I think all it did was disable unstable
by default.
I think removing a feature is a breaking change. Bump it to 0.5.0. |
@bvssvni IMO we should avoid breaking changes, to ensure that any improvements are used in the ecosystem. But YMMV. |
@eddyb It's always better to release a new version than messing up existing projects. |
I would like the safety improvements done so far not to warrant a version bump so that they would naturally propagate through the ecosystem. I'm on board with reverting the switch from On the other hand, "unstable" feature was only available on nightly where breakage is kind of expected, and it was already broken to boot, so all people notice would be that it code with that feature enabled has actually started compiling. IDK. Your call. |
OK, if it was broken then we can bump it to 0.4.3. |
…ly, this actually *improves* performance slightly.
…rt below for it anyway.
I haven't managed to get rid of the unsafe block without losing performance, so instead I've covered it in asserts. Asserts actually improve performance by roughly 2%, depending on the workload. I am actually in the market for an entirely safe This PR is now as good as I can get it. Please conduct final review and merge it or request changes. |
Merging. |
@eddyb Do you want to test your access to publish? Or do you want me to do it? |
Thank you! I'll probably work on |
This PR drops some unsafe code that was introduced in place of safe code as an optimization. It is no longer needed: on Rust 1.27 (current stable) performance degradation from this change is within measurement noise even when taking 10x the normal number of samples using Criterion.
cargo bench
has 2% variance by itself, so it couldn't measure anything below that, so I had to switch to Criterion and then jack up the number of samples to make sure there is no degradation at all. You can find the benchmarking setup here. It's a bit messy, but I can clean it up and open a PR if you're interested.