-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider relaxing initialisation requirement of scratch buffers #105
Comments
This is quite similar to #79 |
Interesting. Haven't thought about the case of the
I'm not sure I understand the burden. Note, it won't change the
to:
Note the
This double API idea is also a good idea for folks who DO have an initialised array, perhaps a buffer that get's re-used as the output. I may be missing the large amounts of boilerplate and documentation required. Won't you be able to define the one in terms of the other? Something like:
The compiler should be able to optimise out the clone... But that's yet to be tested. |
I think the speed benefit alone is too small to motivate a change. Anyone who cares about performance should be reusing the same buffers for every call, so the initialization is only done once. I would also guess that allocating the space for the buffer takes more time than clearing it. And compared to creating an FFT instance it's nothing. |
Is it allowed to cheat with unsafe?
This should skip the clearing of the new vector. I'm also assuming that anyone who wants to squeeze the last tiny drops of performance from their code isn't afraid of a little unsafe :) |
Unfortunately this requires an extra Default trait bound which results in requiring a major version change. Practically it's not an issue since all elements that you'd want to convolve would implement Default, but it's a bit annoying. Default is required for initializing the scratch-buffer, which is strictly not necessary but requires changes to `rustfft`. It's currently pending this issue in rustfft: ejmahler/RustFFT#105
Unfortunately this requires an extra Default trait bound which results in requiring a major version change for fftconvolve. Practically it's not an issue since all elements that you'd want to convolve would implement Default, but it's a bit annoying. Default is required for initializing the scratch-buffer, which is strictly not necessary but requires changes to `rustfft`. It's currently pending this issue in rustfft: ejmahler/RustFFT#105
Unfortunately this requires an extra Default trait bound which results in requiring a major version change for fftconvolve. Practically it's not an issue since all elements that you'd want to convolve would implement Default, but it's a bit annoying. Default is required for initializing the scratch-buffer, which is strictly not necessary but requires changes to `rustfft`. It's currently pending this issue in rustfft: ejmahler/RustFFT#105
AFAIK the answer is no. The documentation provides two invariants that need to hold:
The first one is pretty obvious. If the allocator didn't allocate enough memory for the The second item is more subtle. It's quite obvious that they won't be initialised, but why is that a problem? Well, the compiler doesn't know what type items we're working with. Some items like
From what I can dig up, it applies to floats too: rust-lang/unsafe-code-guidelines#71 (comment) |
Went down a bit of a rabbit hole to find out WHY these rules around UB are so strict. The above was enough to convince me that the spec says it requires all values to be valid, but I was still not sure what reason it had to use such strict rules. This blog post by Ralf Jung gave me a bit more insight in exactly how invalid byte representations of types can lead to UB. I'm not yet convinced that an uninitialised scratch buffer definitely WILL lead to UB when actually executed, but the post was enough to show that there is subtle nuance to the subject of UB and that the pointy-hatted compiler folk have some good reasons for making things complicated. |
I used
This is what I meant with cheating. RustFFT will never read from any location in the scratch or output buffer that it hasn't already written to. So there should never be a situation where the possibly garbage content gets read. |
Have a look at that blog post. It outlines situations where you don't necessarily need to read from uninitialised variables for UB to occur. The compiler can do optimisations that rely on the assumption that all types that aren't wrapped in Perhaps there aren't any such optimisations today for numbers of arrays but it's a hairy part of the compiler and it'll be difficult to say for sure. Either way, it's currently not part of the spec, so it could be an issue in the future. |
Yes I read it. My main issue with his whole thing is that people who are really interested in performance will be really careful to always reuse their buffers to avoid allocations. Then this becomes a lot of extra complications in order to slightly speed up a one-time operation. |
I think the crucial difference is that the rust compiler would emit different metadata to LLVM about the variable while it's wrapped in a
Sure, that's valid. |
The important thing is that just assigning uninitialized value for the variable In contrast, assigning uninitialized value for variable The solution using Leaving current API makes people using |
Is there a way to use MaybeUninit for the output of an out-of-place transform which doesn't require callers to use unsafe code to extract the result? |
Yes, I showed an example of the API I'd propose in a previous message: fn process_outofplace_with_scratch_uninit(
&self,
input: &mut [Complex<T>],
output: [MaybeUninit<Complex<T>>],
scratch: &mut [MaybeUninit<Complex<T>>]
) -> [Complex<T>] Creating the scratch buffer and output buffer would be as simple as: // Statically sized
use std::mem::MaybeUninit;
let scratch: &[MaybeUninit<f32>] = &[MaybeUninit::uninit(); 100];
// Dynamically sized
let scratch_buffer_len = 100;
let scratch: Box<[MaybeUninit<f32>]> = vec![MaybeUninit::uninit(); scratch_buffer_len].into(); Similar code can be written for the output buffer. There's no need for the client to delve into I've been feeling a bit unsatisfied about why these uninitialised values are UB. I started a thread in the rust language zulip forums to see what the rust folks say. My conclusion is that this is on the edge of the classification. "Officially" it's UB but in practice, it's likely to not exhibit any issues. There are no examples of optimisations folks can give that would break this case. MiniRust even currently considers it defined behaviour. |
I agree that creating the buffer can be done in entirely safe code by the caller. What about reading the result? Will the caller have to call |
Nope. This logic can be inside the
I think you may be missing that the |
Ah I see, I missed the return value. |
If you then want to reuse the output buffer, what do you do then? |
I guess you'd just let the returned output buffer go out of scope, which would release the borrow of the MaybeUninit version of the buffer, letting you reuse it again. I haven't tried it but it seems like the compiler would be perfectly happy with that. To that end, I suppose the signature would have to be more specifically: fn process_outofplace_with_scratch_uninit<'a>(
&self,
input: &'_ mut [Complex<T>],
output: &'a mut [MaybeUninit<Complex<T>>],
scratch: &'_ mut [MaybeUninit<Complex<T>>]
) -> &'a mut [Complex<T>] Is this correct? |
You can just wrap the output buffer in the |
Notice my suggestion doesn't borrow the buffer. It would "consume" the buffer. I'm not too familiar with the borrowing semantics to know if your suggestion would work. I would consider it as a different suggestion for the same API. EDIT:
I think this may work but you'll require another method on |
Looking at the API, i'm realizing that assume_init et al work on single elements. To assume a slice of MaybeUninit are all initialized, you have to call |
We can use transmute to convert ’&mut [MaybeUninit<Complex>]’ to ’&mut [Complex]’. |
Can someone show an example of a project where implementing this would make a measurable improvement? |
Not sure about specific applications, but I can get some rough numbers for my specific machine: use std::mem::MaybeUninit;
use std::time::Instant;
fn main() {
let start = Instant::now();
let mut buffer: Box<[f32; 1_000_000_000]> = Box::new([0_f32; 1_000_000_000]);
let allocation_duration = start.elapsed();
println!(
"Time elapsed allocating with zeroing is: {:?}",
allocation_duration
);
let start = Instant::now();
for val in buffer.iter_mut() {
*val = 1.0;
}
let write_duration = start.elapsed();
println!(
"Time elapsed to write ones with zeroing is: {:?}",
write_duration
);
println!(
"Totoal time elapsed with zeroing is: {:?}",
allocation_duration + write_duration
);
let start = Instant::now();
#[allow(invalid_value)]
let mut buffer: Box<[f32; 1_000_000_000]> =
unsafe { Box::new(MaybeUninit::uninit().assume_init()) };
let allocation_duration = start.elapsed();
println!(
"Time elapsed allocating without zeroing is: {:?}",
allocation_duration
);
let start = Instant::now();
for val in buffer.iter_mut() {
*val = 1.0;
}
let write_duration = start.elapsed();
println!(
"Time elapsed to write ones without zeroing is: {:?}",
write_duration
);
println!(
"Totoal time elapsed without zeroing is: {:?}",
allocation_duration + write_duration
);
} This results in the following output when run with
My interpretation is that the first allocation is also doing the page faulting. The second allocation is 13.056µs, i.e. next to no time, because there is almost nothing to do other than just reserving the memory. To get an accurate measurement you need to write to each byte so you can ensure the entire buffer is faulted in. For this specific situation (my machine and working with a billion samples) I would save 35% of the time the application would otherwise have spent allocating. Which is on the order of a few hundreds of milliseconds. Hard to say if it's relevant for initialisation of some applications. This percentage gets smaller as your buffer size decreases. For a million samples you get ~20% speedup and for 1000 samples you get ~8% speedup (although this is pretty susceptible to noise). |
Here is a variation of the same, with an FFT instead to give a more realistic comparison.
I get quite a large variation of the results, but averaging the last five runs I get: It's pretty consistent that it runs faster with zeroing. |
I also initially got a couple of runs where the non-zeroing version is slower. Weird. Not really sure what's happening there. Could be noise. CPU heating up. Warming up some cache somewhere. Not sure. I ran |
Not sure how relevant this is, but when i was doing all the benchmarking for rustfft's avx, one important thing I found to do for consistency is disabling turbo boost. The easiest way I've found to do that is via the bios. Without turbo boost, your benchmarks will be slower across the board, but more consistent in exchange. |
I tried out @ejmahler's suggestion and it doesn't seem to make much of a difference to consistency. But since I re-attempted the test I noticed I was running with the I downloaded the newest |
The
Fft
trait requires scratch buffers to be initialised by defining their type as&mut [Complex<T>]
. This type carries an implicit requirement that the memory be initialised. Since the fft algorithm should ALWAYS write to the scratch buffer before reading from it, it does not really require the memory to be initialised. To capture this in the type, I think we'd need&mut MaybeUninit<[Complex<T>]>
or perhaps&mut [MaybeUninit<Complex<T>>]
.The benefit would be a slight performance improvement for applications. These applications won't be required to fill the scratch buffer with a needless default value anymore and would allow wrapping libraries such as
easyfft
to implement fft operations on slices and arrays without requiring theDefault
trait bound on the elements.I haven't looked at the implementation of the fft algorithms to see if it's practical to express. It could be that it gets in the way of the logic in the way it's currently expressed. It would also carry another major version change.
The text was updated successfully, but these errors were encountered: