-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parallel versions of dct functions #3
Conversation
Thanks for the addition! Iooks good but I’ll try to look at it more tonight or in the coming days. |
I only now realize the Do you prefer defining additional functions with a different name (as in the PR code) or re-using the same functions, just changing the body when the "parallel" feature is enabled (as per the PR "alternative" description)? |
I definitely prefer having both the non parallel and the parallel one as
distinct functions. As it's done now
…On Mon, Apr 17, 2023, 11:55 Nicola Papale ***@***.***> wrote:
I only now realize the slice module is supposed to mirror nalgebra. Of
course I can add methods to that module as well. Taking a quick glance at
the rayon docs, I found this FAQ
<https://github.com/rayon-rs/rayon/blob/master/FAQ.md>. It seems to be
controlled through a global variable.
Do you prefer defining additional functions with a different name (as in
the PR code) or re-using the same functions, just changing the body when
the "parallel" feature is enabled (as per the PR "alternative" description)?
—
Reply to this email directly, view it on GitHub
<#3 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAWFOCIETRCMHSCLLCEO3LTXBUHPRANCNFSM6AAAAAAXA4XTJY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Also, a test comparing the values in the non parallel version and parallel
version would be great if you can find the time to add that.
…On Mon, Apr 17, 2023, 11:55 Nicola Papale ***@***.***> wrote:
I only now realize the slice module is supposed to mirror nalgebra. Of
course I can add methods to that module as well. Taking a quick glance at
the rayon docs, I found this FAQ
<https://github.com/rayon-rs/rayon/blob/master/FAQ.md>. It seems to be
controlled through a global variable.
Do you prefer defining additional functions with a different name (as in
the PR code) or re-using the same functions, just changing the body when
the "parallel" feature is enabled (as per the PR "alternative" description)?
—
Reply to this email directly, view it on GitHub
<#3 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAWFOCIETRCMHSCLLCEO3LTXBUHPRANCNFSM6AAAAAAXA4XTJY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
The tests are ran with |
@mpizenberg you probably overlooked the PR update. It's ready for review. There is no rush, I don't need this merged in any particular time frames. Take your time and it's fine if you have more important maters to attend to. It's just that github doesn't make it easy to know you have something pending, so I make sure you are aware. |
Yeah it's still in my radar, I had snoozed the email for tomorrow as I have a tournament today. Might get time later today actually, will see |
I'm trying the parallel version of the dct in the normal integration example and seeing very much lower acceleration than I'd expect. rayon::ThreadPoolBuilder::new()
.num_threads(8)
.build_global()
.unwrap();
let f_float = f.map(|x| x as f64);
let start = Instant::now();
let mut f_cos = par_dct_2d(f_float);
let duration = start.elapsed();
println!("Time elapsed: {:?}", duration); With this code, on my machine (intel i7 10750H), I'm seeing 6ms for the non-parallel version and 3.2ms for the 8 threads version. 3.5ms for 4 threads. It's a bit weird to see so little acceleration of something that is supposed to be very easy to make parallel, (each row is independent). For reference, the 1thread parallel version takes 7.5ms, compared to 6ms for the non-parallel version. You said that it "dramatically increases the performance" in your use case. Do you mind sharing some results of this change? |
I have a 14 core processor and the speedup sadly isn't 14x as I would have expected. It's more 2x. (But dividing by two the runtime of anything is already great). I've tested this on normal map textures of size 512x512 and 1024x1024. So there is either additional overhead or I'm not parallelizing most of the computation. It did look like the two for loops I replaced by rayon's iterator were where most of the computation happened. I also doubt the rayon overhead is this large. |
Okay, I'm looking at Rayon, seeing if I can use the |
I'm not sure what's going on, but I'm making a mistake somewhere. Would you mind have a look at the following and letting me know if you spot something? (test is failing, thanks for the test!) // your impl
let mut planner = DctPlanner::new();
let dct_dim1 = planner.plan_dct2(height);
mat.as_mut_slice()
.par_chunks_exact_mut(height)
.for_each(|buffer_dim1| dct_dim1.process_dct2(buffer_dim1));
// my failing attempt
let thread_count = rayon::max_num_threads();
let blocks_width = width / thread_count;
let mut planner = DctPlanner::new();
let dct_dim1 = planner.plan_dct2(height);
mat.as_mut_slice()
.par_chunks_mut(blocks_width * height)
.for_each(|chunk| {
let mut scratch = vec![0.0; dct_dim1.get_scratch_len()];
for buffer_dim1 in chunk.chunks_exact_mut(height) {
dct_dim1.process_dct2_with_scratch(buffer_dim1, &mut scratch);
}
}); |
Ok, I found my mistake. I'm continuing the experiment and will report soon. |
Well, so I've attempted to do the parallel version that works on big blocks instead of per column and the result isn't very conclusive. The acceleration drop is basically the same of the one for the simple parallel loop. For example, I got the following results. // 1 thread
Time during dct_dim1: 38.707426ms
Time during transpose: 4.586836ms
Time during dct_dim2: 31.471128ms
Time during slice copy: 1.571495ms
Total Time elapsed: 76.72951ms
// 2 threads
Time during dct_dim1: 20.34982ms
Time during transpose: 4.922647ms
Time during dct_dim2: 17.125235ms
Time during slice copy: 1.533194ms
Total Time elapsed: 44.326799ms
// 4 threads
Time during dct_dim1: 11.949718ms
Time during transpose: 4.962508ms
Time during dct_dim2: 14.913591ms
Time during slice copy: 1.952508ms
Total Time elapsed: 34.218193ms With fast diminishing returns, due both to diminishing returns in the parallel dct, and to the fact that there is the transposition and slice copy overhead anyway. So let's keep the code simple. Also, it might be interesting to follow ejmahler/RustFFT#117 |
Thanks @nicopap ! |
Awesome! Thank you for taking the time to look at this, and carefully. I don't have any other features for fft2d no. |
alright, I'll take some time in the week to publish a new version. Don't hesitate to ping me if it's not done by next weekend. Cheers |
Alright, version 0.1.1 is now published with your improvements @nicopap |
Add
par_*
functions to thenalgebra
module for a parallel version of the algs, using rayon. This dramatically increases performance on multicore processors, at the cost of using far more memory, since we don't share the scratch buffer between runs.The normal integration example goes twice as fast on my machine with this small change.
Alternatives
Another option is not to define different
par_*
functions when the "parallel" feature is enabled, but instead have an alternative code based on the feature.Inside the
dct_2d
function, it would look like this:Then, instead of
you would have