Add parallel versions of dct functions #3

nicopap · 2023-04-17T09:16:51Z

Add par_* functions to the nalgebra module for a parallel version of the algs, using rayon. This dramatically increases performance on multicore processors, at the cost of using far more memory, since we don't share the scratch buffer between runs.

The normal integration example goes twice as fast on my machine with this small change.

Alternatives

Another option is not to define different par_* functions when the "parallel" feature is enabled, but instead have an alternative code based on the feature.

Inside the dct_2d function, it would look like this:

        #[cfg(feature = "parallel")]
        use rayon::iter::ParallelIterator;

        #[cfg(not(feature = "parallel"))]
        let chunk_function = <[f32]>::chunks_exact_mut;
        #[cfg(feature = "parallel")]
        let chunk_function = rayon::slice::ParallelSliceMut::par_chunks_exact_mut;

Then, instead of

        for buffer_dim2 in transposed.as_mut_slice().chunks_exact_mut(width) {
            dct_dim2.process_dct2_with_scratch(buffer_dim2, &mut scratch);
        }

you would have

        chunk_function(transposed.as_mut_slice(), width)
            .for_each(|buffer| dct_dim2.process_dct2(buffer));

mpizenberg · 2023-04-17T09:46:27Z

Thanks for the addition! Iooks good but I’ll try to look at it more tonight or in the coming days.
Is it possible to add parallel options for the Vec api?
I’ve not really used rayon yet, can people control how many threads are spawned. For example if there is diminishing returns in compute speed but the memory increases linearly, maybe people with like 16 core CPU might want to limit the amount of threads it spawns. So is that something the caller can control or does it have to be embedded in the API?

nicopap · 2023-04-17T09:54:54Z

I only now realize the slice module is supposed to mirror nalgebra. Of course I can add methods to that module as well. Taking a quick glance at the rayon docs, I found this FAQ. It seems to be controlled through a global variable.

Do you prefer defining additional functions with a different name (as in the PR code) or re-using the same functions, just changing the body when the "parallel" feature is enabled (as per the PR "alternative" description)?

mpizenberg · 2023-04-17T10:15:40Z

I definitely prefer having both the non parallel and the parallel one as distinct functions. As it's done now

…

On Mon, Apr 17, 2023, 11:55 Nicola Papale ***@***.***> wrote: I only now realize the slice module is supposed to mirror nalgebra. Of course I can add methods to that module as well. Taking a quick glance at the rayon docs, I found this FAQ <https://github.com/rayon-rs/rayon/blob/master/FAQ.md>. It seems to be controlled through a global variable. Do you prefer defining additional functions with a different name (as in the PR code) or re-using the same functions, just changing the body when the "parallel" feature is enabled (as per the PR "alternative" description)? — Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAWFOCIETRCMHSCLLCEO3LTXBUHPRANCNFSM6AAAAAAXA4XTJY> . You are receiving this because you commented.Message ID: ***@***.***>

mpizenberg · 2023-04-17T10:20:13Z

Also, a test comparing the values in the non parallel version and parallel version would be great if you can find the time to add that.

…

On Mon, Apr 17, 2023, 11:55 Nicola Papale ***@***.***> wrote: I only now realize the slice module is supposed to mirror nalgebra. Of course I can add methods to that module as well. Taking a quick glance at the rayon docs, I found this FAQ <https://github.com/rayon-rs/rayon/blob/master/FAQ.md>. It seems to be controlled through a global variable. Do you prefer defining additional functions with a different name (as in the PR code) or re-using the same functions, just changing the body when the "parallel" feature is enabled (as per the PR "alternative" description)? — Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAWFOCIETRCMHSCLLCEO3LTXBUHPRANCNFSM6AAAAAAXA4XTJY> . You are receiving this because you commented.Message ID: ***@***.***>

nicopap · 2023-04-19T14:56:56Z

The tests are ran with cargo test --features 'rustdct parallel nalgebra'. Hmm, CI is left as an exercise to the reader 😉

nicopap · 2023-04-22T08:25:50Z

@mpizenberg you probably overlooked the PR update. It's ready for review. There is no rush, I don't need this merged in any particular time frames. Take your time and it's fine if you have more important maters to attend to. It's just that github doesn't make it easy to know you have something pending, so I make sure you are aware.

mpizenberg · 2023-04-22T08:35:08Z

Yeah it's still in my radar, I had snoozed the email for tomorrow as I have a tournament today. Might get time later today actually, will see

mpizenberg · 2023-04-23T14:03:10Z

I'm trying the parallel version of the dct in the normal integration example and seeing very much lower acceleration than I'd expect.

    rayon::ThreadPoolBuilder::new()
        .num_threads(8)
        .build_global()
        .unwrap();
    let f_float = f.map(|x| x as f64);
    let start = Instant::now();
    let mut f_cos = par_dct_2d(f_float);
    let duration = start.elapsed();
    println!("Time elapsed: {:?}", duration);

With this code, on my machine (intel i7 10750H), I'm seeing 6ms for the non-parallel version and 3.2ms for the 8 threads version. 3.5ms for 4 threads. It's a bit weird to see so little acceleration of something that is supposed to be very easy to make parallel, (each row is independent).

For reference, the 1thread parallel version takes 7.5ms, compared to 6ms for the non-parallel version.

You said that it "dramatically increases the performance" in your use case. Do you mind sharing some results of this change?

nicopap · 2023-04-23T14:55:38Z

I have a 14 core processor and the speedup sadly isn't 14x as I would have expected. It's more 2x. (But dividing by two the runtime of anything is already great). I've tested this on normal map textures of size 512x512 and 1024x1024.

So there is either additional overhead or I'm not parallelizing most of the computation. It did look like the two for loops I replaced by rayon's iterator were where most of the computation happened. I also doubt the rayon overhead is this large.

mpizenberg · 2023-04-23T15:04:36Z

Okay, I'm looking at Rayon, seeing if I can use the scope api to improve stuff. I'll let you know what I find.

mpizenberg · 2023-04-23T16:12:05Z

I'm not sure what's going on, but I'm making a mistake somewhere. Would you mind have a look at the following and letting me know if you spot something? (test is failing, thanks for the test!)

// your impl
        let mut planner = DctPlanner::new();
        let dct_dim1 = planner.plan_dct2(height);
        mat.as_mut_slice()
            .par_chunks_exact_mut(height)
            .for_each(|buffer_dim1| dct_dim1.process_dct2(buffer_dim1));

// my failing attempt
        let thread_count = rayon::max_num_threads();
        let blocks_width = width / thread_count;

        let mut planner = DctPlanner::new();
        let dct_dim1 = planner.plan_dct2(height);
        mat.as_mut_slice()
            .par_chunks_mut(blocks_width * height)
            .for_each(|chunk| {
                let mut scratch = vec![0.0; dct_dim1.get_scratch_len()];
                for buffer_dim1 in chunk.chunks_exact_mut(height) {
                    dct_dim1.process_dct2_with_scratch(buffer_dim1, &mut scratch);
                }
            });

mpizenberg · 2023-04-23T16:31:19Z

Ok, I found my mistake. I'm continuing the experiment and will report soon.

mpizenberg · 2023-04-23T17:58:56Z

Well, so I've attempted to do the parallel version that works on big blocks instead of per column and the result isn't very conclusive. The acceleration drop is basically the same of the one for the simple parallel loop. For example, I got the following results.

// 1 thread
Time during dct_dim1: 38.707426ms
Time during transpose: 4.586836ms
Time during dct_dim2: 31.471128ms
Time during slice copy: 1.571495ms
Total Time elapsed: 76.72951ms

// 2 threads
Time during dct_dim1: 20.34982ms
Time during transpose: 4.922647ms
Time during dct_dim2: 17.125235ms
Time during slice copy: 1.533194ms
Total Time elapsed: 44.326799ms

// 4 threads
Time during dct_dim1: 11.949718ms
Time during transpose: 4.962508ms
Time during dct_dim2: 14.913591ms
Time during slice copy: 1.952508ms
Total Time elapsed: 34.218193ms

With fast diminishing returns, due both to diminishing returns in the parallel dct, and to the fact that there is the transposition and slice copy overhead anyway. So let's keep the code simple.

Also, it might be interesting to follow ejmahler/RustFFT#117

mpizenberg · 2023-04-23T18:02:02Z

Thanks @nicopap !
Are there more things you're interested in getting in there or should I publish a new version?

nicopap · 2023-04-23T19:39:46Z

Awesome! Thank you for taking the time to look at this, and carefully. I don't have any other features for fft2d no.

mpizenberg · 2023-04-23T21:02:55Z

alright, I'll take some time in the week to publish a new version. Don't hesitate to ping me if it's not done by next weekend. Cheers

mpizenberg · 2023-04-29T18:51:06Z

Alright, version 0.1.1 is now published with your improvements @nicopap

Add parallel versions of dct functions

60ef9c7

nicopap added 2 commits April 19, 2023 16:37

Add parallel methods to slice module

3d6b6a8

Add tests

1cd3bca

Add features conditional config to tests

545147b

mpizenberg merged commit 99ac5fe into mpizenberg:main Apr 23, 2023

nicopap deleted the parallel branch August 30, 2023 13:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parallel versions of dct functions #3

Add parallel versions of dct functions #3

nicopap commented Apr 17, 2023

mpizenberg commented Apr 17, 2023

nicopap commented Apr 17, 2023

mpizenberg commented Apr 17, 2023 via email

mpizenberg commented Apr 17, 2023 via email

nicopap commented Apr 19, 2023

nicopap commented Apr 22, 2023

mpizenberg commented Apr 22, 2023

mpizenberg commented Apr 23, 2023 •

edited

Loading

nicopap commented Apr 23, 2023

mpizenberg commented Apr 23, 2023

mpizenberg commented Apr 23, 2023

mpizenberg commented Apr 23, 2023

mpizenberg commented Apr 23, 2023

mpizenberg commented Apr 23, 2023

nicopap commented Apr 23, 2023

mpizenberg commented Apr 23, 2023

mpizenberg commented Apr 29, 2023

Add parallel versions of dct functions #3

Add parallel versions of dct functions #3

Conversation

nicopap commented Apr 17, 2023

Alternatives

mpizenberg commented Apr 17, 2023

nicopap commented Apr 17, 2023

mpizenberg commented Apr 17, 2023 via email

mpizenberg commented Apr 17, 2023 via email

nicopap commented Apr 19, 2023

nicopap commented Apr 22, 2023

mpizenberg commented Apr 22, 2023

mpizenberg commented Apr 23, 2023 • edited Loading

nicopap commented Apr 23, 2023

mpizenberg commented Apr 23, 2023

mpizenberg commented Apr 23, 2023

mpizenberg commented Apr 23, 2023

mpizenberg commented Apr 23, 2023

mpizenberg commented Apr 23, 2023

nicopap commented Apr 23, 2023

mpizenberg commented Apr 23, 2023

mpizenberg commented Apr 29, 2023

mpizenberg commented Apr 23, 2023 •

edited

Loading