Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parallel versions of dct functions #3

Merged
merged 4 commits into from
Apr 23, 2023

Conversation

nicopap
Copy link
Contributor

@nicopap nicopap commented Apr 17, 2023

Add par_* functions to the nalgebra module for a parallel version of the algs, using rayon. This dramatically increases performance on multicore processors, at the cost of using far more memory, since we don't share the scratch buffer between runs.

The normal integration example goes twice as fast on my machine with this small change.

Alternatives

Another option is not to define different par_* functions when the "parallel" feature is enabled, but instead have an alternative code based on the feature.

Inside the dct_2d function, it would look like this:

        #[cfg(feature = "parallel")]
        use rayon::iter::ParallelIterator;

        #[cfg(not(feature = "parallel"))]
        let chunk_function = <[f32]>::chunks_exact_mut;
        #[cfg(feature = "parallel")]
        let chunk_function = rayon::slice::ParallelSliceMut::par_chunks_exact_mut;

Then, instead of

        for buffer_dim2 in transposed.as_mut_slice().chunks_exact_mut(width) {
            dct_dim2.process_dct2_with_scratch(buffer_dim2, &mut scratch);
        }

you would have

        chunk_function(transposed.as_mut_slice(), width)
            .for_each(|buffer| dct_dim2.process_dct2(buffer));

@mpizenberg
Copy link
Owner

Thanks for the addition! Iooks good but I’ll try to look at it more tonight or in the coming days.
Is it possible to add parallel options for the Vec api?
I’ve not really used rayon yet, can people control how many threads are spawned. For example if there is diminishing returns in compute speed but the memory increases linearly, maybe people with like 16 core CPU might want to limit the amount of threads it spawns. So is that something the caller can control or does it have to be embedded in the API?

@nicopap
Copy link
Contributor Author

nicopap commented Apr 17, 2023

I only now realize the slice module is supposed to mirror nalgebra. Of course I can add methods to that module as well. Taking a quick glance at the rayon docs, I found this FAQ. It seems to be controlled through a global variable.

Do you prefer defining additional functions with a different name (as in the PR code) or re-using the same functions, just changing the body when the "parallel" feature is enabled (as per the PR "alternative" description)?

@mpizenberg
Copy link
Owner

mpizenberg commented Apr 17, 2023 via email

@mpizenberg
Copy link
Owner

mpizenberg commented Apr 17, 2023 via email

@nicopap
Copy link
Contributor Author

nicopap commented Apr 19, 2023

The tests are ran with cargo test --features 'rustdct parallel nalgebra'. Hmm, CI is left as an exercise to the reader 😉

@nicopap
Copy link
Contributor Author

nicopap commented Apr 22, 2023

@mpizenberg you probably overlooked the PR update. It's ready for review. There is no rush, I don't need this merged in any particular time frames. Take your time and it's fine if you have more important maters to attend to. It's just that github doesn't make it easy to know you have something pending, so I make sure you are aware.

@mpizenberg
Copy link
Owner

Yeah it's still in my radar, I had snoozed the email for tomorrow as I have a tournament today. Might get time later today actually, will see

@mpizenberg
Copy link
Owner

mpizenberg commented Apr 23, 2023

I'm trying the parallel version of the dct in the normal integration example and seeing very much lower acceleration than I'd expect.

    rayon::ThreadPoolBuilder::new()
        .num_threads(8)
        .build_global()
        .unwrap();
    let f_float = f.map(|x| x as f64);
    let start = Instant::now();
    let mut f_cos = par_dct_2d(f_float);
    let duration = start.elapsed();
    println!("Time elapsed: {:?}", duration);

With this code, on my machine (intel i7 10750H), I'm seeing 6ms for the non-parallel version and 3.2ms for the 8 threads version. 3.5ms for 4 threads. It's a bit weird to see so little acceleration of something that is supposed to be very easy to make parallel, (each row is independent).

For reference, the 1thread parallel version takes 7.5ms, compared to 6ms for the non-parallel version.

You said that it "dramatically increases the performance" in your use case. Do you mind sharing some results of this change?

@nicopap
Copy link
Contributor Author

nicopap commented Apr 23, 2023

I have a 14 core processor and the speedup sadly isn't 14x as I would have expected. It's more 2x. (But dividing by two the runtime of anything is already great). I've tested this on normal map textures of size 512x512 and 1024x1024.

So there is either additional overhead or I'm not parallelizing most of the computation. It did look like the two for loops I replaced by rayon's iterator were where most of the computation happened. I also doubt the rayon overhead is this large.

@mpizenberg
Copy link
Owner

Okay, I'm looking at Rayon, seeing if I can use the scope api to improve stuff. I'll let you know what I find.

@mpizenberg
Copy link
Owner

I'm not sure what's going on, but I'm making a mistake somewhere. Would you mind have a look at the following and letting me know if you spot something? (test is failing, thanks for the test!)

// your impl
        let mut planner = DctPlanner::new();
        let dct_dim1 = planner.plan_dct2(height);
        mat.as_mut_slice()
            .par_chunks_exact_mut(height)
            .for_each(|buffer_dim1| dct_dim1.process_dct2(buffer_dim1));

// my failing attempt
        let thread_count = rayon::max_num_threads();
        let blocks_width = width / thread_count;

        let mut planner = DctPlanner::new();
        let dct_dim1 = planner.plan_dct2(height);
        mat.as_mut_slice()
            .par_chunks_mut(blocks_width * height)
            .for_each(|chunk| {
                let mut scratch = vec![0.0; dct_dim1.get_scratch_len()];
                for buffer_dim1 in chunk.chunks_exact_mut(height) {
                    dct_dim1.process_dct2_with_scratch(buffer_dim1, &mut scratch);
                }
            });

@mpizenberg
Copy link
Owner

Ok, I found my mistake. I'm continuing the experiment and will report soon.

@mpizenberg
Copy link
Owner

Well, so I've attempted to do the parallel version that works on big blocks instead of per column and the result isn't very conclusive. The acceleration drop is basically the same of the one for the simple parallel loop. For example, I got the following results.

// 1 thread
Time during dct_dim1: 38.707426ms
Time during transpose: 4.586836ms
Time during dct_dim2: 31.471128ms
Time during slice copy: 1.571495ms
Total Time elapsed: 76.72951ms

// 2 threads
Time during dct_dim1: 20.34982ms
Time during transpose: 4.922647ms
Time during dct_dim2: 17.125235ms
Time during slice copy: 1.533194ms
Total Time elapsed: 44.326799ms

// 4 threads
Time during dct_dim1: 11.949718ms
Time during transpose: 4.962508ms
Time during dct_dim2: 14.913591ms
Time during slice copy: 1.952508ms
Total Time elapsed: 34.218193ms

With fast diminishing returns, due both to diminishing returns in the parallel dct, and to the fact that there is the transposition and slice copy overhead anyway. So let's keep the code simple.

Also, it might be interesting to follow ejmahler/RustFFT#117

@mpizenberg mpizenberg merged commit 99ac5fe into mpizenberg:main Apr 23, 2023
@mpizenberg
Copy link
Owner

Thanks @nicopap !
Are there more things you're interested in getting in there or should I publish a new version?

@nicopap
Copy link
Contributor Author

nicopap commented Apr 23, 2023

Awesome! Thank you for taking the time to look at this, and carefully. I don't have any other features for fft2d no.

@mpizenberg
Copy link
Owner

alright, I'll take some time in the week to publish a new version. Don't hesitate to ping me if it's not done by next weekend. Cheers

@mpizenberg
Copy link
Owner

Alright, version 0.1.1 is now published with your improvements @nicopap

@nicopap nicopap deleted the parallel branch August 30, 2023 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants