Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Is it possible to impl IntoIterator for ParallelIterator? #1179

Closed
ZhennanWu opened this issue Jul 7, 2024 · 3 comments
Closed

Comments

@ZhennanWu
Copy link

ZhennanWu commented Jul 7, 2024

This is a realistic need surfaced during development of EPGI, a parallel GUI library using rayon. I'll explain the motivation step-by-step

Motivation

Variable parallelism in GUI framework

EPGI uses rayon in various pipeline stages, and the most common way to use rayon is to visit a node's children in parallel. However, child count across the node tree is hugely variable. It is known that different rayon parallelism primitive (or simply non-rayon serial code) have vastly different performance characteristics regarding child count.

For dynamic children widgets, it calls for a runtime selection mechanism for parallelism. A Column widget painting a layout may have two children but get nested 100 times deep, while another Column widget painting a data science visualziation may have no nesting but have 10k children instead. Both of these cases are computationally heavy and need respective optimization.

Limitation of runtime selection mechanism API designs

Given a Vec, a straightforward runtime selection mechanism is impl-ed as:

    fn par_for_each_vec<T: Send, F: Fn(T) + Send + Sync>(threadpool: &ThreadPool, mut vec: Vec<T>, f: F) {
        match vec.len() {
            0 => {}
            1 => f(vec.remove(0)),
            2..=16 => {
                self.scope(|s| {
                    let f_ref = &f;
                    for elem in vec {
                        s.spawn(move |_| f_ref(elem));
                    }
                });
            }
            _ => threadpool.install(|| vec.into_par_iter().for_each(f)),
        };
    }

We can duplicate this for for_each/map/unzip and for Vec/&Vec/&mut Vec. No problem for now.

However, problem surfaces as we went into zip and MultiZip. We will provide an example below

Suppose we have a widget that could have a huge number of children. During its implementation we need to use rayon's MultiZip to zip three &Vecs. Now we want to introduce a runtime selection version

    fn par_zip3_ref_for_each_vec<
        T1: Sync,
        T2: Sync,
        T3: Sync,
        F: Fn(&T1, &T2, &T3) + Send + Sync,
    >(
        threadpool: &ThreadPool
        vec1: &Vec<T1>,
        vec2: &Vec<T2>,
        vec3: &Vec<T3>,
        f: F,
    ) {
        let len = std::cmp::min(std::cmp::min(vec1.len(), vec2.len()), vec3.len());
        match len {
            0 => {}
            1 => f(&vec1[0], &vec2[0], &vec3[0]),
            2..16 => {
                threadpool.scope(|s| {
                    let f_ref = &f;
                    for ((elem1, elem2), elem3) in std::iter::zip(std::iter::zip(vec1, vec2), vec3)
                    {
                        s.spawn(move |_| f_ref(elem1, elem2, elem3));
                    }
                });
            }
            _ => threadpool.install(|| {
                (vec1, vec2)
                    .into_par_iter()
                    .for_each(|(elem1, elem2)| f(elem1, elem2))
            })
        }
    }

Seems fine. But this only handles (&Vec, &Vec, &Vec) case. What about (Vec, &mut Vec, &Vec)? We get an combinatorial explosion! And it is just for a zip-3!

As a legit real-world exmaple, here are two pieces of code in EPGI that is waiting to be parallelized. They both use different versions of zip-3.
https://github.com/ZhennanWu/epgi/blob/850ed7556bb8d619b430f337a0c90da05de72de8/epgi-common/src/basic/stack.rs#L216-L219

https://github.com/ZhennanWu/epgi/blob/850ed7556bb8d619b430f337a0c90da05de72de8/epgi-common/src/basic/stack.rs#L238-L241

Switching between Iterator and ParallelIterator

One way to circumvent this issue is to somehow merge the capabilities of serial iterators and parallel iterators. Requiring IntoIterator as the supertrait of IntoParallelIterator seems implausible, since rayon's IntoParallelIterator is implemented for much more primitive types than std's IntoIterator, and we can't force feed the std. However, I think it is quite possible to make ParallelIterator: IntoIterator.

Then we can (psuedocode)

let it = data.into_par_iter();
if len < 16 {
    threadpool.scope(|s|{
        it.into_iter().for_each(|elem| s.spawn(move |_|f(elem)))
    })
} else {
    threadpool.install(|| it.for_each(f));
}

I wonder if this change is possible or is there any other practical concerns with regard to parallel iterators designs

(Note: of course the above examples can be worked around by introducing intermediate Vecs. But that would lose the purpose of tuning the performance in the first place. This is an experimental project so I want to see how far it can go w.r.t. zero-cost abstraction).

@cuviper
Copy link
Member

cuviper commented Jul 8, 2024

So to be clear, you are not asking for a parallel iterator with serialized output, like #210 or #858. You just want to avoid parallel overhead for smaller inputs?

IMO, the difference between rayon_spawn and rayon_par_iter at the low end is not significant enough to worry about, at least as far as your linked graphs show. I would just use par_iter for all the parallel work.

However, I think it is quite possible to make ParallelIterator: IntoIterator.

It would certainly be a breaking change.

For IndexedParallelIterator, you can get fully serialized behavior by calling .with_min_len(usize::MAX). This lowers to a regular Iterator internally once it gets to Producer::into_iter, but you won't be interacting at this level to use any other Iterator APIs.

You can tune that however you like. Maybe you'll set a known serial/parallel threshold like .with_min_len(4) -- this would have the effect that it won't split parallel at all up to len 7, and then len 8 will only split once, etc. Or you could make it more dynamic like:

indexed_par_iter
    .with_min_len(if len < 16 {
        usize::MAX // no parallelism
    } else {
        1 // full parallelism
    })
    .for_each(f);

... or some gradient therein, as you find makes sense for your workload.

To some extent, you can write your zip variations based on I: IndexedParallelIterator, as long as you also use F: Fn(I::Item) without trying to expand the zipped tuple as separate arguments. Even apart from rayon, Rust doesn't have a good way to deal with that otherwise, lacking support in generics for variadics and different mutability.

@ZhennanWu
Copy link
Author

ZhennanWu commented Jul 8, 2024

IMO, the difference between rayon_spawn and rayon_par_iter at the low end is not significant enough to worry about,

This would be true for most of the workloads. But GUI library with a virtual-dom layer is another area. The most notorious case is perhaps Google's Flutter. In realistic Flutter apps, you will frequently see widget tree up to hundreds of levels deep. If at each node (there would be numerous of nodes throughout the tree), it spins up the whole par_iter subdivide saga, then it would be a problem.

But to be fair, I currently don't have the test case to test this theory. So it may not be that impactful in the end

you can get fully serialized behavior

This is not entirely what I want. The reason to get a serialized iterator (besides switching to fully serialized behavior) is switching to rayon::spawn. rayon::spawn spawns in a serailized fashion, you know.

The reason to use spawn is that GUI widget tree is inherently very unbalanced. And parallel iterator is known to cause blocking when its tasks are unbalanced (#1054) . So par_iter is not very suitable for most of the case. But, we would like to use it for the worst possible child count. Hence the switching.

.with_min_len(if len < 16 {
    usize::MAX // no parallelism
} else {
    1 // full parallelism
})

Brilliant! I'll certainly employ this idea to my other projects.

@cuviper
Copy link
Member

cuviper commented Jul 8, 2024

If at each node (there would be numerous of nodes throughout the tree), it spins up the whole par_iter subdivide saga, then it would be a problem.

I think that par_iter "saga" is not actually as heavy as you imagine, while spawn requires an allocation (HeapJob) for every single call, as well as the counter overhead shown in #1054.

Thanks for the reminder about those completion issues though. We should find a way to make that opt-in for_each_spawn idea work in a "clean" way, or at least "not too dirty"...

For a deep widget tree, I wonder if it's really worth parallelizing the deep/smaller parts at all? It might be fine to let inner parts run sequentially while the outer parts are divided in parallel. This will mostly happen naturally anyway with both par_iter nesting and scope+spawn, until work-stealing gets to the inner stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants