Consistent API for embarrassingly parallel routines between levels of parallelism #17887

ChrisRackauckas · 2016-08-08T04:57:58Z

It seems like it would be natural for @threads loops to allow for a reduction parameter, matching what's done for @parallel. In fact, it seems natural enough that the documentation has to make a specific mention that there isn't one. I propose that it be pretty much the same as @parallel, except over threads.

The text was updated successfully, but these errors were encountered:

amitmurthy · 2016-08-08T05:20:12Z

There is also a need for a higher level API that works across process and threads. Just thinking out loud here:

@threads parallelizes using threads
@parallel parallelizes using workers across nodes
@parfor is a new macro that first splits the range over workers and within each worker further uses threads. The split is dependent on the number of workers and number of threads in each worker. For nprocs()==1, @parfor is equivalent to @threads. For numthreads()==1 on the master and workers, it is equivalent to @parallel.

User code will only ever use @parfor and it will leverage both workers and threads as the case may be.

ChrisRackauckas · 2016-08-08T05:36:03Z

That would be amazing, a simple abstraction beyond threading and multiprocessing.

Maybe this should be expanded to be about standardized tooling for embarrassingly parallel routines. For multiprocessing we have @parallel and pmap. Would a pbroadcast be reasonable as well (that would require some kind of memory sharing like a SharedArray though)? In the same sense, we have @threads. I think it would be helpful to have tmap and tbroadcast (it would make #1802 easy for one to implement on their own even if it wasn't the standard base way). And then as mention, have a @parfor, a map, and a broadcast that are smart to split evenly across workers and threads as you describe, using the threads and multiprocessing constructs. And having all of these with similar APIs would make it easy to work between the levels.

For naming, I think that instead of @parfor, the top user-facing macro which builds off of both should be called @parallel. It's more intuitive. It would make sense for @parallel and p to mean this nicely abstracted parallelism, @threads and t to mean thread level, and @workers w (or @multiprocess and m) to mean multi-process level.

StefanKarpinski · 2016-08-09T15:38:50Z

It would be an annoying deprecation, but I would propose this is a much better naming scheme:

@threaded: parallelizes using threads
@distributed: distributes across worker nodes
@parallel: threaded and distributed

But I lost an argument about calling distributed stuff "parallel" a long time ago, and now I'm not sure it would be worth going through the multi-version deprecation and renaming this would require.

Sacha0 · 2016-08-09T17:54:40Z

Do I understand correctly that the distinction is between threads and processes rather than threads and nodes? If so, alongside 'thread' might some form of the word 'process' be more accurate than 'distribute', processes not necessarily being distributed across nodes? Forgive my ignorance. Best!

ChrisRackauckas · 2016-08-09T18:06:08Z

Yes, it's more of a distinction between threads and processes. You can have multiple independent processes running on the same computer (or node), so it's not necessarily what is usually meant by distributed (although it can do distributed).

But the word "process" wouldn't be smart if we want to extend the map and broadcast functions to each level, and use a naming scheme like I proposed (appending one character in front of map and broadcast). For example, would pmap be parallel map or process map?

eschnett · 2016-08-09T20:41:14Z

In the far future (say, a year from now), threading will work out of the box and will be efficient. I assume people will then basically want to use threading all the time when they are using distributed computing, e.g. to handle latencies. Thus the case "distributed, but not threaded" doesn't seem terribly important -- it is important now, but probably won't be in the future.

This would then lead to people using threaded and parallel, but in practice never using distributed.

I'd thus suggest to go for threaded and distributed, where distributed implies threaded when threading is enabled.

oxinabox · 2016-08-15T02:06:12Z

Beyond bikeshedding (I personally like @threaded, @distributed, and @parallelized), the implementation of this is fairly simple.
It is fairly easy to turn the current implementation of @threads for into a mapreduce.
And from there it is just the cascading.

Things that are needed to make it easier

set the number of threads in a worker process, using addprocs (I don't think we can do this right now, without some hacks around ssh. Possibly nicer to make number of threads a argument to the julia program, which defaults to the env var JULIA_NUM_THREADS.). (Setting env. variables like JULIA_NUM_THREADS for remote workers #18074)
default to 1 process per machine, with many threads.
A way to know how many threads a process has. (Possibly just a alias for remotecall_fetch(()->ENV["JULIA_NUM_THREADS"], pid), possibly just something that is saved when addprocs is done.)

The last point is needed, because we probably want to support asymmetric clusters, at least in terms of number of processors (if not interms of speed). I know my normal cluster is 12core + 12core + 4 core.
And my old cluster of lab machines was 4+4+4+4+8+16.

oschulz · 2016-11-17T08:27:23Z

Thus the case "distributed, but not threaded" doesn't seem terribly important -- it is important now, but probably won't be in the future.

I don't think that's true in all cases: Sure, many user applications will just want stuff to be run in parallel using both multiple hosts and multiple threads. But more complex applications will sometimes need more control over what is done via threads and what is done distributed: For example, data partitioning/placement may may have to be taken into account. Or a complex algorithm may choose to distribute an outer loop (that is not sensitive to latency), but run an inner loop (e.g. a latency-sensitive one) on threads, with several layers of code in between.

robsmith11 · 2019-12-12T04:26:28Z

Now that Threads has matured a bit, has there been any more thought to supporting similar functionality as Distributed?

For example, I (perhaps naively) am surprised not to see an equivalent Threads function for Distributed's pmap. I've been using this simple function, which seems to work well enough for my purposes:

function tmap(f, xs::AbstractArray)
    g = Base.Generator(f,xs)
    et = Base.@default_eltype(g)
    a = Array{et}(undef, length(xs))
    Threads.@threads for i in 1:length(xs)
        a[i] = f(xs[i])
    end
    a
end

tkf · 2020-01-09T01:33:40Z

FYI, Transducers.jl supports "two-level" parallelism as of v0.4.11; i.e., each worker process uses multiple threads for executing reduce. It then give us a superset of pmap that can be fused with arbitrary stateless processing like filtering and flattening. I think this already gives us a uniform API for (not so embarrassingly) parallel (and sequential) computations executed in different backends (Base.Threads and Distributed).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistent API for embarrassingly parallel routines between levels of parallelism #17887

Consistent API for embarrassingly parallel routines between levels of parallelism #17887

ChrisRackauckas commented Aug 8, 2016

amitmurthy commented Aug 8, 2016

ChrisRackauckas commented Aug 8, 2016 •

edited

Loading

StefanKarpinski commented Aug 9, 2016 •

edited

Loading

Sacha0 commented Aug 9, 2016

ChrisRackauckas commented Aug 9, 2016

eschnett commented Aug 9, 2016

oxinabox commented Aug 15, 2016 •

edited

Loading

oschulz commented Nov 17, 2016 •

edited

Loading

robsmith11 commented Dec 12, 2019 •

edited

Loading

tkf commented Jan 9, 2020

ViralBShah commented Jul 3, 2020

Consistent API for embarrassingly parallel routines between levels of parallelism #17887

Consistent API for embarrassingly parallel routines between levels of parallelism #17887

Comments

ChrisRackauckas commented Aug 8, 2016

amitmurthy commented Aug 8, 2016

ChrisRackauckas commented Aug 8, 2016 • edited Loading

StefanKarpinski commented Aug 9, 2016 • edited Loading

Sacha0 commented Aug 9, 2016

ChrisRackauckas commented Aug 9, 2016

eschnett commented Aug 9, 2016

oxinabox commented Aug 15, 2016 • edited Loading

oschulz commented Nov 17, 2016 • edited Loading

robsmith11 commented Dec 12, 2019 • edited Loading

tkf commented Jan 9, 2020

ViralBShah commented Jul 3, 2020

ChrisRackauckas commented Aug 8, 2016 •

edited

Loading

StefanKarpinski commented Aug 9, 2016 •

edited

Loading

oxinabox commented Aug 15, 2016 •

edited

Loading

oschulz commented Nov 17, 2016 •

edited

Loading

robsmith11 commented Dec 12, 2019 •

edited

Loading