multi-threaded (@threads) dotcall/broadcast? #19777

stevengj · 2016-12-30T15:41:17Z

It would be nice to be able to put @threads in front of a dot call, e.g. @threads X .= f.(Y), and have it call a multi-threaded version of broadcast (which would assume f is thread-safe).

The text was updated successfully, but these errors were encountered:

vchuravy · 2016-12-31T03:43:36Z

I don't think this is something that should go into the @threads macro (since that is already too complex). @SimonDanisch has been working on doing this based on dispatch to a special array type in https://github.com/JuliaGPU/GPUArrays.jl/blob/master/src/backends/julia/julia.jl

The beauty of Simon's approach is that it is general and works for distributed arrays, GPU arrays, and native arrays. I consed that it might feel a bit unnatural for native arrays, and I wouldn't be opposed to add a macro that transforms X .= f.(Y) into the threaded dispatch version but I don't think that necessarily needs to be in Base.

stevengj · 2016-12-31T03:47:54Z

Regardless of the name of the macro, it would be nice to have something that just involved a decorator and didn't involve re-allocating or wrapping all of your arrays in some other type. This is the big distinction between threads and GPUs or distributed-memory — with threads, you don't need to decide in advance to put your data on a GPU or in another process.

SimonDanisch · 2016-12-31T10:57:27Z

I'd suggest having a macro similar to fastmath, that replaces the broadcast calls with e.g. threaded_broadcast. Would be nice to share the implementation between what I have and Base :)
I tried to create a minimal working version, but it seems like expand needs some extra attention to get the broadcast back.
Blocks of code expand into a Expr(:thunk, Toplevel LambdaInfo thunk), so I had to iterate through the sub expressions to make things easier.

StefanKarpinski · 2017-01-04T15:02:23Z

Would it be too crazy for broadcast to be implicitly parallel once threading is stable?

martinholters · 2017-01-04T15:14:42Z

Already, broadcast makes no guaranties concerning the order or exact number of times the given function is called (think sparse matrices), so passing non-pure functions is questionable anyway. That would be in favor of implicit parallelism. Con: When broadcasting over a collection with only few items, the complexity of the given function will decide whether it is worth any threading overhead, which cannot be easily decided by broadcast itself.

ChrisRackauckas · 2017-02-09T13:38:43Z

Could this get a 1.0 milestone? At least some kind of macro would be very useful, if not implicit parallelism (or a mixture: default implicit parallelism, which can be overridden with a macro). Since broadcasting is such a cool feature, this would really complete the story.

Sacha0 · 2017-02-09T21:29:46Z

cc @lkuper

vchuravy · 2017-02-10T04:09:06Z

If anyone wants to tackle this (developing this outside of base at first is probably a good idea) my roadmap/ideas would be.

Thin wrapper type around Arrays
Macro that automates this wrapping
Co-existence with other types of parallelism. We currently have DistributedArrays as a seconder user of broadcast
The nice thing about using a thin wrapper is that you can have DArray{Threaded{Array}} to combine distributed and threaded parallelism.
What happens in the case of mixed accelerated array types? e.g. A GPUArray meets a threaded array?
NUMA-aware parallelism. Communication across NUMA nodes is still quite expensive and memory should probably be pinned to the NUMA group that the threads belong to.

For other inspiration take a look at parallel collections in Scala http://docs.scala-lang.org/overviews/parallel-collections/overview.html (which were the inspiration for Java 8).

SimonDanisch · 2017-02-10T04:23:05Z

Just as an info: I'm already getting good speed ups out of @threads in my broadcast implementation in GPUArrays ;) It's still a bit crappy, but if I have some time I could polish it up and I can make a first PR for a tbroadcast or however we call it ;)

RoyiAvital · 2017-02-11T13:41:52Z

@ChrisRackauckas ,

I like the idea of letting the user choose.
Should be something like a Macro (Or the most efficient way to do it) with 3 states:

Auto (Default) - Adding heuristic to set On or OFF.
ON.
OFF.

Thank You.

ChrisRackauckas · 2017-02-11T15:21:05Z

It would be interesting if the heuristic could be applied to arbitrary loops via some macro as well.

stevengj · 2017-02-11T16:38:23Z

@ChrisRackauckas, parallelizing arbitrary loops is precisely what the @threads macro does, no? I'm simply proposing using it for broadcast (and dot calls) as well.

@vchuravy, I'm not sure I like the idea of a special array type, vs. just a @threads decorator on looping constructs. Not needing a special container type is one of the main attractions of shared-memory parallelism.

ChrisRackauckas · 2017-02-11T16:41:41Z

@ChrisRackauckas, parallelizing arbitrary loops is precisely what the @threads macro does, no? I'm simply proposing using it for broadcast (and dot calls) as well.

I was asking if there could be a way to apply whatever implicit parallelism heuristic to a loop. Essentially a macro for "multithread this if the size of the array is greater than x" or whatever is involved in the heuristic, and have the options be tweakable. Then broadcast would just be essentially applying that with the defaults. Your proposal just has a macro, but I'm wondering if implicit parallelism can be added as well.

Otherwise I could see applications wanting to have a bunch of conditionals to check if multithreading should be ran? That last part is dependent on the overhead of multithreading (which I found to be measurable in many small problems, but the benchmarks may be mixed up with #15276 issues).

stevengj · 2017-02-11T20:56:54Z

@ChrisRackauckas, it's not the size of the array, but the expense of the loop iterations that matters. There's also the issue of load balancing if the loop iterations have unequal cost. I agree that you want to automate this (both deciding how many threads to use and how to load balance) to the extent possible. My understanding is that Cilk (and subsequently OpenMP) mostly solved this issue.

Anyway, I see that as orthogonal to this issue.@threads f.(x) should use the same machinery as @threads for. Improvements in the latter should help the former.

Sacha0 · 2017-02-11T20:59:33Z

Ref. #18278 (comment)

pabloferz · 2017-02-20T07:43:10Z

Related #1802 (or is this a duplicate?)

RoyiAvital · 2017-02-20T13:20:41Z

@stevengj , There is a lot to consider whether or not to multi thread a loop.
Hence I just hope Julia will also allow the user to set it OFF or ON on his choice.

jebej · 2017-04-11T18:39:36Z

It would be nice to initially have a simple-case version of the macro as explained in the first post (for cases like cos.(x)) for all the people on the mailing list complaining that it is slower than MATLAB. This would make it really easy to answer. Right now people need to make a special function to see the difference:

function threadedcos(x::AbstractArray)
  out = similar(x)
  Threads.@threads for i in eachindex(x)
    out[i]=cos(x[i])
  end
  return out
end

vchuravy · 2017-04-11T21:09:49Z

using GPUArrays you can automatically accelerate broadcast.

using GPUArrays
GPUArrays.init(:julia) # Otherwise an OpenCL or CUDA backend might be used
a = GPUArray(rand(Float32, 32, 32))
cos.(a)

bramtayl · 2017-07-15T15:49:46Z

I wonder if it wouldn't be possible to have some sort of option to switch out which function is used for . broadcasting. Something like ENV[:dotfunction] = threaded_broadcast

Sacha0 · 2017-07-15T18:50:43Z

I wonder if it wouldn't be possible to have some sort of option to switch out which function is used for . broadcasting. Something like ENV[:dotfunction] = threaded_broadcast

Ref. the discussion around #16285 (comment). Best!

yuyichao · 2017-07-15T18:57:24Z

Do note that this should not be the default or a global setting, not before we require every functions to be threadsafe at least. No guarantee on execution order is a very weak requirement compare to thread safe.

MasonProtter · 2019-05-15T00:20:44Z

It should probably be noted here for those looking for this feature that such a macro has been implemented intoStrided.jl.

oscardssmith · 2021-03-12T21:45:46Z

How hard would this be to do? It would be really nice to get a ~10x speedup with a macro for lots of the easy cases.

stevengj added multithreading Base.Threads and related functionality broadcast Applying a function over a collection labels Dec 30, 2016

vchuravy added the speculative Whether the change will be implemented is speculative label Dec 31, 2016

StefanKarpinski added this to the 1.0 milestone Feb 9, 2017

JeffBezanson modified the milestones: 1.x, 1.0 May 2, 2017

stevengj mentioned this issue May 14, 2019

@threads in array comprehension #32000

Open

Roger-luo mentioned this issue Sep 28, 2019

RFC: let's make ultimate meta packages for physics JuliaPhysics/juliaphysics.github.io#14

Open

stevengj mentioned this issue Mar 15, 2021

use multiple threads in vectorized operations #1802

Closed

stevengj mentioned this issue Apr 3, 2021

@threads f(...) syntax to tell callee to use threaded version? #40329

Open

This was referenced Jun 21, 2021

Correct way to parallelize this code? Jutho/Strided.jl#9

Open

Parallelising and loop performance mabarnes/moment_kinetics#7

Closed

DilumAluthge removed this from the 1.x milestone Mar 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-threaded (@threads) dotcall/broadcast? #19777

multi-threaded (@threads) dotcall/broadcast? #19777

stevengj commented Dec 30, 2016

vchuravy commented Dec 31, 2016 •

edited

Loading

stevengj commented Dec 31, 2016

SimonDanisch commented Dec 31, 2016

StefanKarpinski commented Jan 4, 2017

martinholters commented Jan 4, 2017

ChrisRackauckas commented Feb 9, 2017

Sacha0 commented Feb 9, 2017

vchuravy commented Feb 10, 2017

SimonDanisch commented Feb 10, 2017

RoyiAvital commented Feb 11, 2017

ChrisRackauckas commented Feb 11, 2017

stevengj commented Feb 11, 2017 •

edited

Loading

ChrisRackauckas commented Feb 11, 2017 •

edited

Loading

stevengj commented Feb 11, 2017 •

edited

Loading

Sacha0 commented Feb 11, 2017

pabloferz commented Feb 20, 2017

RoyiAvital commented Feb 20, 2017

jebej commented Apr 11, 2017

vchuravy commented Apr 11, 2017

bramtayl commented Jul 15, 2017

Sacha0 commented Jul 15, 2017

yuyichao commented Jul 15, 2017

MasonProtter commented May 15, 2019

oscardssmith commented Mar 12, 2021

multi-threaded (@threads) dotcall/broadcast? #19777

multi-threaded (@threads) dotcall/broadcast? #19777

Comments

stevengj commented Dec 30, 2016

vchuravy commented Dec 31, 2016 • edited Loading

stevengj commented Dec 31, 2016

SimonDanisch commented Dec 31, 2016

StefanKarpinski commented Jan 4, 2017

martinholters commented Jan 4, 2017

ChrisRackauckas commented Feb 9, 2017

Sacha0 commented Feb 9, 2017

vchuravy commented Feb 10, 2017

SimonDanisch commented Feb 10, 2017

RoyiAvital commented Feb 11, 2017

ChrisRackauckas commented Feb 11, 2017

stevengj commented Feb 11, 2017 • edited Loading

ChrisRackauckas commented Feb 11, 2017 • edited Loading

stevengj commented Feb 11, 2017 • edited Loading

Sacha0 commented Feb 11, 2017

pabloferz commented Feb 20, 2017

RoyiAvital commented Feb 20, 2017

jebej commented Apr 11, 2017

vchuravy commented Apr 11, 2017

bramtayl commented Jul 15, 2017

Sacha0 commented Jul 15, 2017

yuyichao commented Jul 15, 2017

MasonProtter commented May 15, 2019

oscardssmith commented Mar 12, 2021

vchuravy commented Dec 31, 2016 •

edited

Loading

stevengj commented Feb 11, 2017 •

edited

Loading

ChrisRackauckas commented Feb 11, 2017 •

edited

Loading

stevengj commented Feb 11, 2017 •

edited

Loading