Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel Computing Planning Issue #9167

Closed
12 tasks
alanedelman opened this issue Nov 26, 2014 · 53 comments
Closed
12 tasks

Parallel Computing Planning Issue #9167

alanedelman opened this issue Nov 26, 2014 · 53 comments
Labels
parallelism Parallel or distributed computation

Comments

@alanedelman
Copy link
Contributor

As we talk about parallel computing in various contexts, it might be handy to keep in mind
all the high level moving parts. Here are some that come to mind, just to get the ball rolling.

Hardware

Tools

  • Nice Graphical Performance Tools and Instrumentation (My first wish: just
    have Green (working) / Red (idle) on every processor easily every time I do parallel julia)

Programming Models

  • Persistent Data as in Global Array Syntax Darrays (Star-P and others)
  • CILK/Spawns
  • Map/reduce etc.

Communication Layers

  • ZMQ, MPI, sockets

Schedulers

  • ??
  • Does one master know all state?
  • Can our serial model be in conflict with various parallel models?

Libraries

  • Scalapack, parallel FFTW?, sparse matrices, PETSc , (star-p like to compare
    with more pure julia methods?)
@ViralBShah ViralBShah added the parallelism Parallel or distributed computation label Nov 26, 2014
@tknopp
Copy link
Contributor

tknopp commented Nov 26, 2014

Regarding shared memory multithreading the threads branch seems to be the closest thing. #6741 was my experiment on multithreading and this work is superseded by the threads branch.

@andreasnoack
Copy link
Member

Faster data movement

This is partly about communication layers and partly about libraries. I'd like to write pure julia implementations of large scale parallel dense linear algebra functions, but it is difficult to get good perfomance in the present state of Julia's parallel functionality because the communication is slow. The graphs below show timings for moving data to a remote process as a function of array size for present @spawn, @spawn with the changes proposed in #6876 and send and Send from MPI.jl.
muck svg
Notice the log-scale which means that right now @spawns are ten times slower than MPI for large arrays and even slower for small arrays, although the variance is very large for small array @spawns.
Also relevant: #9046.

@elextr
Copy link

elextr commented Nov 27, 2014

I'd add one to the hardware section - NUMA aware memory strategies

@amitmurthy
Copy link
Contributor

addprocs on demand. What I mean by this is that any user should be able to summon thousands of cores, run a parallel algorithm and pay only for the minutes used.

This will be in a separate package, but changes will be required in Base to support this.

A possible model could be JuliaLang backed docker images that can be run on various cloud infrastructures. The package accepts the users credentials and manages the launch, tear down and billing of julia workers used.

@eschnett
Copy link
Contributor

I'm slightly confused about how Julia's talking model will evolve in the presence of multi-threading. I always assumed that Julia's tasks, which are currently executed serially, would be executed in parallel in the future. However, when I look at the threading branch that may not be the case. Has there been a discussion about this?

@tkelman
Copy link
Contributor

tkelman commented Nov 27, 2014

@andreasnoack that is excellent data, thanks for collecting that

I'll note that sadly none of the open-source MPI implementations can build on non-posix platforms right now, but I've got some back-burner plans that involve filling in whatever's missing possibly involving libuv along the way and trying to fix them. Almost noone uses Windows for HPC, but it's mildly annoying that I can't run MPI libraries on 8 local cores if I happen to be using Windows. There is a Microsoft MPI implementation but I doubt many computational MPI libraries work with it. We should still do as much as possible to make Julia play well with MPI, cross-platform or no.

To the Libraries list I'd like to add https://github.com/elemental/Elemental, @poulson has expressed interest in adding Julia bindings and it would be good to know if we need to change anything to make that happen.

@eschnett
Copy link
Contributor

Microsoft's MPI seems to be built on MPICH, one of the mainstream MPI implementations. (I'm reading the description, I haven't tried it.) MPI implementations tends to be very portable (i.e. very close to the standard), and I would thus expect this implementation to work out of the box with any MPI-using application under Windows.

@poulson
Copy link

poulson commented Nov 27, 2014

@tkelman I would assume that I can manually expose an IJulia interface in the same way that I did for IPython (see http://web.stanford.edu/~poulson/notes/2014/11/11/Elemental-IPython.html).

I'm currently wrapping up initial dense and sparse-direct distributed-memory implementations of Mehotra Predictor-Corrector Interior Point Methods for LPs (though I am not yet specially handling small numbers of "dense" rows/columns when forming the normal form of the (ideally) sparse KKT system). I hope to add distributed dense versions of QP IPMs soon, with distributed dense and sparse-direct SDPs being my ultimate goal.

@ViralBShah
Copy link
Member

Why do the libraries not work with Microsoft MPI? What have they done?

@ViralBShah
Copy link
Member

I wonder if much of the time spent is in spawning rather than data movement, at small sizes.

@tkelman
Copy link
Contributor

tkelman commented Nov 27, 2014

@poulson the IPython approach does look interesting, did not know about that (see http://ipython.org/ipython-doc/2/parallel/index.html for anyone else who's curious). We should see how much of that also works through IJulia - maybe all of it does?

Regarding the linear algebra details and optimization applications, do you have a roadmap issue? I'm hoping you won't limit yourself to convex problems, the nonlinear programming solvers of the world are desperately in need of redistributable parallel sparse linear algebra libraries, of which Elemental might be the first to actually fit the bill.

@ViralBShah I'm mostly worried about build systems not knowing how to find and use Microsoft's MPI implementation. It's also not open source so I can't cross-compile it so using WinRPM for binaries is out of the picture.

@poulson
Copy link

poulson commented Nov 27, 2014

@tkelman I apologize for taking this discussion a bit off-topic. But, somewhat in order, my current roadmap is robust sequential/distributed dense/sparse implementations of LPs, QPs, SOCPs, and SDPs which can make use of accelerators on each node.

On the subject of parallelization with IJulia: I have little doubt that it would be straightforward.

@jakebolewski
Copy link
Member

Has the parallel support in IPython been refactored enough to be python agnostic? Last time I used it the implementation was very much tied to using the python kernel.

@poulson
Copy link

poulson commented Nov 27, 2014

@jakebolewski I haven't checked. I'm not saying that it should be blackbox, but that it is conceptually straightforward. I would assume that a complete IJulia implementation would duplicate said functionality with Julia workers instead of Python workers. If someone is seriously interested in this, I'm more than happy to discuss further (and possibly considering adding said functionality myself).

@amitmurthy
Copy link
Contributor

@poulson regular addprocs should work within an IJulia session too, right? And my understanding is that the workers will be around till the kernel is shutdown.

@poulson
Copy link

poulson commented Nov 27, 2014

@amitmurthy I was honestly not familiar with that function, and it looks like it might. But I wouldn't want to promise anything until I have a working prototype.

@poulson
Copy link

poulson commented Nov 27, 2014

@ViralBShah MPICH makes use of pthreads, which Windows does not support.

@tkelman
Copy link
Contributor

tkelman commented Nov 27, 2014

winpthreads works reasonably well, to be fair, and it's a small permissively licensed library. LLVM might force us off the win32-threads version of MinGW at some point anyway.

@ViralBShah
Copy link
Member

@amitmurthy 's recent work on allowing different transports is worth trying out here on this simple benchmark.

@andreasnoack Can you create a set of benchmarks in test/perf, and put this benchmark in there? We will certainly want to add to that list, and it would be great if everyone can run these.

@andreasnoack
Copy link
Member

@ViralBShah I'll try to compile some tests for this.

@wildart
Copy link
Member

wildart commented Dec 3, 2014

Here is an article by John Langford, Allreduce (or MPI) vs. Parameter server approaches, where he reviewed various parallelization approaches used in machine learning.

@classner
Copy link

classner commented Feb 2, 2015

Hi everybody!
As Julia is making great progress, it's becoming really interesting for image processing. (Local) threading is really important here, though. I dug through the various discussions and found this discussion and the "threads" branch.

Apparently, the implementation in that branch is working. The basic tests are passing, though one test relying on the "Images" module currently doesn't work (since the Images module requires functionality available only in the latest Julia version, and I didn't go through a merge since it couldn't be done automatically). What are the plans regarding local parallelism? Are there any plans to merge the "threads" branch? It looks promising!

@timholy
Copy link
Member

timholy commented Feb 2, 2015

Other than a few experiments back in the very earliest days, I have not been involved in the threads branch at all. So take this with a big grain of salt.

I think it's fair to say that there certainly are plans to merge the branch. Guessing here, but presumably what it needs is (1) some free time for its principal developers (including one who needs to defend his thesis) to finish not-yet-implemented features or bug fixes, and (2) people kicking the tires. There's a chance you might be able to help the process along by doing that merge you described above and submitting a PR against the threads branch. If you are brave and willing to start using it in regular work, continued PRs against the threads branch would surely go a long ways to speeding the merger.

@StefanKarpinski
Copy link
Member

The plan is to merge the threading work into master soon, but leave it off by default since it's still quite experimental. The main purpose of merging is to avoid continually rebasing when people make changes to the internals on master that break the threading functionality. That should make it easier for people to try out threading functionality even though it will still be experimental during the 0.4 cycle.

@classner
Copy link

classner commented Feb 2, 2015

That sounds very good! I had a quick look on what the main merge problems were. A critical part of changes concerns the garbage collection, which is at the very heart of Julia and cannot be merged without quite some reading/knowledge of the internals.

@StefanKarpinski
Copy link
Member

Yes, it's a pretty non-trivial merge.

@ViralBShah
Copy link
Member

The GC patch was ready and tested, and merging it first meant that work remains to make it thread safe. As said before, we expect to have this in 0.4, even if disabled by default.

There is a threading branch that is more recent than the threads branch, which is updated to just before the GC merge. Cc: @kpamnany

@jakebolewski
Copy link
Member

Does the threading branch work on OSX / Windows or is it Linux only for the time being?

@ViralBShah
Copy link
Member

I don't think anyone has tried it on OS X. I wouldn't even dare try windows. I personally have only tried it on Linux, and I also see a few segfaults.

@kpamnany
Copy link
Contributor

kpamnany commented Feb 3, 2015

Haven't tried OS X or Windows, but the former shouldn't be too hard to get running. The threading infrastructure mostly needs only pthreads and atomics. There's an LLVM dependency, for thread-locals, but @Keno has a functional hack in an llvm-svn branch (I don't know if this is Linux-specific though). Getting this functionality into LLVM is another TODO.

Reentrancy in the runtime (specifically in the new GC) is the main block right now. And there are, of course, lots of functions in the standard library that need to be made thread-safe, or thread-safe versions added. Most crashes you'll find in the threading branch relate to such problems.

Lots more functionality to add, but it mostly works, and scales quite nicely. :-) See test/perf/threads/laplace3d.

@Keno
Copy link
Member

Keno commented Feb 3, 2015

The LLVM patch is Linux-only at this point, but give me a week or two.

@classner
Copy link

classner commented Feb 3, 2015

Yes, the scaling is near perfect and the functionality looks very promising and impressive!

Do you have plans already for the variable sharing policy? I.e., what variables will be treated thread-local and when are variables syncronized between threads.

@kpamnany
Copy link
Contributor

kpamnany commented Feb 3, 2015

Threads run Julia functions (or blocks that are converted into functions). Block-scoped variables are on the stack, hence thread-local. Variables in an outer scope will be shared. Directive-based privatization a la OpenMP is not high on the priority list (to be honest, it isn't on the list at all, but reductions are). Is this important?

Synchronization support will initially be explicit, i.e. locks and atomics. These should eventually be used to build a library of concurrent data structures.

@kpamnany
Copy link
Contributor

kpamnany commented Feb 3, 2015

In response to @eschnett (from a while ago--sorry, only saw this when I was @mentioned), the interaction between tasks and threads has been the subject of much discussion. Cilk-style parallelism (i.e. fork/join with work stealing) will work best for this, but alternative shared memory parallel programming models are still some ways out. This is simple loop parallelism, and it isn't clear how tasks can/should interact with this model.

@classner
Copy link

classner commented Feb 3, 2015

Thanks for the detailed response! Having a possibility to synchronize, and the simple loop parallelism will be perfectly sufficient. I think the OpenMP directives do make sense to simplify data handling on NUMA architectures. At the same time, they're way out of scope for an experimental feature...

@StefanKarpinski
Copy link
Member

off-topic: @kpamnany, I love your avatar :-)

@kpamnany
Copy link
Contributor

kpamnany commented Feb 3, 2015

@StefanKarpinski, I figured it was the done thing for Julia devs, given yours and Jeff's. :-)

@skariel
Copy link
Contributor

skariel commented Feb 19, 2015

please consider nanomsg it is by the same author and supercedes zeroMQ

@pao
Copy link
Member

pao commented Feb 19, 2015

Julia already supports ZMQ--it's needed to communicate with IPython (Jupyter), so it's sensible to pursue. I think the transports are supposed to be pluggable, so if you're interested in a nanomsg transport, you're welcome to build it!

@garborg
Copy link
Contributor

garborg commented Feb 19, 2015

@skariel A sensible place to start: https://github.com/quinnj/Nanomsg.jl

@ChrisRackauckas
Copy link
Member

I was wrong! Good work!

@IainNZ
Copy link
Member

IainNZ commented Apr 9, 2015

@ChrisRackauckas do you mean fusing multiple linear algebra operations? Because aren't all the basic linear algebra operations parallel already? As for for loops, we have @parallel already but soon enough we'll have @threads too.

@ChrisRackauckas
Copy link
Member

Wow, turns out I was just wrong. When I last used Julia I don't remember this, but I just ran a simple test with multiplying two random matrices of size 10,000 x 10,000 and it did max all my cores. I think that it should be noted in the documentation (what other functions are parallel?).

Also, did not know about @parallel.

FYI, hearing this (and noticing that MATLAB's native parallelization doesn't come close to using the full power of my machine...) finally makes me a convert. The only thing I feel like I am missing is easy CUDA support (i.e. as simple as send the array over to the GPU and now all matrix operations and standard functions are performed on the GPU) but that's nice but not truly necessary.

@IainNZ
Copy link
Member

IainNZ commented Apr 9, 2015

Not sure which operations exactly, but basically "linear algebra on Float64s" for sure. Definitely check out the manual, it covers the parallel computing functionality of Julia pretty well - PRs welcome for improvements as well. CUDA support isn't a base Julia issue (IMO) given its dependent on an external closed source library, but there is JuliaGPU/CUBLAS.jl which is getting there for higher level abstractions: (also contributions welcome there, I'm sure)

    A = rand(elty,m,n)
    d_A = CudaArray(A)
    # test y = alpha*(A*x)
    x = rand(elty,n)
    d_x = CudaArray(x)
    y1 = alpha*(A*x)
    y2 = A*x
    d_y1 = CUBLAS.gemv('N',alpha,d_A,d_x)
    d_y2 = CUBLAS.gemv('N',d_A,d_x)
    h_y1 = to_host(d_y1)
    h_y2 = to_host(d_y2)
    @test_approx_eq(y1,h_y1)
    @test_approx_eq(y2,h_y2)

@ChrisRackauckas
Copy link
Member

The only mention I could find of that functionality is hinted at in the peakflops function on this page: http://julia.readthedocs.org/en/latest/stdlib/linalg/. There is no mention in the parallel computing page http://julia.readthedocs.org/en/latest/manual/parallel-computing/ or the linear algebra page http://julia.readthedocs.org/en/latest/manual/linear-algebra/.

@johnmyleswhite
Copy link
Member

Arguably this isn't something that should be documented as Julia's behavior since it's really the behavior of OpenBLAS.

@StefanKarpinski
Copy link
Member

@ChrisRackauckas, where did the impression that Julia was not maxing out a system's cores when doing things like matmul come from? Doing so is a matter of using a well-tuned multithreaded BLAS – which we always have, albeit a different one from Matlab (OpenBLAS vs. MKL). There are other vectorized operations where Matlab has hand-coded implementations in C that use multiple cores, while Julia's implementation, being written in Julia, uses a single core. Often that deficit can be compensated for by devectorizing the code and improving the serial performance. In the future, Julia will support threading natively, and that issue will go away too. The distributed computing issue is largely orthogonal to all of this, but it also important if one wants to do HPC-style work.

@ViralBShah
Copy link
Member

Once we have multi-threaded Julia, it won't be only hand coded internal functions in C that will be multi-threaded, but the entire Julia Base library. So while we have some distance to cover, we will have something great in the 0.5 timeframe.

@ChrisRackauckas
Copy link
Member

Thank you for the explanation, that clears it up. I look forward to multi-threaded Julia.

@ViralBShah
Copy link
Member

This planning issue seems too broad. Does it break into smaller tasks that can have their own issues? I am tempted to close this, and have specific issues dealing with specific problems.

@ViralBShah
Copy link
Member

I am closing this as very generic. A bunch of specific things are being done already.

@classner
Copy link

classner commented Sep 8, 2015

Where will the status of the threading branch be tracked? And what is its current status? That one was very promising! :)

@ViralBShah
Copy link
Member

#1790

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parallelism Parallel or distributed computation
Projects
None yet
Development

No branches or pull requests