-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel Computing Planning Issue #9167
Comments
Regarding shared memory multithreading the |
Faster data movementThis is partly about communication layers and partly about libraries. I'd like to write pure julia implementations of large scale parallel dense linear algebra functions, but it is difficult to get good perfomance in the present state of Julia's parallel functionality because the communication is slow. The graphs below show timings for moving data to a remote process as a function of array size for present |
I'd add one to the hardware section - NUMA aware memory strategies |
This will be in a separate package, but changes will be required in Base to support this. A possible model could be JuliaLang backed docker images that can be run on various cloud infrastructures. The package accepts the users credentials and manages the launch, tear down and billing of julia workers used. |
I'm slightly confused about how Julia's talking model will evolve in the presence of multi-threading. I always assumed that Julia's tasks, which are currently executed serially, would be executed in parallel in the future. However, when I look at the threading branch that may not be the case. Has there been a discussion about this? |
@andreasnoack that is excellent data, thanks for collecting that I'll note that sadly none of the open-source MPI implementations can build on non-posix platforms right now, but I've got some back-burner plans that involve filling in whatever's missing possibly involving libuv along the way and trying to fix them. Almost noone uses Windows for HPC, but it's mildly annoying that I can't run MPI libraries on 8 local cores if I happen to be using Windows. There is a Microsoft MPI implementation but I doubt many computational MPI libraries work with it. We should still do as much as possible to make Julia play well with MPI, cross-platform or no. To the Libraries list I'd like to add https://github.com/elemental/Elemental, @poulson has expressed interest in adding Julia bindings and it would be good to know if we need to change anything to make that happen. |
Microsoft's MPI seems to be built on MPICH, one of the mainstream MPI implementations. (I'm reading the description, I haven't tried it.) MPI implementations tends to be very portable (i.e. very close to the standard), and I would thus expect this implementation to work out of the box with any MPI-using application under Windows. |
@tkelman I would assume that I can manually expose an IJulia interface in the same way that I did for IPython (see http://web.stanford.edu/~poulson/notes/2014/11/11/Elemental-IPython.html). I'm currently wrapping up initial dense and sparse-direct distributed-memory implementations of Mehotra Predictor-Corrector Interior Point Methods for LPs (though I am not yet specially handling small numbers of "dense" rows/columns when forming the normal form of the (ideally) sparse KKT system). I hope to add distributed dense versions of QP IPMs soon, with distributed dense and sparse-direct SDPs being my ultimate goal. |
Why do the libraries not work with Microsoft MPI? What have they done? |
I wonder if much of the time spent is in spawning rather than data movement, at small sizes. |
@poulson the IPython approach does look interesting, did not know about that (see http://ipython.org/ipython-doc/2/parallel/index.html for anyone else who's curious). We should see how much of that also works through IJulia - maybe all of it does? Regarding the linear algebra details and optimization applications, do you have a roadmap issue? I'm hoping you won't limit yourself to convex problems, the nonlinear programming solvers of the world are desperately in need of redistributable parallel sparse linear algebra libraries, of which Elemental might be the first to actually fit the bill. @ViralBShah I'm mostly worried about build systems not knowing how to find and use Microsoft's MPI implementation. It's also not open source so I can't cross-compile it so using WinRPM for binaries is out of the picture. |
@tkelman I apologize for taking this discussion a bit off-topic. But, somewhat in order, my current roadmap is robust sequential/distributed dense/sparse implementations of LPs, QPs, SOCPs, and SDPs which can make use of accelerators on each node. On the subject of parallelization with IJulia: I have little doubt that it would be straightforward. |
Has the parallel support in IPython been refactored enough to be python agnostic? Last time I used it the implementation was very much tied to using the python kernel. |
@jakebolewski I haven't checked. I'm not saying that it should be blackbox, but that it is conceptually straightforward. I would assume that a complete IJulia implementation would duplicate said functionality with Julia workers instead of Python workers. If someone is seriously interested in this, I'm more than happy to discuss further (and possibly considering adding said functionality myself). |
@poulson regular |
@amitmurthy I was honestly not familiar with that function, and it looks like it might. But I wouldn't want to promise anything until I have a working prototype. |
@ViralBShah MPICH makes use of pthreads, which Windows does not support. |
winpthreads works reasonably well, to be fair, and it's a small permissively licensed library. LLVM might force us off the win32-threads version of MinGW at some point anyway. |
@amitmurthy 's recent work on allowing different transports is worth trying out here on this simple benchmark. @andreasnoack Can you create a set of benchmarks in test/perf, and put this benchmark in there? We will certainly want to add to that list, and it would be great if everyone can run these. |
@ViralBShah I'll try to compile some tests for this. |
Here is an article by John Langford, Allreduce (or MPI) vs. Parameter server approaches, where he reviewed various parallelization approaches used in machine learning. |
Hi everybody! Apparently, the implementation in that branch is working. The basic tests are passing, though one test relying on the "Images" module currently doesn't work (since the Images module requires functionality available only in the latest Julia version, and I didn't go through a merge since it couldn't be done automatically). What are the plans regarding local parallelism? Are there any plans to merge the "threads" branch? It looks promising! |
Other than a few experiments back in the very earliest days, I have not been involved in the threads branch at all. So take this with a big grain of salt. I think it's fair to say that there certainly are plans to merge the branch. Guessing here, but presumably what it needs is (1) some free time for its principal developers (including one who needs to defend his thesis) to finish not-yet-implemented features or bug fixes, and (2) people kicking the tires. There's a chance you might be able to help the process along by doing that merge you described above and submitting a PR against the threads branch. If you are brave and willing to start using it in regular work, continued PRs against the threads branch would surely go a long ways to speeding the merger. |
The plan is to merge the threading work into master soon, but leave it off by default since it's still quite experimental. The main purpose of merging is to avoid continually rebasing when people make changes to the internals on master that break the threading functionality. That should make it easier for people to try out threading functionality even though it will still be experimental during the 0.4 cycle. |
That sounds very good! I had a quick look on what the main merge problems were. A critical part of changes concerns the garbage collection, which is at the very heart of Julia and cannot be merged without quite some reading/knowledge of the internals. |
Yes, it's a pretty non-trivial merge. |
The GC patch was ready and tested, and merging it first meant that work remains to make it thread safe. As said before, we expect to have this in 0.4, even if disabled by default. There is a |
Does the |
I don't think anyone has tried it on OS X. I wouldn't even dare try windows. I personally have only tried it on Linux, and I also see a few segfaults. |
Haven't tried OS X or Windows, but the former shouldn't be too hard to get running. The threading infrastructure mostly needs only pthreads and atomics. There's an LLVM dependency, for thread-locals, but @Keno has a functional hack in an llvm-svn branch (I don't know if this is Linux-specific though). Getting this functionality into LLVM is another TODO. Reentrancy in the runtime (specifically in the new GC) is the main block right now. And there are, of course, lots of functions in the standard library that need to be made thread-safe, or thread-safe versions added. Most crashes you'll find in the threading branch relate to such problems. Lots more functionality to add, but it mostly works, and scales quite nicely. :-) See |
The LLVM patch is Linux-only at this point, but give me a week or two. |
Yes, the scaling is near perfect and the functionality looks very promising and impressive! Do you have plans already for the variable sharing policy? I.e., what variables will be treated thread-local and when are variables syncronized between threads. |
Threads run Julia functions (or blocks that are converted into functions). Block-scoped variables are on the stack, hence thread-local. Variables in an outer scope will be shared. Directive-based privatization a la OpenMP is not high on the priority list (to be honest, it isn't on the list at all, but reductions are). Is this important? Synchronization support will initially be explicit, i.e. locks and atomics. These should eventually be used to build a library of concurrent data structures. |
In response to @eschnett (from a while ago--sorry, only saw this when I was @mentioned), the interaction between tasks and threads has been the subject of much discussion. Cilk-style parallelism (i.e. fork/join with work stealing) will work best for this, but alternative shared memory parallel programming models are still some ways out. This is simple loop parallelism, and it isn't clear how tasks can/should interact with this model. |
Thanks for the detailed response! Having a possibility to synchronize, and the simple loop parallelism will be perfectly sufficient. I think the OpenMP directives do make sense to simplify data handling on NUMA architectures. At the same time, they're way out of scope for an experimental feature... |
off-topic: @kpamnany, I love your avatar :-) |
@StefanKarpinski, I figured it was the done thing for Julia devs, given yours and Jeff's. :-) |
please consider nanomsg it is by the same author and supercedes zeroMQ |
Julia already supports ZMQ--it's needed to communicate with IPython (Jupyter), so it's sensible to pursue. I think the transports are supposed to be pluggable, so if you're interested in a nanomsg transport, you're welcome to build it! |
@skariel A sensible place to start: https://github.com/quinnj/Nanomsg.jl |
I was wrong! Good work! |
@ChrisRackauckas do you mean fusing multiple linear algebra operations? Because aren't all the basic linear algebra operations parallel already? As for for loops, we have |
Wow, turns out I was just wrong. When I last used Julia I don't remember this, but I just ran a simple test with multiplying two random matrices of size 10,000 x 10,000 and it did max all my cores. I think that it should be noted in the documentation (what other functions are parallel?). Also, did not know about @parallel. FYI, hearing this (and noticing that MATLAB's native parallelization doesn't come close to using the full power of my machine...) finally makes me a convert. The only thing I feel like I am missing is easy CUDA support (i.e. as simple as send the array over to the GPU and now all matrix operations and standard functions are performed on the GPU) but that's nice but not truly necessary. |
Not sure which operations exactly, but basically "linear algebra on Float64s" for sure. Definitely check out the manual, it covers the parallel computing functionality of Julia pretty well - PRs welcome for improvements as well. CUDA support isn't a base Julia issue (IMO) given its dependent on an external closed source library, but there is JuliaGPU/CUBLAS.jl which is getting there for higher level abstractions: (also contributions welcome there, I'm sure) A = rand(elty,m,n)
d_A = CudaArray(A)
# test y = alpha*(A*x)
x = rand(elty,n)
d_x = CudaArray(x)
y1 = alpha*(A*x)
y2 = A*x
d_y1 = CUBLAS.gemv('N',alpha,d_A,d_x)
d_y2 = CUBLAS.gemv('N',d_A,d_x)
h_y1 = to_host(d_y1)
h_y2 = to_host(d_y2)
@test_approx_eq(y1,h_y1)
@test_approx_eq(y2,h_y2) |
The only mention I could find of that functionality is hinted at in the peakflops function on this page: http://julia.readthedocs.org/en/latest/stdlib/linalg/. There is no mention in the parallel computing page http://julia.readthedocs.org/en/latest/manual/parallel-computing/ or the linear algebra page http://julia.readthedocs.org/en/latest/manual/linear-algebra/. |
Arguably this isn't something that should be documented as Julia's behavior since it's really the behavior of OpenBLAS. |
@ChrisRackauckas, where did the impression that Julia was not maxing out a system's cores when doing things like matmul come from? Doing so is a matter of using a well-tuned multithreaded BLAS – which we always have, albeit a different one from Matlab (OpenBLAS vs. MKL). There are other vectorized operations where Matlab has hand-coded implementations in C that use multiple cores, while Julia's implementation, being written in Julia, uses a single core. Often that deficit can be compensated for by devectorizing the code and improving the serial performance. In the future, Julia will support threading natively, and that issue will go away too. The distributed computing issue is largely orthogonal to all of this, but it also important if one wants to do HPC-style work. |
Once we have multi-threaded Julia, it won't be only hand coded internal functions in C that will be multi-threaded, but the entire Julia Base library. So while we have some distance to cover, we will have something great in the 0.5 timeframe. |
Thank you for the explanation, that clears it up. I look forward to multi-threaded Julia. |
This planning issue seems too broad. Does it break into smaller tasks that can have their own issues? I am tempted to close this, and have specific issues dealing with specific problems. |
I am closing this as very generic. A bunch of specific things are being done already. |
Where will the status of the threading branch be tracked? And what is its current status? That one was very promising! :) |
As we talk about parallel computing in various contexts, it might be handy to keep in mind
all the high level moving parts. Here are some that come to mind, just to get the ball rolling.
Hardware
Tools
have Green (working) / Red (idle) on every processor easily every time I do parallel julia)
Programming Models
Communication Layers
Schedulers
Libraries
with more pure julia methods?)
The text was updated successfully, but these errors were encountered: