-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed of data movement in @spawn #9992
Comments
If possible, could you share your benchmarking code? |
while the MPI code was calling MPI_Send which according to http://www.mcs.anl.gov/research/projects/mpi/sendmode.html only needs to block till the buffer can be reused. So, I changed the test to do an echo of the sent buffer - The changed tests are here - https://github.com/amitmurthy/ParallelBenchmarks.jl |
Okay. That might be more fair, but I think the relative timings appear very similar. Notice that I've removed the "MPI_serialize" series from my last plot. Two questions, do you see any possibilities for improvements in the left part of the graph and how much of this speed up is possible to achieve if more complicated object like e.g. a factorization is moved. |
The MPI library may be using gather-send and scatter-recv to send its header+data in a single socket call avoiding an intermediate buffer. I don't see a straightforward way of doing this via libuv currently. As for complicated objects, I think it is an issue we will see with both MPI.jl as well as I'll add a plot of only serialization times to the above graph, That should gives us an idea of serialization overhead. |
Added a timing of serializing-deserializing the request (basically the values - The timings are all minimum values in this plot. |
prints
|
Wow, anonymous functions have a 30x higher overhead for serialization/deserialization. Seems that both serialization and deserialization are equally to blame. |
I guess this makes sense. In the case of a regular function, the serialization just sends the symbol name, whereas in the case of an anonymous function, it sends over the entire AST, which even for |
Not really, since serializing-deserializing even an array of 1000 floats is quite less compared to the anonymous function. |
The time is all going in serializing LambdaStaticData, which spends all its time in |
I guess we could cache the uncompressed ASTs in serialization. Perhaps the benchmark example is only good for benchmarking, and for real usage, for now, one can avoid using anonymous functions. |
Caching anonymous functions does not make sense. For small arrays, even if we use the pattern of calling only defined functions via
|
Is it possible the type ambiguity in the specification of type Worker{T<:AsyncStream}
id::Int
r_stream::T
w_stream::T
...
end |
Closing issue as stale. Please re-open new issues as appropriate. |
UPDATE 9 February: After merging #10073, the speed of
@spawn
for large arrays has improved a lot, so I have updated the plots below.This is part of #9167, but deserves a separate issue as it is a well defined problem whereas some the other bullet points are more unspecific. In short, the issue is that we are moving data slowly between processes compared to MPI.
I don't know how and if this can be fixed, so my best bet is to provide data that illustrates the issue. I hope that you can give input to improvements then I can offer to run benchmarks.
The essence of the issue is in this graph
which is an updated version of the graph in #9167. It shows the time of parallel data movement against the size of an
Vector{Float64}
for three different schemes of which the first is our@spawn
and the two others are MPI based. In contrast to the plot in #9167, I have now included timings where I force MPI to use TCP instead of shared memory for the data transport.MPI-TCP
usesMPI.jl
'sSend
andRecv!
. TCP is used for data transport.MPI-SM
is the same, but here the data transport is over shared memory instead of TCP.This overhead makes it difficult to benefit from our parallel functionality e.g. in parallel linear algebra where significant data movement is unavoidable. Below are some further comments to the graph.
Large arrays
The relative timings between our
@spawn
and MPI-TCP is now approximately 2x for large arrays.The difference between
@spawn
and MPI when working within a node with shared memory is over 3x for the largest arrays and the difference grows as the arrays become smaller.Small arrays
When the array has fever than approximately 1000 elements the size doesn't have an effect on the time it takes to move the array. I don't know exactly where this time is spent as it is difficult to profile parallel code. However, the bottomline is that
@spawn
is 10x slower than MPI when using TCP and over 40x slower when MPI is using shared memory for the transport.Example: Symmetric tridiagonal solver
@alanedelman, his PhD student Eka and I have made some implementations of parallel symmetric tridiagonal solvers in Julia. For the same parallel algorithm, Eka did an implementation with
DArray
's and I did an implementation based onMPI.jl
. For a problem of dimension 100000 solved on 1, 2, 4, and 8 processors a graph of the timings showedthe an exponential model for the scalings were
DArray
: time = 0.0082 * nprocs^0.4MPI.jl
: time = 0.0033 * nprocs^(-0.78)where MPI uses TCP for transport. Notice that the exponent for the
DArray
is positive, i.e. overhead dominates the benefit from parallelization in contrast toMPI.jl
where the problem scales as expected in the number of processors.cc: @ViralBShah, @amitmurthy
The text was updated successfully, but these errors were encountered: