Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed of data movement in @spawn #9992

Closed
andreasnoack opened this issue Feb 1, 2015 · 16 comments
Closed

Speed of data movement in @spawn #9992

andreasnoack opened this issue Feb 1, 2015 · 16 comments
Labels
parallelism Parallel or distributed computation performance Must go faster

Comments

@andreasnoack
Copy link
Member

UPDATE 9 February: After merging #10073, the speed of @spawn for large arrays has improved a lot, so I have updated the plots below.

This is part of #9167, but deserves a separate issue as it is a well defined problem whereas some the other bullet points are more unspecific. In short, the issue is that we are moving data slowly between processes compared to MPI.

I don't know how and if this can be fixed, so my best bet is to provide data that illustrates the issue. I hope that you can give input to improvements then I can offer to run benchmarks.

The essence of the issue is in this graph
timings
which is an updated version of the graph in #9167. It shows the time of parallel data movement against the size of an Vector{Float64} for three different schemes of which the first is our @spawn and the two others are MPI based. In contrast to the plot in #9167, I have now included timings where I force MPI to use TCP instead of shared memory for the data transport.

  • MPI-TCP uses MPI.jl's Send and Recv!. TCP is used for data transport.
  • MPI-SM is the same, but here the data transport is over shared memory instead of TCP.

This overhead makes it difficult to benefit from our parallel functionality e.g. in parallel linear algebra where significant data movement is unavoidable. Below are some further comments to the graph.

Large arrays

The relative timings between our @spawn and MPI-TCP is now approximately 2x for large arrays.

The difference between @spawn and MPI when working within a node with shared memory is over 3x for the largest arrays and the difference grows as the arrays become smaller.

Small arrays

When the array has fever than approximately 1000 elements the size doesn't have an effect on the time it takes to move the array. I don't know exactly where this time is spent as it is difficult to profile parallel code. However, the bottomline is that @spawn is 10x slower than MPI when using TCP and over 40x slower when MPI is using shared memory for the transport.

Example: Symmetric tridiagonal solver

@alanedelman, his PhD student Eka and I have made some implementations of parallel symmetric tridiagonal solvers in Julia. For the same parallel algorithm, Eka did an implementation with DArray's and I did an implementation based on MPI.jl. For a problem of dimension 100000 solved on 1, 2, 4, and 8 processors a graph of the timings showed
symtri

the an exponential model for the scalings were

  • DArray: time = 0.0082 * nprocs^0.4
  • MPI.jl: time = 0.0033 * nprocs^(-0.78)

where MPI uses TCP for transport. Notice that the exponent for the DArray is positive, i.e. overhead dominates the benefit from parallelization in contrast to MPI.jl where the problem scales as expected in the number of processors.

cc: @ViralBShah, @amitmurthy

@jiahao jiahao added the parallelism Parallel or distributed computation label Feb 1, 2015
@amitmurthy
Copy link
Contributor

I'll put together a PR that combines #6876 and #9181. That should address some of the issues.

@amitmurthy
Copy link
Contributor

If possible, could you share your benchmarking code?

@andreasnoack
Copy link
Member Author

@amitmurthy
Copy link
Contributor

@sync @spawnat in the benchmarking code results in both 1) transfer of array to remote (the @spawant and 2) waiting for an acknowledgement that the data has indeed been transferred (the @sync)

while the MPI code was calling MPI_Send which according to http://www.mcs.anl.gov/research/projects/mpi/sendmode.html only needs to block till the buffer can be reused.

So, I changed the test to do an echo of the sent buffer - remotecall_fetch of the same array for Julia and a send-recv combination for MPI.

The changed tests are here - https://github.com/amitmurthy/ParallelBenchmarks.jl

The results I get with an echo test (single worker) are :
figure_1

@andreasnoack
Copy link
Member Author

Okay. That might be more fair, but I think the relative timings appear very similar. Notice that I've removed the "MPI_serialize" series from my last plot.

Two questions, do you see any possibilities for improvements in the left part of the graph and how much of this speed up is possible to achieve if more complicated object like e.g. a factorization is moved.

@amitmurthy
Copy link
Contributor

The MPI library may be using gather-send and scatter-recv to send its header+data in a single socket call avoiding an intermediate buffer. I don't see a straightforward way of doing this via libuv currently.

As for complicated objects, I think it is an issue we will see with both MPI.jl as well as @spawn. Unless the types return isbits true, serialization will be an overhead. Maybe #7568 will help in having complicated bits types?

I'll add a plot of only serialization times to the above graph, That should gives us an idea of serialization overhead.

@amitmurthy
Copy link
Contributor

figure_2

Added a timing of serializing-deserializing the request (basically the values - :call_fetch, Base.next_id(), x->x, a twice. This is what a remotecall_fetch sends over the wire. It is interesting that the bulk of the overhead is not on the network side but on serialization.

The timings are all minimum values in this plot.

@amitmurthy
Copy link
Contributor

@spawn and typical usages of remotecall* all serialize anonymous functions which seems to be the culprit:

function ser_timings(x)
    io=PipeBuffer()
    serialize(io, x)
    deserialize(io)

    @elapsed for n in 1:10^5
        serialize(io, x)
        deserialize(io)
    end
end

function all_timings()
    ser_timings(1)
    anon_func = x->x

    for t in (1, 1.0, "Hello", 'c', :a_symbol, x->x, anon_func, myid, (1,1), ones(1), ones(10), ones(1000), fill(1, 1), fill(1, 10), fill(1,1000))
        println(isa(t, Array) ? string(typeof(t),":", length(t)) : typeof(t), "   : ", ser_timings(t))
    end
end

all_timings()

prints

Int64   : 0.01174819
Float64   : 0.05571949
ASCIIString   : 0.112928149
Char   : 0.037121222
Symbol   : 0.057779137
Function   : 3.080522616
Function   : 3.094634032
Function   : 0.100687586
(Int64,Int64)   : 0.059134811
Array{Float64,1}:1   : 0.13216678
Array{Float64,1}:10   : 0.123837175
Array{Float64,1}:1000   : 0.545716207
Array{Int64,1}:1   : 0.110177278
Array{Int64,1}:10   : 0.111104982
Array{Int64,1}:1000   : 0.476917077

@ViralBShah
Copy link
Member

Wow, anonymous functions have a 30x higher overhead for serialization/deserialization. Seems that both serialization and deserialization are equally to blame.

@ViralBShah
Copy link
Member

I guess this makes sense. In the case of a regular function, the serialization just sends the symbol name, whereas in the case of an anonymous function, it sends over the entire AST, which even for x->x is 20-30x bigger than a symbol name.

@amitmurthy
Copy link
Contributor

Not really, since serializing-deserializing even an array of 1000 floats is quite less compared to the anonymous function.

@ViralBShah
Copy link
Member

The time is all going in serializing LambdaStaticData, which spends all its time in uncompressed_ast(). serialize_array_data seems to have much lesser work it has to do in comparison.

@ViralBShah
Copy link
Member

I guess we could cache the uncompressed ASTs in serialization. Perhaps the benchmark example is only good for benchmarking, and for real usage, for now, one can avoid using anonymous functions.

@amitmurthy
Copy link
Contributor

Caching anonymous functions does not make sense. For small arrays, even if we use the pattern of calling only defined functions via @everywhere foo(x)=x; remotecall_fetch(p, foo, a), the overhead of serializing defined functions will be quite large compared to an MPI model where the remote code that works on the sent array is necessarily part of the program flow.

julia> io=PipeBuffer()

julia> @elapsed for n in 1:10^5; serialize(io, x->x); deserialize(io); end
3.560338764

julia> echo(x)=x
echo (generic function with 1 method)

julia> @elapsed for n in 1:10^5; serialize(io, echo); deserialize(io); end
0.197402999

julia> @elapsed for n in 1:10^5; serialize(io, myid); deserialize(io); end
0.118048503

@ggggggggg
Copy link
Contributor

Is it possible the type ambiguity in the specification of Worker is part of the speed issue? r_stream and w_stream are both of type AsyncStream which is abstract. Is there a reason it isn't more like

type Worker{T<:AsyncStream}
    id::Int
    r_stream::T
    w_stream::T
    ...
end

@vtjnash
Copy link
Member

vtjnash commented Jan 8, 2019

Closing issue as stale. Please re-open new issues as appropriate.

@vtjnash vtjnash closed this as completed Jan 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parallelism Parallel or distributed computation performance Must go faster
Projects
None yet
Development

No branches or pull requests

7 participants