-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of network I/O in comparison to MPI #12133
Comments
To get at the effect of libuv vs julia, what about a benchmark in C that uses libuv to implement the direct socket test? |
Did you send these between two computers over TCP? You need to do some work on the MPI side to make sure that you are doing an apples to apples comparison if you are on a single physical node. |
@malmaud that is a good idea - I'll do that. @jakebolewski No, it is all local. What do you mean by "work on the MPI side" ? The benchmark mimics the put!/take! code in the sense that both sides are doing non-blocking Irecv!, Isend before doing a Waitall! on the previous two calls. |
MPI implementations will often optimize interprocess communication on a physical node by communicating over shared memory segments. You need to make sure that you force the implementation to use the network layer if you are benchmarking communication on the same node. You can usually do this through command line flags. |
Although same node interprocess communication is important to optimize, these benchmarks should really be run in a true distributed setting. |
This benchmark is to just get an idea of Julia overhead and where it we can be optimized. Till the time we have true multithreading, a bunch of distributed Julia will continue to be be run on many-core machines via the multi-processor model we have today. Yeah, the MPI code may be using shmem, I'll re-run forcing TCP and update the numbers. |
Same as #9992? |
Similar, but I wanted to narrow the focus in this issue to just the socket layer in the stack. |
I benchmarked the time taken to transfer Float64 arrays of lengths 1, 10, 100, 1000, 10,000 and 100,000
a total of 10,000 times each.
Julia MPI code is here - https://gist.github.com/amitmurthy/6a3dda483f2008e2a4b7 . In each iteration, the processes asynchronously send/recv data and then wait for this step to complete before the next iteration.
Julia put!/take! using RemoteRefs is here - https://gist.github.com/amitmurthy/50aaa18bb65487773fa4
In each iteration, the
put!
s are asynchronous, but the processes synchronize on thetake!
s given the single value store nature of RemoteRefs.Julia direct socket read/write code is here - https://gist.github.com/amitmurthy/c583f4dbf19e02498e61
The server just echoes the data sent, so we need to halve the timings to get the overhead of just one-sided data transfer.
Timings
As can be seen, the data transport overhead in Julia is high even in the direct socket transfer case, i.e. without the overhead of serialization or the RemoteRef implementation. And the fixed overhead seems quite high as can be seen for transfers upto 1000 floats.
While this has been tangentially noted in other issues, I hope simple the simple benchmark against MPI will help us narrow down the causes of this slowness, especially at the network IO / libuv layer.
The text was updated successfully, but these errors were encountered: