Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of network I/O in comparison to MPI #12133

Open
amitmurthy opened this issue Jul 13, 2015 · 8 comments
Open

Improve performance of network I/O in comparison to MPI #12133

amitmurthy opened this issue Jul 13, 2015 · 8 comments
Labels
io Involving the I/O subsystem: libuv, read, write, etc. parallelism Parallel or distributed computation performance Must go faster

Comments

@amitmurthy
Copy link
Contributor

I benchmarked the time taken to transfer Float64 arrays of lengths 1, 10, 100, 1000, 10,000 and 100,000
a total of 10,000 times each.

Julia MPI code is here - https://gist.github.com/amitmurthy/6a3dda483f2008e2a4b7 . In each iteration, the processes asynchronously send/recv data and then wait for this step to complete before the next iteration.

Julia put!/take! using RemoteRefs is here - https://gist.github.com/amitmurthy/50aaa18bb65487773fa4
In each iteration, the put!s are asynchronous, but the processes synchronize on the take!s given the single value store nature of RemoteRefs.

Julia direct socket read/write code is here - https://gist.github.com/amitmurthy/c583f4dbf19e02498e61
The server just echoes the data sent, so we need to halve the timings to get the overhead of just one-sided data transfer.

Timings

number of floats : time in seconds for 10,000 iterations of send/recv

MPI (mpich)
----------------
1 : 0.062286847
10 : 0.024415091
100 : 0.029516504
1000 : 0.063592197
10000 : 0.378325158
100000 : 4.200537498


put!/take!
------------
1 : 1.194750299
10 : 0.699608227
100 : 0.735959668
1000 : 0.893945718
10000 : 1.119156809
100000 : 4.078035043


Direct socket read/write
--------------------------------
1 : 0.193145016
10 : 0.205704565
100 : 0.329878817
1000 : 0.361115851
10000 : 0.88611087
100000 : 4.382292916

As can be seen, the data transport overhead in Julia is high even in the direct socket transfer case, i.e. without the overhead of serialization or the RemoteRef implementation. And the fixed overhead seems quite high as can be seen for transfers upto 1000 floats.

While this has been tangentially noted in other issues, I hope simple the simple benchmark against MPI will help us narrow down the causes of this slowness, especially at the network IO / libuv layer.

@amitmurthy amitmurthy added performance Must go faster io Involving the I/O subsystem: libuv, read, write, etc. labels Jul 13, 2015
@malmaud
Copy link
Contributor

malmaud commented Jul 13, 2015

To get at the effect of libuv vs julia, what about a benchmark in C that uses libuv to implement the direct socket test?

@jakebolewski
Copy link
Member

Did you send these between two computers over TCP? You need to do some work on the MPI side to make sure that you are doing an apples to apples comparison if you are on a single physical node.

@amitmurthy
Copy link
Contributor Author

@malmaud that is a good idea - I'll do that.

@jakebolewski No, it is all local. What do you mean by "work on the MPI side" ? The benchmark mimics the put!/take! code in the sense that both sides are doing non-blocking Irecv!, Isend before doing a Waitall! on the previous two calls.

@jakebolewski
Copy link
Member

MPI implementations will often optimize interprocess communication on a physical node by communicating over shared memory segments. You need to make sure that you force the implementation to use the network layer if you are benchmarking communication on the same node. You can usually do this through command line flags.

@jakebolewski
Copy link
Member

Although same node interprocess communication is important to optimize, these benchmarks should really be run in a true distributed setting.

@amitmurthy
Copy link
Contributor Author

This benchmark is to just get an idea of Julia overhead and where it we can be optimized. Till the time we have true multithreading, a bunch of distributed Julia will continue to be be run on many-core machines via the multi-processor model we have today.

Yeah, the MPI code may be using shmem, I'll re-run forcing TCP and update the numbers.

@JeffBezanson
Copy link
Member

Same as #9992?

@amitmurthy
Copy link
Contributor Author

Similar, but I wanted to narrow the focus in this issue to just the socket layer in the stack.

@ViralBShah ViralBShah added the parallelism Parallel or distributed computation label Apr 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io Involving the I/O subsystem: libuv, read, write, etc. parallelism Parallel or distributed computation performance Must go faster
Projects
None yet
Development

No branches or pull requests

5 participants