Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PMIx performance issue #394

Closed
nkogteva opened this issue Feb 13, 2015 · 35 comments
Closed

PMIx performance issue #394

nkogteva opened this issue Feb 13, 2015 · 35 comments

Comments

@nkogteva
Copy link

We compared execution time of following intervals of mpi_init:

  • time from start to completion of rte_init
  • time from completion of rte_init to modex
  • time to execute modex
  • time from modex to first barrier
  • time to execute barrier
  • time from barrier to complete mpi_init

for 1.8 and trunk versions (configured with --enable-timing) for 0-vpid process using c_hello test.

We noticed that time from modex to first barrier is too much for trunk version. We also compared time of main PMIx procedsures using pmix test (orte/test/mpi/pmix.c) and it seems that PMIx_get is most time-consuming procedure.

Further investigation is needed.

@nkogteva nkogteva added this to the Open MPI 1.9 milestone Feb 13, 2015
@jladd-mlnx
Copy link
Member

@nkogteva Please include the data you collected on the TACC system.

@nkogteva
Copy link
Author

mpi_init intervals (c_hello test: release vs. trunk):
time from barrier to complete mpi_init jpg
time from completion of rte_init to modex jpg
time from modex to first barrier jpg
time to execute barrier jpg
time to execute modex jpg

time utility output (c_hello test: release vs. trunk):
real jpg
sys jpg
user jpg

pmix intervals (pmix test: trunk only):
fence jpg
modex_recv jpg
modex_send jpg
total jpg

@hppritcha
Copy link
Member

@nkogteva in collecting the data, how many MPI ranks/node were you using?

@jladd-mlnx
Copy link
Member

@jsquyres @rhc54 @hppritcha Adding other folks to this issue. This data was collected on the Stampede system at TACC. There are two sets of experimental data here:

  1. Launching a simple MPI hello world application.
  2. Launching a native PMIx unit test with mpirun and invoking PMIx calls directly. The experiments following "pmix intervals (pmix test: trunk only):" are the second such set of tests.

Nadia told me this morning that all of the runs had 8 processes per node, although looking at the Stampede system overview, I'm thinking she may be mistaken. Each Stampede node has dual 8-core Xeon E5 processors - @nkogteva please confirm the number of cores-per-node that you ran in these experiments, better yet, please provide a representative command line.

@rhc54
Copy link
Contributor

rhc54 commented Feb 18, 2015

Looking at the numbers, might be good to check that you have coll/ml turned "off" here as it will otherwise bias the data.

@rhc54
Copy link
Contributor

rhc54 commented Feb 18, 2015

Looking at your unit test, there is clearly something wrong in the measurements. The "modex_send" call cannot be a function of the number of nodes - all it does is cause the provided value to be packed into a local buffer, and then it returns. So there is nothing about modex_send that extends beyond the proc itself, and the time has to be a constant (and pretty darned small, as the scale on your y-axis shows).

I'm working on this area right now, so perhaps it is worth holding off until we get the revised pmix support done. I'm bringing in the pmix-1.0.0 code and instrumenting it with Artem's timing support, so hopefully we'll get some detailed data.

@jladd-mlnx
Copy link
Member

We have always set -mca coll ^ml but @nkogteva can you confirm that you did have ml turned off? @rhc54 I'm not sure that we can conclude that the modex sends are really scaling like O(N) as opposed to O(1) or if that's just some noisy data. What's reproducible @nkogteva, how many trials were done per data point?

@nkogteva
Copy link
Author

@hppritcha @jladd-mlnx @rhc54

I re-ran test with -mca coll ^ml option (it was missed previously).

My command line for 128 launched nodes:
/usr/bin/time -o <time_utility_output_file_name> -p <path_to_ompi>/mpirun -bind-to core --map-by node -display-map -mca mtl '^mxm' -mca btl tcp,sm,self -mca btl_openib_if_include mlx4_0:1 -mca plm rsh -mca coll '^ml' -np 1024 -mca oob tcp -mca grpcomm_rcd_priority 100 -mca orte_direct_modex_cutoff 100000 -mca dstore hash -mca <ompi_timing true path_to_test>/c_hello

New results for pmix test:
fence-page-001
modex_recv-page-001
modex_send-page-001
total-page-001

As you can see MODEX_SEND is not exactly function of number of nodes. It is various for different runs. Anyway MODEX_SEND time is small. As for MODEX_RECV it still have too high values of execution time.

Please also note, this data is related to one run. I'm still gathering data to plot average time values for at least 10 runs. Plots for 1 run were attached in order to focus attention on MODEX_RECV problem.

@rhc54
Copy link
Contributor

rhc54 commented Feb 19, 2015

Thanks Nadia! That cleared things up nicely.

I don't find anything too surprising about the data here. I suspect we are doing a collective behind the scenes (i.e., the fence call returns right away, but a collective is done to exchange the data), and that is why modex_recv looks logarithmic. I wouldn't be too concerned about it for now.

As for it "taking too long" - I would suggest that this isn't necessarily true. It depends on the algorithm and mode we are using. Remember, at this time the BTLs are demanding data for every process in the job, so regardless of direct modex or not, the total time for startup is going to be much longer than we would like.

Frankly, until the BTLs change, the shape of the curves you are seeing are actually about as good as we can do. We can hopefully reduce the time, but the scaling law itself won't change.

One other thing to check: be sure that you build --disable-debug and with some degree of optimization set (e.g., -O2) as that will significantly impact the performance.

@hppritcha
Copy link
Member

Isn't the real issue here the delta between the 1.8 and master timings? I think we need to understand the results in Nadia's real graph. I'd be interested in --disable-debug for both 1.8 and master.

@rhc54
Copy link
Contributor

rhc54 commented Feb 19, 2015

As I've said before, we know the current pmix server implementation in ORTE isn't optimized, and I'm working on it now. So I'm not really concerned at the moment.

@rhc54
Copy link
Contributor

rhc54 commented Feb 20, 2015

I saw that Elena committed some changes to the grpcomm components that sounded like they directly impact the results here. Thanks Elena!

@nkogteva when you get a chance, could you please update these graphs?

@nkogteva
Copy link
Author

nkogteva commented Mar 3, 2015

Looks like we have a significant improvement in modex performance (especially in modex_receive) after fixes in grpcomm. Total pmix test execution time was reduced in more then 20 times:

fence-page-001
modex_recv-page-001
modex_send-page-001
total-page-001

But we still have some problems with mpi_init time due to long time intervals between modex and barrier on some processes and time for barrier on other processes who waits them. For example, for 1024 processes, 128 nodes it takes 18-20 sec.

time from barrier to complete mpi_init-page-001
time from completion of rte_init to modex-page-001
time from modex to first barrier-page-001
time to execute barrier-page-001
time to execute modex-page-001

As result, real time was reduced in 2 times because of pmix fixes but still stays high. But it seems that it is not pmix related issue.
real-page-001
sys-page-001
user-page-001

:

@jladd-mlnx
Copy link
Member

Great data!!! Great work, Nadia and Elena!!

On Tue, Mar 3, 2015 at 7:48 AM, Nadezhda Kogteva notifications@github.com
wrote:

Looks like we have a significant improvement in modex performance
(especially in modex_receive) after fixes in grpcomm. Total pmix test
execution time was reduced in more then 20 times:

[image: fence-page-001]
https://cloud.githubusercontent.com/assets/5219209/6462594/7e945a48-c1bb-11e4-8f4d-5d285f39779f.jpg
[image: modex_recv-page-001]
https://cloud.githubusercontent.com/assets/5219209/6462595/813fbdfa-c1bb-11e4-94cf-5cf06c09f014.jpg
[image: modex_send-page-001]
https://cloud.githubusercontent.com/assets/5219209/6462597/831aa8ba-c1bb-11e4-9e80-8321ebb39755.jpg
[image: total-page-001]
https://cloud.githubusercontent.com/assets/5219209/6462598/852d463a-c1bb-11e4-904c-33c88c96bdf1.jpg

But we still have some problems with mpi_init time due to long time
intervals between modex and barrier on some processes and time for barrier
on other processes who waits them. For example, for 1024 processes, 128
nodes it takes 18-20 sec.

[image: time from barrier to complete mpi_init-page-001]
https://cloud.githubusercontent.com/assets/5219209/6462650/dbcaf2e4-c1bb-11e4-880e-2f46cd75adcb.jpg
[image: time from completion of rte_init to modex-page-001]
https://cloud.githubusercontent.com/assets/5219209/6462649/dbc99192-c1bb-11e4-884f-8e5abb504ce0.jpg
[image: time from modex to first barrier-page-001]
https://cloud.githubusercontent.com/assets/5219209/6462647/dbc84120-c1bb-11e4-8976-b6cea70c4d17.jpg
[image: time to execute barrier-page-001]
https://cloud.githubusercontent.com/assets/5219209/6462651/dbcb2cc8-c1bb-11e4-82e2-4d9440dd63d8.jpg
[image: time to execute modex-page-001]
https://cloud.githubusercontent.com/assets/5219209/6462648/dbc898e6-c1bb-11e4-83ff-6b0782bd36f0.jpg

As result, real time was reduced in 2 times because of pmix fixes but
still stays high. But it seems that it is not pmix related issue.
[image: real-page-001]
https://cloud.githubusercontent.com/assets/5219209/6462699/527a9fde-c1bc-11e4-9cf7-87a08d656243.jpg
[image: sys-page-001]
https://cloud.githubusercontent.com/assets/5219209/6462698/52795d40-c1bc-11e4-87e7-9c8cfd9017a0.jpg
[image: user-page-001]
https://cloud.githubusercontent.com/assets/5219209/6462700/527ac2e8-c1bc-11e4-885e-88df04e62179.jpg

:


Reply to this email directly or view it on GitHub
#394 (comment).

@rhc54
Copy link
Contributor

rhc54 commented Mar 3, 2015

Indeed - thanks! The time from modex to barrier would indicate that someone is doing something with a quadratic signature during their add_procs or component init procedures. We should probably instrument MPI_Init more thoroughly to identify the precise step that is causing the problem.

@bosilca
Copy link
Member

bosilca commented Mar 3, 2015

For the sake of completeness, here are some results from 2011 [1]. They are based on an Open MPI 1.6 spin-off (revision 24321). We compared it with MPICH2 (1.3.2p1 with Hydra over rsh and slurm), as well as MVAPICH (1.2) using the ScELA launcher. Some of these results were using the vanilla Open MPI, and some (the ones marked with the prsh) were using a optimized launcher developed at UTK.

This results are interesting because for most of the cases (with the exception of the ML collective) little has changed during the initialization stages in terms of modex exchange or process registration. So there is almost a fair, one-to-one comparaison with the results posted in this thread. Moreover, these results compare an MPI application composed of MPI_Init & MPI_Finalize, so we are talking about a full modex plus the required synchronizations.

mvapich_vs_ompi
mpich_vs_ompi

[1] Bosilca, G., Herault, T., Rezmerita, A., Dongarra, J. "On Scalability for MPI Runtime Systems," Proceedings of the 2011 IEEE International Conference on Cluster Computing, IEEE Computer Society, Austin, TX, 187 - 195, September, 2011 link

@jladd-mlnx
Copy link
Member

Thanks, George. This is very enlightening.

On Tue, Mar 3, 2015 at 4:36 PM, bosilca notifications@github.com wrote:

For the sake of completeness, here are some results from 2011 [1]. They
are based on an Open MPI 1.6 spin-off (revision 24321). We compared it with
MPICH2 (1.3.2p1 with Hydra over rsh and slurm), as well as MVAPICH (1.2)
using the ScELA launcher. Some of these results were using the vanilla Open
MPI, and some (the ones marked with the prsh) were using a optimized
launcher developed at UTK.

This results are interesting because for most of the cases (with the
exception of the ML collective) little has changed during the
initialization stages in terms of modex exchange or process registration.
So there is almost a fair, one-to-one comparaison with the results posted
in this thread. Moreover, these results compare an MPI application composed
of MPI_Init & MPI_Finalize, so we are talking about a full modex plus the
required synchronizations.

[image: mvapich_vs_ompi]
https://cloud.githubusercontent.com/assets/642701/6472716/626c111a-c1c2-11e4-80be-e990ca07f229.png
[image: mpich_vs_ompi]
https://cloud.githubusercontent.com/assets/642701/6472721/67d37b2a-c1c2-11e4-9972-bf940f5de32b.png

[1] Bosilca, G., Herault, T., Rezmerita, A., Dongarra, J. "On Scalability
for MPI Runtime Systems," Proceedings of the 2011 IEEE International
Conference on Cluster Computing, IEEE Computer Society, Austin, TX, 187 -
195, September, 2011 link
http://icl.cs.utk.edu/news_pub/submissions/06061054.pdf


Reply to this email directly or view it on GitHub
#394 (comment).

@rhc54
Copy link
Contributor

rhc54 commented Mar 4, 2015

Actually, what @bosilca said isn't quite true. The old ORTE used a binomial tree as its fanout. However, we switched that in the trunk to use the radix tree with a default fanout of 64, so we should look much more like the routed delta-ary tree curve.

@bosilca
Copy link
Member

bosilca commented Mar 4, 2015

@rhc54 If at any moment ORTE would have had the scalable behavior you describe, we would not be here to discuss about it.

@rhc54
Copy link
Contributor

rhc54 commented Mar 4, 2015

Ah now, George - don't be that way. My point was only that we haven't used the binomial fanout for quite some time now, and so your graphs are somewhat out-of-date.

ORTE has exhibited very good scaling behavior in the past, despite your feelings about it. According to you're own data, we do quite well when launching in a Slurm environment, which means that the modex algorithms in ORTE itself were good - we launch the daemons using Slurm, but we use our own internal modex in that scenario. We also beat the pants off of MPICH2 hydra at the upper scales in that scenario, as shown in your own graph.

So it was only the rsh launcher that has been the question. We have put very little effort into optimizing it for one simple reason: nobody cares about that one at scale, and it works fine at smaller scale.

Since the time of your publication, we also revamped the rsh launcher so it should be more parallel. I honestly don't know what it would look like now at scale, nor do I particularly care. I only improved it as part of switching to the state machine implementation.

We are here discussing this today because we are trying to characterize the current state of the PMIx implementation, which modified the collective algorithms behind the modex as well as the overall data exchange.

@hppritcha
Copy link
Member

George, for the data in the cited paper, were the apps dynamically or statically linked? Also, which BTLs were enabled?

@bosilca
Copy link
Member

bosilca commented Mar 5, 2015

Hopefully there have been significant progress on ORTE design and infrastructure with palpable impact on it's scalability. I might have failed to correctly read @nkogteva results, but based on these I can hardly quantify all these improvement you are mentioning.

Superpose @nkogteva results on the graphs presented above. What I see from the new scalability graphs is that at 120 processes one needs more than 4 seconds for the release version and over 10 seconds for the trunk. I let you do the fit and see how this would behave at 900 nodes, but without spoiling the surprise it is looking pretty grim.

Now, let's address the prsh. That ORTE using slurm had similar (but still worst) performances compared with a launched based on ssh, is nothing to be excited about. Of course, one can pat itself on the back on how good ORTE is compared with other MPI (MPICH nor MVAPICH). What I see instead is that there were some issues in ORTE regarding the application startup and the modex, and that based on the last set of results these issues are still to be addressed.

@bosilca
Copy link
Member

bosilca commented Mar 5, 2015

@hppritcha the cluster was a heterogeneous cluster, more like a cluster of clusters. Some of these clusters were IB-based, some others MX and few had only ethernet 1G. We didn't restrict the BTLs on the second graph (the one going up to 900 nodes). The application were dynamically linked, but the binaries were staged on the nodes before launching the application (no NFS in our experimental setting). The first graph (the one going up to 140) is an IB cluster, with only IB, SM and SELF BTLs enabled.

@jladd-mlnx
Copy link
Member

@bosilca I think it might be worthwhile for my team to to try to reproduce your experiments. Should be pretty straightforward as @nkogteva has automated the process. What version of OMPI was used?

@bosilca
Copy link
Member

bosilca commented Mar 5, 2015

r24321. We used a Debian 5.0.8 that came with SLURM 1.3.6.

On Thu, Mar 5, 2015 at 5:45 PM, Joshua Ladd notifications@github.com
wrote:

@bosilca https://github.com/bosilca I think it might be worthwhile for
my team to to try to reproduce your experiments. Should be pretty
straightforward as @nkogteva https://github.com/nkogteva has automated
the process. What version of OMPI was used?


Reply to this email directly or view it on GitHub
#394 (comment).

@hppritcha
Copy link
Member

I've gotten some data off of the edison system at NERSC (cray XC30) 5500 node system. Its not practical except during DST to get more than about 500 or so though. I've managed to get data comparing job launch/shutdown times for a simple hello world - no communication, using the native aprun launcher. 8 ranks per node. I'm not seeing any difference in performance between master and the 1.8 branch using aprun. I configured with --disable-dlopen --disable-debug for these runs, otherwise no special configure options (well for master, for 1.8 its a configuration nightmare). coll ml is disabled. Times are output from using "time", i.e. time aprun -n X -N 8 ./a.out. btl selection was restricted to self,vader,ugni.

image

I'm endeavoring to get data using mpirun, but I'm seeing hangs above 128 hosts, or at least the job hasn't completed launching within the 10 minutes I give the job script.

Note the relatively high startup time even at small node counts is most likely an artifact of the machine not being in dedicated mode, as well as the size of the machine. There are complications with management of RDMA credentials required for all jobs that adds some overhead as the number of jobs running on the system grows. ALPS manages the RDMA credentials. Generally there are around 500-1000 jobs running on the system at any one time.

@rhc54
Copy link
Contributor

rhc54 commented Mar 12, 2015

I'm afraid that doesn't really tell us anything except to characterize Cray PMI's performance. When you direct launch with aprun, OMPI just uses the available PMI for all barriers and modex operations. So none of the internal code in question (i.e., PMIx) is invoked.

The real question here is why you are having hangs with mpirun at such small scale. We aren't seeing that elsewhere, so it is a concern.

@hppritcha
Copy link
Member

The data is useful in helping to sort out whether any performance
difference between 1.8 and master is due to the btls. Id say that at least
on the system where this data was collected the answer appears to be no.
On Mar 11, 2015 10:25 PM, "rhc54" notifications@github.com wrote:

I'm afraid that doesn't really tell us anything except to characterize
Cray PMI's performance. When you direct launch with aprun, OMPI just uses
the available PMI for all barriers and modex operations. So none of the
internal code in question (i.e., PMIx) is invoked.

The real question here is why you are having hangs with mpirun at such
small scale. We aren't seeing that elsewhere, so it is a concern.


Reply to this email directly or view it on GitHub
#394 (comment).

@jladd-mlnx
Copy link
Member

Nadia fixed the last remaining big performance bug in #482. With this fix, we are seeing end-to-end hello world times on the order of 90 seconds on 16K ranks ~ 800 nodes. We are focusing now on ratcheting down the shared memory performance.

@rhc54
Copy link
Contributor

rhc54 commented Mar 18, 2015

I have a favor to ask - please quit touching the pmix native and server code in the master. I'm trying to complete the pmix integration, and have to fight thru a bunch of conflicts every time you touch it. Only solution is for me to just dump whatever you changed as the code is changing too much and your changes are too out-of-date to be relevant.

@nkogteva
Copy link
Author

Latest results (1 run, 0-vpid, hello_c):
time from completion of rte_init to modex-page-001
time to execute modex-page-001
time from modex to first barrier-page-001
time to execute barrier-page-001
time from barrier to complete mpi_init-page-001
real-page-001
sys-page-001
user-page-001

@jladd-mlnx
Copy link
Member

Nice, @nkogteva . Looks like you resolved the performance issues. Great work! Now let's help @elenash drill into and analyze the shared memory DB work.

@hppritcha
Copy link
Member

can this be closed now?

@rhc54
Copy link
Contributor

rhc54 commented Apr 20, 2015

I think so, but I'll let Josh be the final decider

On Mon, Apr 20, 2015 at 10:32 AM, Howard Pritchard <notifications@github.com

wrote:

can this be closed now?


Reply to this email directly or view it on GitHub
#394 (comment).

@jladd-mlnx
Copy link
Member

This can be closed.

jsquyres pushed a commit to jsquyres/ompi that referenced this issue Sep 21, 2016
Default allocated nodes to the UP state
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants