-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PMIx performance issue #394
Comments
@nkogteva Please include the data you collected on the TACC system. |
@nkogteva in collecting the data, how many MPI ranks/node were you using? |
@jsquyres @rhc54 @hppritcha Adding other folks to this issue. This data was collected on the Stampede system at TACC. There are two sets of experimental data here:
Nadia told me this morning that all of the runs had 8 processes per node, although looking at the Stampede system overview, I'm thinking she may be mistaken. Each Stampede node has dual 8-core Xeon E5 processors - @nkogteva please confirm the number of cores-per-node that you ran in these experiments, better yet, please provide a representative command line. |
Looking at the numbers, might be good to check that you have coll/ml turned "off" here as it will otherwise bias the data. |
Looking at your unit test, there is clearly something wrong in the measurements. The "modex_send" call cannot be a function of the number of nodes - all it does is cause the provided value to be packed into a local buffer, and then it returns. So there is nothing about modex_send that extends beyond the proc itself, and the time has to be a constant (and pretty darned small, as the scale on your y-axis shows). I'm working on this area right now, so perhaps it is worth holding off until we get the revised pmix support done. I'm bringing in the pmix-1.0.0 code and instrumenting it with Artem's timing support, so hopefully we'll get some detailed data. |
We have always set -mca coll ^ml but @nkogteva can you confirm that you did have ml turned off? @rhc54 I'm not sure that we can conclude that the modex sends are really scaling like O(N) as opposed to O(1) or if that's just some noisy data. What's reproducible @nkogteva, how many trials were done per data point? |
I re-ran test with -mca coll ^ml option (it was missed previously). My command line for 128 launched nodes: As you can see MODEX_SEND is not exactly function of number of nodes. It is various for different runs. Anyway MODEX_SEND time is small. As for MODEX_RECV it still have too high values of execution time. Please also note, this data is related to one run. I'm still gathering data to plot average time values for at least 10 runs. Plots for 1 run were attached in order to focus attention on MODEX_RECV problem. |
Thanks Nadia! That cleared things up nicely. I don't find anything too surprising about the data here. I suspect we are doing a collective behind the scenes (i.e., the fence call returns right away, but a collective is done to exchange the data), and that is why modex_recv looks logarithmic. I wouldn't be too concerned about it for now. As for it "taking too long" - I would suggest that this isn't necessarily true. It depends on the algorithm and mode we are using. Remember, at this time the BTLs are demanding data for every process in the job, so regardless of direct modex or not, the total time for startup is going to be much longer than we would like. Frankly, until the BTLs change, the shape of the curves you are seeing are actually about as good as we can do. We can hopefully reduce the time, but the scaling law itself won't change. One other thing to check: be sure that you build --disable-debug and with some degree of optimization set (e.g., -O2) as that will significantly impact the performance. |
Isn't the real issue here the delta between the 1.8 and master timings? I think we need to understand the results in Nadia's real graph. I'd be interested in --disable-debug for both 1.8 and master. |
As I've said before, we know the current pmix server implementation in ORTE isn't optimized, and I'm working on it now. So I'm not really concerned at the moment. |
I saw that Elena committed some changes to the grpcomm components that sounded like they directly impact the results here. Thanks Elena! @nkogteva when you get a chance, could you please update these graphs? |
Indeed - thanks! The time from modex to barrier would indicate that someone is doing something with a quadratic signature during their add_procs or component init procedures. We should probably instrument MPI_Init more thoroughly to identify the precise step that is causing the problem. |
For the sake of completeness, here are some results from 2011 [1]. They are based on an Open MPI 1.6 spin-off (revision 24321). We compared it with MPICH2 (1.3.2p1 with Hydra over rsh and slurm), as well as MVAPICH (1.2) using the ScELA launcher. Some of these results were using the vanilla Open MPI, and some (the ones marked with the prsh) were using a optimized launcher developed at UTK. This results are interesting because for most of the cases (with the exception of the ML collective) little has changed during the initialization stages in terms of modex exchange or process registration. So there is almost a fair, one-to-one comparaison with the results posted in this thread. Moreover, these results compare an MPI application composed of MPI_Init & MPI_Finalize, so we are talking about a full modex plus the required synchronizations. [1] Bosilca, G., Herault, T., Rezmerita, A., Dongarra, J. "On Scalability for MPI Runtime Systems," Proceedings of the 2011 IEEE International Conference on Cluster Computing, IEEE Computer Society, Austin, TX, 187 - 195, September, 2011 link |
Thanks, George. This is very enlightening. On Tue, Mar 3, 2015 at 4:36 PM, bosilca notifications@github.com wrote:
|
Actually, what @bosilca said isn't quite true. The old ORTE used a binomial tree as its fanout. However, we switched that in the trunk to use the radix tree with a default fanout of 64, so we should look much more like the routed delta-ary tree curve. |
@rhc54 If at any moment ORTE would have had the scalable behavior you describe, we would not be here to discuss about it. |
Ah now, George - don't be that way. My point was only that we haven't used the binomial fanout for quite some time now, and so your graphs are somewhat out-of-date. ORTE has exhibited very good scaling behavior in the past, despite your feelings about it. According to you're own data, we do quite well when launching in a Slurm environment, which means that the modex algorithms in ORTE itself were good - we launch the daemons using Slurm, but we use our own internal modex in that scenario. We also beat the pants off of MPICH2 hydra at the upper scales in that scenario, as shown in your own graph. So it was only the rsh launcher that has been the question. We have put very little effort into optimizing it for one simple reason: nobody cares about that one at scale, and it works fine at smaller scale. Since the time of your publication, we also revamped the rsh launcher so it should be more parallel. I honestly don't know what it would look like now at scale, nor do I particularly care. I only improved it as part of switching to the state machine implementation. We are here discussing this today because we are trying to characterize the current state of the PMIx implementation, which modified the collective algorithms behind the modex as well as the overall data exchange. |
George, for the data in the cited paper, were the apps dynamically or statically linked? Also, which BTLs were enabled? |
Hopefully there have been significant progress on ORTE design and infrastructure with palpable impact on it's scalability. I might have failed to correctly read @nkogteva results, but based on these I can hardly quantify all these improvement you are mentioning. Superpose @nkogteva results on the graphs presented above. What I see from the new scalability graphs is that at 120 processes one needs more than 4 seconds for the release version and over 10 seconds for the trunk. I let you do the fit and see how this would behave at 900 nodes, but without spoiling the surprise it is looking pretty grim. Now, let's address the prsh. That ORTE using slurm had similar (but still worst) performances compared with a launched based on ssh, is nothing to be excited about. Of course, one can pat itself on the back on how good ORTE is compared with other MPI (MPICH nor MVAPICH). What I see instead is that there were some issues in ORTE regarding the application startup and the modex, and that based on the last set of results these issues are still to be addressed. |
@hppritcha the cluster was a heterogeneous cluster, more like a cluster of clusters. Some of these clusters were IB-based, some others MX and few had only ethernet 1G. We didn't restrict the BTLs on the second graph (the one going up to 900 nodes). The application were dynamically linked, but the binaries were staged on the nodes before launching the application (no NFS in our experimental setting). The first graph (the one going up to 140) is an IB cluster, with only IB, SM and SELF BTLs enabled. |
r24321. We used a Debian 5.0.8 that came with SLURM 1.3.6. On Thu, Mar 5, 2015 at 5:45 PM, Joshua Ladd notifications@github.com
|
I've gotten some data off of the edison system at NERSC (cray XC30) 5500 node system. Its not practical except during DST to get more than about 500 or so though. I've managed to get data comparing job launch/shutdown times for a simple hello world - no communication, using the native aprun launcher. 8 ranks per node. I'm not seeing any difference in performance between master and the 1.8 branch using aprun. I configured with --disable-dlopen --disable-debug for these runs, otherwise no special configure options (well for master, for 1.8 its a configuration nightmare). coll ml is disabled. Times are output from using "time", i.e. time aprun -n X -N 8 ./a.out. btl selection was restricted to self,vader,ugni. I'm endeavoring to get data using mpirun, but I'm seeing hangs above 128 hosts, or at least the job hasn't completed launching within the 10 minutes I give the job script. Note the relatively high startup time even at small node counts is most likely an artifact of the machine not being in dedicated mode, as well as the size of the machine. There are complications with management of RDMA credentials required for all jobs that adds some overhead as the number of jobs running on the system grows. ALPS manages the RDMA credentials. Generally there are around 500-1000 jobs running on the system at any one time. |
I'm afraid that doesn't really tell us anything except to characterize Cray PMI's performance. When you direct launch with aprun, OMPI just uses the available PMI for all barriers and modex operations. So none of the internal code in question (i.e., PMIx) is invoked. The real question here is why you are having hangs with mpirun at such small scale. We aren't seeing that elsewhere, so it is a concern. |
The data is useful in helping to sort out whether any performance
|
Nadia fixed the last remaining big performance bug in #482. With this fix, we are seeing end-to-end hello world times on the order of 90 seconds on 16K ranks ~ 800 nodes. We are focusing now on ratcheting down the shared memory performance. |
I have a favor to ask - please quit touching the pmix native and server code in the master. I'm trying to complete the pmix integration, and have to fight thru a bunch of conflicts every time you touch it. Only solution is for me to just dump whatever you changed as the code is changing too much and your changes are too out-of-date to be relevant. |
can this be closed now? |
I think so, but I'll let Josh be the final decider On Mon, Apr 20, 2015 at 10:32 AM, Howard Pritchard <notifications@github.com
|
This can be closed. |
Default allocated nodes to the UP state
We compared execution time of following intervals of mpi_init:
for 1.8 and trunk versions (configured with --enable-timing) for 0-vpid process using c_hello test.
We noticed that time from modex to first barrier is too much for trunk version. We also compared time of main PMIx procedsures using pmix test (orte/test/mpi/pmix.c) and it seems that PMIx_get is most time-consuming procedure.
Further investigation is needed.
The text was updated successfully, but these errors were encountered: