-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow-ish Init in ATM for hires run with large number of MPI's on cori-knl #1578
Comments
Hi @ndkeen , What create_newcase command were you using? I'd like to look at this on Titan, for comparison. |
Think that I found it in the performance_archive - please verify though. |
Well, it was a coupled-hires case that used run_acme script.
With changes after case created, of course. |
Got it. Thanks. |
Cost on Titan is also "high" (40 min. is a lot, though not the 68 min. seen on Cori-KNL):
but
is 1/5 the cost on Titan as it is on Cori-KNL (unless this is capturing some load imbalance from elsewhere). FYI. |
Titan experiments may not be completely relevant to Cori-KNL, but moving from 86400x1 to 43200x2 in ATM and making similar changes in other components decreased A_WCYCL initialization cost from 2040 seconds to 740 seconds. Similarly, changing from 86400x1 to 43200x1 decreased the cost to 1023 seconds, so some of the high initialization cost for 86400x1 on Titan was due to using 16 MPI tasks per node, but most was an algorithmic scaling issue. I'll see if I can figure identify where this is coming from. Again, Titan does not see a high pio_rearrange_create_box, so some of the Cori-KNL overhead is Cori-KNL-specific. |
@ndkeen , what is the latest on this on Cori-KNL? Does using threading (and fewer MPI processes) eliminate the problem? Are you still interested in diagnosing the 86400x1 ATM processes initialization performance issue? My Titan jobs are taking awhile to get scheduled, so will jump to Cori-KNL as well, if you can provide a reproducer. I'm guessing that an F case would be sufficient? In any case, please advise. |
Looking over several runs at the timer
To reproduce, should be as simple as: The current default sets -c 4. To try other values, have to change config_machines.xml. |
Looking at the raw timings for 86400x1, around half of the comp_init_cc_atm time is unattributed. I'll need to run this case myself with some additinal timers added. @ndkeen , I'll repeat your earlier run, using the latest master with some additional instrumentation tweaks. |
Rename DATM S1850 mode to CPLHIST Use %CPLHIST for DATM coupler hist forcing Coupler history forcing should have a %CPLHIST mode for any data component that uses it. This change makes DATM conform to this convention. It also fixes the invalid quotes in DROF valid values in #1577 Also changes strm_domdir to null for the cplhist streams: For CPLHIST you want to get the domain from the first stream file. Its important the the stream domain match the stream data. So when strm_domdir is null shr_strdata ensures that this is the case. This is also happening rof the drof CPLHIST mode. Also increases dtlimit to 3 in some places (it was 2 before): For the CPLHIST mode and the first time sample it needs to be 3 - and we want this to work out of the box. It's okay that this loosens the error checking a bit: It's a very large value - e.g. 100 that would be of concern. Test suite: scripts_regression_tests also verified that the following CESM spinup compset 1850_DATM%CPLHIST_SLND_CICE_POP2%ECO_DROF%CPLHIST_SGLC_WW3 gave the correct values Test baseline: Test namelist changes: Test status: bit for bit Fixes #1577 Fixes #1576 User interface changes?: None Code review: Bill Sacks
I tried a few experiments with nothing gained.
|
Pat pointed out to me that the experiments I tried with a different MPI version, seem to still be using cray-mpich. I tried a few other things, then realized NERSC needs us to use a different method. The cray wrappers Pat, would you mind describing in a sentence or two what you think you are seeing regarding the slow down? Something we can send to NERSC just in case they have seen this before? Or know something to try? Is it that this extra time is being seen at the first MPI point-to-point? |
A little strong - just noted that in my experiments, following @ndkeen 's lead, I can't tell what version of MPI I am using based on the build and run logs. Since performance in my one experiment (using openmpi) was very similar to that when using cray-mpich, I just wanted to verify that I was comparing what I thought I was comparing. Noel, I'll provide a short description of what I am seeing in the near future. I am traveling at the moment. |
I managed to get a run going with Intel MPI. It got stuck at shutdown down and failed, but not before it wrote out a timing.tar file which allows me to see the time for |
I determined that my OpenMPI jobs were not actually using OpenMPI. I tried @ndkeen 's modifications to use Intel MPI. This built (and was accessing the Intel MPI include files). Run died pretty quickly with:
I'll give up on the other MPI libraries for the moment. |
@ndkeen ,
Not a sentence or two, but here is a summary of the latest. I have lots more details, but this should be enough for you to broach the topic with NERSC if you feel that it is worthwhile to pursue it. I added some more instrumentation into ACME to help isolate the 270 nodes (17280 MPI processes, 64 per node)
450 nodes (28800 MPI processes, 64 per node)
675 nodes (43200 MPI processes, 64 per node)
It appears that all of the performance loss is in the MPI overhead,
for process 1 (a PIO process)
while the data over all processes (max and min) was
Note that only one call (hypthosis is that it is the first) accounted This communication operator was implemented with an MPI point-to-point
as compared to the version with the point-to-point implementation.
Note that the point-to-point implementation was motivated |
Ran a 1350 node job. This reproduces @ndkeen 's earlier results, but my runs also have some additional barriers, so this further verifies the scaling issues: 270 nodes (17280 MPI processes, 64 per node)
450 nodes (28800 MPI processes, 64 per node)
675 nodes (43200 MPI processes, 64 per node)
1350 nodes (86400 MPI processes, 64 per node)
Note that CICE initialization (not MPAS-CICE) also grows, but ATM is the dominant cost. I have not looked at CICE closely (and hope not to have to), but have no reason to believe that this is also not an MPI initialization issue. |
Since it was easy, I ran some ne120 F cases on edison (after upgrade). The init times aren't nearly as bad. One run with 21600 MPI's and the other 2 with 43200 (I think I was testing cpu_bind flag or something).
|
I have a reproducer that we can pass on to NERSC and Cray if you want. I'll continue to refine this, and maybe make it even smaller. I'll also be using it to examine workarounds. This just measures the cost of the PIO communication pattern used during the data type creation in box_rearrange_create when there is one PIO process per node (stride 64, offset 1) for two successive calls, surrounded (and separated) by barriers. Time reported is maximum over all processes.
|
Hi Pat, I was distracted by the edison upgrade. You have a reproducer? Wow. Very well done. Certainly interested. Should we put it in the repo? I know that can be a lot of work. I did send emails to two folks at NERSC, but have not heard back. With stand-alone code, I can be much more confident to ask for help. Does it surprise you that edison is quicker? Note, it's free to run on edison until July 31st. |
I don't know what would be involved with putting it in the repo. Where would it go, and what would it be used for in the future (and how would it be maintained)? At the moment my driver code references pio_kinds.F90 in cime/src/externals/pio1/pio, and pio_spmd_utils.F90 from a build directory, generated from pio_spmd_utils.F90_in, also in cime/src/externals/pio1/pio . For this standalone reproducer, I just copied over these source files, and included instructions for building using these files based on the ACME Cori-KNL module loads and compiler options. |
@ndkeen and I had a brief email exchange on creating a performance test suite (e.g. as a cime test suite) mostly for benchmarking our primary configurations, but small test cases/reproducers that hit typically performance-sensitive bits like this would also be useful. |
This particular test is targeting an issue that seems to be somewhat peculiar to Cori-KNL, and only occurs during the initialization. A more typical PIO gather or scatter communication test would be different. (Actually that is what I started from - I modified it to use the initialization pattern.) @jayeshkrishna may already have PIO1 and PIO2 performance tests that include these MPI communication patterns? I'm not competent to modify PIO, or CIME, to add build or run support for these types of tests. I can continue to prototype them, if that would be useful. |
I suppose whether or not it goes into the repo (and where) is up to others, but I'm a fan of stand-alone tests as we can do more with them. I would certainly like to try it when you are ready to let me see it. |
@ndkeen , please grab
from /global/u2/w/worleyph at NERSC. This expands to a directory called
There is a Notes.txt file in the directory that should get you started. Further questions are probably best handled in private e-mails until you are ready to hand this off to NERSC. |
Hi @jayesh (and cc'ing @ndkeen ), (Summarizing above material) The latest on the poor initialization scaling on Cori-KNL is that this appears to be occurring the first time two processes communicate, and is showing up when calling swapm in compute_counts in box_rearrange_create (the first time this is called in ATM initialization). If both swapm calls in compute_counts are replaced with calls to MPI_AlltoallV, then this first call overhead disappears from there and reappears in LND initiaization in the first couple of calls to a2a_box_rear_io2comp_double or swapm_box_rear_io2comp_double, i.e. whether using MPI_AlltoAllW OR swapm. I created two standalone test programs that either gather information from all "compute" tasks to "IO" tasks (comp2io) or scatter information from "IO" tasks to "compute" tasks (io2comp). This verifies the above, that the first call when using swapm or MPI_AlltoAllW has a high cost and scales poorly, and that MPI_AlltoallV performs much better. This does not provide a real workaround since we can't use MPI_AlltoAllW for a2a_box_rear_io2comp_double - this truly is an MPI_AlltoAllW operator unless we stop using MPI data types and pack and unpack buffers manually. Not that the performance comparison for subsequent calls is more complicated, but does capture the "with handshaking" and "without handshaking" difference in performance for comp2io and io2comp. Since these standalone programs do not capture the actual comp2io and io2comp message patterns in ACME - e.g. not all compute tasks send to all io tasks - they are also not adequate for optimizing settings in ACME, and this still needs to be done in the model. I started instrumenting compute_counts to output the different message patterns created in PIO so that I could sumarize for @ndkeen and perhaps further generalize the standalone test prgrams. However I just remembered that you have working on PIO test programs that also use extracted communication patterns - perhaps these are what Noel and I should be using, or perhaps you can help us in this testing on Cori-KNL. At this point, the only solution I can think of is to tell NERSC and Cray about the high start-up cost in swapm and MPI_AlltAllW and ask them to see if they can do for these whatever magic they did for MPI_AlltoallV. However, having an accurate PIO test program may allow us to experiment with our own workarounds (since I am not holding my breath that Cray will address our request in a timely fashion). |
Thanks Pat. Indeed I have been taking the stand-alone tests prepared by Pat and running them with various parameters (inputs, num MPI's, MPI env flags, etc). I'm trying to I was derailed yesterday due to the power outage and now I need to work on some other things, but will soon get back to it. Jayesh: please let me know if there's something else I should be doing/trying. |
Using master from Oct 10, I ran several more F compsets. Pat already discovered that setting PIO_ROOT=0 was a benefit to the ATM init time. I was already using that in several other experiments, but not for these ATM init runs. So these are with a new repo and with:
That's 35 minutes which is quite an improvement already. |
FYI PIO_ROOT == 0 seems to perform better on Titan and on Anvil as well, both low and high resolution cases, and both for initialization and for the run loop. |
That's great. Is @wlin7 using this change? |
@PeterCaldwell , I have a branch with two changes to address this. The first (setting PIO_ROOT to zero by default) is noncontroversial. The second may be specific to Cori-KNL. I'll split these into 2 branches and submit a pull request for the first one later today. |
Great, thanks for your work on this! |
@ndkeen , I have "found" another way to address this issue:
I am running experiments with
and initialization appears to be much faster. Setting PIO_ROOT to zero still seems to be a good idea, but the above may be the real solution for us, especially since at high process counts we do not have a large memory footprint and can afford the extra MPI memory? One concern is that the initialization cost is pushed to MPI_Init, which we do not measure directly. The CaseStatus timing data also includes pre and post processing times, so makes it difficult to pull out just the srun time. I've created another github issue ( #1857 ) to get this fixed, but don't have the fix available for these studies yet. Another concern is whether the RUN_LOOP performance is degraded by doing this. The discussion above does not imply that it would be, but we still need to be sure. You might try the above in some of your experiments as well, to start evaluating the impact. The nice thing about this solution is that it is system-specific, and we would not impact performance on any other system. You should also ask your NERSC and Cray contacts about this, and why they did not suggest it to us when we first asked them about mitigation techniques. Sure seems to be relevant. |
Yea, sorry Pat, in fact, this was what Nathan W (of Cray) did suggest. I thought I had noted it somewhere but can't find it. He was not optimistic it would help because, as you noted, the time might just show up in MPI_Init(). It was still on my list of things to try. And, indeed the time to do whatever is happening before or after the acme.exe cost should be measured. This time is too high for runs on KNL and is wasting MPP's at high scales. I've not had any luck getting anyone to help with this. We just simply can't do those tasks on the compute nodes. The batch file should ONLY be the srun command. |
Okay, I'm trying now. Looks to be working, but I need to figure out how to determine the impact on MPI_Init. Allocating space and setup at once versus reacting every time a message comes from a new process should be faster, and perhaps much faster. I am still hopeful. |
@ndk, it appears to somewhat advantageous to set MPICH_GNI_DYNAMIC_CONN=disabled. By looking in the job script output, e.g.
you can calculate total time to run the model. For example, when MPICH_GNI_DYNAMIC_CONN=enabled, then the above (18:10:18 - 17:21:34) time is just seconds larger than (Init Time + Run Time) from the acme timing summary file. In contrast, when MPICH_GNI_DYNAMIC_CONN=disabled then the two model cost measures can differ by many minutes (as shown below). So it would appear that the above is a reliable way to measure total model cost. I ran experiments using
on Cori-KNL using 270, 675, and 1350 nodes, all with PIO_ROOT = 0. In all cases, using MPICH_GNI_DYNAMIC_CONN=disabled was faster, at least by a little bit. The biggest difference was with 1350 nodes (86400x1 ATM decomposition), as you would expect. Ran this twice with both MPICH_GNI_DYNAMIC_CONN settings. There was some performance variability, but 'disabled' was always better. Fastest runs for each are described below (one time step, no restart write): MPICH_GNI_DYNAMIC_CONN = enabled (so current default)
MPICH_GNI_DYNAMIC_CONN = disabled
So, almost 12 minutes faster by using MPICH_GNI_DYNAMIC_CONN = disabled. It is annoying that the model cost is not all captured in standard timing summary when using this settign, but the data is captured. I'll look into whether we can inject these data into the summary, but I would probably need help from CSEG. In any case, please run your own experiments. This is all using MPI-only (though that should not make any difference), and we should verify that there is no impact on performance on the lower resolution models (should not be). This also makes my proposed pio_N2M replacement for pio_swapm in compute_counts irrelevant, and setting MPICH_GNI_DYNAMIC_CONN = disabled is more effective as well. |
Hmm, I tried "MPICH_GNI_DYNAMIC_CONN = disabled" and it's a little slower, but more troubling is that my repos dating after Oct 16th are slower to init than before. On Oct 13th, I measured 687 seconds for comp_init_cc_atm -- cori-knl, 675 nodes, 43200 MPI's, 2 hyperthreads each. On Oct 16th, I meausure 1082 s (and 1285 secs with the disabling of above env var). All of my runs use PIO_ROOT=0. I'll keep trying. |
Bummer - I thought that we had finished off this issue. I'll try to rerun some of my experiments as well. |
Just now ran this same case (675 nodes, 43200 MPI's, 2 hyperthreads each). a) "MPICH_GNI_DYNAMIC_CONN = enabled"
or 15m 28s
or 17m 49s b) "MPICH_GNI_DYNAMIC_CONN = disabled"
or 13m 50s So, I still see an advantage, even compared to just Init Time for (a). |
@ndkeen , just to be clear, my case was
using master (updated today). |
…1837) High MPI overhead occurs on some systems the first time that two processes communicate. In typical usage there are two types of nonlocal message patterns: one based off of the root of each component and one (in PIO) based off of root+1. By changing the default for PIO_ROOT from one to zero, this start-up overhead is approximately halved. Using a PIO_ROOT value of zero also allows the default performance timing settings to better capture PIO performance. Finally, logically a PIO_ROOT value of one has no special advantage over zero with current multi- and many-core processor architectures. This has been verified in recent performance benchmarking on multiple system and multiple cases, with a PIO_ROOT value of zero performing better than a PIO_ROOT of one even in the RUN LOOP. This addresses issue #1578, but that issue can never really be solved, only mitigated against. There may be more workarounds that are appropriate yet to come. [BFB] [NML] * worleyph/cime/pio_root_zero_as_default: Set PIO_ROOT defaults to be zero instead of one Conflicts: cime/src/drivers/mct/cime_config/config_component.xml
I just merged #2026 into master which sets: I suspect we are not done with this issue, but this hopefully helps (and doesn't cause any other issues). It's also possible that we might have to go back to using PIO_ROOT=1 for memory concerns. I currently still have it set to 0, but have been running into memory issues. |
Noting that this issue & PR also had an impact. |
It looks like we have solved the issue of long OCN init times, however, now I'm seeing that as I add mpi tasks to the ATM, the init time increases more than I was wanting. If there is anything obvious here, let me know.
For a run where I used 1350 nodes, and 86400 MPI tasks the total ATM init time is 2571 seconds.
This is a copy/paste of the top of model.timing.00000 (where ATM and the ioprocs) live.
This is with a PIO stride of 64, so 1 ioproc per node which @mt5555 has been telling me is way too many. I will try some experiments using larger strides and see what happens.
The text was updated successfully, but these errors were encountered: