Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow-ish Init in ATM for hires run with large number of MPI's on cori-knl #1578

Closed
ndkeen opened this issue Jun 7, 2017 · 41 comments
Closed

Comments

@ndkeen
Copy link
Contributor

ndkeen commented Jun 7, 2017

It looks like we have solved the issue of long OCN init times, however, now I'm seeing that as I add mpi tasks to the ATM, the init time increases more than I was wanting. If there is anything obvious here, let me know.

For a run where I used 1350 nodes, and 86400 MPI tasks the total ATM init time is 2571 seconds.
This is a copy/paste of the top of model.timing.00000 (where ATM and the ioprocs) live.
This is with a PIO stride of 64, so 1 ioproc per node which @mt5555 has been telling me is way too many. I will try some experiments using larger strides and see what happens.

  "CPL:INIT"                                                              -        2    -    4073.949219  4061.454346    12.494794         0.000002 
    "CPL:cesm_pre_init2"                                                  -        1    -      12.449738    12.449738    12.449738         0.000001 
    "cesm_init"                                                           -        1    -    4061.432617  4061.432617  4061.432617         0.000001 
      "CPL:init_comps"                                                    -        1    -    3810.102539  3810.102539  3810.102539         0.000001 
        "comp_init_pre_all"                                               -        1    -       0.083295     0.083295     0.083295         0.000001 
        "comp_init_cc_atm"                                                -        1    -    2571.797363  2571.797363  2571.797363         0.000001 
          "a_i:PIO:PIO_openfile"                                          -       61    -     158.213913    13.942003     0.336415         0.000049 
          "a_i:PIO:pio_get_var_1d_double"                                 -      113    -       5.425022     1.167913     0.022379         0.000090 
          "a_i:phys_grid_init"                                            -        1    -     133.904907   133.904907   133.904907         0.000001 
          "a_i:get_decomp"                                                -       39    -    1227.597168  1222.137451     0.000055         0.000031 
            "a_i:get_filemap"                                             -        5    -       0.093036     0.092973     0.000010         0.000004 
            "a_i:newdecomp"                                               -        5    -    1227.498535  1222.041626     0.997233         0.000004 
*             "a_i:PIO:PIO_initdecomp_dof"                                -        6    -    1228.539917  1222.022949     0.997227         0.000005 
                "a_i:PIO:pio_rearrange_create_box"                        -        6    -    1227.946411  1221.628906     0.987424         0.000005 
          "a_i:PIO:pio_read_darray"                                       -       39    -     595.134888   584.083130     0.153516         0.000031 
            "a_i:PIO:pio_read_nf"                                         -       39    -     593.281372   584.037231     0.063636         0.000031 
              "a_i:PIO:pio_read_nfdarray_double"                          -       39    -     593.261292   584.017578     0.063627         0.000031 
            "a_i:PIO:pio_rearrange_read"                                  -       39    -       1.829831     0.293735     0.018234         0.000031 
              "a_i:PIO:pio_rearrange_io2comp_double"                      -       39    -       1.829200     0.293728     0.018227         0.000031 
                "a_i:PIO:swapm_box_rear_io2comp_double"                   -       39    -       1.785178     0.292382     0.017234         0.000031 
          "a_i:prim_printstate"                                           -        1    -       0.344330     0.344330     0.344330         0.000001 
*           "a_i:shr_reprosum_int"                                        -       46    -       0.559446     0.295437     0.000402         0.000037 
              "a_i:repro_sum_loopa"                                       -       46    -       0.076683     0.076603     0.000001         0.000037 
              "a_i:repro_sum_allr_minmax"                                 -       46    -       0.338489     0.141277     0.000149         0.000037 
              "a_i:repro_sum_loopb"                                       -       46    -       0.000116     0.000009     0.000001         0.000037 
              "a_i:repro_sum_allr_i8"                                     -       46    -       0.060827     0.046707     0.000130         0.000037 
          "a_i:PIO:PIO_createfile"                                        -        1    -       0.369699     0.369699     0.369699         0.000001 
          "a_i:PIO:pio_write_darray"                                      -        1    -       1.252447     1.252447     1.252447         0.000001 
            "a_i:PIO:pio_rearrange_write"                                 -        1    -       0.274641     0.274641     0.274641         0.000001 
              "a_i:PIO:pre_pio_write_nf"                                  -        1    -       0.406326     0.406326     0.406326         0.000001 
                "a_i:PIO:pio_rearrange_comp2io_int"                       -        1    -       0.222040     0.222040     0.222040         0.000001 
                  "a_i:PIO:swapm_box_rear_comp2io_int"                    -        1    -       0.092857     0.092857     0.092857         0.000001 
            "a_i:PIO:pio_write_nf"                                        -        1    -       0.088327     0.088327     0.088327         0.000001 
              "a_i:PIO:pio_write_nfdarray_int"                            -        1    -       0.088257     0.088257     0.088257         0.000001 
            "a_i:PIO:post_pio_write_nf"                                   -        1    -       0.757628     0.757628     0.757628         0.000001 
              "a_i:PIO:allred_add_data_to_buf"                            -        1    -       0.755920     0.755920     0.755920         0.000001 
          "a_i:PIO:PIO_closefile"                                         -       37    -       3.712584     2.220797     0.006267         0.000029 
          "a_i:PIO:pio_get_var_0d_text"                                   -       18    -       0.660133     0.072124     0.024909         0.000014 
          "a_i:PIO:pio_get_var_5d_double"                                 -       16    -       0.800779     0.085852     0.039741         0.000013 
          "a_i:PIO:pio_get_var_2d_double"                                 -       33    -       1.209028     0.099566     0.027176         0.000026 
          "a_i:PIO:pio_get_var_0d_double"                                 -       61    -       1.637957     0.051915     0.022119         0.000049 
          "a_i:PIO:pio_get_vara_1d_double"                                -       13    -       0.561216     0.197447     0.024395         0.000010 
          "a_i:PIO:pio_get_var_1d_int"                                    -       26    -       5.091221     1.362160     0.028045         0.000021 
          "a_i:PIO:pio_get_var_3d_int"                                    -        1    -       0.173343     0.173343     0.173343         0.000001 
          "a_i:PIO:pio_get_var_3d_double"                                 -        2    -       0.295559     0.162099     0.133460         0.000002 
@worleyph
Copy link
Contributor

worleyph commented Jun 8, 2017

Hi @ndkeen , What create_newcase command were you using? I'd like to look at this on Titan, for comparison.

@worleyph
Copy link
Contributor

worleyph commented Jun 8, 2017

Think that I found it in the performance_archive - please verify though.

@ndkeen
Copy link
Contributor Author

ndkeen commented Jun 9, 2017

Well, it was a coupled-hires case that used run_acme script.

README.case:2017-06-05 06:11:55: /global/cscratch1/sd/ndk/wacmy/ACME_code/2017-master-jun1/cime/scripts/create_newcase --case /global/cscratch1/sd/ndk/ACME_simulations/2017-master-jun1.hmod1650.master-jun1.nc.nr.st.ne120np4_oRRS18to6v3_ICG.cori-knl/case_scripts --mach cori-knl --compset A_WCYCL2000_H01AS --res ne120np4_oRRS18to6v3_ICG --walltime 8:00:00 --project acme --pecount M

With changes after case created, of course.

@worleyph
Copy link
Contributor

worleyph commented Jun 9, 2017

Got it. Thanks.

@worleyph
Copy link
Contributor

worleyph commented Jun 9, 2017

Cost on Titan is also "high" (40 min. is a lot, though not the 68 min. seen on Cori-KNL):

 "cesm_init"   -        1    -    2425.067627

but

 "a_i:PIO:pio_rearrange_create_box"    -        6    -     237.736053

is 1/5 the cost on Titan as it is on Cori-KNL (unless this is capturing some load imbalance from elsewhere). FYI.

@worleyph
Copy link
Contributor

Titan experiments may not be completely relevant to Cori-KNL, but moving from 86400x1 to 43200x2 in ATM and making similar changes in other components decreased A_WCYCL initialization cost from 2040 seconds to 740 seconds. Similarly, changing from 86400x1 to 43200x1 decreased the cost to 1023 seconds, so some of the high initialization cost for 86400x1 on Titan was due to using 16 MPI tasks per node, but most was an algorithmic scaling issue. I'll see if I can figure identify where this is coming from. Again, Titan does not see a high pio_rearrange_create_box, so some of the Cori-KNL overhead is Cori-KNL-specific.

@worleyph
Copy link
Contributor

@ndkeen , what is the latest on this on Cori-KNL? Does using threading (and fewer MPI processes) eliminate the problem? Are you still interested in diagnosing the 86400x1 ATM processes initialization performance issue?

My Titan jobs are taking awhile to get scheduled, so will jump to Cori-KNL as well, if you can provide a reproducer. I'm guessing that an F case would be sufficient? In any case, please advise.

@ndkeen
Copy link
Contributor Author

ndkeen commented Jul 12, 2017

Looking over several runs at the timer comp_init_cc_atm, I see the following.
The number of nodes in first column, and number of MPI x threads in other columns.
Certainly true that threads are having a large impact on performance for init, which I had not realized until I made this table. Although the two measurements with 675 nodes shows decrease with 2 threads over 1 thread.

nodes   64x1   64x2   64x4  32x4  32x8
85       83     102        
169      130    143   173
170                         227    281
318  
338      333    493
675     1003   1046
1350    3094   3216        1098   1100
2700                       4971 

To reproduce, should be as simple as:
-compset FC5AV1C-H01A -res ne120_ne120

The current default sets -c 4. To try other values, have to change config_machines.xml.

@worleyph
Copy link
Contributor

Looking at the raw timings for 86400x1, around half of the comp_init_cc_atm time is unattributed. I'll need to run this case myself with some additinal timers added.

@ndkeen , I'll repeat your earlier run, using the latest master with some additional instrumentation tweaks.

jgfouca pushed a commit that referenced this issue Jul 14, 2017
Rename DATM S1850 mode to CPLHIST

Use %CPLHIST for DATM coupler hist forcing

Coupler history forcing should have a %CPLHIST mode for any data component
that uses it. This change makes DATM conform to this convention.

It also fixes the invalid quotes in DROF valid values in #1577

Also changes strm_domdir to null for the cplhist streams: 
For CPLHIST you want to get the domain from the first stream
file. Its important the the stream domain match the stream data. So when
strm_domdir is null shr_strdata ensures that this is the case. This is also
happening rof the drof CPLHIST mode.

Also increases dtlimit to 3 in some places (it was 2 before): 
For the CPLHIST mode and the first time sample it
needs to be 3 - and we want this to work out of the box. 
It's okay that this loosens the error checking a bit:
It's a very large value - e.g. 100 that would be of concern.

Test suite: scripts_regression_tests
also verified that the following CESM spinup compset
1850_DATM%CPLHIST_SLND_CICE_POP2%ECO_DROF%CPLHIST_SGLC_WW3
gave the correct values
Test baseline:
Test namelist changes:
Test status: bit for bit

Fixes #1577 
Fixes #1576
User interface changes?: None
Code review: Bill Sacks
@ndkeen
Copy link
Contributor Author

ndkeen commented Jul 18, 2017

I tried a few experiments with nothing gained.
I ran 3 more 675-node cases with 1 thread. Also note that I had the wrong results in the above table for 675-nodes. I had listed the total init time, not the ATM only (I edited). I will try to plot the time as a function of MPI tasks which will probably show a better story.

  1. I simply ran again with a more recent version of master 867 s
  2. I ran that case again using sbcast -- a slurm tactic to speed up getting the executable out to the ranks, but may also be testing something else less well-understood -- no real change. 925 s
  3. Run again with Intel MPI (impi). 998 s

SMS.ne120_ne120.m39n675p43200t01.sbcast
SMS.ne120_ne120.m39n675p43200t01.impi2017

@ndkeen
Copy link
Contributor Author

ndkeen commented Jul 19, 2017

Pat pointed out to me that the experiments I tried with a different MPI version, seem to still be using cray-mpich. I tried a few other things, then realized NERSC needs us to use a different method. The cray wrappers ftn,cc,CC are, of course, going to point to cray MPI. So to use Intel MPI (or openmpi), we need to use the compilers mpiifort, mpiicc, mpiicpc as well as some I_MPI_* env variables. So I went to the trouble of creating a new compiler intel-impi to try this experiment. It does at least build.

Pat, would you mind describing in a sentence or two what you think you are seeing regarding the slow down? Something we can send to NERSC just in case they have seen this before? Or know something to try? Is it that this extra time is being seen at the first MPI point-to-point?

@worleyph
Copy link
Contributor

Pat pointed out to me that the experiments I tried with a different MPI version, seem to still be using cray-mpich.

A little strong - just noted that in my experiments, following @ndkeen 's lead, I can't tell what version of MPI I am using based on the build and run logs. Since performance in my one experiment (using openmpi) was very similar to that when using cray-mpich, I just wanted to verify that I was comparing what I thought I was comparing.

Noel, I'll provide a short description of what I am seeing in the near future. I am traveling at the moment.

@ndkeen
Copy link
Contributor Author

ndkeen commented Jul 20, 2017

I managed to get a run going with Intel MPI. It got stuck at shutdown down and failed, but not before it wrote out a timing.tar file which allows me to see the time for comp_init_cc_atm. This was only a test problem using ne30 F case, but the time in that timer is 2x using Intel MPI than cray MPI. I can try this with a ne120 case, but it may not be worth it.

@worleyph
Copy link
Contributor

I determined that my OpenMPI jobs were not actually using OpenMPI. I tried @ndkeen 's modifications to use Intel MPI. This built (and was accessing the Intel MPI include files). Run died pretty quickly with:

 ...
 28813: acme.exe: error: slurm_get_kvs_comm_set: Transport endpoint is not connected
 ...
 00611: acme.exe: error: slurm_send_kvs_comm_set: Connection timed out
 ...
 00611: Fatal error in MPI_Init: Other MPI error, error stack:
 00611: MPIR_Init_thread(805).......: fail failed
 00611: MPID_Init(1831).............: channel initialization failed
 00611: MPIDI_CH3_Init(147).........: fail failed
 00611: MPID_nem_ofi_post_init(1117): fail failed
 00611: getConnInfoKVS(905).........: PMI_KVS_Get failed
 00611: In: PMI_Abort(69253391, Fatal error in MPI_Init: Other MPI error, error stack:
 00611: MPIR_Init_thread(805).......: fail failed
 00611: MPID_Init(1831).............: channel initialization failed
 00611: MPIDI_CH3_Init(147).........: fail failed
 00611: MPID_nem_ofi_post_init(1117): fail failed
 00611: getConnInfoKVS(905).........: PMI_KVS_Get failed)

I'll give up on the other MPI libraries for the moment.

@worleyph
Copy link
Contributor

@ndkeen ,

Pat, would you mind describing in a sentence or two what you think you are seeing regarding the slow down?

Not a sentence or two, but here is a summary of the latest. I have lots more details, but this should be enough for you to broach the topic with NERSC if you feel that it is worthwhile to pursue it.


I added some more instrumentation into ACME to help isolate the
sources of the poor scaling in the initialization overhead. For my
runs,

270 nodes (17280 MPI processes, 64 per node)

  "CPL:INIT"  543.889771 seconds

450 nodes (28800 MPI processes, 64 per node)

  "CPL:INIT"  874.999878 seconds

675 nodes (43200 MPI processes, 64 per node)

  "CPL:INIT" 1420.839233 seconds

It appears that all of the performance loss is in the MPI overhead,
apparently in the first time that two processes exchange data for a
given communicator, for ATM, for LND, and for ICE in this F case.
For example, in the creation of a data type to use in subsequent
communication as part of the parallel I/O layer (PIO), sending data
from each process to the appropriate PIO read/write process, there is
a many-to-some communication pattern. For this pattern the cost for
the 675 node run was, for process 0 (not a PIO process)
(total over all calls, max per call)

                          Called              Total          Max
 "a_i:pre-CC B-bar"            6           0.042269     0.037482
 "a_i:CC B"                    6         280.114380   278.405823
 "a_i:post-CC B-bar"           6         173.111191   172.199188

for process 1 (a PIO process)

 "a_i:pre-CC B-bar"            6           0.041991     0.037327
 "a_i:CC B"                    6         281.709167   279.997375
 "a_i:post-CC B-bar"           6         171.516434   170.607605

while the data over all processes (max and min) was

                          wallmax (proc   thrd  )   wallmin (proc   thrd  )
 "a_i:pre-CC B-bar"         0.130 (  2945      0)     0.003 ( 37308      0)
 "a_i:CC B"               453.217 ( 40894      0)   280.114 (     0      0)
 "a_i:post-CC B-bar"      173.111 (     0      0)     0.021 ( 24641      0)

Note that only one call (hypthosis is that it is the first) accounted
for almost all of the cost out of the 6 times it was called. There was
no load imbalance coming into the operator, and quite a bit coming out
of it. There were 675 PIO processes in this operator, one per node.

This communication operator was implemented with an MPI point-to-point
algorithm. If I instead used MPI_Alltoallv, the cost went to almost
zero for this particular timer. However the cost then moved to the
next point-to-point implementation of a non-local communication
operator. Also, with each replacement the overall cost of CPL:Init
generally increased. For example, when replacing all point-to-point
calls with collective (MPI_Alltoallv, MPI_Alltoallw, MPI_gather)
CPL:Init cost increases to

  "CPL:INIT" 2426.001709 seconds
   "comp_init_cc_atm"   607.984802
   "comp_init_cc_lnd"  1129.624390
   "comp_init_cc_ice"   561.643188

as compared to the version with the point-to-point implementation.

  "CPL:INIT" 1420.839233 seconds
   "comp_init_cc_atm"   879.601685
   "comp_init_cc_lnd"    93.860229
   "comp_init_cc_ice"   350.327026

Note that the point-to-point implementation was motivated
by very poor performance with Cray-optimized versions of MPICH on the
Gemini network when MPI_Alltoall was used to implement this
many-to-some pattern, especially when there is load imbalance coming
into the oeprator. The assumption is that this was due to very
large number of unexpected messages causing the degraded performance,
even in the MPI library collectives. It also appears to be a useful
optimization on the NERSC systems, though not as dramatic as on Titan.

@worleyph
Copy link
Contributor

Ran a 1350 node job. This reproduces @ndkeen 's earlier results, but my runs also have some additional barriers, so this further verifies the scaling issues:

270 nodes (17280 MPI processes, 64 per node)

"CPL:INIT"  543.889771 seconds
     "comp_init_cc_atm"        282.020813
     "comp_init_cc_lnd"         59.025913
     "comp_init_cc_ice"        141.156219

450 nodes (28800 MPI processes, 64 per node)

"CPL:INIT"  874.999878 seconds
     "comp_init_cc_atm"        483.143494
     "comp_init_cc_lnd"         78.226891
     "comp_init_cc_ice"        228.905304       

675 nodes (43200 MPI processes, 64 per node)

"CPL:INIT"  1420.839233 seconds
     "comp_init_cc_atm"        879.601685
     "comp_init_cc_lnd"         93.860229
     "comp_init_cc_ice"        350.327026   

                          Called              Total          Max
  "a_i:pre-CC B-bar"            6           0.042269     0.037482
  "a_i:CC B"                    6         280.114380   278.405823
  "a_i:post-CC B-bar"           6         173.111191   172.199188

1350 nodes (86400 MPI processes, 64 per node)

"CPL:INIT"  4139.669922 seconds
     "comp_init_cc_atm"       2986.752197
     "comp_init_cc_lnd"        294.270386
     "comp_init_cc_ice"        693.185913      

                           Called              Total          Max
  "a_i:pre-CC B-bar"            6           0.037806     0.029725
  "a_i:CC B"                    6        1008.000854 1004.3925783
  "a_i:post-CC B-bar"           6         650.500305   648.610535

Note that CICE initialization (not MPAS-CICE) also grows, but ATM is the dominant cost. I have not looked at CICE closely (and hope not to have to), but have no reason to believe that this is also not an MPI initialization issue.

@ndkeen
Copy link
Contributor Author

ndkeen commented Jul 28, 2017

Since it was easy, I ran some ne120 F cases on edison (after upgrade). The init times aren't nearly as bad. One run with 21600 MPI's and the other 2 with 43200 (I think I was testing cpu_bind flag or something).

SMS.ne120_ne120.mfedihotHn0900p21600t01/timing/acme_timing_stats.170727-120549.gz:"comp_init_cc_atm"                               21600    21600 2.160000e+04   2.685356e+06   124.430 ( 15575      0)   124.272 ( 12227      0)
SMS.ne120_ne120.mfedihotHn1800p43200t01/timing/acme_timing_stats.170727-172203.gz:"comp_init_cc_atm"                               43200    43200 4.320000e+04   1.047854e+07   242.902 ( 16478      0)   242.467 ( 15011      0)
SMS.ne120_ne120.mfedihotHn1800p43200t01b/timing/acme_timing_stats.170727-173615.gz:"comp_init_cc_atm"                               43200    43200 4.320000e+04   1.111384e+07   257.492 ( 19944      0)   257.155 ( 39246      0)

@worleyph
Copy link
Contributor

worleyph commented Jul 28, 2017

I have a reproducer that we can pass on to NERSC and Cray if you want. I'll continue to refine this, and maybe make it even smaller. I'll also be using it to examine workarounds.

This just measures the cost of the PIO communication pattern used during the data type creation in box_rearrange_create when there is one PIO process per node (stride 64, offset 1) for two successive calls, surrounded (and separated) by barriers. Time reported is maximum over all processes.

 4 nodes (256 processes): 
  (a) 2.989506721496582E-002
  (b) 1.514911651611328E-003

 256 nodes (4096 processes):
  (a) 58.5745210647583
  (b) 0.119852066040039

 675 nodes (43200 processes):
  (a) 456.856745004654
  (b) 0.653519868850708

 1024 nodes (65536 processes):
  (a) 748.560154914856
  (b) 0.525825977325439

 1350 nodes (86400 processes):
  (a) 1633.26983404160
  (b) 1.03820204734802

@ndkeen
Copy link
Contributor Author

ndkeen commented Jul 28, 2017

Hi Pat, I was distracted by the edison upgrade. You have a reproducer? Wow. Very well done. Certainly interested. Should we put it in the repo? I know that can be a lot of work. I did send emails to two folks at NERSC, but have not heard back. With stand-alone code, I can be much more confident to ask for help. Does it surprise you that edison is quicker? Note, it's free to run on edison until July 31st.

@worleyph
Copy link
Contributor

I don't know what would be involved with putting it in the repo. Where would it go, and what would it be used for in the future (and how would it be maintained)?

At the moment my driver code references

pio_kinds.F90
pio_support.F90

in cime/src/externals/pio1/pio, and

pio_spmd_utils.F90

from a build directory, generated from pio_spmd_utils.F90_in, also in cime/src/externals/pio1/pio .

For this standalone reproducer, I just copied over these source files, and included instructions for building using these files based on the ACME Cori-KNL module loads and compiler options.

@philipwjones
Copy link
Contributor

@ndkeen and I had a brief email exchange on creating a performance test suite (e.g. as a cime test suite) mostly for benchmarking our primary configurations, but small test cases/reproducers that hit typically performance-sensitive bits like this would also be useful.

@worleyph
Copy link
Contributor

This particular test is targeting an issue that seems to be somewhat peculiar to Cori-KNL, and only occurs during the initialization. A more typical PIO gather or scatter communication test would be different. (Actually that is what I started from - I modified it to use the initialization pattern.)

@jayeshkrishna may already have PIO1 and PIO2 performance tests that include these MPI communication patterns? I'm not competent to modify PIO, or CIME, to add build or run support for these types of tests. I can continue to prototype them, if that would be useful.

@ndkeen
Copy link
Contributor Author

ndkeen commented Jul 28, 2017

I suppose whether or not it goes into the repo (and where) is up to others, but I'm a fan of stand-alone tests as we can do more with them. I would certainly like to try it when you are ready to let me see it.

@worleyph
Copy link
Contributor

@ndkeen , please grab

 ATM_initialization_cost_scaling_reproducer.tar.gz

from /global/u2/w/worleyph at NERSC. This expands to a directory called

 ATM_initialization_cost_scaling_reproducer

There is a Notes.txt file in the directory that should get you started. Further questions are probably best handled in private e-mails until you are ready to hand this off to NERSC.

@worleyph
Copy link
Contributor

worleyph commented Aug 3, 2017

Hi @jayesh (and cc'ing @ndkeen ),

(Summarizing above material) The latest on the poor initialization scaling on Cori-KNL is that this appears to be occurring the first time two processes communicate, and is showing up when calling swapm in compute_counts in box_rearrange_create (the first time this is called in ATM initialization). If both swapm calls in compute_counts are replaced with calls to MPI_AlltoallV, then this first call overhead disappears from there and reappears in LND initiaization in the first couple of calls to a2a_box_rear_io2comp_double or swapm_box_rear_io2comp_double, i.e. whether using MPI_AlltoAllW OR swapm.

I created two standalone test programs that either gather information from all "compute" tasks to "IO" tasks (comp2io) or scatter information from "IO" tasks to "compute" tasks (io2comp). This verifies the above, that the first call when using swapm or MPI_AlltoAllW has a high cost and scales poorly, and that MPI_AlltoallV performs much better. This does not provide a real workaround since we can't use MPI_AlltoAllW for a2a_box_rear_io2comp_double - this truly is an MPI_AlltoAllW operator unless we stop using MPI data types and pack and unpack buffers manually.

Not that the performance comparison for subsequent calls is more complicated, but does capture the "with handshaking" and "without handshaking" difference in performance for comp2io and io2comp. Since these standalone programs do not capture the actual comp2io and io2comp message patterns in ACME - e.g. not all compute tasks send to all io tasks - they are also not adequate for optimizing settings in ACME, and this still needs to be done in the model.

I started instrumenting compute_counts to output the different message patterns created in PIO so that I could sumarize for @ndkeen and perhaps further generalize the standalone test prgrams. However I just remembered that you have working on PIO test programs that also use extracted communication patterns - perhaps these are what Noel and I should be using, or perhaps you can help us in this testing on Cori-KNL.

At this point, the only solution I can think of is to tell NERSC and Cray about the high start-up cost in swapm and MPI_AlltAllW and ask them to see if they can do for these whatever magic they did for MPI_AlltoallV. However, having an accurate PIO test program may allow us to experiment with our own workarounds (since I am not holding my breath that Cray will address our request in a timely fashion).

@ndkeen
Copy link
Contributor Author

ndkeen commented Aug 3, 2017

Thanks Pat. Indeed I have been taking the stand-alone tests prepared by Pat and running them with various parameters (inputs, num MPI's, MPI env flags, etc). I'm trying to
a) learn more about this issue in general
b) try quick/easy experiments that could be great work-arounds if they work
c) "package up" in a way that we can send to NERSC for additional help
d) also trying to obtain more profiling data using craypat

I was derailed yesterday due to the power outage and now I need to work on some other things, but will soon get back to it. Jayesh: please let me know if there's something else I should be doing/trying.

@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 12, 2017

Using master from Oct 10, I ran several more F compsets. Pat already discovered that setting PIO_ROOT=0 was a benefit to the ATM init time. I was already using that in several other experiments, but not for these ATM init runs. So these are with a new repo and with:
./xmlchange --file env_run.xml PIO_ROOT=0
But it does NOT include the changes that Pat made in a branch.

675 nodes, 43200 MPI's, 2 threads:
"comp_init_cc_atm"                           43200    43200 4.320000e+04   3.303537e+07   764.829 (  3471      0)   764.613 ( 15367      0)

1350 nodes, 86400 MPI's, 2 threads:
"comp_init_cc_atm"                           86400    86400 8.640000e+04   1.844306e+08  2134.710 ( 62805      0)  2134.565 ( 74474      0)

1350 nodes, 86400 MPI's, 1 thread:
"comp_init_cc_atm"                           86400    86400 8.640000e+04   1.975688e+08  2286.824 ( 74283      0)  2286.572 ( 23764      0)

That's 35 minutes which is quite an improvement already.

@worleyph
Copy link
Contributor

FYI PIO_ROOT == 0 seems to perform better on Titan and on Anvil as well, both low and high resolution cases, and both for initialization and for the run loop.

@PeterCaldwell
Copy link
Contributor

That's great. Is @wlin7 using this change?

@worleyph
Copy link
Contributor

@PeterCaldwell , I have a branch with two changes to address this. The first (setting PIO_ROOT to zero by default) is noncontroversial. The second may be specific to Cori-KNL. I'll split these into 2 branches and submit a pull request for the first one later today.

@PeterCaldwell
Copy link
Contributor

Great, thanks for your work on this!

@worleyph
Copy link
Contributor

worleyph commented Oct 19, 2017

@ndkeen , I have "found" another way to address this issue:

  MPICH_GNI_DYNAMIC_CONN
           By default, connections are set up on demand. This allows
           for optimal performance while minimizing memory
           requirements. If set to enabled, dynamic connections are
           enabled; if set to disabled, MPICH establishes all
           connections at job startup, which may require significant
           amounts of memory.

I am running experiments with

 setenv MPICH_GNI_DYNAMIC_CONN disabled

and initialization appears to be much faster. Setting PIO_ROOT to zero still seems to be a good idea, but the above may be the real solution for us, especially since at high process counts we do not have a large memory footprint and can afford the extra MPI memory?

One concern is that the initialization cost is pushed to MPI_Init, which we do not measure directly. The CaseStatus timing data also includes pre and post processing times, so makes it difficult to pull out just the srun time. I've created another github issue ( #1857 ) to get this fixed, but don't have the fix available for these studies yet.

Another concern is whether the RUN_LOOP performance is degraded by doing this. The discussion above does not imply that it would be, but we still need to be sure.

You might try the above in some of your experiments as well, to start evaluating the impact. The nice thing about this solution is that it is system-specific, and we would not impact performance on any other system.

You should also ask your NERSC and Cray contacts about this, and why they did not suggest it to us when we first asked them about mitigation techniques. Sure seems to be relevant.

@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 19, 2017

Yea, sorry Pat, in fact, this was what Nathan W (of Cray) did suggest. I thought I had noted it somewhere but can't find it. He was not optimistic it would help because, as you noted, the time might just show up in MPI_Init(). It was still on my list of things to try.

And, indeed the time to do whatever is happening before or after the acme.exe cost should be measured. This time is too high for runs on KNL and is wasting MPP's at high scales. I've not had any luck getting anyone to help with this. We just simply can't do those tasks on the compute nodes. The batch file should ONLY be the srun command.

@worleyph
Copy link
Contributor

Okay, I'm trying now. Looks to be working, but I need to figure out how to determine the impact on MPI_Init. Allocating space and setup at once versus reacting every time a message comes from a new process should be faster, and perhaps much faster. I am still hopeful.

@worleyph
Copy link
Contributor

worleyph commented Oct 21, 2017

@ndk, it appears to somewhat advantageous to set MPICH_GNI_DYNAMIC_CONN=disabled. By looking in the job script output, e.g.

 2017-10-20 17:21:34 MODEL EXECUTION BEGINS HERE
 ...
 2017-10-20 18:10:18 MODEL EXECUTION HAS FINISHED

you can calculate total time to run the model.

For example, when MPICH_GNI_DYNAMIC_CONN=enabled, then the above (18:10:18 - 17:21:34) time is just seconds larger than (Init Time + Run Time) from the acme timing summary file.

In contrast, when MPICH_GNI_DYNAMIC_CONN=disabled then the two model cost measures can differ by many minutes (as shown below). So it would appear that the above is a reliable way to measure total model cost.

I ran experiments using

 -res ne120_ne120 -compset FC5AV1C-H01A 

on Cori-KNL using 270, 675, and 1350 nodes, all with PIO_ROOT = 0. In all cases, using MPICH_GNI_DYNAMIC_CONN=disabled was faster, at least by a little bit. The biggest difference was with 1350 nodes (86400x1 ATM decomposition), as you would expect.

Ran this twice with both MPICH_GNI_DYNAMIC_CONN settings. There was some performance variability, but 'disabled' was always better. Fastest runs for each are described below (one time step, no restart write):

MPICH_GNI_DYNAMIC_CONN = enabled (so current default)

 48m 4s by timers
   - 46m 25s in Init Time
   - 1m 39s in Run Time

 48m 44s by job script output

MPICH_GNI_DYNAMIC_CONN = disabled

 16m 34s minutes by timers
    - 15m 58s in Init Time
    - 1m 36s in Run TIme

 36m 49s by job script output

So, almost 12 minutes faster by using MPICH_GNI_DYNAMIC_CONN = disabled. It is annoying that the model cost is not all captured in standard timing summary when using this settign, but the data is captured. I'll look into whether we can inject these data into the summary, but I would probably need help from CSEG.

In any case, please run your own experiments. This is all using MPI-only (though that should not make any difference), and we should verify that there is no impact on performance on the lower resolution models (should not be). This also makes my proposed pio_N2M replacement for pio_swapm in compute_counts irrelevant, and setting MPICH_GNI_DYNAMIC_CONN = disabled is more effective as well.

@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 25, 2017

Hmm, I tried "MPICH_GNI_DYNAMIC_CONN = disabled" and it's a little slower, but more troubling is that my repos dating after Oct 16th are slower to init than before. On Oct 13th, I measured 687 seconds for comp_init_cc_atm -- cori-knl, 675 nodes, 43200 MPI's, 2 hyperthreads each. On Oct 16th, I meausure 1082 s (and 1285 secs with the disabling of above env var). All of my runs use PIO_ROOT=0. I'll keep trying.

@worleyph
Copy link
Contributor

Bummer - I thought that we had finished off this issue. I'll try to rerun some of my experiments as well.

@worleyph
Copy link
Contributor

Just now ran this same case (675 nodes, 43200 MPI's, 2 hyperthreads each).

a) "MPICH_GNI_DYNAMIC_CONN = enabled"

Init Time   :     927.938 seconds

or 15m 28s

 2017-10-25 17:13:57 MODEL EXECUTION BEGINS HERE
 2017-10-25 17:31:46 MODEL EXECUTION HAS FINISHED

or 17m 49s

b) "MPICH_GNI_DYNAMIC_CONN = disabled"

Init Time   :     413.762 seconds

 2017-10-25 17:56:47 MODEL EXECUTION BEGINS HERE
 2017-10-25 18:10:37 MODEL EXECUTION HAS FINISHED

or 13m 50s

So, I still see an advantage, even compared to just Init Time for (a).

@worleyph
Copy link
Contributor

@ndkeen , just to be clear, my case was

 -compset FC5AV1C-H01A -res ne120_ne120

using master (updated today).

jayeshkrishna added a commit that referenced this issue Nov 3, 2017
…1837)

High MPI overhead occurs on some systems the first time that
two processes communicate. In typical usage there are two types
of nonlocal message patterns: one based off of the root of each
component and one (in PIO) based off of root+1. By changing
the default for PIO_ROOT from one to zero, this start-up overhead
is approximately halved.

Using a PIO_ROOT value of zero also allows the default performance
timing settings to better capture PIO performance. Finally,
logically a PIO_ROOT value of one has no special advantage over
zero with current multi- and many-core processor architectures.
This has been verified in recent performance benchmarking on multiple
system and multiple cases, with a PIO_ROOT value of zero performing
better than a PIO_ROOT of one even in the RUN LOOP.

This addresses issue #1578, but that issue can never really be solved,
only mitigated against. There may be more workarounds that are
appropriate yet to come.

[BFB]
[NML]

* worleyph/cime/pio_root_zero_as_default:
  Set PIO_ROOT defaults to be zero instead of one

Conflicts:
	cime/src/drivers/mct/cime_config/config_component.xml
@ndkeen
Copy link
Contributor Author

ndkeen commented Jan 24, 2018

I just merged #2026 into master which sets:
setenv MPICH_GNI_DYNAMIC_CONN=disabled
(only for cori-knl and only for cray-mpich)
I see ~25% improvement in init when I use this for a 43200-way F-compset.

I suspect we are not done with this issue, but this hopefully helps (and doesn't cause any other issues).

It's also possible that we might have to go back to using PIO_ROOT=1 for memory concerns. I currently still have it set to 0, but have been running into memory issues.

@ndkeen
Copy link
Contributor Author

ndkeen commented Jun 19, 2019

Noting that this issue & PR also had an impact.
#2496

@ndkeen ndkeen closed this as completed Jul 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants