-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scalability issue in MCT #917
Comments
Current code looks like:
For oRRS15to5 and 65536, then for sea ice (and probably the ocean?) ngseg = 5778041 This causes initialization to take over an hour on Titan. (For 32768 it takes 30 minutes.) Other parts of initialization are also expensive, but this is the one location that is scaling the worst. |
My current proposed solution, still to be tested, is
with a corresponding modification for determining the pe_list (an optional parameter). Note that temp_list is subsequently used to calculate the pe_list, but not for anything else. The above changes the logic to be linear in ngseg instead of O(ngseg*nprocs). |
Would also be nice if GlobalSegMap included the size of the MPI communicator, so that temp_list can be dimensioned appropriately, instead of as ngseg. Could calculate this locally in active_pes_ but it is not for some reason, and I am not familiar enough with MCT to understand the dependencies and expectations for any of this code. |
Hey Pat, I believe there is a typo in the F90 filename in the initial post. In case anyone is interested, m_GlobalSegMap.F90 is the link to the relevant code snippet that Pat referred above. |
Thanks. Fixed. |
Thanks Pat. I'll take a look at this. |
Pat, After the do-loop, the original algorithm will produce valid values for
|
I thought that is what Pat meant by "with a corresponding modification for determining the pe_list" |
I'll let @rljacob decide, but the rest of my fix would be
|
Sorry for not reading the entire comment. :( |
No problem. I rarely make it to the bottom of e-mails and the like (so be forewarned :-) ). The above logic eliminates the need for the sorting logic in the current code, which may also be a performance improvement, though not guaranteed. |
Update: Ran a (32768 process / 2 threads per process) job and ngseg was approximately the same, for both ocean and sea ice: NGSEG = 5777939 so this appears to be the source of the problem. Note that active_pes_ is called twice each for ocean and sea ice. Calculating this once and saving it with the GSMap (assuming that both calls are for the same GSMap) would be more efficient as well. |
How many horizontal grid points do the ocean and sea ice have again at oRRS15to5? |
(for ocean)
|
(and for sea ice)
|
So 5778136 gridpoints and 5777939 segments. That's almost 1-1 which means there are very few contiguously numbered runs of gridpoints on a processor. The algorithm needs to be better but some attention paid to the decomposition and grid numbering could lower that ngseg number. |
@douglasjacobsen should speak up, but I read this as 5778136 cells, but 11631265 gridpoints (so not quite as bad as 1-1). I believe that the MPAS development team has been looking at better partitioning strategies - guess that they should add this to their list of metrics? |
True I'm not sure if they're decomposing by cell or vertex. Vertex would make that a little better. The lowest possible value for ngseg = n_mpiprocs. The worst possible value is ngseg = n_gridpoints. |
@worleyph and @rljacob This is the issue I brought up during the all-hands SE / Perf speed dating session. I don't know if that conversation is documented anywhere, but here's a summary: MPAS meshes can be ordered arbitrarily, which could easily cause scaling issues during initialization while trying to determine how to communicate with other portions of the mesh. I don't know how the coupler does this currently, but it's possible that it's overly expensive for an MPAS mesh. Based on this conversation, I'd guess (without really understanding the code you guys are talking about) that this is similar to what's happening. MPAS meshes are also decomposed using nCells, so as you guys mentioned it's close to the worst case (1 cell / contiguous segment). Changing the decomposition wouldn't change this, but changing the numbering of cells would. This change is relatively easy, but in the past we've run into issues that were hidden by having grids ordered this way (changing them to "random" ordering reveals alternative issues). Furthermore, the ordering as far as the coupler goes is somewhat arbitrary. So, it's possible that with slight modifications within ocn_comp_mct.F, where we index the mesh (and maybe ordering the scrip file in a specific way) we could remove the issue without actually having to change MPAS mesh files at all. I'll spend some time seeing if I can come up with a modification on the MPAS sides this next week, but it might be good to use something like Pat's proposed fix, after it's tested, just to help remove scalability issues within MCT. |
@douglasjacobsen , thanks. I am hoping that my proposal (swapping a quardratic complexity algorithm with a linear complexity algorithm) will solve the immediate problem. However, there are also data structures and MPI communications on these data structures (e.g. BCAST of vectors of length ngseg) that add to memory and communication overhead that will be decreased by minimizing the size of ngseg. So any thing you can do will be useful. |
My first run, using 32768 processes and 2 threads per process worked scarily well. Total initialization cost decreased from 2019 seconds to 935 seconds, and the cost of the first loop in active_pes_ (described in comments #2 and #3) decreased from 954 seconds to 0.1 seconds (summed over all calls), which is 2-4 times smaller than the complexity decrease would have indicated? Perhaps the compiler likes the simpler code. Memory accesses are more local as well, ... I do not think that any of these calls use the pe_list optional parameter, so this does not test the second part of my changes. In this context, the only output from the routine is "count". I printed this out for process 0, and it is the same for both the old and new versions (32768 here). Note hower that I was not able to see whether the simulation output was the same or not - I only ran for 1 day, and there was no numerical diagnostic output to compare that I can see (for this GMPAS case). However it did run, and the timing of the run loop was essentially the same for the two runs (748 seconds vs. 752 seconds). My 65536x1 job is still in the queue. If this has the same positive behavior I'd like to see this modification put into a pull request (and tested more rigorously). Probably can't wait for the next CIME merge, as this impacts testing the high resolution models runs going on now. Given how cheap the new version is, further optimizations like computing this only once and putting it into the GSMap data structure can wait until the next CIME merge, especially since this change would be more intrusive. (I may also be completely misunderstanding what is going on, and perhaps these calls to active_pes_ can't be eliminated by precomputation.) |
@worleyph please make PR to https://github.com/MCSclimate/MCT and I'll do testing there and then bring it in to ACME. |
@rljacob , so my solution is an appropriate one? I didn't spend a lot of time on it, and this is code that I am not familiar with. |
FYI. |
I believe that this optimization is now in the model. |
I identified a scalability issue in active_pes_ in m_GlobalSegMap.F90 that severely impacts MPAS-O and MPAS-CICE when using oRRS15to5 and 65536 or more processes. I developed a workaround (still to be tested), but MCT is part of CIME and this issue and fix needs to be owned by someone else. (My fix also probably does not go far enough - the information being computed should be done in the initX_ routines and added to the GlobalSegMap type in my opinon, but that is a more intrusive modification.)
I'll add the problem in the next comment, and my proposed fix (not yet tested) in a subsequent comment.
The text was updated successfully, but these errors were encountered: