scalability issue in MCT #917

worleyph · 2016-06-17T15:45:12Z

I identified a scalability issue in active_pes_ in m_GlobalSegMap.F90 that severely impacts MPAS-O and MPAS-CICE when using oRRS15to5 and 65536 or more processes. I developed a workaround (still to be tested), but MCT is part of CIME and this issue and fix needs to be owned by someone else. (My fix also probably does not go far enough - the information being computed should be done in the initX_ routines and added to the GlobalSegMap type in my opinon, but that is a more intrusive modification.)

I'll add the problem in the next comment, and my proposed fix (not yet tested) in a subsequent comment.

worleyph · 2016-06-17T15:50:44Z

Current code looks like:

   allocate(temp_list(ngseg), stat=ierr)
 ...
   do n=1,ngseg
      if(GSMap%pe_loc(n) >= 0) then ! a legitimate pe_location
         new = .true.
         do i=1, count
            if(GSMap%pe_loc(n) == temp_list(i)) new = .false.
         end do
         if(new) then
            count = count + 1
            temp_list(count) = GSMap%pe_loc(n)
         endif
      else
        (error abort)
      endif
   end do

For oRRS15to5 and 65536, then for sea ice (and probably the ocean?)

ngseg = 5778041
count = 65536

This causes initialization to take over an hour on Titan. (For 32768 it takes 30 minutes.) Other parts of initialization are also expensive, but this is the one location that is scaling the worst.

worleyph · 2016-06-17T15:58:35Z

My current proposed solution, still to be tested, is

   allocate(temp_list(0:ngseg), stat=ierr)
 ...
   do n=1,ngseg
      if(GSMap%pe_loc(n) >= 0) then ! a legitimate pe_location
         if (temp_list(GSMap%pe_loc(n)) == -1) then
            temp_list(GSMap%pe_loc(n)) = 1
            count = count + 1
         endif
      else
        ...
      endif
   enddo

with a corresponding modification for determining the pe_list (an optional parameter). Note that temp_list is subsequently used to calculate the pe_list, but not for anything else.

The above changes the logic to be linear in ngseg instead of O(ngseg*nprocs).

worleyph · 2016-06-17T16:07:46Z

Would also be nice if GlobalSegMap included the size of the MPI communicator, so that temp_list can be dimensioned appropriately, instead of as ngseg. Could calculate this locally in active_pes_ but it is not for some reason, and I am not familiar enough with MCT to understand the dependencies and expectations for any of this code.

bishtgautam · 2016-06-17T16:17:32Z

Hey Pat, I believe there is a typo in the F90 filename in the initial post.

In case anyone is interested, m_GlobalSegMap.F90 is the link to the relevant code snippet that Pat referred above.

worleyph · 2016-06-17T16:21:27Z

Thanks. Fixed.

rljacob · 2016-06-17T16:28:52Z

Thanks Pat. I'll take a look at this.

bishtgautam · 2016-06-17T16:31:27Z

Pat, After the do-loop, the original algorithm will produce valid values for temp_list(1:count), but that won't be the case for your solution. Here is another solution:

   allocate(temp_list(0:ngseg), stat=ierr)
   allocate(temp_list2(0:ngseg), stat=ierr)
 ...
   do n=1,ngseg
      if(GSMap%pe_loc(n) >= 0) then ! a legitimate pe_location
         if (temp_list(GSMap%pe_loc(n)) == -1) then
            temp_list(GSMap%pe_loc(n)) = 1
            count = count + 1
            temp_list2(count) = GSMap%pe_loc(n)
         endif
      else
        ...
      endif
   enddo

  temp_list(:) = temp_list2(:)

jayeshkrishna · 2016-06-17T16:38:14Z

I thought that is what Pat meant by "with a corresponding modification for determining the pe_list"

worleyph · 2016-06-17T16:38:17Z

I'll let @rljacob decide, but the rest of my fix would be

   if(present(pe_list)) then
      allocate(pe_list(count), perm(count), stat=ierr)
      if (ierr /= 0) then
         call die(myname_,'allocate(pe_list...',ierr)
      endif

      i = 1
      do n=1,ngseg
         if (temp_list(n) == 1) then
            pe_list(i) = n
            i = i+1
            if (i> count) exit
         endif
      enddo
   endif

bishtgautam · 2016-06-17T16:43:54Z

Sorry for not reading the entire comment. :(

worleyph · 2016-06-17T16:47:20Z

No problem. I rarely make it to the bottom of e-mails and the like (so be forewarned :-) ).

The above logic eliminates the need for the sorting logic in the current code, which may also be a performance improvement, though not guaranteed.

worleyph · 2016-06-17T19:01:46Z

Update: Ran a (32768 process / 2 threads per process) job and ngseg was approximately the same, for both ocean and sea ice:

NGSEG = 5777939

so this appears to be the source of the problem. Note that active_pes_ is called twice each for ocean and sea ice. Calculating this once and saving it with the GSMap (assuming that both calls are for the same GSMap) would be more efficient as well.

rljacob · 2016-06-17T19:04:21Z

How many horizontal grid points do the ocean and sea ice have again at oRRS15to5?

worleyph · 2016-06-17T19:06:54Z

(for ocean)

  ----- reading dimensions from stream 'mesh' using file /lustre/atlas1/cli900/world-shared/cesm/inputdata/ocn/mpas-o/oRRS15to5/ocean.RRS.15-5km.151209.nc
                      nCells = 5778136
                      nEdges =17411251
                   nVertices =11631265
                         TWO =       2
                    maxEdges =       7
                   maxEdges2 =      14
                vertexDegree =       3
                 nVertLevels =     100

worleyph · 2016-06-17T19:09:40Z

(and for sea ice)

  ----- reading dimensions from stream 'input' using file /lustre/atlas1/cli900/world-shared/cesm/inputdata/ice/mpas-cice/oRRS15to5/seaice.RRS.15to5km.151209.nc
                      nCells = 5778136
                      nEdges =17411251
                   nVertices =11631265
                         TWO =       2
                    maxEdges =       7
                   maxEdges2 =      14
                vertexDegree =       3

rljacob · 2016-06-17T19:26:11Z

So 5778136 gridpoints and 5777939 segments. That's almost 1-1 which means there are very few contiguously numbered runs of gridpoints on a processor. The algorithm needs to be better but some attention paid to the decomposition and grid numbering could lower that ngseg number.

worleyph · 2016-06-17T19:29:26Z

@douglasjacobsen should speak up, but I read this as 5778136 cells, but 11631265 gridpoints (so not quite as bad as 1-1). I believe that the MPAS development team has been looking at better partitioning strategies - guess that they should add this to their list of metrics?

rljacob · 2016-06-17T19:31:55Z

True I'm not sure if they're decomposing by cell or vertex. Vertex would make that a little better. The lowest possible value for ngseg = n_mpiprocs. The worst possible value is ngseg = n_gridpoints.

douglasjacobsen · 2016-06-18T03:43:39Z

@worleyph and @rljacob This is the issue I brought up during the all-hands SE / Perf speed dating session. I don't know if that conversation is documented anywhere, but here's a summary:

MPAS meshes can be ordered arbitrarily, which could easily cause scaling issues during initialization while trying to determine how to communicate with other portions of the mesh. I don't know how the coupler does this currently, but it's possible that it's overly expensive for an MPAS mesh.

Based on this conversation, I'd guess (without really understanding the code you guys are talking about) that this is similar to what's happening.

MPAS meshes are also decomposed using nCells, so as you guys mentioned it's close to the worst case (1 cell / contiguous segment). Changing the decomposition wouldn't change this, but changing the numbering of cells would. This change is relatively easy, but in the past we've run into issues that were hidden by having grids ordered this way (changing them to "random" ordering reveals alternative issues).

Furthermore, the ordering as far as the coupler goes is somewhat arbitrary. So, it's possible that with slight modifications within ocn_comp_mct.F, where we index the mesh (and maybe ordering the scrip file in a specific way) we could remove the issue without actually having to change MPAS mesh files at all.

I'll spend some time seeing if I can come up with a modification on the MPAS sides this next week, but it might be good to use something like Pat's proposed fix, after it's tested, just to help remove scalability issues within MCT.

worleyph · 2016-06-18T12:48:07Z

@douglasjacobsen , thanks. I am hoping that my proposal (swapping a quardratic complexity algorithm with a linear complexity algorithm) will solve the immediate problem. However, there are also data structures and MPI communications on these data structures (e.g. BCAST of vectors of length ngseg) that add to memory and communication overhead that will be decreased by minimizing the size of ngseg. So any thing you can do will be useful.

worleyph · 2016-06-18T13:26:35Z

My first run, using 32768 processes and 2 threads per process worked scarily well. Total initialization cost decreased from 2019 seconds to 935 seconds, and the cost of the first loop in active_pes_ (described in comments #2 and #3) decreased from 954 seconds to 0.1 seconds (summed over all calls), which is 2-4 times smaller than the complexity decrease would have indicated? Perhaps the compiler likes the simpler code. Memory accesses are more local as well, ...

I do not think that any of these calls use the pe_list optional parameter, so this does not test the second part of my changes. In this context, the only output from the routine is "count". I printed this out for process 0, and it is the same for both the old and new versions (32768 here).

Note hower that I was not able to see whether the simulation output was the same or not - I only ran for 1 day, and there was no numerical diagnostic output to compare that I can see (for this GMPAS case). However it did run, and the timing of the run loop was essentially the same for the two runs (748 seconds vs. 752 seconds).

My 65536x1 job is still in the queue. If this has the same positive behavior I'd like to see this modification put into a pull request (and tested more rigorously). Probably can't wait for the next CIME merge, as this impacts testing the high resolution models runs going on now.

Given how cheap the new version is, further optimizations like computing this only once and putting it into the GSMap data structure can wait until the next CIME merge, especially since this change would be more intrusive. (I may also be completely misunderstanding what is going on, and perhaps these calls to active_pes_ can't be eliminated by precomputation.)

rljacob · 2016-06-18T21:12:41Z

@worleyph please make PR to https://github.com/MCSclimate/MCT and I'll do testing there and then bring it in to ACME.

worleyph · 2016-06-20T02:55:40Z

@rljacob , so my solution is an appropriate one? I didn't spend a lot of time on it, and this is code that I am not familiar with.
I'll put in my changes as soon as I can, but I may not be able to get to this until next week - I have to provide some content for a proposal due this week.

worleyph · 2016-06-27T16:16:52Z

@rljacob:

 > git push origin worleyph/active_pes_perf_fix
 ERROR: Permission to MCSclimate/MCT.git denied to worleyph.
 fatal: Could not read from remote repository.

 Please make sure you have the correct access rights
 and the repository exists.

FYI.

worleyph · 2017-07-04T00:38:17Z

I believe that this optimization is now in the model.

worleyph added Utils Critical labels Jun 17, 2016

worleyph assigned rljacob Jun 17, 2016

worleyph mentioned this issue Jun 20, 2016

scalability issue in I/O for HiRes ocean/ice #919

Closed

worleyph closed this as completed Jul 4, 2017

rljacob mentioned this issue Jul 4, 2017

Update2 MCT subtree to MCT_2.10.beta #1452

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scalability issue in MCT #917

scalability issue in MCT #917

worleyph commented Jun 17, 2016 •

edited

Loading

worleyph commented Jun 17, 2016 •

edited

Loading

worleyph commented Jun 17, 2016 •

edited

Loading

worleyph commented Jun 17, 2016

bishtgautam commented Jun 17, 2016

worleyph commented Jun 17, 2016

rljacob commented Jun 17, 2016

bishtgautam commented Jun 17, 2016

jayeshkrishna commented Jun 17, 2016

worleyph commented Jun 17, 2016

bishtgautam commented Jun 17, 2016

worleyph commented Jun 17, 2016

worleyph commented Jun 17, 2016 •

edited

Loading

rljacob commented Jun 17, 2016 •

edited

Loading

worleyph commented Jun 17, 2016 •

edited

Loading

worleyph commented Jun 17, 2016

rljacob commented Jun 17, 2016

worleyph commented Jun 17, 2016 •

edited

Loading

rljacob commented Jun 17, 2016 •

edited

Loading

douglasjacobsen commented Jun 18, 2016

worleyph commented Jun 18, 2016

worleyph commented Jun 18, 2016

rljacob commented Jun 18, 2016

worleyph commented Jun 20, 2016

worleyph commented Jun 27, 2016

worleyph commented Jul 4, 2017

scalability issue in MCT #917

scalability issue in MCT #917

Comments

worleyph commented Jun 17, 2016 • edited Loading

worleyph commented Jun 17, 2016 • edited Loading

worleyph commented Jun 17, 2016 • edited Loading

worleyph commented Jun 17, 2016

bishtgautam commented Jun 17, 2016

worleyph commented Jun 17, 2016

rljacob commented Jun 17, 2016

bishtgautam commented Jun 17, 2016

jayeshkrishna commented Jun 17, 2016

worleyph commented Jun 17, 2016

bishtgautam commented Jun 17, 2016

worleyph commented Jun 17, 2016

worleyph commented Jun 17, 2016 • edited Loading

rljacob commented Jun 17, 2016 • edited Loading

worleyph commented Jun 17, 2016 • edited Loading

worleyph commented Jun 17, 2016

rljacob commented Jun 17, 2016

worleyph commented Jun 17, 2016 • edited Loading

rljacob commented Jun 17, 2016 • edited Loading

douglasjacobsen commented Jun 18, 2016

worleyph commented Jun 18, 2016

worleyph commented Jun 18, 2016

rljacob commented Jun 18, 2016

worleyph commented Jun 20, 2016

worleyph commented Jun 27, 2016

worleyph commented Jul 4, 2017

worleyph commented Jun 17, 2016 •

edited

Loading

worleyph commented Jun 17, 2016 •

edited

Loading

worleyph commented Jun 17, 2016 •

edited

Loading

worleyph commented Jun 17, 2016 •

edited

Loading

rljacob commented Jun 17, 2016 •

edited

Loading

worleyph commented Jun 17, 2016 •

edited

Loading

worleyph commented Jun 17, 2016 •

edited

Loading

rljacob commented Jun 17, 2016 •

edited

Loading