Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scalability issue in MCT #917

Closed
worleyph opened this issue Jun 17, 2016 · 25 comments
Closed

scalability issue in MCT #917

worleyph opened this issue Jun 17, 2016 · 25 comments
Assignees

Comments

@worleyph
Copy link
Contributor

worleyph commented Jun 17, 2016

I identified a scalability issue in active_pes_ in m_GlobalSegMap.F90 that severely impacts MPAS-O and MPAS-CICE when using oRRS15to5 and 65536 or more processes. I developed a workaround (still to be tested), but MCT is part of CIME and this issue and fix needs to be owned by someone else. (My fix also probably does not go far enough - the information being computed should be done in the initX_ routines and added to the GlobalSegMap type in my opinon, but that is a more intrusive modification.)

I'll add the problem in the next comment, and my proposed fix (not yet tested) in a subsequent comment.

@worleyph
Copy link
Contributor Author

worleyph commented Jun 17, 2016

Current code looks like:

   allocate(temp_list(ngseg), stat=ierr)
 ...
   do n=1,ngseg
      if(GSMap%pe_loc(n) >= 0) then ! a legitimate pe_location
         new = .true.
         do i=1, count
            if(GSMap%pe_loc(n) == temp_list(i)) new = .false.
         end do
         if(new) then
            count = count + 1
            temp_list(count) = GSMap%pe_loc(n)
         endif
      else
        (error abort)
      endif
   end do

For oRRS15to5 and 65536, then for sea ice (and probably the ocean?)

ngseg = 5778041
count = 65536

This causes initialization to take over an hour on Titan. (For 32768 it takes 30 minutes.) Other parts of initialization are also expensive, but this is the one location that is scaling the worst.

@worleyph
Copy link
Contributor Author

worleyph commented Jun 17, 2016

My current proposed solution, still to be tested, is

   allocate(temp_list(0:ngseg), stat=ierr)
 ...
   do n=1,ngseg
      if(GSMap%pe_loc(n) >= 0) then ! a legitimate pe_location
         if (temp_list(GSMap%pe_loc(n)) == -1) then
            temp_list(GSMap%pe_loc(n)) = 1
            count = count + 1
         endif
      else
        ...
      endif
   enddo

with a corresponding modification for determining the pe_list (an optional parameter). Note that temp_list is subsequently used to calculate the pe_list, but not for anything else.

The above changes the logic to be linear in ngseg instead of O(ngseg*nprocs).

@worleyph
Copy link
Contributor Author

Would also be nice if GlobalSegMap included the size of the MPI communicator, so that temp_list can be dimensioned appropriately, instead of as ngseg. Could calculate this locally in active_pes_ but it is not for some reason, and I am not familiar enough with MCT to understand the dependencies and expectations for any of this code.

@bishtgautam
Copy link
Contributor

Hey Pat, I believe there is a typo in the F90 filename in the initial post.

In case anyone is interested, m_GlobalSegMap.F90 is the link to the relevant code snippet that Pat referred above.

@worleyph
Copy link
Contributor Author

Thanks. Fixed.

@rljacob
Copy link
Member

rljacob commented Jun 17, 2016

Thanks Pat. I'll take a look at this.

@bishtgautam
Copy link
Contributor

Pat, After the do-loop, the original algorithm will produce valid values for temp_list(1:count), but that won't be the case for your solution. Here is another solution:

   allocate(temp_list(0:ngseg), stat=ierr)
   allocate(temp_list2(0:ngseg), stat=ierr)
 ...
   do n=1,ngseg
      if(GSMap%pe_loc(n) >= 0) then ! a legitimate pe_location
         if (temp_list(GSMap%pe_loc(n)) == -1) then
            temp_list(GSMap%pe_loc(n)) = 1
            count = count + 1
            temp_list2(count) = GSMap%pe_loc(n)
         endif
      else
        ...
      endif
   enddo

  temp_list(:) = temp_list2(:)

@jayeshkrishna
Copy link
Contributor

I thought that is what Pat meant by "with a corresponding modification for determining the pe_list"

@worleyph
Copy link
Contributor Author

I'll let @rljacob decide, but the rest of my fix would be

   if(present(pe_list)) then
      allocate(pe_list(count), perm(count), stat=ierr)
      if (ierr /= 0) then
         call die(myname_,'allocate(pe_list...',ierr)
      endif

      i = 1
      do n=1,ngseg
         if (temp_list(n) == 1) then
            pe_list(i) = n
            i = i+1
            if (i> count) exit
         endif
      enddo
   endif

@bishtgautam
Copy link
Contributor

Sorry for not reading the entire comment. :(

@worleyph
Copy link
Contributor Author

No problem. I rarely make it to the bottom of e-mails and the like (so be forewarned :-) ).

The above logic eliminates the need for the sorting logic in the current code, which may also be a performance improvement, though not guaranteed.

@worleyph
Copy link
Contributor Author

worleyph commented Jun 17, 2016

Update: Ran a (32768 process / 2 threads per process) job and ngseg was approximately the same, for both ocean and sea ice:

NGSEG = 5777939

so this appears to be the source of the problem. Note that active_pes_ is called twice each for ocean and sea ice. Calculating this once and saving it with the GSMap (assuming that both calls are for the same GSMap) would be more efficient as well.

@rljacob
Copy link
Member

rljacob commented Jun 17, 2016

How many horizontal grid points do the ocean and sea ice have again at oRRS15to5?

@worleyph
Copy link
Contributor Author

worleyph commented Jun 17, 2016

(for ocean)

  ----- reading dimensions from stream 'mesh' using file /lustre/atlas1/cli900/world-shared/cesm/inputdata/ocn/mpas-o/oRRS15to5/ocean.RRS.15-5km.151209.nc
                      nCells = 5778136
                      nEdges =17411251
                   nVertices =11631265
                         TWO =       2
                    maxEdges =       7
                   maxEdges2 =      14
                vertexDegree =       3
                 nVertLevels =     100

@worleyph
Copy link
Contributor Author

(and for sea ice)

  ----- reading dimensions from stream 'input' using file /lustre/atlas1/cli900/world-shared/cesm/inputdata/ice/mpas-cice/oRRS15to5/seaice.RRS.15to5km.151209.nc
                      nCells = 5778136
                      nEdges =17411251
                   nVertices =11631265
                         TWO =       2
                    maxEdges =       7
                   maxEdges2 =      14
                vertexDegree =       3

@rljacob
Copy link
Member

rljacob commented Jun 17, 2016

So 5778136 gridpoints and 5777939 segments. That's almost 1-1 which means there are very few contiguously numbered runs of gridpoints on a processor. The algorithm needs to be better but some attention paid to the decomposition and grid numbering could lower that ngseg number.

@worleyph
Copy link
Contributor Author

worleyph commented Jun 17, 2016

@douglasjacobsen should speak up, but I read this as 5778136 cells, but 11631265 gridpoints (so not quite as bad as 1-1). I believe that the MPAS development team has been looking at better partitioning strategies - guess that they should add this to their list of metrics?

@rljacob
Copy link
Member

rljacob commented Jun 17, 2016

True I'm not sure if they're decomposing by cell or vertex. Vertex would make that a little better. The lowest possible value for ngseg = n_mpiprocs. The worst possible value is ngseg = n_gridpoints.

@douglasjacobsen
Copy link
Member

@worleyph and @rljacob This is the issue I brought up during the all-hands SE / Perf speed dating session. I don't know if that conversation is documented anywhere, but here's a summary:

MPAS meshes can be ordered arbitrarily, which could easily cause scaling issues during initialization while trying to determine how to communicate with other portions of the mesh. I don't know how the coupler does this currently, but it's possible that it's overly expensive for an MPAS mesh.

Based on this conversation, I'd guess (without really understanding the code you guys are talking about) that this is similar to what's happening.

MPAS meshes are also decomposed using nCells, so as you guys mentioned it's close to the worst case (1 cell / contiguous segment). Changing the decomposition wouldn't change this, but changing the numbering of cells would. This change is relatively easy, but in the past we've run into issues that were hidden by having grids ordered this way (changing them to "random" ordering reveals alternative issues).

Furthermore, the ordering as far as the coupler goes is somewhat arbitrary. So, it's possible that with slight modifications within ocn_comp_mct.F, where we index the mesh (and maybe ordering the scrip file in a specific way) we could remove the issue without actually having to change MPAS mesh files at all.

I'll spend some time seeing if I can come up with a modification on the MPAS sides this next week, but it might be good to use something like Pat's proposed fix, after it's tested, just to help remove scalability issues within MCT.

@worleyph
Copy link
Contributor Author

@douglasjacobsen , thanks. I am hoping that my proposal (swapping a quardratic complexity algorithm with a linear complexity algorithm) will solve the immediate problem. However, there are also data structures and MPI communications on these data structures (e.g. BCAST of vectors of length ngseg) that add to memory and communication overhead that will be decreased by minimizing the size of ngseg. So any thing you can do will be useful.

@worleyph
Copy link
Contributor Author

My first run, using 32768 processes and 2 threads per process worked scarily well. Total initialization cost decreased from 2019 seconds to 935 seconds, and the cost of the first loop in active_pes_ (described in comments #2 and #3) decreased from 954 seconds to 0.1 seconds (summed over all calls), which is 2-4 times smaller than the complexity decrease would have indicated? Perhaps the compiler likes the simpler code. Memory accesses are more local as well, ...

I do not think that any of these calls use the pe_list optional parameter, so this does not test the second part of my changes. In this context, the only output from the routine is "count". I printed this out for process 0, and it is the same for both the old and new versions (32768 here).

Note hower that I was not able to see whether the simulation output was the same or not - I only ran for 1 day, and there was no numerical diagnostic output to compare that I can see (for this GMPAS case). However it did run, and the timing of the run loop was essentially the same for the two runs (748 seconds vs. 752 seconds).

My 65536x1 job is still in the queue. If this has the same positive behavior I'd like to see this modification put into a pull request (and tested more rigorously). Probably can't wait for the next CIME merge, as this impacts testing the high resolution models runs going on now.

Given how cheap the new version is, further optimizations like computing this only once and putting it into the GSMap data structure can wait until the next CIME merge, especially since this change would be more intrusive. (I may also be completely misunderstanding what is going on, and perhaps these calls to active_pes_ can't be eliminated by precomputation.)

@rljacob
Copy link
Member

rljacob commented Jun 18, 2016

@worleyph please make PR to https://github.com/MCSclimate/MCT and I'll do testing there and then bring it in to ACME.

@worleyph
Copy link
Contributor Author

@rljacob , so my solution is an appropriate one? I didn't spend a lot of time on it, and this is code that I am not familiar with.
I'll put in my changes as soon as I can, but I may not be able to get to this until next week - I have to provide some content for a proposal due this week.

@worleyph
Copy link
Contributor Author

@rljacob:

 > git push origin worleyph/active_pes_perf_fix
 ERROR: Permission to MCSclimate/MCT.git denied to worleyph.
 fatal: Could not read from remote repository.

 Please make sure you have the correct access rights
 and the repository exists.

FYI.

@worleyph
Copy link
Contributor Author

worleyph commented Jul 4, 2017

I believe that this optimization is now in the model.

@worleyph worleyph closed this as completed Jul 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants