Improve default PE layouts on Edison #1241

PeterCaldwell · 2017-01-26T17:54:35Z

Replaces existing generic, poor-performing ne30 PE layouts on Edison with better values. 173 node and 375 node A_WCYCL1850 layouts were created by @worleyph a year or so ago and the 114 node F-compset layout was created by @PeterCaldwell.

mt5555 · 2017-01-26T18:11:53Z

Thanks for doing this! Since no good deed goes unpunished, would you also be willing to add pages on these configurations to the performancne archive on confluence?

PeterCaldwell · 2017-01-26T22:07:54Z

@mt5555 - yup, I'll make confluence pages for these layouts.

PeterCaldwell · 2017-02-01T22:05:02Z

Hi @ndkeen - Is this PR in a zombie state, or is someone working on it? Do you need anything from me?

ndkeen · 2017-02-01T22:06:43Z

Yes, I just haven't gotten to it. I'm certainly fine with this, but assume I need to test it out myself? Or can I just merge it? Fired off acme_developer test. The ne30p4 tests should use this.

Ok, 2 of the tests are asking for 114 nodes. Which sounds like the change you wanted is in place. However, do we really want tests in acme_developer to ask for this many nodes?

All acme_developer jobs finished... except for the two that are asking for 114 nodes. Still in Q.

Of those 2 runs that use this updated compset (ie using 114 nodes with ne30np4), one is DEBUG. That one failed with a floating overflow. I doubled the number of nodes and tried again -- this gives each MPI more memory -- and that did not solve the problem.

/global/cscratch1/sd/ndk/acme_scratch/edison/mPJC0/SMS_D_Ln1.ne30_oEC.F1850C5AV1C-02.edison_intel.20170201_151526

0685:  about to call aerosol_vars%restart:        78727          15
0685: forrtl: error (72): floating overflow
0685: Image              PC                Routine            Line        Source             
0685: acme.exe           000000000BE73DC1  Unknown               Unknown  Unknown
0685: acme.exe           000000000BE72517  Unknown               Unknown  Unknown
0685: acme.exe           000000000BD80A64  Unknown               Unknown  Unknown
0685: acme.exe           000000000BD80876  Unknown               Unknown  Unknown
0685: acme.exe           000000000BCFA4C4  Unknown               Unknown  Unknown
0685: acme.exe           000000000BD05BE9  Unknown               Unknown  Unknown
0685: acme.exe           000000000B6E9470  Unknown               Unknown  Unknown
0685: acme.exe           0000000003058DE7  clubb_intr_mp_clu        2456  clubb_intr.F90
0685: acme.exe           0000000000E67531  physpkg_mp_tphysb        2362  physpkg.F90
0685: acme.exe           0000000000E37DB4  physpkg_mp_phys_r        1008  physpkg.F90
0685: acme.exe           000000000BE0F603  Unknown               Unknown  Unknown

cross-referencing issue #1263

PeterCaldwell · 2017-02-03T00:11:29Z

Hi @ndkeen - I just did a 2 day FC5AV1C-L run on Edison using 114 nodes with DEBUG=TRUE and it completed just fine. The run is at: /global/cscratch1/sd/petercal/ACME_simulations/ACME.FC5AV1C-L.ne30_ne30.edison.test3-M/ . Could you try the failing test one more time?

ndkeen · 2017-02-03T01:31:07Z

Well that's good that for your case, there was nothing for the compiler to catch. I can see inside that directory, but cannot look at any of the files. I would recommend that you make a change in the run_acme* scripts (at least for NERSC) to use umask 022 instead of what you have there now as default.

I had already ran it twice (once normally and once with 2x more nodes), but I just submitted it again.

I also started a build from scratch using create_newcase (as opposed to the way it is launched with create_test). This run failed in the same way.

Should be these steps below:

create_newcase --case /global/cscratch1/sd/ndk/acme_scratch/edison/mPJC0/SMS_D_Ln1.ne30_oEC.F1850C5AV1C-02.edison_intelb --res ne30_oEC --mach edison --compiler intel --compset F1850C5AV1C-02  --walltime 0:45:00
cd /global/cscratch1/sd/ndk/acme_scratch/edison/mPJC0/SMS_D_Ln1.ne30_oEC.F1850C5AV1C-02.edison_intelb/bld/esp/obj /global/cscratch1/sd/ndk/acme_scratch/edison/mPJC0/SMS_D_Ln1.ne30_oEC.F1850C5AV1C-02.edison_intelb
xmlchange -i env_build.xml DEBUG=TRUE
case.setup
case.build
case.submit

/global/cscratch1/sd/ndk/acme_scratch/edison/mPJC0/SMS_D_Ln1.ne30_oEC.F1850C5AV1C-02.edison_intelb

PeterCaldwell · 2017-02-03T23:17:52Z

Ok, I've verified that the new F-compset PE layout causes the SMS_D test (1 step smoke test compiled in debug mode) to crash with "floating invalid" message in CLUBB and that the code passed that test before I changed its default PE layout. Note, however, that when I ran an FC5AV1C-L case with my new layout and DEBUG=TRUE outside of the test suite, it ran fine. This makes me think there's something wrong with the nuances of the test rather than with the usefulness of the layout. In any case, I will keep debugging this (perhaps CLUBB doesn't like hyperthreading?) but have removed the problematic layout from my branch for the purposes of this PR.

The revised PR consists of modifications to the default, M, and L PE layouts for the WCYCL experiment. Since there are no tests of the WCYCL case in the devel test suite, there's no point in re-running the test suite on my branch. Since I'm baffled why the F-compset change would cause SMS_D to fail, I can't promise that my WCYCL layouts won't cause similar problems. However, the WCYCL layouts are less avant-garde (since they use no threading at all) and we've been using them for all the WCYCL runs we've done for the last year... so I think they're probably safe.

mt5555 · 2017-02-03T23:45:32Z

@ndkeen should comment on this. He recently found on some machines, to improve vectorization, calculations will be done on data where the result will be masked out. In debug mode, we trap floating point errors and stop execution. In non-debug mode, if this was a real error, you would eventually see NaNs in your output. So probably everything is ok.

ndkeen · 2017-02-05T17:55:11Z

Mark is referring to a different issue (#1183) where I ran into some floating-point issues on cori-knl for DEBUG, F-compset runs (within testing framework and outside). The problems were happening inside a lapack routine from MKL and we eventually decided to simply turn off floating-point checking when going into the routine, turning it back on when it comes out.

Additionally, there is another issue (#1231) where I am trying to turn on floating-point checking for GNU builds and I found a test that failed, but it ran OK with more memory, so it just seems like the additional checking pushed the memory use just above the limit.

However, in this case, it's not stopping in lapack and it's not affected by giving each MPI more memory. So it still seems like it could be a real issue. Also, this is on edison with v15 of Intel. One thing I have not tried is running this same test on a more recent master (I assumed this was already fairly recent).

Here is the place code in the code where it stops:

   call outfld( 'CLOUDFRAC_CLUBB',  alst,                    pcols, lchnk )
   call outfld( 'RCMINLAYER_CLUBB', rcm_in_layer*1000._r8,   pcols, lchnk )  ! <-- this line
   call outfld( 'CLOUDCOVER_CLUBB', cloud_frac,              pcols, lchnk )

I can think of a few other things to test to verify more. Is it complaining about rcm_in_layer itself or the multiplication by 1000? Is it really an overflow or were the values "messed up" before? If I simply comment this line will it continue running? (or write 0's if this is indeed sending data to output file and it might be expecting at least a place-holder).

gold2718 · 2017-02-05T18:16:54Z

One thing I can think of is that the indicated line forces creation of a temporary array.
Could you try breaking this into two lines to see what happens?

rcm_in_layer = rcm_in_layer*1000._r8
call outfld( 'RCMINLAYER_CLUBB', rcm_in_layer,   pcols, lchnk )

(rcm_in_layer is not used after the outfld call so it is okay to change it here)

PeterCaldwell · 2017-02-07T23:25:05Z

Hi @ndkeen - I'd like the current version of my branch merged w/ master ASAP (so it can be used for new coupled tests we want to do next week). As currently implemented, this branch just makes the PE layouts used by the coupled team anyways into defaults. Since there aren't any devel suite tests of the coupled model, these changes will pass the devel suite. I've removed the 114 node F-compset layout from this PR - I'll make that more controversial change as a separate PR (after figuring out what's going on). Is that ok?

ndkeen · 2017-02-07T23:43:31Z

Yes, sounds good. I didn't get it this weekend and NERSC machines (where my repos are) have been down all day yesterday and today. But good to know you just want to merge it without the 114 node change and asap. I was still planning on creating a github issue as it may point to something we should fix and is relatively easy to reproduce.

PeterCaldwell · 2017-02-08T00:09:02Z

I like the idea of creating a github issue regarding the 114 node F-compset. Thanks for doing that! Thanks also (in advance) for merging the WCYCL PE layouts.

ndkeen

Looks fine. I can't currently run the acme_developer tests, however, these changes should not affect any of these tests.

@worleyph

Replaces existing generic, poor-performing ne30 PE layouts on Edison with better values. 173 node and 375 node A_WCYCL1850 layouts were created by @worleyph a year or so.

ndkeen · 2017-02-08T21:56:52Z

I merged to next

ndkeen · 2017-02-09T16:39:08Z

@gold2718 : I have been meaning to try your suggestion. I ran into another similar floating point issue at an outfld() call where there was another temporary array creation. In the acme_developer tests, but stopped in different F90 file. I tried out your suggestion there (to do the math before calling outfld), however, it failed in same way at the math step:

!call outfld('NIMIX_IMM', niimm_bc+niimm_dst, pcols, lchnk) !<-- floating invalid _was_ here 
niimm_bc=niimm_bc+niimm_dst  !<-- floating invalid now here
call outfld('NIMIX_IMM', niimm_bc, pcols, lchnk)

Looking further I see that these arrays are initialized to zero, but only for the first ncol. So is it possible that there are some un-initialized values sitting above ncol that, when added, don't taste so great?

For example, I am trying:

!   nnuccc_dst(:ncol,:) = 0._r8   
  nnuccc_dst = 0._r8

for all of the arrays in that section, and the run completed.

gold2718 · 2017-02-09T20:13:46Z

@ndkeen , you are correct in that not all chunks have ncol == pcols so there are unused items at the top. In those cases, you can either fix the initialization (as you did in your example) or modify the outfld call, for example:
call outfld('NIMIX_IMM', niimm_bc(:ncol)+niimm_dst(:ncol), ncol, lchnk)
which is slightly more efficient (in that we never touch the unused columns.

ndkeen · 2017-02-10T03:09:29Z

Hmmm. I made the change, but still stops on same line.

   call outfld('NIMIX_IMM', niimm_bc(:pcols,:)+niimm_dst(:pcols,:), pcols, lchnk)  !ndk    still stops here                                                                                                                           
   call outfld('NIMIX_CNT', nicnt_bc(:pcols,:)+nicnt_dst(:pcols,:), pcols, lchnk)
   call outfld('NIMIX_DEP', nidep_bc(:pcols,:)+nidep_dst(:pcols,:), pcols, lchnk)

The variables are only zero'd to :ncol (as it was). Is it possible pcols is larger than ncol and the values between pcols and ncol are not initialized?

   niimm_bc(:ncol,:) = 0._r8
   nicnt_bc(:ncol,:) = 0._r8
   nidep_bc(:ncol,:) = 0._r8

gold2718 · 2017-02-10T03:12:51Z

That's exactly what I said which is why you have to use foo(:ncol), not foo(:pcols) (see my note above).

ndkeen · 2017-02-10T03:24:06Z

Oh you meant to change the 3rd arg to outfld as well. I thought it was just an example.
Also note your example only lists one dimension and these arrays have 2.

Also, this seems like a little more serious of a change. All of the other calls to outfld are called with pcols. Since we've gotten off of the original issue, I will make another issue that's more specific. But if we fix it (in all places), it could also fix the same issue that I saw with the 114-node layout.

#1263

PeterCaldwell · 2017-02-24T00:22:04Z

I noticed this went to next 15 days ago but still isn't on master... is this PR stalled out again?

ndkeen · 2017-02-27T18:45:21Z

Was waiting for master to be open for updates. ?

@worleyph