Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve default PE layouts on Edison #1241

Merged
merged 3 commits into from
Mar 2, 2017
Merged

Conversation

PeterCaldwell
Copy link
Contributor

Replaces existing generic, poor-performing ne30 PE layouts on Edison with better values. 173 node and 375 node A_WCYCL1850 layouts were created by @worleyph a year or so ago and the 114 node F-compset layout was created by @PeterCaldwell.

@mt5555
Copy link
Contributor

mt5555 commented Jan 26, 2017

Thanks for doing this! Since no good deed goes unpunished, would you also be willing to add pages on these configurations to the performancne archive on confluence?

@PeterCaldwell
Copy link
Contributor Author

@mt5555 - yup, I'll make confluence pages for these layouts.

@PeterCaldwell
Copy link
Contributor Author

Hi @ndkeen - Is this PR in a zombie state, or is someone working on it? Do you need anything from me?

@ndkeen
Copy link
Contributor

ndkeen commented Feb 1, 2017

Yes, I just haven't gotten to it. I'm certainly fine with this, but assume I need to test it out myself? Or can I just merge it? Fired off acme_developer test. The ne30p4 tests should use this.

Ok, 2 of the tests are asking for 114 nodes. Which sounds like the change you wanted is in place. However, do we really want tests in acme_developer to ask for this many nodes?

All acme_developer jobs finished... except for the two that are asking for 114 nodes. Still in Q.

Of those 2 runs that use this updated compset (ie using 114 nodes with ne30np4), one is DEBUG. That one failed with a floating overflow. I doubled the number of nodes and tried again -- this gives each MPI more memory -- and that did not solve the problem.

/global/cscratch1/sd/ndk/acme_scratch/edison/mPJC0/SMS_D_Ln1.ne30_oEC.F1850C5AV1C-02.edison_intel.20170201_151526

0685:  about to call aerosol_vars%restart:        78727          15
0685: forrtl: error (72): floating overflow
0685: Image              PC                Routine            Line        Source             
0685: acme.exe           000000000BE73DC1  Unknown               Unknown  Unknown
0685: acme.exe           000000000BE72517  Unknown               Unknown  Unknown
0685: acme.exe           000000000BD80A64  Unknown               Unknown  Unknown
0685: acme.exe           000000000BD80876  Unknown               Unknown  Unknown
0685: acme.exe           000000000BCFA4C4  Unknown               Unknown  Unknown
0685: acme.exe           000000000BD05BE9  Unknown               Unknown  Unknown
0685: acme.exe           000000000B6E9470  Unknown               Unknown  Unknown
0685: acme.exe           0000000003058DE7  clubb_intr_mp_clu        2456  clubb_intr.F90
0685: acme.exe           0000000000E67531  physpkg_mp_tphysb        2362  physpkg.F90
0685: acme.exe           0000000000E37DB4  physpkg_mp_phys_r        1008  physpkg.F90
0685: acme.exe           000000000BE0F603  Unknown               Unknown  Unknown

cross-referencing issue #1263

@PeterCaldwell
Copy link
Contributor Author

Hi @ndkeen - I just did a 2 day FC5AV1C-L run on Edison using 114 nodes with DEBUG=TRUE and it completed just fine. The run is at: /global/cscratch1/sd/petercal/ACME_simulations/ACME.FC5AV1C-L.ne30_ne30.edison.test3-M/ . Could you try the failing test one more time?

@ndkeen
Copy link
Contributor

ndkeen commented Feb 3, 2017

Well that's good that for your case, there was nothing for the compiler to catch. I can see inside that directory, but cannot look at any of the files. I would recommend that you make a change in the run_acme* scripts (at least for NERSC) to use umask 022 instead of what you have there now as default.

I had already ran it twice (once normally and once with 2x more nodes), but I just submitted it again.

I also started a build from scratch using create_newcase (as opposed to the way it is launched with create_test). This run failed in the same way.

Should be these steps below:

create_newcase --case /global/cscratch1/sd/ndk/acme_scratch/edison/mPJC0/SMS_D_Ln1.ne30_oEC.F1850C5AV1C-02.edison_intelb --res ne30_oEC --mach edison --compiler intel --compset F1850C5AV1C-02  --walltime 0:45:00
cd /global/cscratch1/sd/ndk/acme_scratch/edison/mPJC0/SMS_D_Ln1.ne30_oEC.F1850C5AV1C-02.edison_intelb/bld/esp/obj /global/cscratch1/sd/ndk/acme_scratch/edison/mPJC0/SMS_D_Ln1.ne30_oEC.F1850C5AV1C-02.edison_intelb
xmlchange -i env_build.xml DEBUG=TRUE
case.setup
case.build
case.submit

/global/cscratch1/sd/ndk/acme_scratch/edison/mPJC0/SMS_D_Ln1.ne30_oEC.F1850C5AV1C-02.edison_intelb

@PeterCaldwell
Copy link
Contributor Author

Ok, I've verified that the new F-compset PE layout causes the SMS_D test (1 step smoke test compiled in debug mode) to crash with "floating invalid" message in CLUBB and that the code passed that test before I changed its default PE layout. Note, however, that when I ran an FC5AV1C-L case with my new layout and DEBUG=TRUE outside of the test suite, it ran fine. This makes me think there's something wrong with the nuances of the test rather than with the usefulness of the layout. In any case, I will keep debugging this (perhaps CLUBB doesn't like hyperthreading?) but have removed the problematic layout from my branch for the purposes of this PR.

The revised PR consists of modifications to the default, M, and L PE layouts for the WCYCL experiment. Since there are no tests of the WCYCL case in the devel test suite, there's no point in re-running the test suite on my branch. Since I'm baffled why the F-compset change would cause SMS_D to fail, I can't promise that my WCYCL layouts won't cause similar problems. However, the WCYCL layouts are less avant-garde (since they use no threading at all) and we've been using them for all the WCYCL runs we've done for the last year... so I think they're probably safe.

@mt5555
Copy link
Contributor

mt5555 commented Feb 3, 2017

@ndkeen should comment on this. He recently found on some machines, to improve vectorization, calculations will be done on data where the result will be masked out. In debug mode, we trap floating point errors and stop execution. In non-debug mode, if this was a real error, you would eventually see NaNs in your output. So probably everything is ok.

@ndkeen
Copy link
Contributor

ndkeen commented Feb 5, 2017

Mark is referring to a different issue (#1183) where I ran into some floating-point issues on cori-knl for DEBUG, F-compset runs (within testing framework and outside). The problems were happening inside a lapack routine from MKL and we eventually decided to simply turn off floating-point checking when going into the routine, turning it back on when it comes out.

Additionally, there is another issue (#1231) where I am trying to turn on floating-point checking for GNU builds and I found a test that failed, but it ran OK with more memory, so it just seems like the additional checking pushed the memory use just above the limit.

However, in this case, it's not stopping in lapack and it's not affected by giving each MPI more memory. So it still seems like it could be a real issue. Also, this is on edison with v15 of Intel. One thing I have not tried is running this same test on a more recent master (I assumed this was already fairly recent).

Here is the place code in the code where it stops:

   call outfld( 'CLOUDFRAC_CLUBB',  alst,                    pcols, lchnk )
   call outfld( 'RCMINLAYER_CLUBB', rcm_in_layer*1000._r8,   pcols, lchnk )  ! <-- this line
   call outfld( 'CLOUDCOVER_CLUBB', cloud_frac,              pcols, lchnk )

I can think of a few other things to test to verify more. Is it complaining about rcm_in_layer itself or the multiplication by 1000? Is it really an overflow or were the values "messed up" before? If I simply comment this line will it continue running? (or write 0's if this is indeed sending data to output file and it might be expecting at least a place-holder).

@gold2718
Copy link

gold2718 commented Feb 5, 2017

One thing I can think of is that the indicated line forces creation of a temporary array.
Could you try breaking this into two lines to see what happens?

rcm_in_layer = rcm_in_layer*1000._r8
call outfld( 'RCMINLAYER_CLUBB', rcm_in_layer,   pcols, lchnk )

(rcm_in_layer is not used after the outfld call so it is okay to change it here)

@PeterCaldwell
Copy link
Contributor Author

Hi @ndkeen - I'd like the current version of my branch merged w/ master ASAP (so it can be used for new coupled tests we want to do next week). As currently implemented, this branch just makes the PE layouts used by the coupled team anyways into defaults. Since there aren't any devel suite tests of the coupled model, these changes will pass the devel suite. I've removed the 114 node F-compset layout from this PR - I'll make that more controversial change as a separate PR (after figuring out what's going on). Is that ok?

@ndkeen
Copy link
Contributor

ndkeen commented Feb 7, 2017

Yes, sounds good. I didn't get it this weekend and NERSC machines (where my repos are) have been down all day yesterday and today. But good to know you just want to merge it without the 114 node change and asap. I was still planning on creating a github issue as it may point to something we should fix and is relatively easy to reproduce.

@PeterCaldwell
Copy link
Contributor Author

I like the idea of creating a github issue regarding the 114 node F-compset. Thanks for doing that! Thanks also (in advance) for merging the WCYCL PE layouts.

Copy link
Contributor

@ndkeen ndkeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine. I can't currently run the acme_developer tests, however, these changes should not affect any of these tests.

ndkeen added a commit that referenced this pull request Feb 8, 2017
Replaces existing generic, poor-performing ne30 PE layouts on Edison
with better values. 173 node and 375 node A_WCYCL1850 layouts were
created by @worleyph a year or so.
@ndkeen
Copy link
Contributor

ndkeen commented Feb 8, 2017

I merged to next

@ndkeen
Copy link
Contributor

ndkeen commented Feb 9, 2017

@gold2718 : I have been meaning to try your suggestion. I ran into another similar floating point issue at an outfld() call where there was another temporary array creation. In the acme_developer tests, but stopped in different F90 file. I tried out your suggestion there (to do the math before calling outfld), however, it failed in same way at the math step:

!call outfld('NIMIX_IMM', niimm_bc+niimm_dst, pcols, lchnk) !<-- floating invalid _was_ here 
niimm_bc=niimm_bc+niimm_dst  !<-- floating invalid now here
call outfld('NIMIX_IMM', niimm_bc, pcols, lchnk)

Looking further I see that these arrays are initialized to zero, but only for the first ncol. So is it possible that there are some un-initialized values sitting above ncol that, when added, don't taste so great?

For example, I am trying:

!   nnuccc_dst(:ncol,:) = 0._r8   
  nnuccc_dst = 0._r8   

for all of the arrays in that section, and the run completed.

@gold2718
Copy link

gold2718 commented Feb 9, 2017

@ndkeen , you are correct in that not all chunks have ncol == pcols so there are unused items at the top. In those cases, you can either fix the initialization (as you did in your example) or modify the outfld call, for example:
call outfld('NIMIX_IMM', niimm_bc(:ncol)+niimm_dst(:ncol), ncol, lchnk)
which is slightly more efficient (in that we never touch the unused columns.

@ndkeen
Copy link
Contributor

ndkeen commented Feb 10, 2017

Hmmm. I made the change, but still stops on same line.

   call outfld('NIMIX_IMM', niimm_bc(:pcols,:)+niimm_dst(:pcols,:), pcols, lchnk)  !ndk    still stops here                                                                                                                           
   call outfld('NIMIX_CNT', nicnt_bc(:pcols,:)+nicnt_dst(:pcols,:), pcols, lchnk)
   call outfld('NIMIX_DEP', nidep_bc(:pcols,:)+nidep_dst(:pcols,:), pcols, lchnk)

The variables are only zero'd to :ncol (as it was). Is it possible pcols is larger than ncol and the values between pcols and ncol are not initialized?

   niimm_bc(:ncol,:) = 0._r8
   nicnt_bc(:ncol,:) = 0._r8
   nidep_bc(:ncol,:) = 0._r8

@gold2718
Copy link

That's exactly what I said which is why you have to use foo(:ncol), not foo(:pcols) (see my note above).

@ndkeen
Copy link
Contributor

ndkeen commented Feb 10, 2017

Oh you meant to change the 3rd arg to outfld as well. I thought it was just an example.
Also note your example only lists one dimension and these arrays have 2.

Also, this seems like a little more serious of a change. All of the other calls to outfld are called with pcols. Since we've gotten off of the original issue, I will make another issue that's more specific. But if we fix it (in all places), it could also fix the same issue that I saw with the 114-node layout.

#1263

@PeterCaldwell
Copy link
Contributor Author

I noticed this went to next 15 days ago but still isn't on master... is this PR stalled out again?

@ndkeen
Copy link
Contributor

ndkeen commented Feb 27, 2017

Was waiting for master to be open for updates. ?

@ndkeen ndkeen merged commit a3230b4 into master Mar 2, 2017
ndkeen added a commit that referenced this pull request Mar 2, 2017
Replaces existing generic, poor-performing ne30 PE layouts on Edison
with better values. 173 node and 375 node A_WCYCL1850 layouts were
created by @worleyph a year or so ago and the 114 node F-compset
layout was created by @PeterCaldwell.

[BFB]
agsalin pushed a commit that referenced this pull request Apr 13, 2017
Replaces existing generic, poor-performing ne30 PE layouts on Edison
with better values. 173 node and 375 node A_WCYCL1850 layouts were
created by @worleyph a year or so ago and the 114 node F-compset
layout was created by @PeterCaldwell.

[BFB]
agsalin pushed a commit that referenced this pull request Apr 13, 2017
Mapping/update runoff to ocn

Updates to the runoff_to_ocean generator tool, with three goals in mind:

1. There was a bug in the ocean -> ocean smoothing

2. With the introduction of the estuary box model (EBM) in POP, we need a tool that merges the nearest neighbor map with the smooth map so that the result is a nearest neighbor map in open ocean but a smooth map in the marginal seas

3. The introduction of the EBM also added new requirements to the tool itself, namely the ability to map from runoff to coastal ocean points in the nearest neighbor map and then smooth from the coastal ocean points to the global ocean in the smooth map

Test suite: N/A
Test baseline: N/A
Test namelist changes: N/A
Test status: N/A

Fixes #1241 and #1243

User interface changes: the runoff map namelists are slightly different (though this code is backwards compatible with the old namelist)

Code review: @klindsay28 and I looked at how I implemented his NCL script in the code, and @kauff and I looked at the Fortran modifications.
@PeterCaldwell PeterCaldwell deleted the PeterCaldwell/cime/pe_layout1 branch July 25, 2017 20:49
ndkeen added a commit that referenced this pull request Jul 28, 2017
…ext (PR #1637)

Currently, unless you explicitly specify a custom PE layout on Edison, F-compset runs will default to 192 tasks + 4 threads (=32 total nodes), yielding <2 SYPD. This PR specifies a reasonable default for ne30 F compsets.

Note that what is proposed here isn't an optimal layout. Hyperthreading can increase SYPD by 1 or more, but it isn't turned on by default any more. This layout was chosen simply because it works and isn't horrible. just allows newbies or people who forget to specify the PE layout to get something that isn't totally useless. Efforts to get a faster PE layout would be useful.

Note that this is a second attempt to do this. #1241 also proposed a similar F PE layout but failed to pass tests for unrelated issues documented here: #1263

(measured 10.9 sypd after edison upgrade)
jgfouca pushed a commit that referenced this pull request Jan 8, 2018
0f241db response to comments
1007a7a cannot predetermin ndims here
99ef07d Merge pull request #1241 from NCAR/free_new_allocs
29ed162 free recently allocated vars
fbc3584 Merge pull request #1239 from NCAR/dontuse_nc_max
63dee3d Merge pull request #1240 from NCAR/limitto2GiB
64f2492 limit to 2GiB due to romio bug
29aee05 dont use NC_MAX values
d831ad3 Merge pull request #1231 from mgduda/mpi_type_fix
e996bdb Merge pull request #1222 from NCAR/ejh_autoconf_logging
426af22 Partial fix for incorrect type of 'mpi_type' in pioc_support.c
7eb724f added enable-logging option to autotools build

git-subtree-dir: src/externals/pio2
git-subtree-split: 0f241db88cfee1912a2769a052dba0d2d79f83d5
jgfouca pushed a commit that referenced this pull request Feb 27, 2018
Replaces existing generic, poor-performing ne30 PE layouts on Edison
with better values. 173 node and 375 node A_WCYCL1850 layouts were
created by @worleyph a year or so ago and the 114 node F-compset
layout was created by @PeterCaldwell.

[BFB]
jgfouca pushed a commit that referenced this pull request Mar 14, 2018
Replaces existing generic, poor-performing ne30 PE layouts on Edison
with better values. 173 node and 375 node A_WCYCL1850 layouts were
created by @worleyph a year or so ago and the 114 node F-compset
layout was created by @PeterCaldwell.

[BFB]
rljacob pushed a commit that referenced this pull request Apr 16, 2021
Replaces existing generic, poor-performing ne30 PE layouts on Edison
with better values. 173 node and 375 node A_WCYCL1850 layouts were
created by @worleyph a year or so ago and the 114 node F-compset
layout was created by @PeterCaldwell.

[BFB]
rljacob pushed a commit that referenced this pull request May 6, 2021
Replaces existing generic, poor-performing ne30 PE layouts on Edison
with better values. 173 node and 375 node A_WCYCL1850 layouts were
created by @worleyph a year or so ago and the 114 node F-compset
layout was created by @PeterCaldwell.

[BFB]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants