-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve default PE layouts on Edison #1241
Conversation
Thanks for doing this! Since no good deed goes unpunished, would you also be willing to add pages on these configurations to the performancne archive on confluence? |
@mt5555 - yup, I'll make confluence pages for these layouts. |
Hi @ndkeen - Is this PR in a zombie state, or is someone working on it? Do you need anything from me? |
Yes, I just haven't gotten to it. I'm certainly fine with this, but assume I need to test it out myself? Or can I just merge it? Fired off acme_developer test. The ne30p4 tests should use this. Ok, 2 of the tests are asking for 114 nodes. Which sounds like the change you wanted is in place. However, do we really want tests in acme_developer to ask for this many nodes? All acme_developer jobs finished... except for the two that are asking for 114 nodes. Still in Q. Of those 2 runs that use this updated compset (ie using 114 nodes with ne30np4), one is DEBUG. That one failed with a floating overflow. I doubled the number of nodes and tried again -- this gives each MPI more memory -- and that did not solve the problem.
cross-referencing issue #1263 |
Hi @ndkeen - I just did a 2 day FC5AV1C-L run on Edison using 114 nodes with DEBUG=TRUE and it completed just fine. The run is at: /global/cscratch1/sd/petercal/ACME_simulations/ACME.FC5AV1C-L.ne30_ne30.edison.test3-M/ . Could you try the failing test one more time? |
Well that's good that for your case, there was nothing for the compiler to catch. I can see inside that directory, but cannot look at any of the files. I would recommend that you make a change in the run_acme* scripts (at least for NERSC) to use I had already ran it twice (once normally and once with 2x more nodes), but I just submitted it again. I also started a build from scratch using create_newcase (as opposed to the way it is launched with create_test). This run failed in the same way. Should be these steps below:
|
Ok, I've verified that the new F-compset PE layout causes the SMS_D test (1 step smoke test compiled in debug mode) to crash with "floating invalid" message in CLUBB and that the code passed that test before I changed its default PE layout. Note, however, that when I ran an FC5AV1C-L case with my new layout and DEBUG=TRUE outside of the test suite, it ran fine. This makes me think there's something wrong with the nuances of the test rather than with the usefulness of the layout. In any case, I will keep debugging this (perhaps CLUBB doesn't like hyperthreading?) but have removed the problematic layout from my branch for the purposes of this PR. The revised PR consists of modifications to the default, M, and L PE layouts for the WCYCL experiment. Since there are no tests of the WCYCL case in the devel test suite, there's no point in re-running the test suite on my branch. Since I'm baffled why the F-compset change would cause SMS_D to fail, I can't promise that my WCYCL layouts won't cause similar problems. However, the WCYCL layouts are less avant-garde (since they use no threading at all) and we've been using them for all the WCYCL runs we've done for the last year... so I think they're probably safe. |
@ndkeen should comment on this. He recently found on some machines, to improve vectorization, calculations will be done on data where the result will be masked out. In debug mode, we trap floating point errors and stop execution. In non-debug mode, if this was a real error, you would eventually see NaNs in your output. So probably everything is ok. |
Mark is referring to a different issue (#1183) where I ran into some floating-point issues on cori-knl for DEBUG, F-compset runs (within testing framework and outside). The problems were happening inside a lapack routine from MKL and we eventually decided to simply turn off floating-point checking when going into the routine, turning it back on when it comes out. Additionally, there is another issue (#1231) where I am trying to turn on floating-point checking for GNU builds and I found a test that failed, but it ran OK with more memory, so it just seems like the additional checking pushed the memory use just above the limit. However, in this case, it's not stopping in lapack and it's not affected by giving each MPI more memory. So it still seems like it could be a real issue. Also, this is on edison with v15 of Intel. One thing I have not tried is running this same test on a more recent master (I assumed this was already fairly recent). Here is the place code in the code where it stops:
I can think of a few other things to test to verify more. Is it complaining about rcm_in_layer itself or the multiplication by 1000? Is it really an overflow or were the values "messed up" before? If I simply comment this line will it continue running? (or write 0's if this is indeed sending data to output file and it might be expecting at least a place-holder). |
One thing I can think of is that the indicated line forces creation of a temporary array.
( |
Hi @ndkeen - I'd like the current version of my branch merged w/ master ASAP (so it can be used for new coupled tests we want to do next week). As currently implemented, this branch just makes the PE layouts used by the coupled team anyways into defaults. Since there aren't any devel suite tests of the coupled model, these changes will pass the devel suite. I've removed the 114 node F-compset layout from this PR - I'll make that more controversial change as a separate PR (after figuring out what's going on). Is that ok? |
Yes, sounds good. I didn't get it this weekend and NERSC machines (where my repos are) have been down all day yesterday and today. But good to know you just want to merge it without the 114 node change and asap. I was still planning on creating a github issue as it may point to something we should fix and is relatively easy to reproduce. |
I like the idea of creating a github issue regarding the 114 node F-compset. Thanks for doing that! Thanks also (in advance) for merging the WCYCL PE layouts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine. I can't currently run the acme_developer tests, however, these changes should not affect any of these tests.
Replaces existing generic, poor-performing ne30 PE layouts on Edison with better values. 173 node and 375 node A_WCYCL1850 layouts were created by @worleyph a year or so.
I merged to next |
@gold2718 : I have been meaning to try your suggestion. I ran into another similar floating point issue at an outfld() call where there was another temporary array creation. In the acme_developer tests, but stopped in different F90 file. I tried out your suggestion there (to do the math before calling outfld), however, it failed in same way at the math step:
Looking further I see that these arrays are initialized to zero, but only for the first ncol. So is it possible that there are some un-initialized values sitting above ncol that, when added, don't taste so great? For example, I am trying:
for all of the arrays in that section, and the run completed. |
@ndkeen , you are correct in that not all chunks have ncol == pcols so there are unused items at the top. In those cases, you can either fix the initialization (as you did in your example) or modify the outfld call, for example: |
Hmmm. I made the change, but still stops on same line.
The variables are only zero'd to :ncol (as it was). Is it possible pcols is larger than ncol and the values between pcols and ncol are not initialized?
|
That's exactly what I said which is why you have to use foo(:ncol), not foo(:pcols) (see my note above). |
Oh you meant to change the 3rd arg to outfld as well. I thought it was just an example. Also, this seems like a little more serious of a change. All of the other calls to outfld are called with pcols. Since we've gotten off of the original issue, I will make another issue that's more specific. But if we fix it (in all places), it could also fix the same issue that I saw with the 114-node layout. |
I noticed this went to next 15 days ago but still isn't on master... is this PR stalled out again? |
Was waiting for master to be open for updates. ? |
Replaces existing generic, poor-performing ne30 PE layouts on Edison with better values. 173 node and 375 node A_WCYCL1850 layouts were created by @worleyph a year or so ago and the 114 node F-compset layout was created by @PeterCaldwell. [BFB]
Replaces existing generic, poor-performing ne30 PE layouts on Edison with better values. 173 node and 375 node A_WCYCL1850 layouts were created by @worleyph a year or so ago and the 114 node F-compset layout was created by @PeterCaldwell. [BFB]
Mapping/update runoff to ocn Updates to the runoff_to_ocean generator tool, with three goals in mind: 1. There was a bug in the ocean -> ocean smoothing 2. With the introduction of the estuary box model (EBM) in POP, we need a tool that merges the nearest neighbor map with the smooth map so that the result is a nearest neighbor map in open ocean but a smooth map in the marginal seas 3. The introduction of the EBM also added new requirements to the tool itself, namely the ability to map from runoff to coastal ocean points in the nearest neighbor map and then smooth from the coastal ocean points to the global ocean in the smooth map Test suite: N/A Test baseline: N/A Test namelist changes: N/A Test status: N/A Fixes #1241 and #1243 User interface changes: the runoff map namelists are slightly different (though this code is backwards compatible with the old namelist) Code review: @klindsay28 and I looked at how I implemented his NCL script in the code, and @kauff and I looked at the Fortran modifications.
…ext (PR #1637) Currently, unless you explicitly specify a custom PE layout on Edison, F-compset runs will default to 192 tasks + 4 threads (=32 total nodes), yielding <2 SYPD. This PR specifies a reasonable default for ne30 F compsets. Note that what is proposed here isn't an optimal layout. Hyperthreading can increase SYPD by 1 or more, but it isn't turned on by default any more. This layout was chosen simply because it works and isn't horrible. just allows newbies or people who forget to specify the PE layout to get something that isn't totally useless. Efforts to get a faster PE layout would be useful. Note that this is a second attempt to do this. #1241 also proposed a similar F PE layout but failed to pass tests for unrelated issues documented here: #1263 (measured 10.9 sypd after edison upgrade)
0f241db response to comments 1007a7a cannot predetermin ndims here 99ef07d Merge pull request #1241 from NCAR/free_new_allocs 29ed162 free recently allocated vars fbc3584 Merge pull request #1239 from NCAR/dontuse_nc_max 63dee3d Merge pull request #1240 from NCAR/limitto2GiB 64f2492 limit to 2GiB due to romio bug 29aee05 dont use NC_MAX values d831ad3 Merge pull request #1231 from mgduda/mpi_type_fix e996bdb Merge pull request #1222 from NCAR/ejh_autoconf_logging 426af22 Partial fix for incorrect type of 'mpi_type' in pioc_support.c 7eb724f added enable-logging option to autotools build git-subtree-dir: src/externals/pio2 git-subtree-split: 0f241db88cfee1912a2769a052dba0d2d79f83d5
Replaces existing generic, poor-performing ne30 PE layouts on Edison with better values. 173 node and 375 node A_WCYCL1850 layouts were created by @worleyph a year or so ago and the 114 node F-compset layout was created by @PeterCaldwell. [BFB]
Replaces existing generic, poor-performing ne30 PE layouts on Edison with better values. 173 node and 375 node A_WCYCL1850 layouts were created by @worleyph a year or so ago and the 114 node F-compset layout was created by @PeterCaldwell. [BFB]
Replaces existing generic, poor-performing ne30 PE layouts on Edison with better values. 173 node and 375 node A_WCYCL1850 layouts were created by @worleyph a year or so ago and the 114 node F-compset layout was created by @PeterCaldwell. [BFB]
Replaces existing generic, poor-performing ne30 PE layouts on Edison with better values. 173 node and 375 node A_WCYCL1850 layouts were created by @worleyph a year or so ago and the 114 node F-compset layout was created by @PeterCaldwell. [BFB]
Replaces existing generic, poor-performing ne30 PE layouts on Edison with better values. 173 node and 375 node A_WCYCL1850 layouts were created by @worleyph a year or so ago and the 114 node F-compset layout was created by @PeterCaldwell.