-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory bottleneck in chgres_cube #633
Comments
component winds. Fixes ufs-community#633
Fixes ufs-community#633.
I set up at case using GFSv16 netcdf data as input for a C1152 L128 grid. My test script and config file are on Dell: Using 'develop' at 570ea39 required 8 nodes/6 tasks per node. Using the branch at f584c91 only required 4 nodes/6 tasks per node. Will try additional tests. |
Tried C3072 L65 on Dell. Using 'develop' required 30 nodes/6 tasks per node. Using the branch required 20 nodes/6 tasks per node. |
Here is the error I get from 'FieldRegrid', which is solved my doubling the number of nodes:
According to the ESMF group (@rsdunlapiv) this is the result of using 32 bit pointers in some ESMF routines |
The ESMF group recommends a switch to ESMF v8.3 to help fix this. I just tried v8.3 on Hera using develop at f658c1e and all chgres regression tests passed. Will open an issue to upgrade to v8.3. |
The ESMF group provided a test branch that fixes this - https://github.com/esmf-org/esmf/tree/feature/large-messages I cloned and compiled this on Hera here: |
On Hera, I compiled 'develop' at 2a07b2c for use as the 'control'. For the 'test', I compiled 'develop' using the update ESMF branch. This was done by modifying the build module as follows:
The test case was a C1152 grid using 128 vertical levels. All config files and scripts are here: |
Running the 'control' with 7 nodes/6 tasks per node, resulted in this error (see "log.fail.7nodes.develop"):
Rerunning with 8 nodes/6 tasks per node was successful. See "log.pass.8nodes.develop". |
Running the 'test' (which used the update ESMF branch) was successful using only 5 nodes/6 tasks per node. See "log.pass.5nodes.new.esmf.branch". So, using the new ESMF test branch eliminates the MPI error and reduces the amount of resources to run large grids. |
Update from the ESMF team (Gerhard):
|
ESMF v8.3.1 was officially released: https://github.com/esmf-org/esmf/releases/tag/v8.3.1 |
Anning Cheng was trying to create a C3072 L128 grid using the gdas_init utility on Cactus. The wind fields in the coldstart files were not correct. I was able to repeat the problem using develop at 711a4dc. I then upgraded to ESMF v8.4.0bs08, but the problem persisted. I ran with 8 nodes/18 tasks per node, and I requested memory of 500 GB. A plot of the problem is attached: |
So, the way I create the ESMF fields for 3-d winds must have some other problems. As a test, I merged the latest updates from |
components. Fixes ufs-community#633.
Update unit tests for new specification of wind fields. Fixes ufs-community#633.
Users occasionally get out-of-memory issues when running chgres_cube for large domains. Almost always, this happens during the regridding of the 3-D winds to the edges of the grid box.
UFS_UTILS/sorc/chgres_cube.fd/atmosphere.F90
Line 347 in 570ea39
I suspect this is because the ESMF field for winds is 4-dimensional (x,y,z wind components in the vertical).
UFS_UTILS/sorc/chgres_cube.fd/atmosphere.F90
Line 632 in 570ea39
Interpolating each wind component separately or as a field bundle would likely save memory.
The text was updated successfully, but these errors were encountered: