Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load esmf/8.5.0 in gaea.intel.lua #1327

Merged
merged 5 commits into from
Nov 1, 2024
Merged

Conversation

RussTreadon-NOAA
Copy link
Contributor

This PR adds load("esmf/8.5.0") to modulefiles/GDAS/gaea.intel.lua. This is required for the GDASApp build to successfully complete on Gaea C5.

Resolves #1326

@RussTreadon-NOAA RussTreadon-NOAA self-assigned this Oct 11, 2024
@RussTreadon-NOAA
Copy link
Contributor Author

I don't think the automatic check failures are due to the changes in this PR. There have been several GitHub Incident - Disruption with some GitHub services - 11 October 2024 notifications within the last hour.

@RussTreadon-NOAA
Copy link
Contributor Author

Confirmed that adding load("esmf/8.5.0") to gaea.intel.lua is sufficient to allow build.sh to successfully run to completion on Gaea C5. Test build completed in /gpfs/f5/nggps_emc/scratch/Russ.Treadon/git/GDASApp/test.

@CoryMartin-NOAA
Copy link
Contributor

@RussTreadon-NOAA sorry I didn't catch this sooner, but I see that C5 is having an OS upgrade next week... I fear this will make a lot of spack-stack things become broken, so perhaps it's best to pause this until after that maintenance/upgrade?

@RussTreadon-NOAA
Copy link
Contributor Author

Oops, that's unfortunate. I'll flip this PR to draft since what works today will likely fail next week.

@RussTreadon-NOAA RussTreadon-NOAA marked this pull request as draft October 11, 2024 19:28
@CoryMartin-NOAA
Copy link
Contributor

I just so happened to catch "Gaea C5 upgrade" on my calendar since I subscribe to the RDHPCS calendar. Email search says it is for an OS upgrade. If it is anything like the upgrade on other machines, it will be quite disruptive.

@RussTreadon-NOAA
Copy link
Contributor Author

Agreed. This PR may be in draft mode for quite some time. If the wait gets too long, we can simply close this PR and open a new one once the OS upgrade dust settles.

@RussTreadon-NOAA
Copy link
Contributor Author

Successfully build GDASApp on Gaea C5 using feature/gaea-c5 at a140bbd. Given this, mark this PR as Ready for review.

@RussTreadon-NOAA RussTreadon-NOAA marked this pull request as ready for review October 24, 2024 16:59
@RussTreadon-NOAA
Copy link
Contributor Author

GDASApp ctests

Unfortunately several GDASApp ctests fail on Gaea C5

est project /gpfs/f5/nggps_emc/scratch/Russ.Treadon/git/GDASApp/test/build
      Start 1579: test_gdasapp_util_coding_norms
 1/23 Test #1579: test_gdasapp_util_coding_norms ..........   Passed    4.09 sec
      Start 1580: test_gdasapp_util_ioda_example
 2/23 Test #1580: test_gdasapp_util_ioda_example ..........   Passed    0.65 sec
      Start 1581: test_gdasapp_util_prepdata
 3/23 Test #1581: test_gdasapp_util_prepdata ..............***Failed    0.38 sec
      Start 1582: test_gdasapp_util_rads2ioda
 4/23 Test #1582: test_gdasapp_util_rads2ioda .............   Passed    0.36 sec
      Start 1583: test_gdasapp_util_ghrsst2ioda
 5/23 Test #1583: test_gdasapp_util_ghrsst2ioda ...........Subprocess aborted***Exception:   0.42 sec

...

      Start 1949: test_gdasapp_convert_bufr_adpsfc_snow
21/23 Test #1949: test_gdasapp_convert_bufr_adpsfc_snow ...   Passed    4.08 sec
      Start 1950: test_gdasapp_convert_bufr_adpsfc
22/23 Test #1950: test_gdasapp_convert_bufr_adpsfc ........   Passed    4.62 sec
      Start 1951: test_gdasapp_convert_gsi_satbias
23/23 Test #1951: test_gdasapp_convert_gsi_satbias ........   Passed   18.00 sec

39% tests passed, 14 tests failed out of 23

Label Time Summary:
gdas-utils    =   8.29 sec*proc (12 tests)
script        =   8.29 sec*proc (12 tests)

Total Test time (real) =  49.75 sec

The following tests FAILED:
        1581 - test_gdasapp_util_prepdata (Failed)
        1583 - test_gdasapp_util_ghrsst2ioda (Subprocess aborted)
        1584 - test_gdasapp_util_smap2ioda (Subprocess aborted)
        1585 - test_gdasapp_util_smos2ioda (Subprocess aborted)
        1586 - test_gdasapp_util_viirsaod2ioda (Subprocess aborted)
        1587 - test_gdasapp_util_icecabi2ioda (Subprocess aborted)
        1588 - test_gdasapp_util_icecamsr2ioda (Subprocess aborted)
        1589 - test_gdasapp_util_icecmirs2ioda (Subprocess aborted)
        1590 - test_gdasapp_util_icecjpssrr2ioda (Subprocess aborted)
        1944 - test_gdasapp_fv3jedi_fv3inc (Failed)
        1945 - test_gdasapp_snow_create_ens (Failed)
        1946 - test_gdasapp_snow_imsproc (Failed)
        1947 - test_gdasapp_snow_apply_jediincr (Failed)
        1948 - test_gdasapp_snow_letkfoi_snowda (Failed)
Errors while running CTest

@RussTreadon-NOAA
Copy link
Contributor Author

Rerun test_gdasapp_util_ghrsst2ioda with -VV. Output contains

1583: Test command: /gpfs/f5/nggps_emc/scratch/Russ.Treadon/git/GDASApp/test/build/bin/gdas_obsprovider2ioda.x "../testinput/gdas_ghrsst2ioda.yaml"
1583: Environment variables:
1583:  OMP_NUM_THREADS=1
1583: Test timeout computed to be: 1500
1583: OOPS Starting 2024-10-24 13:15:13 (UTC-0400)
1583: [TestReference] Comparing to reference file: testref/ghrsst2ioda.test
1583: Relative float tolerance for tests : 1e-06
1583: Absolute float tolerance for tests : 0
1583: [TestReference] Saving Test output to: testoutput/ghrsst2ioda.test
1583: Configuration input file is: ../testinput/gdas_ghrsst2ioda.yaml
1583: Full configuration is:YAMLConfiguration[path=../testinput/gdas_ghrsst2ioda.yaml, root={provider => GHRSST , window begin => 2021-03-24T15:00:00Z , window end => 2021-03\
-24T21:00:00Z , binning => {stride => 2 , min number of obs => 1} , bounds => {min => -3 , max => 50} , output file => ghrsst_sst_ma_20210324.ioda.nc , input files => (ghrsst\
_sst_ma_202103241540.nc4,ghrsst_sst_ma_202103241550.nc4) , test => {reference filename => testref/ghrsst2ioda.test , test output filename => testoutput/ghrsst2ioda.test , flo\
at relative tolerance => 1e-06}}]
1583: OOPS_STATS ObjectCountHelper started.
1583: OOPS_STATS Run start                                - Runtime:      1.01 sec,  Memory: total:    60.92 Mb, per task: min =    60.92 Mb, max =    60.92 Mb
1583: Run: Starting gdasapp::ObsProvider2IodaApp
1583: --- Window begin: 2021-03-24T15:00:00Z
1583: --- Window end: 2021-03-24T21:00:00Z
1583: --- Input files: [ghrsst_sst_ma_202103241540.nc4,ghrsst_sst_ma_202103241550.nc4]
1583: --- Output files: ghrsst_sst_ma_20210324.ioda.nc
1583: Processing files provided by GHRSST
1583: Exception: No such file or directory
1583: file: ncFile.cpp  line:88
1583: Exception: gdasapp::ObsProvider2IodaApp terminating...
1583: Exception: level: 0
1583: No such file or directory
1583: file: ncFile.cpp  line:88
1583: terminate called after throwing an instance of 'oops::TestReferenceMissingTestLineError'
1583:   what():  TestReference: Missing test output line corresponding to reference file Line#:1
1583: Ref line: 'Reading: [ghrsst_sst_ma_202103241540.nc4,ghrsst_sst_ma_202103241550.nc4]'
1/1 Test #1583: test_gdasapp_util_ghrsst2ioda ....Subprocess aborted***Exception:   7.88 sec

0% tests passed, 1 tests failed out of 1

@DavidNew-NOAA , is this failure an example of what you mentioned during today's (10/24) JEDI-T2O meeting: develop: works for non-marine DA apps? Do we have a tentative fix?

@CoryMartin-NOAA
Copy link
Contributor

@RussTreadon-NOAA I thought that was in reference to the gdas.cd hash update PR that, while breaking aero and snow, fixes soca?

@RussTreadon-NOAA
Copy link
Contributor Author

test_gdasapp_util_prepdata failure

test_gdasapp_util_prepdata passes on Cactus, Hera, Hercules, and Orion. It fails on Gaea-C5. ncgen fails for select cdl files with message

1580: Generating icec_abi_g16_1.nc4
1580: ncgen: NetCDF: Not a valid data type or _FillValue type mismatch
1580:   (genbin.c:genbin_netcdf:131)
1580: Generating icec_abi_g16_2.nc4
1580: ncgen: NetCDF: Not a valid data type or _FillValue type mismatch
1580:   (genbin.c:genbin_netcdf:131)

The above failures correspond to the executions

ncgen -o icec_abi_g16_1.nc4 /gpfs/f5/nggps_emc/scratch/Russ.Treadon/git/GDASApp/test/bundle/gdas-utils/test/testdata/icec_abi_g16_1.cdl

and

ncgen -o icec_abi_g16_2.nc4 /gpfs/f5/nggps_emc/scratch/Russ.Treadon/git/GDASApp/test/bundle/gdas-utils/test/testdata/icec_abi_g16_2.cdl

I am currently building GDASApp on Gaea C6 to see if the same error occurs on C6.

Tagging @apchoiCMD for awareness.

@apchoiCMD
Copy link
Collaborator

@RussTreadon-NOAA Thanks for letting me know and I have never tried to use Gaea machine. Yes, If a prepdata fails, all ioda-converters fail with no inputs. Let me try what I can do-

@RussTreadon-NOAA
Copy link
Contributor Author

Failed test_gdasapp tests

Rerun failed jobs

        1943 - test_gdasapp_fv3jedi_fv3inc (Failed)
        1946 - test_gdasapp_snow_apply_jediincr (Failed)
        1947 - test_gdasapp_snow_letkfoi_snowda (Failed)

Each of these jobs fail on Gaea C5 for the same reason

1943: srun: error: Unable to allocate resources: No partition specified or system default partition
16/23 Test #1943: test_gdasapp_fv3jedi_fv3inc .............***Failed    0.07 sec
srun: error: Unable to allocate resources: No partition specified or system default partition
1946: srun: error: Unable to allocate resources: No partition specified or system default partition
19/23 Test #1946: test_gdasapp_snow_apply_jediincr ........***Failed    0.69 sec
do_snowDA: calling apply snow increment
srun: error: Unable to allocate resources: No partition specified or system default partition
1947: do_snowDA: calling fv3-jedi
1947: + srun --export=ALL -n 6 /gpfs/f5/nggps_emc/scratch/Russ.Treadon/git/GDASApp/test/bundle/gdas/build/bin/gdas.x fv3jedi localensembleda letkf_snow.yam\
l
1947: srun: error: Unable to allocate resources: No partition specified or system default partition
20/23 Test #1947: test_gdasapp_snow_letkfoi_snowda ........***Failed    1.72 sec

As a test copy test/snow/apply_jedi_incr.sh to test.sh . Add #SBATCH preamble to test.sh followed by sbatch test.sh. The job submits and runs. Check environment to ensure SLURM and SBATCH variables are set. ctest -R test_gdasapp_snow_letkfoi_snowda still fails.

I have not yet found the proper combination of environment variables to set to get these three jobs to run as ctests.

Tagging @CoryMartin-NOAA , @danholdaway , @guillaumevernieres , and @DavidNew-NOAA for awareness

@RussTreadon-NOAA
Copy link
Contributor Author

test_gdasapp_util_2ioda failures

In looking at the gdasapp_util_*_2ioda it is not clear if the failures are due to

  1. issues with the individual jobs, or
  2. failure of test_gdasapp_util_prepdata

@apchoiCMD , do ctests

        1582 - test_gdasapp_util_ghrsst2ioda
        1583 - test_gdasapp_util_smap2ioda
        1584 - test_gdasapp_util_smos2ioda
        1585 - test_gdasapp_util_viirsaod2ioda
        1586 - test_gdasapp_util_icecabi2ioda
        1587 - test_gdasapp_util_icecamsr2ioda
        1588 - test_gdasapp_util_icecmirs2ioda
        1589 - test_gdasapp_util_icecjpssrr2ioda

depend on output from test_gdasapp_util_prepdata?

FYI, the Gaea C6 build is complete. The above jobs fail on both Gaea C5 and C6.

@apchoiCMD
Copy link
Collaborator

test_gdasapp_util_2ioda failures

In looking at the gdasapp_util_*_2ioda it is not clear if the failures are due to

  1. issues with the individual jobs, or
  2. failure of test_gdasapp_util_prepdata

@apchoiCMD , do ctests

        1582 - test_gdasapp_util_ghrsst2ioda
        1583 - test_gdasapp_util_smap2ioda
        1584 - test_gdasapp_util_smos2ioda
        1585 - test_gdasapp_util_viirsaod2ioda
        1586 - test_gdasapp_util_icecabi2ioda
        1587 - test_gdasapp_util_icecamsr2ioda
        1588 - test_gdasapp_util_icecmirs2ioda
        1589 - test_gdasapp_util_icecjpssrr2ioda

depend on output from test_gdasapp_util_prepdata?

FYI, the Gaea C6 build is complete. The above jobs fail on both Gaea C5 and C6.

depend on output from test_gdasapp_util_prepdata? -> yes, that is main issue
I am now building GDASApp on Gaea C5 and will get it fixed and be back ASAP!

@RussTreadon-NOAA
Copy link
Contributor Author

Install feature/gaea-c5 at ed79b92 on Gaea C5. Run ctests with following results

Test project /gpfs/f5/nggps_emc/scratch/Russ.Treadon/git/GDASApp/gaea-c5/build
      Start 1578: test_gdasapp_util_coding_norms
 1/23 Test #1578: test_gdasapp_util_coding_norms ..........   Passed    4.65 sec
      Start 1579: test_gdasapp_util_ioda_example
 2/23 Test #1579: test_gdasapp_util_ioda_example ..........   Passed    0.27 sec

...

      Start 1949: test_gdasapp_convert_bufr_adpsfc
22/23 Test #1949: test_gdasapp_convert_bufr_adpsfc ........   Passed    8.16 sec
      Start 1950: test_gdasapp_convert_gsi_satbias
23/23 Test #1950: test_gdasapp_convert_gsi_satbias ........   Passed   15.25 sec

48% tests passed, 12 tests failed out of 23

Label Time Summary:
gdas-utils    =   7.49 sec*proc (12 tests)
script        =   7.49 sec*proc (12 tests)

Total Test time (real) =  62.63 sec

The following tests FAILED:
        1580 - test_gdasapp_util_prepdata (Failed)
        1582 - test_gdasapp_util_ghrsst2ioda (Subprocess aborted)
        1583 - test_gdasapp_util_smap2ioda (Subprocess aborted)
        1584 - test_gdasapp_util_smos2ioda (Subprocess aborted)
        1585 - test_gdasapp_util_viirsaod2ioda (Subprocess aborted)
        1586 - test_gdasapp_util_icecabi2ioda (Subprocess aborted)
        1587 - test_gdasapp_util_icecamsr2ioda (Subprocess aborted)
        1588 - test_gdasapp_util_icecmirs2ioda (Subprocess aborted)
        1589 - test_gdasapp_util_icecjpssrr2ioda (Subprocess aborted)
        1943 - test_gdasapp_fv3jedi_fv3inc (Failed)
        1946 - test_gdasapp_snow_apply_jediincr (Failed)
        1947 - test_gdasapp_snow_letkfoi_snowda (Failed)
Errors while running CTest

The reasons for the failures are the same as reported earlier

test_gdasapp_util_prepdata fails with

      Start 1580: test_gdasapp_util_prepdata
 3/23 Test #1580: test_gdasapp_util_prepdata ..............***Failed    0.32 sec
Not running on hera, skipping anaconda module loading.
Generating rads_adt_3a_2021181.nc4
Generating rads_adt_3b_2021181.nc4
Generating icec_abi_g16_1.nc4
ncgen: NetCDF: Not a valid data type or _FillValue type mismatch
        (genbin.c:genbin_netcdf:131)

The test_gdasapp_util_*2ioda failures may be resolved when test_gdasapp_util_prepdata works.

test_gdasapp_fv3jedi_fv3inc, test_gdasapp_snow_apply_jediincr, and test_gdasapp_snow_letkfoi_snowda fail with

srun: error: Unable to allocate resources: No partition specified or system default partition

Each of these tests run a mpi executable using srun -n 6. On Hera, Hercules, and Orion, we set the following environment variable before running ctests

export OMP_NUM_THREADS=1
export SLURM_ACCOUNT=da-cpu
export SALLOC_ACCOUNT=$SLURM_ACCOUNT
export SBATCH_ACCOUNT=$SLURM_ACCOUNT
export SLURM_QOS=debug

On hercules and orion we also specify ulimit -s unlimited.

For the Gaea C5 tests,

ulimit -s unlimited
export OMP_NUM_THREADS=1
export SLURM_ACCOUNT=nggps_emc
export SALLOC_ACCOUNT=$SLURM_ACCOUNT
export SBATCH_ACCOUNT=$SLURM_ACCOUNT
export SLURM_QOS=debug

Setting additional environment variables on Gaea C5 may allow test_gdasapp_fv3jedi_fv3inc, test_gdasapp_snow_apply_jediincr, and test_gdasapp_snow_letkfoi_snowda to successfully run.

@RussTreadon-NOAA
Copy link
Contributor Author

@CoryMartin-NOAA , this PR is ready for review. The slurm / sbatch environment variable issue is outside the repo ... but I'm fine with keeping this PR open if we're concerned that we'll forget to follow up & resolve this after merging.

Copy link
Contributor

@CoryMartin-NOAA CoryMartin-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @RussTreadon-NOAA thanks also for the CRTM data fix.

I think we merge, the slurm dependency is a larger issue and needs to be resolved for wcoss and gaea

@RussTreadon-NOAA
Copy link
Contributor Author

Good point. We need to step away from the assumption that slurm is the workload manager.

@apchoiCMD
Copy link
Collaborator

@RussTreadon-NOAA Even test_gdasapp_util_prepdata works there are some more fails with ncgen issue on Gaea

Test project /ncrc/home2/Mindo.Choi/GDASApp/build/gdas-utils
      Start  1: test_gdasapp_util_coding_norms
 1/12 Test  #1: test_gdasapp_util_coding_norms ......   Passed    4.45 sec
      Start  2: test_gdasapp_util_ioda_example
 2/12 Test  #2: test_gdasapp_util_ioda_example ......   Passed    0.52 sec
      Start  3: test_gdasapp_util_prepdata
 3/12 Test  #3: test_gdasapp_util_prepdata ..........***Failed    0.33 sec
      Start  4: test_gdasapp_util_rads2ioda
 4/12 Test  #4: test_gdasapp_util_rads2ioda .........   Passed    0.23 sec
      Start  5: test_gdasapp_util_ghrsst2ioda
 5/12 Test  #5: test_gdasapp_util_ghrsst2ioda .......***Exception: SegFault  0.24 sec
      Start  6: test_gdasapp_util_smap2ioda
 6/12 Test  #6: test_gdasapp_util_smap2ioda .........   Passed    0.24 sec
      Start  7: test_gdasapp_util_smos2ioda
 7/12 Test  #7: test_gdasapp_util_smos2ioda .........   Passed    0.24 sec
      Start  8: test_gdasapp_util_viirsaod2ioda
 8/12 Test  #8: test_gdasapp_util_viirsaod2ioda .....***Exception: SegFault  0.24 sec
      Start  9: test_gdasapp_util_icecabi2ioda
 9/12 Test  #9: test_gdasapp_util_icecabi2ioda ......Subprocess aborted***Exception:   4.41 sec
      Start 10: test_gdasapp_util_icecamsr2ioda
10/12 Test #10: test_gdasapp_util_icecamsr2ioda .....   Passed    0.23 sec
      Start 11: test_gdasapp_util_icecmirs2ioda
11/12 Test #11: test_gdasapp_util_icecmirs2ioda .....   Passed    0.23 sec
      Start 12: test_gdasapp_util_icecjpssrr2ioda
12/12 Test #12: test_gdasapp_util_icecjpssrr2ioda ...   Passed    0.22 sec

67% tests passed, 4 tests failed out of 12

Label Time Summary:
gdas-utils    =  11.58 sec*proc (12 tests)
script        =  11.58 sec*proc (12 tests)

Total Test time (real) =  11.64 sec

The following tests FAILED:
          3 - test_gdasapp_util_prepdata (Failed)
          5 - test_gdasapp_util_ghrsst2ioda (SEGFAULT)
          8 - test_gdasapp_util_viirsaod2ioda (SEGFAULT)
          9 - test_gdasapp_util_icecabi2ioda (Subprocess aborted)
Errors while running CTest
Output from these tests are in: /ncrc/home2/Mindo.Choi/GDASApp/build/gdas-utils/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.

I will look into more details when my priority is done- I have no idea on both test_gdasapp_util_ghrsst2ioda and test_gdasapp_util_viirsaod2ioda

@RussTreadon-NOAA
Copy link
Contributor Author

Thanks you @apchoiCMD . This is progress!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add esmf to gaea.intel.lua
3 participants