Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Integrate smoke/dust capability of main_aqm into develop #1185

Merged
merged 20 commits into from
Jan 31, 2025

Conversation

chan-hoo
Copy link
Collaborator

@chan-hoo chan-hoo commented Jan 27, 2025

DESCRIPTION OF CHANGES:

  • Integrate the smoke and dust capability of the main-aqm branch into the develop branch of the UFS SRW App.
  • This capability works on Hera, Orion, Hercules, and Gaea-C6.
  • A we2e test for smoke/dust is added:
./run_WE2E_tests.py -m [hera/orion/hercules/gaea-c6] -a [project/name] -t smoke_dust_grid_RRFS_CONUS_3km_suite_HRRR_gf
  • A sample configuration YAML file config.smoke_dust.yaml is added to the ush directory:
cd ush
cp config.smoke_dust.yaml config.yaml
vim config.yaml
(check MACHINE, ACCOUNT, EXTRN_MDL_SOURCE_BASEDIR_ICS, and EXTRN_MDL_SOURCE_BASEDIR_LBCS)
  • Update the hash of the UFS Weather Model with 3a5e52e where the change of the RRFS production branch is included.
  • Update the hash of UPP with the release/srw-v3.0.0 branch in the NOAA-EPIC fork of UPP.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • derecho.intel
  • gaea.intel
  • gaea-c6.intel
  • hera.gnu
  • hera.intel
  • hercules.intel
  • jet.intel
  • orion.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite (Hera)
  • comprehensive tests (specify which if a subset was used)

ISSUE:

Resolve Issue mentioned in #1111

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

CONTRIBUTORS:

@BenKoziol-NOAA

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Jan 29, 2025

@BenKoziol-NOAA, do you have any idea which python package causes this error on Derecho?

The AQM WE2E test was run on Derecho. The tests failed in make_sfc_climo with:

FATAL ERROR: ERROR IN NF90_CREATE: Permission denied

@chan-hoo
Copy link
Collaborator Author

@MichaelLueken, if so, I have an idea. What do you think of creating another conda environment like sd_environment.yml only for the smoke_dusk task?

@MichaelLueken
Copy link
Collaborator

Sure, a separate sd_environment.yml file can be added to create the conda environment for smoke and dust capabilities. I'll continue checking to see if there are modifications that can be made to the current environment.yml that will allow the AQM WE2E test to run on Derecho and still allow the smoke and dust WE2E test to run on Hercules and Orion.

@benkozi
Copy link
Collaborator

benkozi commented Jan 29, 2025

FATAL ERROR: ERROR IN NF90_CREATE: Permission denied

@chan-hoo Do you have a traceback? It's not necessarily a Python package.

@chan-hoo
Copy link
Collaborator Author

FATAL ERROR: ERROR IN NF90_CREATE: Permission denied

@chan-hoo Do you have a traceback? It's not necessarily a Python package.

@MichaelLueken, can you answer Ben's question?

@MichaelLueken
Copy link
Collaborator

@chan-hoo @benkozi I don't have the old runs that were failing yesterday, so no, I don't have any logs currently with the tracebacks.

@MichaelLueken
Copy link
Collaborator

@benkozi My current run has failed on Derecho. Please see the experiment:

/glade/derecho/scratch/mlueken/ufs-srweather-app/expt_dirs/aqm_grid_AQM_NA13km_suite_GFS_v16

@benkozi
Copy link
Collaborator

benkozi commented Jan 29, 2025

My current run has failed on Derecho. Please see the experiment:

@MichaelLueken I don't see anything in the log to indicate it's related to the Python environment. My understanding is that the conda environment is not used for this task. Am I interpreting the modules correctly?

The following modules were not unloaded:
  (Use "module --force purge" to unload all):

  1) ncarenv/23.09

Loading modules for task "make_sfc_climo" ...

Currently Loaded Modules:
  1) conda                        30) crtm-fix/2.4.0.1_emc
  2) python_srw                   31) git-lfs/3.3.0
  3) ncarenv/23.09           (S)  32) crtm/2.4.0.1
  4) intel-classic/2023.2.1       33) g2/3.5.1
  5) stack-intel/2021.10.0        34) g2tmpl/1.13.0
  6) craype/2.7.20                35) ip/4.3.0
  7) cray-mpich/8.1.25            36) sp/2.5.0
  8) libfabric/1.15.2.0           37) w3emc/2.10.0
  9) cray-pals/1.2.11             38) gftl/1.10.0
 10) stack-cray-mpich/8.1.25      39) gftl-shared/1.6.1
 11) nghttp2/1.57.0               40) fargparse/1.5.0
 12) curl/8.4.0                   41) tar/1.34
 13) cmake/3.23.1                 42) gettext/0.21.1
 14) libjpeg/2.1.0                43) libxcrypt/4.4.35
 15) jasper/2.0.32                44) sqlite/3.43.2
 16) zlib/1.2.13                  45) util-linux-uuid/2.38.1
 17) libpng/1.6.37                46) python/3.10.13
 18) snappy/1.1.10                47) mapl/2.40.3-esmf-8.6.0
 19) zstd/1.5.2                   48) nemsio/2.5.4
 20) c-blosc/1.21.5               49) sfcio/1.4.1
 21) pkg-config/0.29.2            50) sigio/2.3.2
 22) hdf5/1.14.0                  51) w3nco/2.4.1
 23) netcdf-c/4.9.2               52) wrf-io/1.2.0
 24) netcdf-fortran/4.6.1         53) gmake/4.2.1
 25) parallel-netcdf/1.12.2       54) wgrib2/2.0.8
 26) parallelio/2.5.10            55) srw_common
 27) esmf/8.6.0                   56) prod_util/2.1.1
 28) fms/2024.01.02               57) build_derecho_intel
 29) bacio/2.4.1

This looks like a run-of-the-mill permissions issue - nothing related to runtime linking, etc. The simple test is to swap out the environment file for the old one and see if the issue persists.

@MichaelLueken
Copy link
Collaborator

@benkozi The old environment.yml file allows the test to pass without issue. The updated environment.yml in this PR results in the failure. In ush/load_modules_run_task.sh, the srw_app conda environment is activated so long as neither an AQM or plotting task is being prepared.

@benkozi
Copy link
Collaborator

benkozi commented Jan 29, 2025

The old environment.yml file allows the test to pass without issue. The updated environment.yml in this PR results in the failure. In ush/load_modules_run_task.sh, the srw_app conda environment is activated so long as neither an AQM or plotting task is being prepared.

@MichaelLueken - Okay, thanks. In that case, it's probably a conflict with the MPI that's being used in the new environment file. In my opinion, the smoke/dust environment file should be split out like was suggested. Having multiple MPIs and ESMF installations loaded will likely cause some chaos.

@MichaelLueken
Copy link
Collaborator

@benkozi @chan-hoo In that case, let's try adding an sd_environment.yml conda environment. This should hopefully allow all tasks to run on the various machines without overwriting MPI.

@chan-hoo
Copy link
Collaborator Author

@MichaelLueken, I've changed the conda env for smoke/dust. The change works well on Hera and Gaea-C6. Can you run the AQM we2e on Derecho again?

@MichaelLueken
Copy link
Collaborator

Thanks, @chan-hoo! I'm running the AQM on Derecho and smoke and dust on Hercules. Everything is looking good so far.

@MichaelLueken
Copy link
Collaborator

Following the addition of the sd_environment.yml conda environment (srw_sd) at c1ae41e, the AQM WE2E test is now passing on Derecho:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
aqm_grid_AQM_NA13km_suite_GFS_v16_20250129131458                   COMPLETE            3381.64
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            3381.64

The smoke and dust WE2E test was also run on Hercules and passes as well:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
smoke_dust_grid_RRFS_CONUS_3km_suite_HRRR_gf_20250129141539        COMPLETE            1805.63
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1805.63

Since the tests are passing and @chan-hoo has addressed the documentation updates, I will now approve these changes.

@chan-hoo
Copy link
Collaborator Author

@MichaelLueken , thank you for your test and approval !!!

@chan-hoo
Copy link
Collaborator Author

@benkozi, thank you for your approval!

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Jan 30, 2025
@MichaelLueken
Copy link
Collaborator

On Gaea-C6, Hera, and Jet, the conda environment failed to build, causing the Functional WorkflowTaskTests and Functional UnitTests to fail. I will relaunch these jobs in Jenkins and hopefully they will behave this time.

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Jan 30, 2025

@MichaelLueken , do you think it would be better to add some sleep command between conda commands in devbuild.sh like?

mamba env create
sleep 5
mamba env create

@MichaelLueken
Copy link
Collaborator

@chan-hoo Looking at the log files, it never made it that far. It looks like there was an error immediately after attempting to build miniforge3, which resulted in conda never being built, let alone srw_app, srw_graphics, and srw_sd. It seems to just be a hiccup while attempting to build on these three machines and hopefully rerunning the tests will allow the build to succeed.

@MichaelLueken
Copy link
Collaborator

The automated Jenkins tests have successfully passed on Derecho, Gaea, Hercules, and Orion.

The Gaea-C6, Hera GNU, Hera Intel, and Jet tests failed to properly clone miniforge3 and failed to build the necessary conda environments, leading to failure on these platforms.

The Gaea-C6, Hera, and Jet tests have been requeued in Jenkins.

@MichaelLueken
Copy link
Collaborator

I suspect that the issues with conda that occurred on Gaea-C6, Hera, and Jet this morning were due to the issues with GitHub. The rerun on Gaea-C6 has successfully built the conda environments and the Functional WorkflowTaskTests are running without issue. I hope to be able to merge this work tomorrow morning, once all tests have completed.

@MichaelLueken
Copy link
Collaborator

The rerun of the automated Jenkins tests on Gaea-C6, Hera GNU, Hera Intel, and Jet have all successfully passed.

Moving forward with merging this PR now.

@MichaelLueken MichaelLueken merged commit ad926b4 into ufs-community:develop Jan 31, 2025
4 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Port RRFS-SD features
3 participants