Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fms.parallel startup #1477

Merged
merged 5 commits into from
Mar 29, 2024
Merged

Fms.parallel startup #1477

merged 5 commits into from
Mar 29, 2024

Conversation

dkokron
Copy link
Contributor

@dkokron dkokron commented Mar 11, 2024

This PR is a replacement for #1405 which I closed accidentally.

NetCDF-4, using the HDF5 file layout, has the ability to do parallel I/O in two different modes. The two modes are referred to as “independent” while the second mode is referred to as “collective”. The collective mode has been tested with a few NOAA workloads and shown to provide substantial improvement in job startup time while reducing negative impact on the underlying Lustre file system.

This PR does not address parallel I/O via pNetCDF.

This PR adds an option to enable collective reads. The user controls that option via settings in input.nml. The default behavior is unchanged, the user has to activate collective reads using the settings in input.nml.

Fixes # 1322
#1322

How Has This Been Tested?
I have run a RRFS (regional) case on WCOSS2 with and without collective reads activated. The resulting binary restart files are zero-diff.
I have not yet run a regional HAFS case or the UFS model on a full cube.

The compile time environment used to compile FMS was:
Currently Loaded Modules:

  1. craype-x86-rome (H) 3) craype-network-ofi (H) 5) PrgEnv-intel/8.3.3 7) intel/19.1.3.304 9) cray-mpich/8.1.19 11) hpc-intel/19.1.3.304 13) hdf5/1.14.1
  2. libfabric/1.11.0.0. (H) 4) envvar/1.0 6) cmake/3.20.2 8) craype/2.7.17 10) hpc/1.2.0 12) hpc-cray-mpich/8.1.19 14) netcdf/4.9.2

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules
  • New check tests, if applicable, are included
  • make distcheck passes

@dkokron
Copy link
Contributor Author

dkokron commented Mar 11, 2024

Further testing of this proposal depends on availability of Acorn which is scheduled to be in dedicated time for about 2 weeks starting today.

fms2_io/netcdf_io.F90 Outdated Show resolved Hide resolved
fms2_io/netcdf_io.F90 Outdated Show resolved Hide resolved
@dkokron
Copy link
Contributor Author

dkokron commented Mar 11, 2024 via email

@rem1776
Copy link
Contributor

rem1776 commented Mar 12, 2024

@dkokron Just wanted to give you a heads up, it looks like this failed the linter CI check for our style guidelines. This will just need a line length fixed to be less than 120 characters, looks like its in fms2_io/netcdf_io.F90.

fms2_io/netcdf_io.F90 Outdated Show resolved Hide resolved
fms2_io/netcdf_io.F90 Outdated Show resolved Hide resolved
fms2_io/netcdf_io.F90 Outdated Show resolved Hide resolved
@thomas-robinson
Copy link
Member

@dkokron are you still looking at a March 25 time frame for testing? I'm trying to plan out a testing tag schedule.

@dkokron
Copy link
Contributor Author

dkokron commented Mar 19, 2024

I'm not sure what you mean by testing. I've run the code (as of today) using all the UFS cases that I'm interested in testing.

@thomas-robinson
Copy link
Member

thomas-robinson commented Mar 21, 2024

Further testing of this proposal depends on availability of Acorn which is scheduled to be in dedicated time for about 2 weeks starting today.

@dkokron this is what I meant by testing. You indicated that Acorn would be available around March 25. If you are satisfied with this PR and the testing done on your side, we can complete our code reviews and schedule it for merging and regression testing on our side.

@dkokron
Copy link
Contributor Author

dkokron commented Mar 21, 2024

@thomas-robinson Acorn returned earlier than expected and I was able to get my testing completed yesterday.

Copy link
Contributor

@uramirez8707 uramirez8707 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try this update after removing the io_layout=1,1 'fix'? What is the difference in performance?

Comment on lines +152 to +156
logical :: use_collective = .false. !< Flag indicating if we should open the file for collective input
!! this should be set to .true. in the user application if they want
!! collective reads (put before open_file())
integer :: tile_comm=MPP_COMM_NULL !< MPI communicator used for collective reads.
!! To be replaced with a real communicator at user request
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are these set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous testing with and without the io_layout=1,1 'fix' did not reveal anything. I should probably redo that testing.

As for how those are set, the user needs to set those in their application.

I have attached a 'patch' file for atmos_cubed_sphere/tools/external_ic.F90 as an example.
external.patch

I added similar changes to
FV3/atmos_cubed_sphere/tools/fv_io.F90 (fv_core.res,fv_tracer.res)
FV3/io/fv3atm_restart_io.F90 (Oro_restart, Sfc_restart, Phy_restart)

@dkokron
Copy link
Contributor Author

dkokron commented Mar 27, 2024

Is there anything more I need to do to move this forward?

Comment on lines +152 to +154
logical :: use_collective = .false. !< Flag indicating if we should open the file for collective input
!! this should be set to .true. in the user application if they want
!! collective reads (put before open_file())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uramirez8707 @thomas-robinson - does this comment/variable name make it clear this applies only to reads and not writes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It says for collective input so I think that's clear

Comment on lines 674 to 676
err = nf90_get_att(fileobj%ncid, nf90_global, "_IsNetcdf4", IsNetcdf4)
err = nf90_close(fileobj%ncid)
if(IsNetcdf4 /= 1) then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is IsNetcdf4 not always set here? If it is going to be set here and then checked, why initialize it when it's declared?

If IsNetcdf4 is not always set by nf90_get_att, then if it equals 1 on one pass and is not set on the next, it will still be 1 because of the implied save by setting it when it's declared.

@bensonr bensonr removed the request for review from Scitech777 March 29, 2024 16:02
@rem1776 rem1776 merged commit f5d9892 into NOAA-GFDL:main Mar 29, 2024
17 of 19 checks passed
@MatthewPyle-NOAA
Copy link

@bensonr Has this change made it into any alpha type release? Trying to understand the path it will take to being deployed in an FMS release. Thanks!

@bensonr
Copy link
Contributor

bensonr commented Apr 18, 2024

@bensonr Has this change made it into any alpha type release? Trying to understand the path it will take to being deployed in an FMS release. Thanks!

It was initially released as part of the 2024.01 beta4 release. Let us know if you encounter any issues in using it.

@MatthewPyle-NOAA
Copy link

@bensonr Things are mostly looking good using this beta4 release (did have a regression test of the model fail with it, but don't believe FMS is responsible for that). What is the timeline for an official 2024.01 release? Thanks!

@thomas-robinson
Copy link
Member

@MatthewPyle-NOAA the release will (hopefully) be today or tomorrow pending our internal discussion at noon today. There will be a follow on patch in about a week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants