-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fms.parallel startup #1477
Fms.parallel startup #1477
Conversation
Further testing of this proposal depends on availability of Acorn which is scheduled to be in dedicated time for about 2 weeks starting today. |
Will do.
…On Mon, Mar 11, 2024, 12:50 PM Rusty Benson ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In fms2_io/netcdf_io.F90
<#1477 (comment)>:
> + integer :: TileComm=MPI_COMM_NULL !< MPI communicator used for collective reads.
+ !! To be replaced with a real communicator at user request
You define a variable MPP_COMM_NULL in mpp.F90 as was done for
MPP_INFO_NULL
<https://github.com/NOAA-GFDL/FMS/blob/main/mpp/mpp.F90#L1328-L1335> and
make it public
<https://github.com/NOAA-GFDL/FMS/blob/main/mpp/mpp.F90#L199>. This way
we can keep the MPI layer confined to mpp.
------------------------------
In fms2_io/netcdf_io.F90
<#1477 (comment)>:
> @@ -32,6 +32,7 @@ module netcdf_io_mod
use mpp_mod
use fms_io_utils_mod
use platform_mod
+use mpi, only: MPI_COMM_NULL
Remove pursuant to comment regarding MPI_COMM_NULL below.
—
Reply to this email directly, view it on GitHub
<#1477 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACODV2BL7U4AGHOWVRONSDDYXX4G3AVCNFSM6AAAAABEQVUTL2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTSMRYGYYDCMZXGU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
@dkokron Just wanted to give you a heads up, it looks like this failed the linter CI check for our style guidelines. This will just need a line length fixed to be less than 120 characters, looks like its in |
@dkokron are you still looking at a March 25 time frame for testing? I'm trying to plan out a testing tag schedule. |
I'm not sure what you mean by testing. I've run the code (as of today) using all the UFS cases that I'm interested in testing. |
@dkokron this is what I meant by testing. You indicated that Acorn would be available around March 25. If you are satisfied with this PR and the testing done on your side, we can complete our code reviews and schedule it for merging and regression testing on our side. |
@thomas-robinson Acorn returned earlier than expected and I was able to get my testing completed yesterday. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try this update after removing the io_layout=1,1 'fix'? What is the difference in performance?
logical :: use_collective = .false. !< Flag indicating if we should open the file for collective input | ||
!! this should be set to .true. in the user application if they want | ||
!! collective reads (put before open_file()) | ||
integer :: tile_comm=MPP_COMM_NULL !< MPI communicator used for collective reads. | ||
!! To be replaced with a real communicator at user request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are these set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous testing with and without the io_layout=1,1 'fix' did not reveal anything. I should probably redo that testing.
As for how those are set, the user needs to set those in their application.
I have attached a 'patch' file for atmos_cubed_sphere/tools/external_ic.F90 as an example.
external.patch
I added similar changes to
FV3/atmos_cubed_sphere/tools/fv_io.F90 (fv_core.res,fv_tracer.res)
FV3/io/fv3atm_restart_io.F90 (Oro_restart, Sfc_restart, Phy_restart)
Is there anything more I need to do to move this forward? |
logical :: use_collective = .false. !< Flag indicating if we should open the file for collective input | ||
!! this should be set to .true. in the user application if they want | ||
!! collective reads (put before open_file()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@uramirez8707 @thomas-robinson - does this comment/variable name make it clear this applies only to reads and not writes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It says for collective input so I think that's clear
fms2_io/netcdf_io.F90
Outdated
err = nf90_get_att(fileobj%ncid, nf90_global, "_IsNetcdf4", IsNetcdf4) | ||
err = nf90_close(fileobj%ncid) | ||
if(IsNetcdf4 /= 1) then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is IsNetcdf4
not always set here? If it is going to be set here and then checked, why initialize it when it's declared?
If IsNetcdf4
is not always set by nf90_get_att, then if it equals 1 on one pass and is not set on the next, it will still be 1 because of the implied save
by setting it when it's declared.
@bensonr Has this change made it into any alpha type release? Trying to understand the path it will take to being deployed in an FMS release. Thanks! |
It was initially released as part of the 2024.01 beta4 release. Let us know if you encounter any issues in using it. |
@bensonr Things are mostly looking good using this beta4 release (did have a regression test of the model fail with it, but don't believe FMS is responsible for that). What is the timeline for an official 2024.01 release? Thanks! |
@MatthewPyle-NOAA the release will (hopefully) be today or tomorrow pending our internal discussion at noon today. There will be a follow on patch in about a week. |
This PR is a replacement for #1405 which I closed accidentally.
NetCDF-4, using the HDF5 file layout, has the ability to do parallel I/O in two different modes. The two modes are referred to as “independent” while the second mode is referred to as “collective”. The collective mode has been tested with a few NOAA workloads and shown to provide substantial improvement in job startup time while reducing negative impact on the underlying Lustre file system.
This PR does not address parallel I/O via pNetCDF.
This PR adds an option to enable collective reads. The user controls that option via settings in input.nml. The default behavior is unchanged, the user has to activate collective reads using the settings in input.nml.
Fixes # 1322
#1322
How Has This Been Tested?
I have run a RRFS (regional) case on WCOSS2 with and without collective reads activated. The resulting binary restart files are zero-diff.
I have not yet run a regional HAFS case or the UFS model on a full cube.
The compile time environment used to compile FMS was:
Currently Loaded Modules:
Checklist:
make distcheck
passes