Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port ufs-weather-model to acorn (WCOSS2 TDS) #292

Closed
DusanJovic-NOAA opened this issue Nov 20, 2020 · 16 comments
Closed

Port ufs-weather-model to acorn (WCOSS2 TDS) #292

DusanJovic-NOAA opened this issue Nov 20, 2020 · 16 comments
Assignees
Labels
enhancement New feature or request

Comments

@DusanJovic-NOAA
Copy link
Collaborator

Description

We got access to acorn, a WCOSS2 TDS (Test Development System). UFS must be ported to WCOSS2.

Solution

Add support to the build system and regression test to build and run tests on WCOSS2.

See PR #291

@DusanJovic-NOAA
Copy link
Collaborator Author

Debug version of ESMF library is not (yet) available, so modulefiles/wcoss2/fv3 and modulefiles/wcoss2/fv3_debug are identical.

@DusanJovic-NOAA
Copy link
Collaborator Author

fv3_ccpp_wrtGauss_netcdf_parallel test crashed on acorn with this error.

 in fcst run phase 2, na=           0
 in fcst run phase 2, na=           1
 line          391 NetCDF: HDF error
MPICH ERROR [Rank 148] [job id 83a195bd-3b1e-454a-9447-cc19004723da] [Fri Nov 20 18:12:19 2020] [unknown] [nid001005] - Abort(1) (rank 148 in comm 496): application called MPI_Abort(comm=0x84000001, 1) - process 148

aborting job:
application called MPI_Abort(comm=0x84000001, 1) - process 148

Looks like NetCDF/HDF library error.

@edwardhartnett
Copy link
Contributor

What version of HDF5 and netCDF were used for this?

@DusanJovic-NOAA
Copy link
Collaborator Author

hdf5/1.10.6
netcdf/4.7.4

@edwardhartnett
Copy link
Contributor

Were the netCDF parallel I/O tests run when it was built?

@edwardhartnett
Copy link
Contributor

Also what is line 391 from? Feel free to provide additional info. ;-)

@DusanJovic-NOAA
Copy link
Collaborator Author

DusanJovic-NOAA commented Nov 23, 2020

I believe that line is from netcdf library. I see it here:
libdispatch/derror.c: return "NetCDF: HDF error";
But I do not know where line number 391 comes from.

@edwardhartnett
Copy link
Contributor

Was this built with hpc-stack?

This is a bit like 20 questions. ;-)

Can you please take the time to write up a proper issue, with all the relevant information? Please answer my questions about whether the netCDF tests were run.

If you believe that you have found a netCDF bug, we need a small program that does not depend on any other code (except netcdf), which demonstrates the problem. That is always going to be the first step in demonstrating that you have found an actual netCDF bug, and not just a bug in your netCDF code. ;-)

Once you have such a test program, you're going to need to post it as an issue in either netcdf-c or netcdf-fortran, and mention me in the issue so I can try to fix it.

This issue does not contain nearly enough information to debug an actual netCDF problem. You will have to provide more compete information if you believe you have found a netCDF bug.

@DusanJovic-NOAA
Copy link
Collaborator Author

Were the netCDF parallel I/O tests run when it was built?

I don't know. The hpc-stack was built by the NCEPLIBS group. @Hang-Lei-NOAA do you know if parallel I/O tests were run on wcoss2.

@Hang-Lei-NOAA
Copy link

Hang-Lei-NOAA commented Nov 23, 2020 via email

@edwardhartnett
Copy link
Contributor

I need to see the full output of configure and make check, including the config.log file.

I don't actually have access to this system yet, the paperwork is still working its way through...

@DusanJovic-NOAA
Copy link
Collaborator Author

@Hang-Lei-NOAA Can you provide netcdf log files from hpc-stack installation on acorn. Thanks.

@Hang-Lei-NOAA
Copy link

Hang-Lei-NOAA commented Nov 23, 2020 via email

@climbfuji
Copy link
Collaborator

Stupid question, because I was running into a similar problem on gaea with those two tests failing when creating a new baseline:

fv3_ccpp_wrtGauss_netcdf_parallel
fv3_ccpp_regional_quilt_netcdf_parallel

Did you enable on the PARALLEL_NETCDF build in the ufs-weather-model cmake config? For gaea, I added

set(PARALLEL_NETCDF ON  CACHE BOOL "Enable parallel NetCDF" FORCE)

to cmake/configure_gaea.intel.cmake. Now trying again if those tests pass or not.

@climbfuji
Copy link
Collaborator

Stupid question, because I was running into a similar problem on gaea with those two tests failing when creating a new baseline:

fv3_ccpp_wrtGauss_netcdf_parallel
fv3_ccpp_regional_quilt_netcdf_parallel

Did you enable on the PARALLEL_NETCDF build in the ufs-weather-model cmake config? For gaea, I added

set(PARALLEL_NETCDF ON  CACHE BOOL "Enable parallel NetCDF" FORCE)

to cmake/configure_gaea.intel.cmake. Now trying again if those tests pass or not.

Update - in my case, these two tests now pass.

@DusanJovic-NOAA
Copy link
Collaborator Author

Done in #295

pjpegion pushed a commit to NOAA-PSL/ufs-weather-model.p7b that referenced this issue Jul 20, 2021
Update top level CMakeLists.txt in ccpp directory by removing all compiler flags. Now when ccpp is built inside FV3 (UFS) the compiler flags provided by UFS (for example cmake/Intel.cmake or cmake/GNU.cmake) are used.
Also ccpp provides single library fv3ccpp which is used in the FV3's CMakeLists.txt, simplifying how fv3's dependencies on ccpp are specified.
epic-cicd-jenkins pushed a commit that referenced this issue Apr 17, 2023
* Remove all references to /lfs3 on Jet

* Add Ben and Ratko to the CODEOWNERS file

* Replace hard-coded make_orog module file with build-level module file in UFS_UTILS

* Remove hard-coded make_sfc_climo module file

* Fixes after updating fork with authoritative repo

* Set ad-hoc stochastic physics scheme magnitudes to -999.0 when not used to avoid bug with do_sppt/skeb/shum namelist entries

* Add nrows to input.nml, HALO_BLEND to config_defaults.sh, and apply HALO_BLEND user-defined value during generate step.

* Add nrows_blend to the template namelist file.

* Add comment in config_defaults.sh to set HALO_BLEND to zero if the user wants to shut it off.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants