Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model run-time failure in the SRW App (and maybe others) when using GFS v15 external model data #961

Closed
JeffBeck-NOAA opened this issue Dec 15, 2021 · 6 comments
Labels
bug Something isn't working

Comments

@JeffBeck-NOAA
Copy link
Contributor

JeffBeck-NOAA commented Dec 15, 2021

Description

When using GFS- or RRFS-based SDFs, the SRW App will fail with a segmentation fault during model integration if using external model data from GFS v15. External model data from GFS v16 will run fine, indicating a specific physics parameterization may need a model field that is available in GFS v16, but maybe not in GFS v15 data.

To Reproduce:

  1. Build the SRW App (maybe MRW App as well?) with ufs-weather-model hash 5a461c1
  2. Initialize a SRW App simulation (maybe MRW App as well?) using a pre-defined domain, using a GFS- or RRFS-based SDF (e.g., FV3_GFS_v15p2, RRFS_v1alpha, etc.)
  3. Model integration fails with a segmentation fault:

For FV3_GFS_V15p2:

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
ufs_model 00000000041E322C Unknown Unknown Unknown
libpthread-2.17.s 00002B94D4921630 Unknown Unknown Unknown
ufs_model 0000000003702F8A gfdl_cloud_microp 3935 module_gfdl_cloud_microphys.F90
ufs_model 0000000003706882 gfdl_cloud_microp 2158 module_gfdl_cloud_microphys.F90
ufs_model 000000000370578F gfdl_cloud_microp 2030 module_gfdl_cloud_microphys.F90
ufs_model 00000000036F7DFD gfdl_cloud_microp 996 module_gfdl_cloud_microphys.F90
ufs_model 00000000036F15D9 gfdl_cloud_microp 495 module_gfdl_cloud_microphys.F90
ufs_model 00000000036324AD gfdl_cloud_microp 238 gfdl_cloud_microphys.F90

For RRFS_v1alpha:

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
ufs_model 00000000041E32CB Unknown Unknown Unknown
libpthread-2.17.s 00002AD6008D0630 Unknown Unknown Unknown
ufs_model 0000000003939580 module_sf_mynn_mp 3628 module_sf_mynn.F90
ufs_model 0000000003939177 module_sf_mynn_mp 3366 module_sf_mynn.F90
ufs_model 0000000003937493 module_sf_mynn_mp 1430 module_sf_mynn.F90
ufs_model 0000000003933E66 module_sf_mynn_mp 431 module_sf_mynn.F90

Additional context

Runs will strangely succeed with older FV3GFS data when using a very coarse resolution domain (~25 km), but fail at higher resolutions. Simulations using older FV3GFS data will also succeed when running the FV3_HRRR SDF, indicating that the problematic parameterization is not in that SDF.

@danrosen25, @gsketefian @climbfuji, @LarissaReames-NOAA, @jwolff-ncar, @BenjaminBlake-NOAA, @mkavulich

@JeffBeck-NOAA JeffBeck-NOAA added the bug Something isn't working label Dec 15, 2021
@arunchawla-NOAA
Copy link

Is this still a problem ?

@JeffBeck-NOAA
Copy link
Contributor Author

JeffBeck-NOAA commented Mar 9, 2022

@arunchawla-NOAA, this may no longer be a date-dependent issue, but we continue to have problems with a number of the 25-km WE2E tests from the SRW App (grids, IC/LBC configurations, and SDF) that are failing:

grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp - FAILED
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional - FAILED
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_2017_gfdlmp - FAILED
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16 - FAILED

We have tested FV3GFS ICs/LBCs from 2019070100 and 2021070100 and GSMGFS ICs/LBCs from 2019052000.

From the GFS_2017_gfdlmp test using 2021070100 data:

         netcdf Write Time is    0.16778 at Fcst   00:00
 total            Write Time is    0.57909 at Fcst   00:00
  aft fcst run output time=        3600 FBcount=           3 na=          90
 PASS: fcstRUN phase 1, na =            1  time is    46.8860931396484
 PASS: fcstRUN phase 2, na =            1  time is   0.247372865676880
 PASS: fcstRUN phase 1, na =            2  time is    46.8145129680634
 PASS: fcstRUN phase 2, na =            2  time is   0.237903833389282
forrtl: error (65): floating invalid
Image              PC                Routine            Line        Source
ufs_model          000000000CBD0F5E  Unknown               Unknown  Unknown
libpthread-2.17.s  00002AC33EDAB630  Unknown               Unknown  Unknown
ufs_model          0000000004F64C8A  Unknown               Unknown  Unknown
ufs_model          00000000037F37A9  nh_utils_mod_mp_s        1416  nh_utils.F90
ufs_model          0000000003762AD6  nh_utils_mod_mp_r         449  nh_utils.F90
libiomp5.so        00002AC33D4BBA43  __kmp_invoke_micr     Unknown  Unknown

From the GFS_v16 case using GSMGFS data:

 PASS: fcstRUN phase 2, na =           91  time is   0.132863998413086
 PASS: fcstRUN phase 1, na =           92  time is    1.85313796997070
 PASS: fcstRUN phase 2, na =           92  time is   0.132879018783569
 PASS: fcstRUN phase 1, na =           93  time is    1.85436296463013
 PASS: fcstRUN phase 2, na =           93  time is   0.132988929748535
forrtl: severe (174): SIGSEGV, segmentation fault occurred
forrtl: severe (174): SIGSEGV, segmentation fault occurred
forrtl: severe (174): SIGSEGV, segmentation fault occurred
srun: error: h15c17: tasks 3-4,9: Exited with exit code 174
srun: launch/slurm: _step_signal: Terminating StepId=29279403.0
slurmstepd: error: *** STEP 29279403.0 ON h15c17 CANCELLED AT 2022-03-08T01:36:43 ***
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
ufs_model          000000000414BE7F  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B4564134630  Unknown               Unknown  Unknown
libmpi.so.12       00002B45635CFF94  PMPIDI_CH3I_Progr     Unknown  Unknown
libmpi.so.12.0     00002B456380B8FD  Unknown               Unknown  Unknown
libmpi.so.12       00002B45638BCC42  PMPI_Probe            Unknown  Unknown
ufs_model          000000000082F38A  _ZN5ESMCI3VMK4rec        4485  ESMCI_VMKernel.C
ufs_model          0000000000F445F9  _ZN5ESMCI3XXE4exe        4085  ESMCI_DELayout.C
ufs_model          0000000000F42D68  _ZN5ESMCI3XXE4exe        5409  ESMCI_DELayout.C
ufs_model          0000000001361146  _ZN5ESMCI11ArrayB        1680  ESMCI_ArrayBundle.C
ufs_model          0000000000A3D132  c_esmc_arraybundl         717  ESMCI_ArrayBundle_F.C
ufs_model          00000000007554F2  esmf_arraybundlem        2945  ESMF_ArrayBundle.F90

This does not appear to be a model instability as no warnings are printed prior to failure and we're running at dt_atmos = 40 for a 25-km grid, which is fine.

Any help troubleshooting this problem is greatly welcome!

@gsketefian
Copy link

I'll just add that these tests were previously working fine, so it's probably a namelist issue due to updates in the weather model or a software version issue. Lack of consistent testing in the SRW App resulted in these failures creeping in between the last update of the ufs-weather-model hash (maybe in December?) and now. @mkavulich and the rest of the DTC team have a repo management plan that includes testing, and we'll discuss that at tomorrow's code management meeting. Still, we could use help from EMC's ufs-weather-model developers to debug these specific tests.

@junwang-noaa
Copy link
Collaborator

@junwang-noaa
Copy link
Collaborator

@JeffBeck-NOAA May I ask if the issue is resolved? Thanks

@JeffBeck-NOAA
Copy link
Contributor Author

Thanks for the reminder @junwang-noaa. Closing this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants