Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dimension change in sfc restart files #344

Closed
junwang-noaa opened this issue Jul 15, 2021 · 18 comments · Fixed by #350
Closed

dimension change in sfc restart files #344

junwang-noaa opened this issue Jul 15, 2021 · 18 comments · Fixed by #350
Labels
bug Something isn't working

Comments

@junwang-noaa
Copy link
Collaborator

Description

This is reported by Bin Liu.

After syncing with the ufs-weather-model develop as of 07/12/2012. I notice there is a change in the model output sfc_data restart files.

It looks to me there was a change of the "Time" dimension size from 1 into 2, and also only the zorl variable has the Time dimension now.
double tg3 ( yaxis_1, xaxis_1 )
long_name : tg3
units : none
checksum : 90839D8A4C3CC0C1

  double zorl ( Time, yaxis_1, xaxis_1 )
     long_name :    zorl
     units :        none
     checksum :     471AD51A1F40C999,EF5F56B3B8D8CB1A

In contrast, previously, all the surface variables have the Time dimension and the Time dimension size is 1.
double tg3 ( Time, yaxis_1, xaxis_1 )
long_name : tg3
units : none
checksum : F9DF63348B3CECF2

  double zorl ( Time, yaxis_1, xaxis_1 )
     long_name :    zorl
     units :        none
     checksum :     7B1F22BE4A631212

Not sure if this is an expected change. If so, could you point us to the related background/context for this change?

Also, I checked that in the new sfc_data.nc restart file, for the zorl variable, the first dimension is actually not really Time. zorl(1,:,:) is actually zorlo (which has missing value on land), and zorl(2,:,:) has actual zorl value for both land and water points.

  double zorl ( Time, yaxis_1, xaxis_1 )
     long_name :    zorl
     units :        none
     checksum :     471AD51A1F40C999,EF5F56B3B8D8CB1A

So, this might be introduced improperly. A better solution would be adding the zorlo variable in the sfc_data.nc (since there are zorli, zorll variables already). Also, Time dimension should be restored properly as in other fv_core and phy_data files.

To Reproduce:

What compilers/machines are you seeing this with? on all the platforms

  1. run RT test and check the RESTART/sfc_data.tile*.nc file

Testing:

Dependent PRs:

@junwang-noaa junwang-noaa added the bug Something isn't working label Jul 15, 2021
@junwang-noaa
Copy link
Collaborator Author

junwang-noaa commented Jul 15, 2021

After looking at the restart files in the baseline, the changes come from the fv3 PR#285, committed on 5/27 @climbfuji

@BinLiu-NOAA
Copy link
Collaborator

Thanks, @junwang-noaa! This issue/change was found/identified when conduct HAFS DA testing after syncing with the support/HAFS branch the latest ufs-weather-model develop branch. Also, @ericaligo-NOAA noticed this change in their FV3CAM DA testing as well.

@climbfuji
Copy link
Collaborator

I think I know what happened. The variable zorl is already written to the restart files, now it is written twice. The I/O layer somehow thinks interprets this as two time indices (why?!?). It should have also thrown an error that this variable is already registered for restarts ...

I had to "add" the variable to get b4b identical results for restart runs. Since it was already in the restart files, I am wondering which of my other code changes (e.g. the block compute_tsfc_zorl_for_colstart in FV3GFS_io.F90) fixed the b4b mismatch for restart runs. Let me try to simply remove the redundant zorl variable and see what happens.

@junwang-noaa
Copy link
Collaborator Author

junwang-noaa commented Jul 15, 2021 via email

@climbfuji
Copy link
Collaborator

When you remove the redundant zorl variable, can you check if the time dimension for other fields will go back? Or could it be something else that messed up the time dimension for those fields.

Sure, I will check.

@climbfuji
Copy link
Collaborator

So, the problem is NOT writing redundant zorl. The problem is that the existing code - before my changes - is reading a netCDF variable called zorl into Sfcprop%zorlw from initial/restart files, and writing Sfcprop%zorlw to a netCDF variable called zorl !

I can of course change the name of that variable, but we need to check carefully for unintended side effects.

@junwang-noaa
Copy link
Collaborator Author

I don't get this part. It looks to me there is no issue about read/write a data array zorlw as zorl, that should not change the time dimension. Also previous code does not change time dimension. So again, we write out two zorl fields from two different data array, that is the cause messing up the time dimension, is this correct? So if we give two different field name for "zorlw" and "zorl", will that resolve the issue?

@climbfuji
Copy link
Collaborator

I don't get this part. It looks to me there is no issue about read/write a data array zorlw as zorl, that should not change the time dimension. Also previous code does not change time dimension. So again, we write out two zorl fields from two different data array, that is the cause messing up the time dimension, is this correct? So if we give two different field name for "zorlw" and "zorl", will that resolve the issue?

Yes, correct. we write out two zorl fields from two different data array. Sfcprop%zorlw should never have been written/read as zorl - very confusing. I am going to change that and keep zorl as in my PR and see what it does.

@BinLiu-NOAA
Copy link
Collaborator

@climbfuji and @junwang-noaa, we previously also noticed that it is very confusing about using the zorl variable to store the zorlw value in the sfc_data.nc file. It will be definitely better to correct this (say having the correct zorl and zorlw, together with zorll and zorli in the restart sfc_data.nc files).

@climbfuji
Copy link
Collaborator

@climbfuji and @junwang-noaa, we previously also noticed that it is very confusing about using the zorl variable to store the zorlw value in the sfc_data.nc file. It will be definitely better to correct this (say having the correct zorl and zorlw, together with zorll and zorli in the restart sfc_data.nc files).

Yep, that is what I am doing now. Ideally, they should be listed one after the other in FV3GFS_io.F90.

@SMoorthi-emc
Copy link
Contributor

SMoorthi-emc commented Jul 15, 2021 via email

@BinLiu-NOAA
Copy link
Collaborator

@SMoorthi-emc, Yes, zorl, zorll, and zorli were already there in sfc_data.nc restart files previously. Only the zorlw was missing and zorl used wrong values (zorlw). But, I think @climbfuji is trying to fix this. Also agree with @climbfuji that these four variables should be listed one after the other in FV3GFS_io.F90.

  double zorl ( Time, yaxis_1, xaxis_1 )
     long_name :    zorl
     units :        none
     checksum :     7B1F22BE4A631212

  double zorll ( Time, yaxis_1, xaxis_1 )
     long_name :    zorll
     units :        none
     checksum :     CD4121999996E7E2

  double zorli ( Time, yaxis_1, xaxis_1 )
     long_name :    zorli
     units :        none
     checksum :     BC00000000000000

@SMoorthi-emc
Copy link
Contributor

SMoorthi-emc commented Jul 15, 2021 via email

@junwang-noaa
Copy link
Collaborator Author

Dom is going to correct that

@climbfuji
Copy link
Collaborator

I's not that straightforward. If I simply require Sfcprop%zorlw to be zorlw without making it an optional variable, the code will fail with

FATAL from PE    69: fms_io(restore_state_all): unable to find mandatory variable zorlw in restart file sfc_data.tile3.nc

with existing initial conditions.

So what I really have to do is read zorl in the initial conditions as Sfcprop%zorl, make Sfcprop%zorlw => zorlw an optional variable, and change the current logic for computing Sfcprop%zorl from Sfcprop%zorlw for coldstart runs. Stay tuned :-)

@climbfuji
Copy link
Collaborator

Good news. If I make this change to read zorl in the initial conditions as Sfcprop%zorl and change the logic in the file to compute zorlw, I get "almost" the same results - they are b4b identical in checksums of state variables, but the RESTART surface files are different (because now there is zorlw and zorl, not a zorl variable with a time dimension of length 2).

This is for uncoupled/binary landmask runs, haven't tried coupled runs w/ fractional landmask yet yet.

baseline dir = /lustre/f2/pdata/ncep_shared/emc.nemspara/RT/NEMSfv3gfs/develop-20210712/INTEL/control
working dir  = /lustre/f2/scratch/Dom.Heinzeller/FV3_RT/rt_12499/control_2threads
Checking test 003 control_2threads results ....
 Comparing sfcf000.nc .........OK
 Comparing sfcf024.nc .........OK
 Comparing atmf000.nc .........OK
 Comparing atmf024.nc .........OK
 Comparing GFSFLX.GrbF00 .........OK
 Comparing GFSFLX.GrbF24 .........OK
 Comparing GFSPRS.GrbF00 .........OK
 Comparing GFSPRS.GrbF24 .........OK
 Comparing RESTART/coupler.res .........OK
 Comparing RESTART/fv_core.res.nc .........OK
 Comparing RESTART/fv_core.res.tile1.nc .........OK
 Comparing RESTART/fv_core.res.tile2.nc .........OK
 Comparing RESTART/fv_core.res.tile3.nc .........OK
 Comparing RESTART/fv_core.res.tile4.nc .........OK
 Comparing RESTART/fv_core.res.tile5.nc .........OK
 Comparing RESTART/fv_core.res.tile6.nc .........OK
 Comparing RESTART/fv_srf_wnd.res.tile1.nc .........OK
 Comparing RESTART/fv_srf_wnd.res.tile2.nc .........OK
 Comparing RESTART/fv_srf_wnd.res.tile3.nc .........OK
 Comparing RESTART/fv_srf_wnd.res.tile4.nc .........OK
 Comparing RESTART/fv_srf_wnd.res.tile5.nc .........OK
 Comparing RESTART/fv_srf_wnd.res.tile6.nc .........OK
 Comparing RESTART/fv_tracer.res.tile1.nc .........OK
 Comparing RESTART/fv_tracer.res.tile2.nc .........OK
 Comparing RESTART/fv_tracer.res.tile3.nc .........OK
 Comparing RESTART/fv_tracer.res.tile4.nc .........OK
 Comparing RESTART/fv_tracer.res.tile5.nc .........OK
 Comparing RESTART/fv_tracer.res.tile6.nc .........OK
 Comparing RESTART/phy_data.tile1.nc .........OK
 Comparing RESTART/phy_data.tile2.nc .........OK
 Comparing RESTART/phy_data.tile3.nc .........OK
 Comparing RESTART/phy_data.tile4.nc .........OK
 Comparing RESTART/phy_data.tile5.nc .........OK
 Comparing RESTART/phy_data.tile6.nc .........OK
 Comparing RESTART/sfc_data.tile1.nc ............ALT CHECK......NOT OK
 Comparing RESTART/sfc_data.tile2.nc ............ALT CHECK......NOT OK
 Comparing RESTART/sfc_data.tile3.nc ............ALT CHECK......NOT OK
 Comparing RESTART/sfc_data.tile4.nc ............ALT CHECK......NOT OK
 Comparing RESTART/sfc_data.tile5.nc ............ALT CHECK......NOT OK
 Comparing RESTART/sfc_data.tile6.nc ............ALT CHECK......NOT OK

AND:

netcdf sfc_data.tile6 {
dimensions:
        xaxis_1 = 96 ;
        yaxis_1 = 96 ;
        zaxis_1 = 2 ;
        zaxis_2 = 4 ;
        Time = UNLIMITED ; // (1 currently)
variables:
        float xaxis_1(xaxis_1) ;
                xaxis_1:long_name = "xaxis_1" ;
                xaxis_1:units = "none" ;
                xaxis_1:cartesian_axis = "X" ;
        float yaxis_1(yaxis_1) ;
                yaxis_1:long_name = "yaxis_1" ;
                yaxis_1:units = "none" ;
                yaxis_1:cartesian_axis = "Y" ;
        float zaxis_1(zaxis_1) ;
                zaxis_1:long_name = "zaxis_1" ;
                zaxis_1:units = "none" ;
                zaxis_1:cartesian_axis = "Z" ;
        float zaxis_2(zaxis_2) ;
                zaxis_2:long_name = "zaxis_2" ;
                zaxis_2:units = "none" ;
                zaxis_2:cartesian_axis = "Z" ;
        float Time(Time) ;
                Time:long_name = "Time" ;
                Time:units = "time level" ;
                Time:cartesian_axis = "T" ;
...
        double zorl(Time, yaxis_1, xaxis_1) ;
                zorl:long_name = "zorl" ;
                zorl:units = "none" ;
                zorl:checksum = "716992B89F88E2DD" ;

@junwang-noaa
Copy link
Collaborator Author

@climbfuji how about other fields besides zorl, do they have dimension ((Time, yaxis_1, xaxis_1) too? Did the restart test reproduce control?

@climbfuji
Copy link
Collaborator

@climbfuji how about other fields besides zorl, do they have dimension ((Time, yaxis_1, xaxis_1) too? Did the restart test reproduce control?

All variables have the same dimensions, yes. I created new baselines and verified against them on gaea.intel, all tests pass.

BinLiu-NOAA added a commit to hafs-community/HAFS that referenced this issue Jul 18, 2021
Merge HAFSv0.2 configuration into develop and sync HAFS submodules

This PR tries to:
*Bring the HAFSv0.2 configuration used in 2021 HFIP HAFS real-time parallel experiments back in the HAFS develop branch
*Sync HAFS submodules including hafs_forecast.fd (as of 07/12/2021), and hafs_utils.fd (as of 07/09/2021), hafs_gsi.fd (as of 07/09/2021), and hafs_post.fd (as of 07/09/2021) with their corresponding authoritative branches.

Notes:
*The HAFSv0.2 configuration were recently finalized for 2021 HFIP HAFSv0.2A real-time experiment based on the retrospective tests for 2019/2020 NATL storms. This HAFSv0.2A experiment also serves as the control experiment/configuration for the HAFSv0.2B/D/E real-time experiments. The development, testing and evaluation of the HAFSv0.2A configuration is one of the important HAFS development milestones.
*Thanks to @danrosen25, @uturuncoglu, and @ChunxiZhang-NOAA for helping solve the conflicts when syncing the support/HAFS branch with the latest ufs-weather-model develop branch.
*There is a known issue from the ufs-weather-model side: NOAA-EMC/fv3atm/issues/344, which changed the Time dimension and contents of the sfc_data.nc restart files. This broke the HAFS DA capability since DA/GSI need to read these sfc_data restart files. Fixing is ongoing from the ufs-weather-model side. And this HAFS DA capability will be restored next time when we sync ufs-weather-model develop branch after the fix. Currently, if you want to run the HAFS DA configurations/experiments, please keep using the feature/hafs_ensda branch instead.
BinLiu-NOAA added a commit to hafs-community/HAFS that referenced this issue Aug 13, 2021
Sync HAFS submodules with their corresponding authoritative branches:
- hafs_forecast.fd as of 08/05/2021
- hafs_gsi.fd as of 08/06/2021 plus the dual-resolution 3DEnVar bug fix
- hafs_post.fd as of 08/02/2021
- hafs_utils.fd as of  as of 07/23/2021
- hafs_graphics.fd/hrd_gplot as of 08/10/2021
- hafs_graphics.fd/emc_graphics as of 08/10/2021
Besides application level changes were made accordingly with the updated submodules.

This PR addresses issue #80.

Notes:
- The [bug of wrong Time dimension in FV3 restart sfc_data files](NOAA-EMC/fv3atm#344) has been fixed in ufs-weather-model through this ufs-weather-model [PR](ufs-community/ufs-weather-model#702). 
- The bug fix in for the dual-resolution EnVar analysis in GSI (hafs_gsi.fd) contributed from OU collaborators has also been included in this PR. With that the HAFS ENSDA configurations can now work properly.
- As for the hafs_forecast.fd submodule, support/HAFS branch is identical to the ufs-weather-model develop branch as of 08/05/2021. More information can be found through this PR (ufs-community/ufs-weather-model#715)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants