Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C/CD restarts #684

Closed
dabail10 opened this issue Feb 11, 2022 · 16 comments
Closed

C/CD restarts #684

dabail10 opened this issue Feb 11, 2022 · 16 comments

Comments

@dabail10
Copy link
Contributor

The atmospheric fields for the gx3 grid are not initialized properly when restarting for the C/CD grids.

@dabail10
Copy link
Contributor Author

Ok. I just found something. The strocnx and strocny terms are written at the T points for the B grid. I am not writing these at the N and E points.

@dabail10
Copy link
Contributor Author

The forcing is fine just after the scatter to Tair_data, etc. The problem appears to be in the call to interpolate_data. However, I do not understand how this could be impacted by the choice of grid_ice. Maybe we have an out of bounds issue?

This is the forcing debugging before writing the restart:

(JRA55_data)fdbg read recnum = 82 2
(ice_read_nc_xy) fid= 65536, lnrec = 82, varname = airtmp
(ice_read_nc_xy) min, max, sum = 232.65000915527344 315.08947753906250 3289199.3167572021 airtmp
(ice_read_nc_xy) fid= 65536, lnrec = 82, varname = wndewd
(ice_read_nc_xy) min, max, sum = -20.413463592529297 24.186750411987305 -10165.004966371693 wndewd
(ice_read_nc_xy) fid= 65536, lnrec = 82, varname = wndnwd
(ice_read_nc_xy) min, max, sum = -21.002643585205078 25.291442871093750 -3829.6174139123177 wndnwd
(ice_read_nc_xy) fid= 65536, lnrec = 82, varname = spchmd
(ice_read_nc_xy) min, max, sum = 1.9999999494757503E-005 2.2468701004981995E-002 106.14897644742996 spchmd
(ice_read_nc_xy) fid= 65536, lnrec = 82, varname = glbrad
(ice_read_nc_xy) min, max, sum = 0.0000000000000000 1121.9163818359375 2021567.1505828160 glbrad
(ice_read_nc_xy) fid= 65536, lnrec = 82, varname = dlwsfc
(ice_read_nc_xy) min, max, sum = 108.15037536621094 464.18704223632812 3708957.9000167847 dlwsfc
(ice_read_nc_xy) fid= 65536, lnrec = 82, varname = ttlpcp
(ice_read_nc_xy) min, max, sum = 0.0000000000000000 1.6827541403472424E-003 0.36126554293369600 ttlpcp
(file_year)fdbg start
(JRA55_data)fdbg c12intp = 1.0000000000000000 0.0000000000000000
(interpolate data)fdbg start
(interpolate data)fdbg start
(interpolate data)fdbg start
(interpolate data)fdbg start
(JRA55_data)fdbg JRA55_bulk_data
(JRA55_data)fdbg fsw 0.0000000000000000 1107.2705078125000
(JRA55_data)fdbg flw 134.40629577636719 445.21746826171875
(JRA55_data)fdbg fsnow 0.0000000000000000 1.3184784911572933E-003
(JRA55_data)fdbg Tair 238.51657104492188 305.27325439453125
(JRA55_data)fdbg uatm -19.403127670288086 24.026130676269531
(JRA55_data)fdbg vatm -19.080163955688477 22.923736572265625
(JRA55_data)fdbg Qa 1.2762486585415900E-004 2.0693773403763771E-002

This is after the restart:

(JRA55_data)fdbg read recnum = 41 1
(ice_read_nc_xy) fid= 131072, lnrec = 41, varname = airtmp
(ice_read_nc_xy) min, max, sum = 235.16474914550781 311.37399291992188 3291994.7671508789 airtmp
(ice_read_nc_xy) fid= 131072, lnrec = 41, varname = wndewd
(ice_read_nc_xy) min, max, sum = -22.082468032836914 23.189935684204102 -8725.5228494405746 wndewd
(ice_read_nc_xy) fid= 131072, lnrec = 41, varname = wndnwd
(ice_read_nc_xy) min, max, sum = -21.994728088378906 23.148794174194336 -1427.1374772240233 wndnwd
(ice_read_nc_xy) fid= 131072, lnrec = 41, varname = spchmd
(ice_read_nc_xy) min, max, sum = 1.9999999494757503E-005 2.6516553014516830E-002 106.02048647137053 spchmd
(ice_read_nc_xy) fid= 131072, lnrec = 41, varname = glbrad
(ice_read_nc_xy) min, max, sum = 0.0000000000000000 1163.1197509765625 1946284.5765726499 glbrad
(ice_read_nc_xy) fid= 131072, lnrec = 41, varname = dlwsfc
(ice_read_nc_xy) min, max, sum = 108.89191436767578 445.44702148437500 3721171.8467712402 dlwsfc
(ice_read_nc_xy) fid= 131072, lnrec = 41, varname = ttlpcp
(ice_read_nc_xy) min, max, sum = 0.0000000000000000 1.6247323947027326E-003 0.36015906054723323 ttlpcp
(file_year)fdbg start
(JRA55_data)fdbg read recnum = 42 2
(ice_read_nc_xy) fid= 131072, lnrec = 42, varname = airtmp
(ice_read_nc_xy) min, max, sum = 235.34988403320312 310.20663452148438 3291270.4802856445 airtmp
(ice_read_nc_xy) fid= 131072, lnrec = 42, varname = wndewd
(ice_read_nc_xy) min, max, sum = -20.949886322021484 24.961160659790039 -9158.5232439686079 wndewd
(ice_read_nc_xy) fid= 131072, lnrec = 42, varname = wndnwd
(ice_read_nc_xy) min, max, sum = -20.784015655517578 22.704746246337891 -1422.4114163977902 wndnwd
(ice_read_nc_xy) fid= 131072, lnrec = 42, varname = spchmd
(ice_read_nc_xy) min, max, sum = 1.9999999494757503E-005 2.4642026051878929E-002 106.01425504431973 spchmd
(ice_read_nc_xy) fid= 131072, lnrec = 42, varname = glbrad
(ice_read_nc_xy) min, max, sum = 0.0000000000000000 1144.6767578125000 1988415.8927561489 glbrad
(ice_read_nc_xy) fid= 131072, lnrec = 42, varname = dlwsfc
(ice_read_nc_xy) min, max, sum = 109.19615936279297 442.67092895507812 3717032.7060241699 dlwsfc
(ice_read_nc_xy) fid= 131072, lnrec = 42, varname = ttlpcp
(ice_read_nc_xy) min, max, sum = 0.0000000000000000 2.0417976193130016E-003 0.36046172976443569 ttlpcp
(file_year)fdbg start
(JRA55_data)fdbg c12intp = 0.66666666666666674 0.33333333333333331 (JRA55_data)fdbg JRA55_bulk_data
(JRA55_data)fdbg fsw -0.36112385786755863 999.19781494140625
(JRA55_data)fdbg flw -0.24487592304761874 438.04678344726562
(JRA55_data)fdbg fsnow -0.38755234655475546 0.24119655894178182
(JRA55_data)fdbg Tair 0.0000000000000000 304.66586303710938
(JRA55_data)fdbg uatm -16.086044311523438 15.694542566935223
(JRA55_data)fdbg vatm -16.489130020141602 13.849607467651367
(JRA55_data)fdbg Qa -0.24937952887286091 0.45918422484554866

@dabail10
Copy link
Contributor Author

Also, I had to add the following to icepack_shortwave.F90:

``

  •        exp_min = min(exp_argmax,ts/mu0n)
    
  •       exp_min = max(min(exp_argmax,ts/mu0n),puny)
           extins = exp(-exp_min)
    

``

This should have crashed previously in debug tests, but I guess we never had a case where exp_min went to zero. This is safer though.

@dabail10
Copy link
Contributor Author

So, it is in ice_forcing.F90 where it looks like the _data fields are interpolated to the internal fields. For some reason, it looks like it is only done on the master task. I'm going to try to run with 1 block per processor.

Dave

@dabail10
Copy link
Contributor Author

Could ice_read_nc be calling the wrong interface?

@dabail10
Copy link
Contributor Author

No, it looks like it is calling ice_read_nc_xy as it should be.

@dabail10
Copy link
Contributor Author

This is so weird. I can't explain what is going on. So, here is a summary so far.

  1. Before the restart, all of the _data arrays in JRA55_data are being read in and scattered correctly. So the last step before the restart for fsw_data and fsw we are getting:

fsw_data 652.99182128906250 1
fsw 652.99182128906250 1
fsw_data 0.0000000000000000 2
fsw 0.0000000000000000 2
fsw_data 551.21191406250000 3
fsw 551.21191406250000 3
fsw_data 103.21938323974609 0
fsw 103.21938323974609 0

I changed this test so it is one block per processor to simplify things. All blocks have valid values at 1,1,1. The second number is the task. I guess task/block 2 has a zero value.

  1. After the restart (only for the CD/C cases), it is no longer scattering correctly in ice_read_nc_xy. I see the global array min maxes are fine, but the scattered arrays are not right:
    fsw_data 0.0000000000000000 2
    fsw 0.0000000000000000 2
    fsw_data 0.0000000000000000 3
    fsw 0.0000000000000000 3
    fsw_data 0.0000000000000000 1
    fsw 0.0000000000000000 1
    fsw_data 151.15623474121094 0
    fsw 151.15623474121094 0

So, only the master task has valid data. Here is the debug from ice_read_nc_xy for the global array:

(ice_read_nc_xy) fid= 131072, lnrec = 41, varname = glbrad
(ice_read_nc_xy) min, max, sum = 0.0000000000000000 1163.1197509765625 1946284.5765726499 glbrad(ice_read_nc_xy) fid= 131072, lnrec = 42, varname = glbrad
(ice_read_nc_xy) min, max, sum = 0.0000000000000000 1144.6767578125000 1988415.8927561489 glbrad

Then the global min/max of the fsw array is:
(JRA55_data)fdbg fsw -0.36111095289487805 999.19781494140625

So, it is somehow getting negative shortwave. There is no interpolation of the shortwave here:

fsw(:,:,:) = fsw_data(:,:,1,:)

So, what is happening? It is related to the restart of the CD/C grid. So, are the arrays not allocated for the CD grid on a restart? The sequence for a restart is:

call input_data <- sets grid_ice
call alloc_state <- allocates uvelE, vvelE, uvelN, vvelN
call alloc_flux <- all of the remaining CD grid variables should be allocated
...
call init_forcing_ocn(dt) -> this calls alloc_forcing and so the fsw_data array should be allocated
...
call init_rest <- reads the restarts
...
if (trim(runtype) == 'continue' .or. restart) &
call init_shortwave ! initialize radiative transfer
...
call init_forcing_atm <- calls JRA55_data

So, everything should be allocated by the time init_forcing_atm is called. The only difference between the initial and restart is that the call to init_rest reads the initial file and init_shortwave is not called. So, why would that impact the reading of the JRA55 data? Also, why would it only fail for CD/C?

Dave

@dabail10
Copy link
Contributor Author

Just for interest sake. I removed the reads for all of the CD/C grid variables and the restart works fine.

@dabail10
Copy link
Contributor Author

Are field_loc_Nface and field_loc_Eface reversed in ice_gather_scatter.F90?

if (this_block%tripoleTFlag) then
select case (field_loc)
case (field_loc_center) ! cell center location
xoffset = 2
yoffset = 0
case (field_loc_NEcorner) ! cell corner (velocity) location
xoffset = 1
yoffset = -1
case (field_loc_Eface) ! cell face location
xoffset = 1
yoffset = 0
case (field_loc_Nface) ! cell face location
xoffset = 2
yoffset = -1
case (field_loc_noupdate) ! ghost cells never used - use cell center
xoffset = 1
yoffset = 1
end select
else
select case (field_loc)
case (field_loc_center) ! cell center location
xoffset = 1
yoffset = 1
case (field_loc_NEcorner) ! cell corner (velocity) location
xoffset = 0
yoffset = 0
case (field_loc_Eface) ! cell face location
xoffset = 0
yoffset = 1
case (field_loc_Nface) ! cell face location
xoffset = 1
yoffset = 0
case (field_loc_noupdate) ! ghost cells never used - use cell center
xoffset = 1
yoffset = 1
end select
endif

@dabail10
Copy link
Contributor Author

I guess not. It's curious that the cell center is 1,1 in ice_gather_scatter.F90, but it is 0,0 in ice_boundary.F90.

@apcraig
Copy link
Contributor

apcraig commented Feb 17, 2022

Just FYI, I have on my task list to setup a unit test driver to test the halo updates and gather/scatters. That would be comprehensive in the sense that I'd test various grids, all field_loc values, and all other options in all combinations as best as I could. Again, one of those situations where we have some confidence in the calls/combinations we use, but not so much in the calls/combinations we don't. It doesn't surprise me that something that hasn't been used before may not working properly. @dabail10, have you found the problem in the current implementation or are you still looking? Thanks for your efforts!

@dabail10
Copy link
Contributor Author

I am still looking. I removed the CD/C restart fields from the read and it runs fine. When I add just one back in, it aborts. So, I am suspicious of a scatter here. However, could it just be a memory issue? I'm completely puzzled.

@dabail10
Copy link
Contributor Author

I just did a 64x1 case, so two nodes and 64 tasks. This still has the "same" problem:

(JRA55_data)fdbg JRA55_bulk_data
(JRA55_data)fdbg fsw 0.0000000000000000 804.36022949218750
(JRA55_data)fdbg flw 0.0000000000000000 1163.1197509765625
(JRA55_data)fdbg fsnow 0.0000000000000000 438.04678344726562
(JRA55_data)fdbg Tair 0.0000000000000000 302.18690999348962
(JRA55_data)fdbg uatm -12.503940264383953 303.66243489583337
(JRA55_data)fdbg vatm -20.713456471761070 23.780344009399414
(JRA55_data)fdbg Qa -21.591157277425133 23.000778198242188

While fsw is no longer negative, Qa is negative. Plus, the uatm component has exceptionally large values. So, I am back to thinking it is the scattering.

@dabail10
Copy link
Contributor Author

dabail10 commented Feb 18, 2022

This is the temperature Tair field.

Screen Shot 2022-02-18 at 10 29 05 AM

.

@apcraig
Copy link
Contributor

apcraig commented Mar 14, 2022

Debugging now. There was a problem in the query_field implementation. Only the master task was getting a valid value. I have no idea how the restart read didn't deadlock on the scatter, but it didn't and the non-master values were unset. I have fixed that and trying to do some further validation. I may not get exact restart working, but hoping to at least validate fields are being read correctly. I will PR a fix soon to cgridDEV.

@apcraig
Copy link
Contributor

apcraig commented Mar 15, 2022

Fixed with apcraig#66. Restarts are now bit-for-bit for C/CD in current gridsys test suite. We may need to expand testing as we move forward. If new issues arise, new issues can be created.

@apcraig apcraig closed this as completed Mar 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants