Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cam-fv fails to exclude rttov obs from upper levels when no_obs_assim_above_level > 0 #296

Closed
kdraeder opened this issue Oct 2, 2021 · 22 comments
Assignees
Labels
Bug Something isn't working CAM Community Atmosphere Model

Comments

@kdraeder
Copy link
Contributor

kdraeder commented Oct 2, 2021

Describe the bug

Offsetting errors in model_mod allowed assimilations using rttov radiances to mostly appear to work,
but obs that should have been excluded by the namelist variable no_obs_assim_above_level were not.

  1. List the steps someone needs to take to reproduce the bug.
    Set up and run a CAM6 assimilation
  • of rttov, channel 11 radiances using VERTISPRESSURE and height 25 hPa
  • with no_obs_assim_above_level = 5 (~35 hPa in CAM6)
  1. What was the expected outcome?
    No obs should be assimilated
  2. What actually happened?
    All obs were assimilated

Which model(s) are you working with?

cesm2_1:CAM6 in models/cam-fv

Version of DART

Which version of DART are you using? Manhattan (at CMCC)

Have you modified the DART code?

Yes, to fix the bugs, only in cam-fv/model_mod.f90:

  1. vert_value < no_assim_above_level should compare against no_obs_assim_above_level
  2. Fixing that revealed that the call to obs_too_high should not be in model_interpolate. It can abort the counting of model levels by rttov. It belongs in get_close_{obs,state}
  3. When called from get_close_X obs_too_high needs a section handling VERTISSCALEHEIGHT, which means that no_assim_above_scaleh must be a global variable.

If your code changes are available on GitHub, please provide the repository.
I'm waiting for test results of the fixed code from Gio at CMCC.
Then I'll push the changes to NCAR/DART:Manhattan.

Build information

Please describe:

  1. The machine you are running on: Cheyenne
  2. The compiler you are using: intel
@kdraeder kdraeder added Bug Something isn't working Minor Minor functionality, non-critical data; easy fix labels Oct 2, 2021
@kdraeder kdraeder self-assigned this Oct 2, 2021
@nancycollins
Copy link
Collaborator

sounds good, except you want to make a branch with the changes and make a pull request, right? the fixed code should go on the main branch along with release notes/change log.

@kdraeder
Copy link
Contributor Author

kdraeder commented Oct 4, 2021

Thanks for the git workflow reminder. I lose track of that when I'm in the midst of fixing things.
I still have work to do to join the 21st century ...

@hkershaw-brown
Copy link
Member

Couple of questions on this.

What is special about radiance observations, that they are not being excluded?
would a different type of observation with:

VERTISPRESSURE and height 25 hPa
with no_obs_assim_above_level = 5 (~35 hPa in CAM6)

be excluded correctly?

Can you get a test case from Gio that we can run on Cheyenne? Or share your test case so we have a reproducer for this bug. Gio is not running the latest version of DART v9.11.11, correct?

@hkershaw-brown hkershaw-brown removed the Minor Minor functionality, non-critical data; easy fix label Oct 15, 2021
@kdraeder
Copy link
Contributor Author

The radiance obs above no_obs_assim_above_level were assimilated because obs_def_rttov_mod.f90
calls interpolate with VERTISLEVEL. Model_interpolate passes this to obs_too_high, which compares it against
no_assim_above_level, which is set to -1 during initialization, but never updated to the user-provided value in
no_**obs_**assim_above_level. No levels are less than -1, so none are excluded.

This happens only for radiance obs. Conventional obs at 25 hPa were excluded in Gio's tests.
Gio's radiance obs are defined with VERTISPRESSURE and a height which depends on the satellite channel.

When the comparison in obs_too_high is changed from no_assim_above_level to no_obs_assim_above_level,
then rttov uses model_interpolate to count the levels in CAM, starting with level 1, which is the top level in CAM.
Obs_too_high then finds that 1 < no_obs_assim_above_level (5) and fails.

This is what prompted moving obs_too_high out of model_interpolate and into get_close_{state,obs},
which required more changes.

I'm highly confident that Gio is not running the latest DART.
I'll ask him for some data files and any code and script changes that are needed to use them.

@nancycollins
Copy link
Collaborator

kevin is right about how radiance obs work, but i'll clarify that the forward operator for any obs type that used any of the available vertical types and looped trying to query the model_mod for "all" values would have failed the same way. the most common use is for a forward operator to construct locations with VERTISLEVEL (since there's no interpolation required), and then query the model_mod for values at all levels and compute something from that column of values.

but any forward operators trying to call the code with multiple pressure levels or heights from a forward operator could fail in the same way. it just happened that radiance obs were the first obs with a forward operator that used this strategy to run with cam. (something like total precipitable water would do the same.) the lesson is that as kevin says, checking obs_too_high() in interpolate is a bad strategy, and instead moving that function to the get_close() routines is much better.

also - the intent was that the init code would assign: no_assim_above_level = no_obs_assim_above_level, to be consistent with the other no_assim_above_... variables and the check routine would use that. when the no_assim_above_scale_height variable get added, maybe this can be changed?

@kdraeder
Copy link
Contributor Author

Here are instructions from Gio about setting up a test case 2 different ways.

The first refers to scripts for running the ensemble of hindcasts in sequence (#53 )
and (apparently) assimilating using rttov, which are on github:
giovanniconti83/suite_assimilation-atmo-CMCC

  1. You need to set in /main_scripts/case_create.sh the variable RAD=TRUE.
  2. Some paths for the data must be changed also in the script /case_archive/case.templaterad.original
    That is basically the setup_{advanced,hybrid} script.

The second way is outlined here.

  1. the CAM initial file are the .i produced as in your setup script.
    The needed variables cannot be saved in the.i file and must be saved in a .h0 file.
    I added these lines to your setup script
echo " empty_htapes  = .true. "                        >> ${fname}
echo " fincl1        = 'PHIS:I', 'LANDFRAC:I','TREFHT:I', 'TS:I', 'ICEFRAC:I', 'OCNFRAC:I'"   >> ${fname}
echo " nhtfrq        = -6 "                      >> ${fname}
echo " mfilt         = 1 "                             >> ${fname}

These variables, or the derived ones, are computed in the assimilation script with NCO command
(in /main_scripts/cases_assimilate.sh that is a modified version of your script)
before calling DART,:
This cycle is in the Block 6 before the computation of the inflation

if ( $radmode == TRUE ) then
  set inst=1
  while ( $inst <= $nens )
     set inst_string=`printf _%04d $inst`
     set ftoass="$CASESRUNROOT/${case_name}$inst_string/run/${case_name}$inst_string.cam.i.${ATM_DATE_EXT}.nc"
     set ftoassh="$CASESRUNROOT/${case_name}$inst_string/run/${case_name}$inst_string.cam.h0.${ATM_DATE_EXT}.nc"
     echo "Add TS, TREFHT, LANDFRAC, QVMR to $ftoass"
     ncks -A -C -v TS $ftoassh $ftoass
     ncks -A -C -v TREFHT $ftoassh $ftoass
     ncap2 -A -s "QVMR=Q/(1-Q)" -v $ftoass $ftoass
     rm tmp.nc
     ncap2 -s 'where(LANDFRAC>OCNFRAC) LANDFRAC=0; elsewhere LANDFRAC=1' -v $ftoassh tmp.nc
     ncks -A -C -v ICEFRAC $ftoassh tmp.nc
     ncap2 -A -s 'where(ICEFRAC>0.5) LANDFRAC=2' -v  tmp.nc $ftoass
     rm tmp.nc

     @ inst++
  end
endif

Since these added variables are not updated by the filter we need to add by hand the inflation for these fields

if ( $radmode == TRUE ) then

    echo "add TS TREFHT LANDFRAC QVMR to the inflation files"
    ncks -C -v TS $CASESRUNROOT/${case_name}_0001/run/${case_name}_0001.cam.i.${ATM_DATE_EXT}.nc tmp_TS.nc
    ncap2 -O -s "TS=TS*0+1" tmp_TS.nc tmp_m_TS.nc
    ncap2 -O -s "TS=TS*0" tmp_TS.nc tmp_s_TS.nc
    ncks -A -v TS tmp_m_TS.nc input_priorinf_mean.nc
    ncks -A -v TS tmp_s_TS.nc input_priorinf_sd.nc
    ncks -C -v TREFHT $CASESRUNROOT/${case_name}_0001/run/${case_name}_0001.cam.i.${ATM_DATE_EXT}.nc tmp_TREFHT.nc
    ncap2 -O -s "TREFHT=TREFHT*0+1" tmp_TREFHT.nc tmp_m_TREFHT.nc
    ncap2 -O -s "TREFHT=TREFHT*0" tmp_TREFHT.nc tmp_s_TREFHT.nc
    ncks -A -v TREFHT tmp_m_TREFHT.nc input_priorinf_mean.nc
    ncks -A -v TREFHT tmp_s_TREFHT.nc input_priorinf_sd.nc
    ncks -C -v LANDFRAC $CASESRUNROOT/${case_name}_0001/run/${case_name}_0001.cam.i.${ATM_DATE_EXT}.nc tmp_LANDFRAC.nc
    ncap2 -O -s "LANDFRAC=LANDFRAC*0+1" tmp_LANDFRAC.nc tmp_m_LANDFRAC.nc
    ncap2 -O -s "LANDFRAC=LANDFRAC*0" tmp_LANDFRAC.nc tmp_s_LANDFRAC.nc
    ncks -A -v LANDFRAC tmp_m_LANDFRAC.nc input_priorinf_mean.nc
    ncks -A -v LANDFRAC tmp_s_LANDFRAC.nc input_priorinf_sd.nc
    ncks -C -v QVMR $CASESRUNROOT/${case_name}_0001/run/${case_name}_0001.cam.i.${ATM_DATE_EXT}.nc tmp_QVMR.nc
    ncap2 -O -s "QVMR=QVMR*0+1" tmp_QVMR.nc tmp_m_QVMR.nc
    ncap2 -O -s "QVMR=QVMR*0" tmp_QVMR.nc tmp_s_QVMR.nc
    ncks -A -v QVMR tmp_m_QVMR.nc input_priorinf_mean.nc
    ncks -A -v QVMR tmp_s_QVMR.nc input_priorinf_sd.nc
    rm tmp_*.nc

endif

The code above is always in block 6 just after

if ($USING_PRIOR_INFLATION == true) then
   if ($PRIOR_INFLATION_FROM_RESTART == false) then

      echo "inf_flavor(1) = $PRIOR_INF, using namelist values."

   else
      # Look for the output from the previous assimilation (or fill_inflation_restart)
      # If inflation files exists, use them as input for this assimilation
      (${LIST} -rt1 *.dart.rh.${scomp}_output_priorinf_mean* | tail -n 1 >! latestfile) > & /dev/null
      (${LIST} -rt1 *.dart.rh.${scomp}_output_priorinf_sd*   | tail -n 1 >> latestfile) > & /dev/null
      set nfiles = `cat latestfile | wc -l`

      echo "nfiles = $nfiles "

      if ( $nfiles > 0 ) then

         set latest_mean = `head -n 1 latestfile`
         set latest_sd   = `tail -n 1 latestfile`

         echo "latest_mean = $latest_mean"
         echo "latest_sd = $latest_sd"
         # Need to COPY instead of link because of short-term archiver and disk management.
         ${COPY} $latest_mean input_priorinf_mean.nc
         ${COPY} $latest_sd   input_priorinf_sd.nc
  1. a small file with Ch8,9,10,11 is attached. It is the usual test for 2017.01.15.21600.
    [reanrttovtestcaseatncar.zip]

  2. Here attached also the input.nml modified by the scripts in order to manage the 30 member ensemble.

  3. Remember that for the clearsky assimilation you also need to have
    inside the directory containing the DART filter (you can see this in the case_create.sh file):

if [[ "$RAD" == "TRUE" ]]; then
  echo " Copy RTTOV db file"
  ${COPY} $dartroot/observations/forward_operators/rttov_sensor_db.csv $tmpdir/.
  ${COPY} ${rttovdir}/rtcoef_rttov12/rttov7pred54L/rtcoef_eos_2_amsua.dat  $tmpdir/.
  ${COPY} ${rttovdir}/rtcoef_rttov12/mietable/mietable_eos_amsua.dat  $tmpdir/.
fi

Actually the mietable_eos_amsua.dat is not necessary for the test we are doing
(It is necessary if you set to false the clearsky option of rttov namelist.)

In order to create the obs_seq_ file with the vertical location I modified
/DART/observations/obs_converters/AIRS/amsua_netCDF_support_mod.f90
adding (here attached the complete file):

!which_vert = VERTISUNDEF
which_vert = 2 !VERTISPRESSURE

 ! column integrated value, so no vertical location (without vertical
         ! location it is impossible to use the vertical localization)
         ! Following Xie at all 2018
         select case (ichan)
           case (8)
              vloc = 18000.0_r8
           case (9)
              vloc = 9000.0_r8
           case (10)
              vloc = 5000.0_r8
           case (11)
              vloc = 2500.0_r8
           case (12)
              vloc = 1200.0_r8
           case (13)
              vloc = 500.0_r8
           case (14)
              vloc = 200.0_r8
           case default
              write(*,*) "ERROR: AMSU-A Channel for AQUA satellite not working!"
              cycle channel_loop
         end select

[ KR: This is added functionality, which I believe is not needed for setting up a test case ].
In order to check how stable is the peak of the weighting function (as suggested by jeff),
before starting to change something inside the model mode,
I'm modifying the obs_def_rtto_mod.f90 file adding

! GC
real :: wfmax_loc

subroutine get_rttov_wfmax_loc(runtime, channel, wfmax_loc)

     type(rttov_sensor_runtime_type), pointer  :: runtime
     integer, intent(in)                       :: channel
     real, intent(out)                         :: wfmax_loc
     real                                      :: wfmax
     real, allocatable, dimension(:)           :: wf, tau, p
     integer                                   :: nlev,ens_size, ii, imem, allo_stat, deallo_stat
     integer :: ii_loc(1)

     wfmax = 0.0

     ens_size=size(runtime%profiles(:))
     nlev=size(runtime%profiles(1)%p(:))

     ! allocate the weighting function array
     allocate(wf(1:nlev), tau(1:nlev), p(1:nlev), stat=allo_stat)
     if (allo_stat/=0) then
       print*, "Weighting function profile allocation is failed!"
       print*, "Error code: ", allo_stat
       stop
     end if

     tau = runtime%transmission%tau_levels(:,channel)
     do imem = 1,ens_size
        p   = runtime%profiles(imem)%p(:)

        ! compute the profile following Matricardi suggestions
        do ii = 1, nlev
           wf(ii) = -( (tau(ii)-tau(ii+1))/(p(ii)-p(ii+1)) )*( p(ii)+p(ii+1) )/2
        end do

        wfmax=maxval(wf)
        ii_loc = FINDLOC(wf, VALUE=wfmax)
        wfmax_loc = wfmax_loc + p(ii_loc(1))
     end do

     wfmax_loc = wfmax_loc / ens_size

     ! deallocate
     deallocate(wf, tau, p, stat=deallo_stat)
     if (deallo_stat/=0) then
       print*, "Weighting function profile deallocation is failed!"
       print*, "Error code: ", deallo_stat
       stop
     end if

end subroutine

And in fo_forward_model:

call get_rttov_wfmax_loc(runtime, channel, wfmax_loc)

print*, 'wfmax_loc = ', wfmax_loc

@hkershaw-brown
Copy link
Member

Is it possible to just get a tar file of the restarts, inflation files and obs_seq.out out files he is using?

@kdraeder
Copy link
Contributor Author

The obs_seq.out file is included in the zip file attached part way through my previous comment.

I could probably get the restarts, but we have some locally that should work:
casper: /gpfs/csfs1/cisl/dares/Reanalyses/f.e21.FHIST_BGC.f09_025.CAM6assim.011/rest/2017-01/f.e21.FHIST_BGC.f09_025.CAM6assim.011.0001.alltypes.2017-01-02-00000.tar
Either that can be replicated by links to form an ensemble, or there are more members in the same place;
0001 -> 0002, ...

Inflation files can be found in
/gpfs/csfs1/cisl/dares/Reanalyses/f.e21.FHIST_BGC.f09_025.CAM6assim.011/esp/hist/2017-01/f.e21.FHIST_BGC.f09_025.CAM6assim.011.dart.rh.cam_output_priorinf_mean.2017-01.tar
and ...sd.2017-01.tar

@hkershaw-brown
Copy link
Member

ok cool. I shall grab these and reproduce this with DART:main.

thanks

@hkershaw-brown hkershaw-brown added the CAM Community Atmosphere Model label Feb 3, 2022
@kdraeder
Copy link
Contributor Author

kdraeder commented Sep 10, 2022

Gio gave me access to a github fork of the code that generated the "curiosity", which appears in the obs_seq.final files. Obs which should be excluded by cam-fv/model_mod.nml:no_obs_assim_above_level end up with a DART QC = 0 (assimilated) and differences between the priors and posteriors. The problem arises because the get_close_xxx calls from filter_assim pass back num_close = 0, but do not pass QC information. The get_close routines don't even pass back a status variable, so there's no conventional way for model_mod to signal that an ob has been excluded because of DARTQC_NOT_IN_NAMELIST. This incorrect obs_seq.final data will show up in obs space diagnostics, which will show high observations as being assimilated, when they weren't.
An example is observation 100 in
/glade/scratch/raeder/Debug_highobs0/run/Check_Setting/Debug_highobs0.dart.e.cam_obs_seq_final.2017-10-01-21600:

OBS          100
  12.8051681637842
  12.7115463592079           prior_mean
  12.7165606318343           posterior_mean
 2.809426256590811E-002
 2.671526798046297E-002
  12.7416650986981
  12.7436166598793
  12.6860498568220
  12.6901996893429
  12.7069241221035
  12.7158655462808
 0.000000000000000E+000
 0.000000000000000E+000      DART qc  (incorrect)
         99         101          -1
obdef
loc3d
   0.5586342831883179        0.9539198770527996E-02     23000.00      3     
                                                        23 km is above the exclusion
kind
          4
gpsroref     100
 0.00+000  0.00+000  0.00+000  0.00+000  0.00+000  0.00+000 GPSREF
31226     152214
 2.412720521198785E-003

I've set up a small test case (@hkershaw-brown), which uses a single GPS profile (instead of rttov, following @nancycollins suggestion) and 3 members (on 9 nodes):

DART: /glade/u/home/raeder/DART/cmcc_kevin-dbg/models/cam-fv/shell_scripts/cesm2_1
case: /glade/work/raeder/Exp/Debug_highobs2
rundir: /glade/scratch/raeder/Debug_highobs2/run

which includes a potential (partial?) fix. For obs higher than no_obs_assim_above_level,
get_close_xxx are modified to return

num_close = -1 * DARTQC_NOT_IN_NAMELIST,

which is a hard coded parameter because trying to use the QC module leads to circular dependency. Assim_tools_mod:filter_assim is modified to:

  1. check num_close for a negative value,
  2. change the obs_ens_handle%copies(OBS_GLOBAL_QC_COPY, owners_index) to be DARTQC_NOT_IN_NAMELIST, and
  3. set num_close = 0, which causes the SEQUENTIAL_OBS loop to cycle.

Nancy warns that we can't count on this to work because the parallelism prevents(?) the variables from being reliably available. But it worked in my small test case; the obs that are higher than no_obs_assim_above_level have QC = DARTQC_NOT_IN_NAMELIST (unless they have some other non-0 QC).

The priors and posteriors are still different, but those obs will be excluded from obs space diagnostics, and for the right reason.
/glade/scratch/raeder/Debug_highobs2/run/Debug_highobs2.dart.e.cam_obs_seq_final.2017-10-01-21600:

OBS          100
  12.8051681637842
  12.7115808661031          prior_mean
  12.7165150633910          posterior_mean
 2.692235294416099E-002
 2.565008872577182E-002
  12.7404379427365
  12.7425537273919
  12.6871390898394
  12.6912720644110
  12.7071655657335
  12.7157193983701
 0.000000000000000E+000
  5.00000000000000          DART qc  (correct)
         99         101          -1
obdef
loc3d
   0.5586342831883179        0.9539198770527996E-02     23000.00000000000      3
kind
          4
gpsroref     100
 0.00+000  0.00+000  0.00+000  0.00+000  0.00+000  0.00+000 GPSREF
31226     152214
 2.412720521198785E-003

This case has inflation turned off, so that is not the cause of the increments. The differences in (estimated) priors and posteriors probably comes from other, accepted observations just below the no_obs_assim_above_level causing increments in this ob that's above that level. The no-inflation case has slightly different values than the inflation case.

I have not explored whether this problem exists in other models.

@nancycollins
Copy link
Collaborator

hi kevin,

that's good sleuthing so far. here's a suggestion for the next test.

make the input file only a single GPS obs, not a full GPS profile. make sure you select an obs that is above the threshold.

change both the get_close_state and get_close_obs code to return 0 numbers of close state/obs for obs above the threshold (so don't use the new -1 code). yes, that means this obs "is assimilated" but it should not have any impact on any of the state.

if you see differences in the state and/or differences in the prior/posterior then something else is going on and that needs to be tracked down. otherwise, it's working as intended. it would be good to establish that the code is working before trying to attribute changes to something else.

@nancycollins
Copy link
Collaborator

as a design point, even if you can update QCs in the filter loop and make it work under all data layouts (with transpose, without transpose, etc), you still have to deal with the fact that each task has a different list of obs and state variables it is responsible for. without communication between tasks, which is expensive and usually forces a barrier/sync point (bad for speed), it is possible for some tasks to return an error and some not. you can say if that happens it is a bug, but it is bad design to implement something where an error in user code cannot even be detected.

if your intent is to know which observations did not contribute to changing the results of the assimilation - for any reason - that might be worth discussing. perhaps the localization distance is too small, perhaps the obs type is explicitly excluded by the obs_impact tool, perhaps there is model specific code in the get_close routines to exclude them. in that case, it might be possible to add an optional call to the global sum routine to count up the overall number of state items impacted by this obs (at a performance hit). any obs with 0 impact could have a new QC value set which indicates that obs had no impact on the results of the assimilation. that could be interesting for understanding what observations do and don't matter to your results.

you would have to update the diagnostic routines to take that new value into account, but it's a more general solution.

@kdraeder
Copy link
Contributor Author

The single ob test @nancycollins suggested shows no increments in state space or obs space.
So the increments I saw in earlier tests do seem to be from lower obs that were supposed to be assimilated.

When I described the changes to filter_assim I neglected to include that the
"change the obs_ens_handle%copies(OBS_GLOBAL_QC_COPY, owners_index) to be DARTQC_NOT_IN_NAMELIST,"
has a conditional on it:
if(ens_handle%my_pe == owner)
which is the same conditional that's used to get an existing QC for this ob
and calculate the obs prior and increment, instead of getting them from a broadcast.
Does that prevent filter from generating inconsistent data on different tasks?
It seems like it should; if each ob is handled by only 1 task,
and the code says "don't assimilate it", then nothing is done to state variables or obs
on any tasks.

@nancycollins
Copy link
Collaborator

kevin, that code in filter does not solve the parallel problem. it only works because you wrote your model-specific get_close routines so that every task returns the same result. selecting any task would give you the same answer. if another non-owner task returned a different result this code would not catch it (and cannot without a global mpi operation).

you did make something work for your very specific use case but it is not a general solution for anyone else and shouldn't be added to the main filter code. one thing i've done in design is to think about how i would document a change and critically, how someone else might use a new feature for a different purpose with a different model. if i can't do that, then it is too narrow a solution.

i'd propose the cam model_mod simply return 0 from get_close for high obs. the obs then has no impact on the results. yes, it won't be marked with a different QC but other obs lower down also may not have an impact on the results and they aren't called out in the diagnostics. if that is important, then the more general solution is one i outlined above.

@hkershaw-brown
Copy link
Member

throwing this out there: wrf-dart's solution to this problem is to to have an obs preprocessor.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!
! wrf_dart_obs_preprocess - WRF-DART utility program that at a
! minimum removes all observations outside
! of a WRF domain and add observations from
! supplimental obs sequences. The program
! assumes all data is from one observation
! time. In addition, this program allows
! the user to do the following functions:
!
! - remove observations near the lateral boundaries
! - increase observation error near lateral boundaries
! - remove observations above certain pressure/height levels
! - remove observations where the model and obs topography are large
! - remove significant level rawinsonde data
! - remove rawinsonde observations near TC core
! - superob aircraft and satellite wind data
!
! created Oct. 2007 Ryan Torn, NCAR/MMM
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

@kdraeder
Copy link
Contributor Author

I was coming to the same conclusion about a preprocessor, as unsatisfying as that is.
It would involve a fair amount of work to write and test the program, modify the scripts,
and propagate the changes to current users. It might slow down the cycling too,
since we'd need to check every ob for its vertical location, and write out a new obs_seq file,
when only a small fraction would be excluded.

But before that I want to make one more attempt to identify what I'm misunderstanding.
Then I'll go back to my day job and quit adventuring in filter code.

In filter_assim the SEQUENTIAL_OBS loop allegedly cycles over all observations:

! Loop through all the (global) observations sequentially
SEQUENTIAL_OBS: do i = 1, obs_ens_handle%num_vars

Each task owns a subset of those.
Those need to have their priors and increments calculated by the owner,
while all the other obs get the priors and increments from a broadcast.

(Ignoring caching) Then each task calls get_close_obs,
for both the obs it owns and the ones it doesn't.
! Everybody is doing this section, cycle if qc is bad
(This seems like a lot of redundant calculation, which makes me suspect that
SEQUENTIAL_OBS loops over a subset of obs.)
The cam-fv model_mod get_close_obs checks the vertical location of the ob,
and if it's too high it sends back num_close = 0.
So every task knows that num_close = 0 for that ob.
If a different value were returned, then every task would know that that ob had that value.

Is there, or could there be, a model_mod which knows whether an ob is owned by a given task,
so that different tasks could get different num_close values back from get_close?
So far I don't see how that information could be passed to model_mod through get_close_obs.

I understand that get_close returning num_obs = 0 leads to no impact on the assimilation,
but it leads to the appearance of an impact when looking at obs space diagnostics,
with no way to distinguish correct and incorrect QC=0 obs. The obs space software
sees QC = 0 and includes that ob in the statistics and pictures of prior or posterior
estimated obs, even if it wasn't assimilated.

I can see how the estimated obs for the too-high ob could be needlessly updated by other obs,
due to the lack of syncing, but if the QC in the obs_seq file correctly signals that the ob was not assimilated,
then those ob estimates won't be used for anything, both during the assimilation
and in the obs space processing.

I'm wondering whether another solution would be to set the distances very large,
in addition to num_close = 0. That might also prevent other obs from updating
the ob estimates. But that would distort the obs space in a different way.
Excluded obs would still look like they were assimilated, but end up identical to the actual ob,
so they would reduce the bias and RMSE estimates from what they should be.

(Full disclosure; I didn't write these cam-fv get_close variants or define their argument lists).

@nancycollins
Copy link
Collaborator

see my comments below:

I was coming to the same conclusion about a preprocessor, as unsatisfying as that is. It would involve a fair amount of work to write and test the program, modify the scripts, and propagate the changes to current users. It might slow down the cycling too, since we'd need to check every ob for its vertical location, and write out a new obs_seq file, when only a small fraction would be excluded.

if your goal is to replicate results of previous runs, then i think the preprocessing proposal is the only one that will work. i don't see it as unsatisfying. there are 2 existing preprocessing programs already in existance (wrf and mpas) but they do so much more than you need. you can start with the obs_loop program and add code similar to what's in the cam model_mod to exclude high obs. it would run fast and compared to the overall cycle time of cam/dart, it would be in the noise.

But before that I want to make one more attempt to identify what I'm misunderstanding. Then I'll go back to my day job and quit adventuring in filter code.

In filter_assim the SEQUENTIAL_OBS loop allegedly cycles over all observations:

! Loop through all the (global) observations sequentially
SEQUENTIAL_OBS: do i = 1, obs_ens_handle%num_vars

Each task owns a subset of those. Those need to have their priors and increments calculated by the owner, while all the other obs get the priors and increments from a broadcast.

(Ignoring caching) Then each task calls get_close_obs, for both the obs it owns and the ones it doesn't. ! Everybody is doing this section, cycle if qc is bad (This seems like a lot of redundant calculation, which makes me suspect that SEQUENTIAL_OBS loops over a subset of obs.)

this is not redundant. you're missing that each task has its own copy of the state and obs data. each task takes the current obs and then works independently to apply the increments. the SEQUENTIAL_OBS loop goes through every obs.

The cam-fv model_mod get_close_obs checks the vertical location of the ob, and if it's too high it sends back num_close = 0. So every task knows that num_close = 0 for that ob.

yes, because you added code to the part of get_close that is the same on each task. each task is doing exactly the same computation, independently so it works for your specific case. but only the observation is the same in the get_close call. the list of state and obs locations, types, etc are different on each task. so if you wanted to do something different with this code, where some tasks might say 0 and some might say another number, then this strategy doesn't work. so yes, you have a solution to a very narrow problem but it's in no way general.

If a different value were returned, then every task would know that that ob had that value.

Is there, or could there be, a model_mod which knows whether an ob is owned by a given task, so that different tasks could get different num_close values back from get_close? So far I don't see how that information could be passed to model_mod through get_close_obs.

in filter right now, different tasks do return different number of num_close values. that's where the speedup of parallelism comes in. each task has a different subset of the state. it gets passed the current ob and independently decides which parts of the subset of the state that is on its own task are close enough to be impacted by this obs. there is currently no need to know which tasks found close obs and which did not.

I understand that get_close returning num_obs = 0 leads to no impact on the assimilation, but it leads to the appearance of an impact when looking at obs space diagnostics, with no way to distinguish correct and incorrect QC=0 obs. The obs space software sees QC = 0 and includes that ob in the statistics and pictures of prior or posterior estimated obs, even if it wasn't assimilated.

this is slightly the wrong way to look at this, in my opinion. would you say "an obs was assimilated" if it had no impact on the results? there are lots of ways that could happen. the cutoff distance could prevent an obs from impacting the state.
the obs_impact_tool could prevent an obs from impacting the state. there could be many other obs which have 0 impact on the resulting output state, and you currently have no way to know that either. so there's nothing special about high obs that might not also be true for other obs. it's a false statement to say that marking high obs with a different QC identifies all obs which are not changing the output state.

I can see how the estimated obs for the too-high ob could be needlessly updated by other obs, due to the lack of syncing, but if the QC in the obs_seq file correctly signals that the ob was not assimilated, then those ob estimates won't be used for anything, both during the assimilation and in the obs space processing.

that's true but not relevant. other obs may also not contribute to the change in model state and they are marked with a QC of 0.

I'm wondering whether another solution would be to set the distances very large, in addition to num_close = 0. That might also prevent other obs from updating the ob estimates. But that would distort the obs space in a different way. Excluded obs would still look like they were assimilated, but end up identical to the actual ob, so they would reduce the bias and RMSE estimates from what they should be.

i'm not sure what you mean by 'identical to the actual ob'? the forward operator computes a prior and posterior estimate, and the obs itself has a value. the prior and posterior may be changed by nearby obs, or not. that's true for any number of other observations in other locations in the model.

(Full disclosure; I didn't write these cam-fv get_close variants or define their argument lists).

the get_close interfaces are called by filter and must have identical argument lists for all models that are part of dart. they cannot be changed unless all models change their model_mod code to match. this only happens when there is a compelling case to add a feature that cannot be accomplished any other way and is of high priority for multiple models.

if you want to schedule a video call, i'd be happy to hash out more about how the parallelism works and what can and cannot be done without communicating between tasks. but i think you shouldn't give this solution to gio because it's not standard. if sometime in the future someone installs a standard assim_tools_mod.f90 with a nonstandard cam model_mod.f90, or vice versa, bad things will happen. better to have a solution that's consistent with the main dart branch of the code.

@hkershaw-brown
Copy link
Member

Sequential obs do is around all observations

! Loop through all the (global) observations sequentially
SEQUENTIAL_OBS: do i = 1, obs_ens_handle%num_vars

This bit:

! Everybody is doing this section, cycle if qc is bad
   if(nint(obs_qc) /= 0) cycle SEQUENTIAL_OBS

Everybody (every processor) is doing the the calculations (get_close, increments, ...) for the same observation 'i', but on their part of the state and their forward operators. This is an example with 4 processors and 5 obs:

proc 1 proc 2 proc 3 proc 4
i=1 num_close for obs1 0 3 15 0
i=2 num_close for obs2 1 8 0 19
i=3 bad qc skip
i=4 num_close for obs3 0 0 0 0
i=5 num_close for obs4 7 7 3 0

There is pseudo code for assim_tools_mod.f90 here:
https://github.com/NCAR/DART-software-series/blob/main/pseudo_code/assim_tools_mod-pseudo.f90
Note this code will not compile, it is just to show the key parts of filter_assim. Might be helpful, might not.

For a non-prerocessor option:
Can you convert all the observations verticals before sequential obs do using convert_all_obs_verticals_first=.true.
and have convert_vertical_obs fail for the high obs?

subroutine convert_vertical_obs(ens_handle, num, locs, loc_qtys, loc_types, &
which_vert, my_status)

This gives you a istatus you can return to filter assim to trigger a 'bad obs' before sequential obs do. You could have convert_vertical_obs log the too high obs.

@kdraeder
Copy link
Contributor Author

Yes, if convert_all_obs_verticals_first=.true., then convert_vertical_obs already returns a my_status
which is already used to make a non-0 QC assignment (DARTQC_FAILED_VERT_CONVERT).
But the user-controlled conditional on the call means that the user might prevent that assignment
without being aware of it. This won't be fixed by a preprocessor, because failed vertical conversions
can happen for reasons which may not be known before filter runs. So the problem of QC=0 for obs
that have not been assimilated is broader than the cam namelist issue (no_obs_assim_above_level).

Comments earlier in assim_tools imply that allowing users this control is important for efficiency reasons:
logical :: convert_all_obs_verticals_first = .true.
! ... this depends on how many obs per task and whether the mean is distributed or replicated on each task.
so it may not be wise to take that control away. I can't judge that at this point.

By the way, I finally see why returning a QC code from get_close_obs won't work in the current code.
Each task loops through all obs in the same order, but if it doesn't own an ob, it gets an ob's info
from a different task . That task may not have have called get_close_obs and set the QC value yet.
It seems like this could be fixed by having each task handle its own obs first,
then loop through the obs is doesn't own, so that all the QCs have been set before any fetches.

@nancycollins
Copy link
Collaborator

Yes, if convert_all_obs_verticals_first=.true., then convert_vertical_obs already returns a my_status
which is already used to make a non-0 QC assignment (DARTQC_FAILED_VERT_CONVERT).
But the user-controlled conditional on the call means that the user might prevent that assignment
without being aware of it. This won't be fixed by a preprocessor, because failed vertical conversions
can happen for reasons which may not be known before filter runs. So the problem of QC=0 for obs
that have not been assimilated is broader than the cam namelist issue (no_obs_assim_above_level).

it seems like either a preprocessor or telling people they have to convert verts first if they are using rttov observations are your two best choices for fixes here. you'd need to put additional code in the vert convert routine to look for high obs because it doesn't do that now. a preprocessor would cleanly flag high obs and then anything that would fail the vert convert for other reasons would continue to work. but you can't use get_close for this.

By the way, I finally see why returning a QC code from get_close_obs won't work in the current code.
Each task loops through all obs in the same order, but if it doesn't own an ob, it gets an ob's info
from a different task . That task may not have have called get_close_obs and set the QC value yet.
It seems like this could be fixed by having each task handle its own obs first,
then loop through the obs is doesn't own, so that all the QCs have been set before any fetches.

good insight. the catch is that as each observation is assimilated, it computes an impact and updates not only the state but also the remaining unassimilated obs. this is because the forward operators were computed on the original state, and as each observation is assimilated it changes the state. there's a paper where jeff proves that if you apply the same update to the forward operator results before you assimilate subsequent obs you get the same answer as computing the forward operators inside the main assimilation loop as the state gets updated.

bottom line is the main loop must process each ob in order to maintain the correct results.

@nancycollins
Copy link
Collaborator

p.s. i like the preprocessor solution better. but if you go the vert convert route you have to be sure you only add code to the exact routine called by filter. other parts of the vert convert calling path can't return errors without putting you back into the original problem of failing to compute values for top-of-model locations needed by rttov.

@hkershaw-brown
Copy link
Member

closing:
pull requst #450 fixed vertical conversion qc8 #401 and QC values being lost if posterior FOs.

For more detail on excluding obs discussion see discussion #436

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working CAM Community Atmosphere Model
Projects
None yet
Development

No branches or pull requests

3 participants