-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: convert_vertical_obs DARTQC_FAILED_VERT_CONVERT #401
Comments
confirmed |
While trying to trace the evolution of QC values I believe that I found
|
I also see QC values of 5,6, and 7 in the obs_seq_finals, |
what fortran code are you making that pseudocode from? if it's the get_dart_qc() routine in quality_control_mod.f90, then there is lots of logic there to avoid overwriting existing QCs and it does different things for the prior and posterior. if the forward operator returns a non-zero istatus from interpolate() for any ensemble member, then it failed the forward operator. is that what you're questioning? |
QC values 2 and 3 are "failed the posterior FO" and "failed the posterior FO for eval-only obs". those are certainly possible. |
The only get_dart_qc that I see is in quality_control_mod. @nancycollins, if you're describing the logic in just get_dart_qc, then I'll point out
It appears to me that it doesn't matter why a forward operator failed, I haven't expected to see QC=2 or 3, so I wasn't looking for them. |
@kdraeder you're totally correct that it doesn't matter why the forward operator failed. a QC=4 says interpolate() returned an error for at least one ensemble member. that istatus array is returned from the interpolate routine, which is only called when computing the forward operators for prior or posterior, not from the call to vertical conversion in the assim_tools() routine. there may a bug there, but the forward operators are called first for all obs to compute the priors, then filter_assim() runs, then the forward operators are called to compute the posteriors. do you suspect the posterior forward operator QC code is overwriting the QC if it's already set to 8? or that a QC=8 isn't getting set in the main assimilation loop code? the filter_assim() code doesn't use get_dart_qc() or get_obs_ens_distrib_state(). could it be that where you added code to return a failed vertical conversion somehow is affecting the interpolate call? if the interpolate routine calls something that calls something that fails a vertical conversion, the interpolate routine will return an istatus /= 0 and it will set a QC=4. an obs that failed the forward operator won't be processed in the main assimilation loop, so the vert conversion won't be called there. sorry - i forget what you said you saw. did you get QC=4 when you tried making it fail the vertical conversion in filter_assim()? or did you get QC=0? |
@nancycollins I edited the bug report so you might have not got the edits by email. |
ah- it's getting a QC=7 which means it failed the outlier threshold test. so the code shouldn't be testing for an outlier threshold if the QC is 8. |
yup, & obs 3 has a qc of 0 in the obs_seq.final |
You two seem to have advanced beyond Nancy's questions, so I won't answer all of them, In one GPS profile in the case where I excluded all obs in the N. hemisphere, Is there a problem with forward_op for the prior using the ob's native vertical coordinate, |
the filter_assim() subroutine makes a copy of the obs location while it is doing the assimilation, so changing the vertical in that copy doesn't affect the original location which is used in both calls to the forward operator code. |
There is a reset of OBS_GLOBAL_QC_COPY to 0 in the forward_operator_mod. Possibly this is what is blasting way QC=8 DART/assimilation_code/modules/observations/forward_operator_mod.f90 Lines 170 to 171 in ba00b2c
|
I looked into the @HKershaw note about QC being reset to 0 My test case is set up to make all obs in the northern hemisphere There is no posterior output from my prints in forward_op, The QC in the obs_seq.final is 0. |
cool thanks for the update Kevin! @kdraeder |
I think I've identified a large part of the problem. There seems to be a missing connection between the obs_fwd_op_ens_handle Meanwhile, convert_all_obs_verticals_first leads to the assignment of QC=8 There's some QC consolidating in quality_control_mod:get_dart_qc, |
that's helpful, kevin. have you tried a test where you do run with posterior forward operators being computed? i suspect that any QC values changed in the assimilation loop were being written to the output seq file in the posterior obs diagnostics call. when we allowed skipping computing the posteriors, we lost those values and we had no tests to catch that. |
Writing the posteriors confirms your suspicion; |
it looks like the main part of the QC=8 bug has been found and can be fixed. i think there is a related bug here that also needs to be addressed. we allow the user to choose whether to convert all the obs and state verticals to the requested vertical localization coordinate first before going into the assimilation loop, or wait and have them converted on demand inside the get_close routines. the original code only had vertical conversions inside the get_close routines, so when we added a choice to do all the vert converts before the main loop, the 'on demand' option was to wait for get_close to be called and that continued to work. the idea of leaving in a choice was that if most of the state is not going to be affected by obs (e.g. the ocean where large parts are unobserved) then it's cheaper to only convert the state items which are actually going to be changed. same with the obs - if most of the obs already have a bad qc or aren't going to be assimilated because their type is being ignored, then it's cheaper to convert them on demand once it's sure they are going to be assimilated. however, it looks to me like we have no way to set QC=8 for failed vertical conversion if we wait and do the obs conversion on demand inside get_close. that leads to inconsistent results. it's possible that the fix for this is to have an explicit call to convert_vertical_obs() for the base obs inside the sequential_obs loop. get_close will never have to do the vert convert for the base obs being assimilated in that loop but that's ok. those routines will still need to convert close state items and other close obs but a failure there doesn't trigger a QC=8. the trick here is to avoid doing a possibly expensive vertical convert if this obs isn't going to be assimilated because of an existing bad QC, and do it only on 1 task and not all. if it's done inside the code block that starts: ! Following block is done only by the owner of this observation then that may work. it would be good if it could look at the QC first before calling vert convert. then there's a broadcast which i think takes care of the other tasks getting the updated vert coordinate. |
This is the idea Kevin getting at in #296. A get_close qc value which lets you know Does the inconsistency matter? (I don't know) For example, if you preprocessed high obs out your results would be inconsistent with failing in get_close. |
no, this is slightly different. i'm saying get_close cannot set a QC ever. if there are 0 close states or obs, so be it. no QC can or should indicate this. what i'm saying is if the vertical conversion for the base obs is deferred until get close and it fails, there is no way to consistently set a QC=8. if all the vertical conversions are done up front that obs will get a QC=8. this came up as a separate issue as i traced through the code based on kevin's issues, but it's distinct. i think it needs to be fixed by calling vert_convert even "on demand" and not letting the conversion of the base obs wait until get_close. |
p.s. fixing this has the side effect of allowing the cam model_mod to choose to fail the vertical conversion of the base obs in the convert_vert routine and consistently having all the high obs have a QC=8 whether the vert converts are done on demand or all up front. otherwise there's what i consider a bug if the obs_seq.final changes based on the setting of 'convert_all_obs_vertical_first'. i don't think using QC=8 for high obs is the best solution in cam since failing the vertical conversion for an obs inside the model domain seems confusing to me but it is a choice on how to implement it. the other solution is to just have get_close return 0 close items so that obs has no effect but is assimilated with a QC=0. removing high obs ahead of time bypasses this entire issue but if you don't the code should still do something consistent. |
I definitely prefer Nancy's idea of converting the base obs up front, regardless of the setting of convert_all_obs_vertical_first, over letting conversion failures (for whatever reason) end up with QC=0 |
If I understand Nancy's description of the parallelism in this context, |
this line: if(ens_handle%my_pe == owner) then says whether this task owns an obs or not. this is indeed known up front. |
just to jump in again, |
maybe the simpler fix here is to stop allowing 'convert_all_obs_verticals_first' to be set by the user. i'm looking at the case where it gets set to false in the namelist. maybe that shouldn't be supported and we should remove it from the namelist. that's pretty disruptive (requiring every input.nml in the system to remove it) so maybe add code to test it for .false. and fail there saying this is a deprecated item that will be removed in the future. |
But that line is inside the SEQ loop, so it looks like every task except one is trying to get |
yes, that's how it works. thus the name "sequential_obs loop". |
to be more helpful - only one mpi task has all the forward operator values for any given observation. so it alone is able to compute the increments. it then broadcasts them to all the other waiting mpi tasks, which then in parallel find close items and applies the regression to them. then the mpi task which has the forward operator values for the next obs computes the increments and broadcasts them to the other tasks. the obs have to be handled one by one for the numerics to work. the get_close routines almost always take much longer than computing increments, so the parallelism results in a net speedup in spite of having to synchronize after each ob. |
Ah, that's what I was missing; the parellelism is about the regressions, not about the increments. |
let me try to summarize what code changes i think are needed to address this issue:
does this seem right to everyone? |
I think only 1. for this bug fix.
|
fair enough - item 2 is a separate bug (i'll open one), and item 3 addresses the cam high obs issue. i can take an initial stab at what the new filter_mod routine would have to do if you want. (tell me if someone else is already working on this.) there is a transpose of the forward operator ensemble in that routine that i think still has to be done but i'm going to have to look at it for a while. |
i pushed a qc8fix branch to the repo with an updated filter_mod.f90 file. it needs more testing, but so far it seems to capture the QC=8 changes correctly. if someone else could take it and test it as well, that would be peachy. thanks. |
@mjs2369 and @kdraeder have tested the updated filter_mod.f90 and found that it solves the problem. @nancycollins I'm curious about whether you considered pulling the QC consolidation out of |
@kdraeder no, i didn't pull that out because "all_copies_to_all_vars()" and the "get_copy()" call involve MPI communication, and there's an allocation of a potentially large temp buffer. i didn't want to do that more than once in the case where you are computing the posteriors. also i think the replication is actually clearer in the typical case of computing posteriors, where it keeps all the consolidation of ensemble mean, sd, FO values and QCs together. edit: to add comment about allocating temp array, and add a comment about code clarity. |
Update: While testing, I found that although @nancycollins changes perfectly fix the problem of the QCs not being updated when compute_posterior = .false. , there is another bug in which the QC values of 8 are not ending up in the obs_seq.final under the specific conditions of compute_posterior = .true. and distributed_state = .false. I have been able to determine that the QC values are being updated incorrectly in the posterior obs_space_diagnostics when distributed_state = .false. :
The values of the obs_temp array, which are used in the call to replace_qc(), are also missing the 8 values, and therefore seq is being updated to these incorrect values:
When distributed_state = .false. When distributed_state = .true. The obs_temp array is assigned to the values of obs_fwd_op_ens_handle%vars(:, owners_index) in the call to get_copy() at line 1750 of filter_mod.f90
I've done some work with DDT that has shown that the values for obs_fwd_op_ens_handle%copies(OBS_GLOBAL_QC_COPY, owners_index) are being changed from 8's in the call to the posterior get_obs_ens_distributed_state(). So it is in this subroutine that the bug is located, and the issue is that some of the QC values are being changed from their current value to 0's. When distributed_state = .false. , get_obs_ens_distributed_state() uses the ALL_OBSERVATIONS DO loop as opposed to the MY_OBSERVATIONS DO loop that is used when distributed_state = .true.
So there is some discrepancy between the two DO loops that is causing the ALL_OBSERVATIONS DO loop to update some of the QC values stored in obs_fwd_op_ens_handle to 0. I am still trying to figure out what exactly this is, so if anyone has any input on this, please let me know. I am thinking it might be due to this line here:
|
this happens in the posterior FO calcs, right? then it shouldn't execute line 171 if "is_prior" is false, which should be the case for the posterior. in your tests, what is the setting in &assim_tools_nml :: do_all_obs_verticals_first ? if it's false, is there a chance it is related to bug #426? i can't remember exactly, but i think what happens there is if a vertical conversion fails it never sets the 8 to begin with. so if you're seeing 8's come in and 0's come out, then it's not related. |
Hi Nancy. You're right about line 171; it is in the posterior FO, so it isn't executing that line. Must be somewhere else in the ALL_OBSERVATIONS loop. I've been running all of my tests with convert_all_obs_verticals_first = .true. set in the namelist so that I wouldn't run into the bug in #426 . This is a completely separate issue in which the 8's are overwritten in obs_fwd_op_ens_handle during the posterior execution of get_obs_ens_distributed_state() . I have confirmed that the 8's are still present in obs_fwd_op_ens_handle up to call to the posterior get_obs_ens_distributed_state() on line 1024 but have been changed by the completion of this subroutine.
|
sounds like you've narrowed it down to the culprit subroutine, good. apologies if these are things you're already doing, but here is some general debugging philosophy that i had to remind myself of more than once:
normally i'd say the best test case is one you can run on your laptop. getting mpi running on your laptop isn't usually too bad, but i've had wretched luck getting a debugger running there. apple security requires a self-signed certificate and i've gone through the step-by-step instructions multiple times with no success. plus i'm not sure any laptop debuggers handle mpi very well. so cheyenne is probably best for this one if it's an mpi-only bug. |
note from discussion last week with Marlee: qc value is in %copies when entering
|
fixed by #450 |
🐛 🐳 🐼
User reported,
convert_vertical_obs
DARTQC_FAILED_VERT_CONVERT is not ending up in the obs_seq.final
will update with reproducer shortly.
Describe the bug
List the steps someone needs to take to reproduce the bug.
Run filter with cam-fv but force convert_vertical_obs to return vstatus /= 0
What was the expected outcome?
The obs_seq.final should have DARTQC=8 for every obs.
What actually happened?
No 8s
Error Message
No error messages
Which model(s) are you working with?
cam-fv but I think applicable to any model that returns a non-zero convert_vertical_obs status.
Screenshots
Here is obs_qc = 8 across all tasks at `cycle SEQUENTIAL_OBS
Here is task 0 writing the obs sequence
Here is obs_seq.final:
Version of DART
Which version of DART are you using?
v10.5.1-1-gf37ba3ccb
Have you modified the DART code?
Yes - to force the failure and also a bunch of print statements.
If your code changes are available on GitHub, please provide the repository:
Forcing vertical fail for cam-fv
https://github.com/NCAR/DART/tree/bug-vertical
Test code here:
/glade/scratch/hkershaw/DART/Bugs/bug_401/
note you will have to grab
/glade/scratch/hkershaw/DART/Bugs/bug_401/test_cam-fv.main
and
/glade/scratch/hkershaw/DART/Bugs/bug_401/in_files/
to run filter in test_cam-fv.main
Build information
Please describe:
The text was updated successfully, but these errors were encountered: