Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Results should not be sensitive to choice of budget decomposition #18

Open
BrodiePearson opened this issue Jun 8, 2021 · 34 comments
Open
Assignees

Comments

@BrodiePearson
Copy link
Collaborator

I think the results of the ADC closure are quite sensitive to how we decompose the tendency equations! (possibly due to rounding errors?)

I was playing around with the w2 budget, and found that simply combining two tendency terms led to significant differences in final simulated profiles. In this case I ran two simulations where I either left the return to isotropy and buoyancy terms separate:

w2tend3(k,iCell) = tauvVel(k,iCell)*(u2(i1,k,iCell) + v2(i1,k,iCell))/3.0_RKIND
w2tend4(k,iCell) = (2.0_RKIND - 4.0_RKIND/3.0_RKIND*C_b)*Mc(k,iCell)* &
(grav*alphaT(k,iCell)*tumd(k,iCell) - grav*betaS(k,iCell)*sumd(k,iCell))

or I combined them into one term (which theoretically should be an identical system of equations):

w2tend3(k,iCell) = tauvVel(k,iCell)*(u2(i1,k,iCell) + v2(i1,k,iCell))/3.0_RKIND + &
(2.0_RKIND - 4.0_RKIND/3.0_RKIND*C_b)*Mc(k,iCell)* &
(grav*alphaT(k,iCell)*tumd(k,iCell) - grav*betaS(k,iCell)*sumd(k,iCell))
w2tend4(k,iCell) = 0.0_RKIND*(2.0_RKIND - 4.0_RKIND/3.0_RKIND*C_b)*Mc(k,iCell)* &
(grav*alphaT(k,iCell)*tumd(k,iCell) - grav*betaS(k,iCell)*sumd(k,iCell))

This figure shows the final profiles of w2 (left) after 24 hours of the conv2 setup using either split (standard) terms or the combined tendency terms. It also shows the sum w2tend3 + w2tend4 (middle) and the individual terms (right).
potential_rounding_errors

This apparent sensitivity of w2 profiles to equation formulation is not good, so it would be great if someone could do a similar test and see if these differences are real/reproducible.

The maximum difference of total tendency terms (middle panel) between the two simulations grows from 1e-23 on the first time step to 4e-8 after 24 hours. Perhaps this is a truncation error that grows - although it is surprising that it produces such a different w2 profile given that the model doesn't have the chaos of a truly turbulent flow.

@vanroekel
Copy link
Owner

@BrodiePearson this is very odd indeed. can you post your namelist and the configuration parameters for the init step? It seems highly unlikely this is truncation error. If you can share the setup I'll try run this.

@BrodiePearson
Copy link
Collaborator Author

@vanroekel I've attached the folder that I'm using. It should be identical to the one that you originally sent me, with modifications to work for the new set of constants. I hadn't created any other examples yet, so I would be interested to know if these differences are (a) an error on my part, or (b) only specific to this case.
cvmix_test_conv2_new_constants.zip

@katsmith133
Copy link
Collaborator

katsmith133 commented Jun 8, 2021

@BrodiePearson and @vanroekel
For me, combining those two terms does not seem to cause the same issue, even after 82 hours...
Cooling2_PR18check_combined_terms
Note: The difference in the w2tend1 + w2tend_pressure term is to be expected as it involves w2ten3 and t2tend4.

...at least not for the default namelist parameters I am using. I am now going to test using the namelist parameters Brodie is using and let you know what I get...

@katsmith133
Copy link
Collaborator

Using Brodie's namelist, I get the following after 24 hours:
Cooling2_PR18check_combined_terms_brodie_params

It does appear that combining the two terms does change the w2 and w3 profiles slightly with Brodie's namelist parameter values (note: I used my init and forcing files, so it is not expected that I would get the same differences as Brodie).

In case it helps up determine why this change is happening, here are the differences in the namelist values between the previous comment's run (where no change is seen) and this comment's run with Brodie's parameters (where changes are seen):
parameter = Brodie's namelist param values -> Kat's namelist param values
config_dt = 0:01:00 -> 1:00:00
config_adc_Cmom = 0.0 -> 4.0
config_adc_Ctherm = 0.0 -> 4.0
config_adc_alpha_0 = 10.8 -> 0.8
config_adc_bc_const = 1.8 -> 1.0
config_adc_bc_const_wp2 = 1.8 -> 0.0
config_adc_up2_vp2_factor = 4.0 -> 1.0
config_adc_use_splat_parameterization = .true. -> .false.

@vanroekel
Copy link
Owner

Thanks @katsmith133 a few questions

  1. are you using the same code for both runs?
  2. Do you have any ideas what parameter is the key here? none of them really make sense to me as tripping this issue. The most likely to me would be Cmom, Ctherm or config_dt. I'll try test as well.

@katsmith133
Copy link
Collaborator

@vanroekel

  1. I assume we are in as far as I am using the most recent version of the code on Github and made only that small same change as Brodie showed before. But I am not using the same executable, no. I can try that.
  2. Agreed. I do not know which one it is, but would assume at first it is one of those you mentioned. I can change each one in turn and see which one(s) it is.

@vanroekel
Copy link
Owner

In my testing I see the same as @BrodiePearson even with config_dt = 00:00:01 so that is not the issue. If I reenable Cmom and Ctherm the diffs are much smaller (1E-17), but the model crashes 6 hours in. It makes no sense to me why disabling Cmom/Ctherm exposes this issue. I'll try look through the code some more.

@BrodiePearson
Copy link
Collaborator Author

BrodiePearson commented Jun 8, 2021

@vanroekel @katsmith133 Thanks for both taking a look. I followed Kat's lead and found that her setup gave me very small differences in the final w2 profiles (~1e-21; we might expect this from small truncation differences between combining the terms or calculating them separately right?). It looks to me like config_adc_bc_const_wp2 is probably the main issue (amplifies this error?). I can run with all my parameters that Kat listed above, except with config_adc_bc_const_wp2 = 0.0 and the difference is larger (1e-9) but arguably negligible (<0.1%). But switching to config_adc_bc_const_wp2 = 1.8 increases the difference to ~1e-5 or ~100%. Currently stumped why that is...

On Lukes point, certain combinations of these parameters seem to cause crashes (I think Kat ran into this issue previously), I tried different combinations and found that I could disable Cmom/Ctherm as long as I switched splatting on. In this case, with all other parameters using Kats settings, the differences are ~1e-9

@vanroekel
Copy link
Owner

Hey all, just an update. If I do config_adc_bc_const_wp2 = 0.0 and config_use_splat_parameterization = .false.. the two options listed here are BFB (diff is hard zero). This is starting to have the feel of a memory index issue to me. I'll keep looking.

@BrodiePearson
Copy link
Collaborator Author

Hmmm. I just looked at the w2tend6 (splat) term and noticed there is a typo in the allocation code:


and it is not deallocated:
w2tend5, wttend1, wttend2, wttend3, &

Could that be the problem?

@vanroekel
Copy link
Owner

Hah, just found that too! I'm testing if this is the problem. The deallocate won't matter, but the other might. I'm a bit doubtful though given it only influences splat, but I'll let you know.

@vanroekel
Copy link
Owner

unfortunately that wasn't it. Did not change my results. Important to fix though. Thanks for pointing that out.

@katsmith133
Copy link
Collaborator

@BrodiePearson and @vanroekel
I was curious and went back to commit 3ade082 (prior to splat being added) to see if this issue pre-dated that. If I just use the code where it is at for commit 3ade082, there is no difference between combining the terms or leaving them separate. However, when I add in a w2 BC similar to what was added in the splat commit:
Screen Shot 2021-06-08 at 5 09 45 PM

I get a substantial difference between keeping them separated and combining them (this is after 19 hours):
Screen Shot 2021-06-08 at 5 10 57 PM
So, I believe the issue pre-dates the splat parameterization commit... but perhaps one of you can confirm this...

@vanroekel
Copy link
Owner

I have been looking at this some more, and did find something interesting. The nonBFB nature seems to stem initially from the wumd calculation
https://github.com/vanroekel/MPAS-Model/blob/ocean/addADCMixing/src/core_ocean/shared/mpas_ocn_adcReconstruct.F#L386-L387
if I remove the sqrt calculation I get the BFB behavior back even with a splat-like BC in w2. Using python, with a few more digits it seems this is a rounding issue in a place just beyond double precision. It is disturbing to me this happens and small errors can grow in certain cases.

I'm not sure what the solution is, but perhaps we should do some rounding to the 8th or 10th decimal perhaps? It makes me wonder if there is a different bug hanging around.

@katsmith133
Copy link
Collaborator

@vanroekel when you say that you remove the sqrt calculation, do you mean you've replaced the sqrt function with ()**0.5 or just removed the square-rooting all together?

I found that if I replaced the sqrt function with ()**0.5, I still get the same issue... meaning it's not a rounding issue inherent to the sqrt function.
I also tried to just remove the square-rooting all together from that term, but the simulation crashes after 1 hour due to terms growing too large...

@vanroekel
Copy link
Owner

yes exactly right, remove the sqrt function. The diffs all go away. But It shouldn't work for long as the units are incorrect. I only ran for 1 minute, i'm not surprised it crashes quickly.

@vanroekel
Copy link
Owner

Okay, I "may" have a good solution for this issue. I'm still working on a few more tests. Once I do those and clean up I'll write more about what I did and post a fix. But in quick summary it is related to the loop around w2t and w2s assignment. My reproducer is fixed with my changes, just trying to understand what change fixed things.

@vanroekel
Copy link
Owner

Well I found the critical switch to "fix" this, but I don't like it... If you change the sign of this line

https://github.com/vanroekel/MPAS-Model/blob/ocean/addADCMixing/src/core_ocean/shared/mpas_ocn_adcReconstruct.F#L266

The problem is fixed. This makes no sense to me. Could one of you try change that sign too and see if the problem disappears? I'd appreciate any thoughts you have on possible causes!

@katsmith133
Copy link
Collaborator

@vanroekel I changed the sign of that term and the issue did not disappear for me. Not sure if that is because of different parameters or not, but I still see a difference. Not sure if this helps, but I also tried just setting that term to zero, as the LES for the cooling case implied w2t should be zero at the surface, and the problem still occurs.

@BrodiePearson
Copy link
Collaborator Author

BrodiePearson commented Jun 11, 2021

Nice find! I won't be able to test this until I'm back home this afternoon.

I think this change makes sense physically (but I don't understand why it caused this issue). If we assume a basic down-gradient formulation for this term:
$\overline{w'w'T'} = -\kappa \frac{\partial \overline{w'T'}}{\partial z}$
In scenarios where $\overline{w'T'}$ is positive at the surface, this flux will decrease with depth so $\frac{\partial \overline{w'T'}}{\partial z}$ is also positive. With this downgradient assumption, this means $\overline{w'w'T'}$ must be negative.

The code defines this third-order term as

w2t(1,iCell) = -0.3_RKIND*wstar * wtsfc(iCell)

while wt(k,1,iCell) = wtsfc(iCell) and wstar>0. This means that for a positive surface wT term, you need a negative sign here to make $\overline{w'w'T'}$ is negative. Same with $\overline{w'w'S'}$. Perhaps this is a residual from the vertically-flipped atmospheric systems that we may have taken these boundary conditions from, where a positive surface flux implies a negative gradient and resulting positive third-order term?

@vanroekel
Copy link
Owner

@katsmith133 interesting. I wonder what else I changed. It is definitely gone for me. But maybe I don't run long enough? How long until you check? I also have a tone of other changes I made to try fix this, perhaps it is the sum of all of them.

@BrodiePearson even if it makes physical sense to change the sign, I'm very concerned that in my code/test at least having w2t less than some critical value introduce weird sensitivities. It doesn't seem this should happen even if the sign is not technically right.

@katsmith133
Copy link
Collaborator

@vanroekel I checked at 12 hours. The differences at 1 hour are pretty small (1e-24), but by 12 hours its on the order of 1e-7 for u2, v2, and w2 and 5e-3 for temperature.

I'm going to just check again, just to make sure...

@vanroekel
Copy link
Owner

Okay, that could be it. But I will say with the sign as is the diff in w2 was that small you show (at 1 hour), but gone (hard zero) if I flip the sign. I'm going to keep on with these tests and hope that if I can get rid of that diff the longer term results will be the same too, we'll see!

@katsmith133
Copy link
Collaborator

katsmith133 commented Jun 11, 2021

@vanroekel Sounds good. FYI, I have the changes in #19 implemented in my code. I don't think that should influence this, as my tests showed those changes had no effect on this issue, but just so you know.

Also, I ran Valgrind on the code and there are lots of memory leaks throughout MPAS-O, mostly to do with not deallocating pools I think, but nothing specifically in the ADC code was throwing errors (besides not deallocating the pools associated with the ADC code, but I don't know how to do this in the code). TNote: this is after I implemented the changes from #19 (as that adds deallocation of any variables that were not and fixes an out of bounds indexing issue).

@vanroekel
Copy link
Owner

@katsmith133 thanks for the update and thanks for checking valgrind, that's very good to know. I've been continuing to play with this and unfortunately haven't made progress. I need to put this aside for a bit, but have pushed my test branch to a new branch -- https://github.com/vanroekel/MPAS-Model/tree/ocean/adcTestBranch if you wanted to take a look and keep playing, I'd appreciate it. I still think this does have something to do with the w2t terms, what that is I don't know though...

One other, slightly contradictory bit of evidence, when I moved w2 dissipation from implicit to explicit it changed the behavior of the model with regards to adding the two terms in w2tend or not.

@katsmith133
Copy link
Collaborator

@vanroekel Yup, I'll keep playing with it

@vanroekel
Copy link
Owner

@katsmith133 and @BrodiePearson an update: it turns out the closure is sensitive to combining terms in w3, w2, or wt, the pattern of difference is interestingly the same for w3 and wt. I've also tried setting things constant in the vertical to find the issue. The only thing that fixes it is setting w2 to a constant in time and space. Not clear what any of this means yet, but still poking at it

@katsmith133
Copy link
Collaborator

@vanroekel and @BrodiePearson I am attempting to go back to different commits to see when this issue began. I'm thinking that might give us an idea of what's causing it...?

@BrodiePearson
Copy link
Collaborator Author

@katsmith133 If you can figure out a commit that would be helpful! Although I am worried this issue may exist from the beginning/very early on (comparing different formulations of the budgets is not a standard test for the code)

@vanroekel
Copy link
Owner

@katsmith133 I fully agree with @BrodiePearson it would be wonderful to pin this to a commit, but worry greatly this issue has been around forever.

I never would have thought to test this to even add a test, it is so weird...

@katsmith133
Copy link
Collaborator

@vanroekel and @BrodiePearson Updates:

  • This issue exists in Qing's branch as well... here are differences between a run with w2tend3/4 separated and combined after 56 hours:
    Diff_QingsBranch_hour56
    I thought perhaps some of his clean up might have accidentally fixed this issue, but no such luck.

  • I've tested back to commit 939a1dd and still see the issue. Prior to this commit I have issues getting the code to run and I am not sure it is worth the time trying to get it to run...

  • I've tested Luke's new branch (with all of the small changes made) and still see the issue...

Not super helpful updates, but just crossing off things that have been tried so we don't overlap efforts.

@vanroekel
Copy link
Owner

Thanks for the update @katsmith133 it is indeed good to cross these off the list. I agree it is not beneficial to keep going backward in commits.

I'm still working on this too, don't have anything definitive yet, but think I am making progress.

@vanroekel
Copy link
Owner

@katsmith133 and @BrodiePearson I've added the rounding and it fixes the issue in my very short tests. Can you pick up the ocean/adcTestBranch and try it in your longer tests? A few notes

  1. I have helper flags to do the combining so you don't have to rebuild, combine_w2, combine_w3, and combine_wt will pull together two terms for testing.
  2. I've made a ton of other changes in the process of fixing this that I think we should keep, so please look them over and let me know if you have any questions.
  3. the rounding is controlled by config_adc_decimals_to_keep flag. If you set this to 1e16 it will truncate everything below 1e-16 as an example. This is the value I used for my tests

Hopefully this works. Thinking on this more, I actually think the idea from amrapalli is the best long term solution potentially as it could be a more elegant way forward by converting all velocity units to cm/s we could potentially push past the issue without having to guess where to truncate.

Either way let me know how it goes and especially if you find an issue.

@katsmith133
Copy link
Collaborator

@vanroekel and @BrodiePearson Ok, after some testing, I've found that this solution works so long as you set the rounding to the correct decimal. Using 1e16 saves too many digits and the garbage digits end up amplifying, making the problem worse than before. Using 1e12 instead produces very similar profiles to before the rounding, but also produces BFB matching when tendency terms are kept seperate or combined. This has only been tested for the 100W cooling case with Cmom = Ctherm = 4.0, config_adc_up2_vp2_factor = 1.0, config_adc_bc_const_wp2 = 0.0, with both splatting turned on and off, and config_adc_decimals_to_keep = 1.0E12. Most other values are their defaults. I am unsure yet as to whether the value of config_adc_decimals_to_keep = 1.0E12 will work for all cases or if different forcing scenarios might require a different number of digits to be rounded to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants