-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes non-BFB issue with F-compsets when threading is used #1219
Conversation
@wlin7 : I have assigned you as an integrator for this PR. Please feel free to reassign. I have conducted tests to assure myself that it is BFB using SMS_Ln5_P32x1.ne16_ne16.FC5AV1C-04P2. I have also run tests to make sure that this fixes the non-BFB issue due to threading. |
I just realized that ideally we should fix this bug by branching off at the point where this bug was introduced. This will enable other folks to get this patch if they are working with an intermediate version of the code. My only worry is that the buggy code was in CIME2. @rljacob and @jgfouca : Do you think the buggy code being in CIME2 can complicate merging with the current master? I think it should not be an issue but I just want to make sure. |
Since the fix is entirely in CAM code, it won't be impacted by cime2/cime5 |
Thanks @singhbalwinder for working so hard on this. I will wait on merging this. |
This PR addresses an issue which makes the model non-deterministic (i.e. non-BFB) when run with more than one thread. PR #1147 introduced a logical variable (cldfsnow_logic) which was declared and assigned at module level. This kind of declaration automatically sets a variable with 'SAVE' attribute which in turn makes the variable a shared variable (to be shared by all the threads). This PR removes this variables and retain the same functionality. Fixes #1203 [BFB] - Bit-For-Bit
f2c1d42
to
f781610
Compare
@wlin7 : It is ready to go now. I have based it off of the point where the bug was introduced. |
@singhbalwinder , can you remind me which branch should be used? As you described above, it is introduced in PR #1147, which was for branch singhbalwinder/atm/av1c-04p2. Did you mean the fix in singhbalwinder/atm/fix-non-BFB-threading-runs is now based off singhbalwinder/atm/av1c-04p2, Thanks, |
@singhbalwinder , after checking out singhbalwinder/atm/fix-non-BFB-threading-runs, I can see it is immediately after your merge of singhbalwinder/atm/av1c-04p2. Now I understand "the point" that you based this the fix off. |
this is the same as all pull requests: merge singhbalwinder/atm/fix-non-BFB-threading-runs into next. Then if the tests pass on next, merge singhbalwinder/atm/fix-non-BFB-threading-runs into master |
Thanks @mt5555 . It is merged into next. |
Hang on. Isn't this a non-BFB change? I think threading is on by default for edison so will this change answers for beta0 runs done on edison? |
all our baseline testing is (unfortunately) done on machines that dont use threads. |
True but we are still in beta testing mode and master is BFB with beta0 right now. We shouldn't change that without @golaz ok. |
Agreed - except in this case, the existing code (with threads turned on ) is not even reproducible. Thus on Edison master is not BFB with beta0, since two beta0 runs wont even agree with each other. |
I'm surprised the coupled group didn't notice that. |
@golaz reported a non-BFB restart for coupled run, but at the time was not suspecting atm having issue. There might still be other issues contributing to the non-BFB. |
@rljacob: the low-res coupled simulation beta0 was run on Edison using either 173 or 375 nodes (layouts taken from https://acme-climate.atlassian.net/wiki/display/CH/PE+layouts+for+faster+throughput+with+low-res+v1+alpha+coupled). These layouts are non-threaded, which explains why we did not see the same problem. The non-BFB issue that @wlin7 mentions happened when I changed layouts from 173 to 375 without realizing that BFBFLAG was set to FALSE. |
Thanks for clarifying. |
@singhbalwinder , how to know if this non-BFB fix has been tested on next? I can see in the [CDash ] (http://my.cdash.org/index.php?project=ACME_Climate) that your more recent merge for hash 4268089 has been tested. But no where to find if the non-BFB threading fix (commit hash 6ce7561) has been tested. |
Great. I am going to merge it to master right away. |
Merge branch 'singhbalwinder/atm/fix-non-BFB-threading-runs' (PR #1219) This PR addresses an issue which makes the model non-deterministic (i.e. non-BFB) when run with more than one thread. PR #1147 introduced a logical variable (cldfsnow_logic) which was declared and assigned at module level. This kind of declaration automatically sets a variable with 'SAVE' attribute which in turn makes the variable a shared variable (to be shared by all the threads). This PR removes this variables and retain the same functionality. Fixes #1203 [BFB] - Bit-For-Bit
They are not tested per commit and we allow multiple BFB commits in a day but only one if it is a non-BFB commit. If there is an issue with a commit, we have to go back and test each commit (but that happens rarely). |
and as a reminder - if anyone commits a non-BFB commit, no other commits are allowed, and the integrator should let us all know by emailing the integrators email list. |
Thank you for the reminder. This fix is now in the master, |
Great! Thanks @wlin7 . |
Merge branch 'singhbalwinder/atm/fix-non-BFB-threading-runs' (PR #1219) This PR addresses an issue which makes the model non-deterministic (i.e. non-BFB) when run with more than one thread. PR #1147 introduced a logical variable (cldfsnow_logic) which was declared and assigned at module level. This kind of declaration automatically sets a variable with 'SAVE' attribute which in turn makes the variable a shared variable (to be shared by all the threads). This PR removes this variables and retain the same functionality. Fixes #1203 [BFB] - Bit-For-Bit
additional include path needed for cmake check_function_exists standard cmake module check_function_exists was not being found when scripts_regression_tests.py was run from cron. Test suite: scripts_regression_tests.py (also ran tests with pio2 as default) Test baseline: Test namelist changes: Test status: bit for bit Fixes cmake build issues introduced in PR #1202 User interface changes?: Code review:
Merge branch 'singhbalwinder/atm/fix-non-BFB-threading-runs' (PR #1219) This PR addresses an issue which makes the model non-deterministic (i.e. non-BFB) when run with more than one thread. PR #1147 introduced a logical variable (cldfsnow_logic) which was declared and assigned at module level. This kind of declaration automatically sets a variable with 'SAVE' attribute which in turn makes the variable a shared variable (to be shared by all the threads). This PR removes this variables and retain the same functionality. Fixes #1203 [BFB] - Bit-For-Bit
Merge branch 'singhbalwinder/atm/fix-non-BFB-threading-runs' (PR #1219) This PR addresses an issue which makes the model non-deterministic (i.e. non-BFB) when run with more than one thread. PR #1147 introduced a logical variable (cldfsnow_logic) which was declared and assigned at module level. This kind of declaration automatically sets a variable with 'SAVE' attribute which in turn makes the variable a shared variable (to be shared by all the threads). This PR removes this variables and retain the same functionality. Fixes #1203 [BFB] - Bit-For-Bit
This PR addresses an issue which makes the model non-deterministic
(i.e. non-BFB) when run with more than one thread. PR #1147 introduced
a logical variable (cldfsnow_logic) which was declared and assigned at
module level. This kind of declaration automatically sets a variable
with 'SAVE' attribute which in turn makes the variable a shared variable
(to be shared by all the threads). This PR removes this variables and
retain the same functionality.
Fixes #1203
[BFB] - Bit-For-Bit