Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes non-BFB issue with F-compsets when threading is used #1219

Merged
merged 1 commit into from
Jan 19, 2017

Conversation

singhbalwinder
Copy link
Contributor

This PR addresses an issue which makes the model non-deterministic
(i.e. non-BFB) when run with more than one thread. PR #1147 introduced
a logical variable (cldfsnow_logic) which was declared and assigned at
module level. This kind of declaration automatically sets a variable
with 'SAVE' attribute which in turn makes the variable a shared variable
(to be shared by all the threads). This PR removes this variables and
retain the same functionality.

Fixes #1203

[BFB] - Bit-For-Bit

@singhbalwinder
Copy link
Contributor Author

@wlin7 : I have assigned you as an integrator for this PR. Please feel free to reassign.

I have conducted tests to assure myself that it is BFB using SMS_Ln5_P32x1.ne16_ne16.FC5AV1C-04P2.

I have also run tests to make sure that this fixes the non-BFB issue due to threading.

@singhbalwinder
Copy link
Contributor Author

I just realized that ideally we should fix this bug by branching off at the point where this bug was introduced. This will enable other folks to get this patch if they are working with an intermediate version of the code. My only worry is that the buggy code was in CIME2.

@rljacob and @jgfouca : Do you think the buggy code being in CIME2 can complicate merging with the current master? I think it should not be an issue but I just want to make sure.

@rljacob
Copy link
Member

rljacob commented Jan 17, 2017

Since the fix is entirely in CAM code, it won't be impacted by cime2/cime5

@singhbalwinder
Copy link
Contributor Author

Thanks @rljacob ! I will change this PR to start from the branch which introduced the bug.

@wlin7 : Please do not merge it until I fix this.

@wlin7
Copy link
Contributor

wlin7 commented Jan 17, 2017

Thanks @singhbalwinder for working so hard on this. I will wait on merging this.

This PR addresses an issue which makes the model non-deterministic
(i.e. non-BFB) when run with more than one thread. PR #1147 introduced
a logical variable (cldfsnow_logic) which was declared and assigned at
module level. This kind of declaration automatically sets a variable
with 'SAVE' attribute which in turn makes the variable a shared variable
(to be shared by all the threads). This PR removes this variables and
retain the same functionality.

Fixes #1203

[BFB] - Bit-For-Bit
@singhbalwinder singhbalwinder force-pushed the singhbalwinder/atm/fix-non-BFB-threading-runs branch from f2c1d42 to f781610 Compare January 17, 2017 22:19
@singhbalwinder
Copy link
Contributor Author

@wlin7 : It is ready to go now. I have based it off of the point where the bug was introduced.

@wlin7
Copy link
Contributor

wlin7 commented Jan 18, 2017

@singhbalwinder , can you remind me which branch should be used? As you described above, it is introduced in PR #1147, which was for branch singhbalwinder/atm/av1c-04p2. Did you mean the fix in singhbalwinder/atm/fix-non-BFB-threading-runs is now based off singhbalwinder/atm/av1c-04p2, Thanks,

@wlin7
Copy link
Contributor

wlin7 commented Jan 18, 2017

@singhbalwinder , after checking out singhbalwinder/atm/fix-non-BFB-threading-runs, I can see it is immediately after your merge of singhbalwinder/atm/av1c-04p2. Now I understand "the point" that you based this the fix off.
One more question: at the end of this page, it say " this branch has no conflicts with the base branch", instead of "no conflicts with the master". Is it safe to merge as is (to next then master)?

@mt5555
Copy link
Contributor

mt5555 commented Jan 18, 2017

this is the same as all pull requests: merge singhbalwinder/atm/fix-non-BFB-threading-runs into next. Then if the tests pass on next, merge singhbalwinder/atm/fix-non-BFB-threading-runs into master

@wlin7
Copy link
Contributor

wlin7 commented Jan 18, 2017

Thanks @mt5555 . It is merged into next.

@rljacob
Copy link
Member

rljacob commented Jan 18, 2017

Hang on. Isn't this a non-BFB change? I think threading is on by default for edison so will this change answers for beta0 runs done on edison?

@mt5555
Copy link
Contributor

mt5555 commented Jan 18, 2017

all our baseline testing is (unfortunately) done on machines that dont use threads.

@rljacob
Copy link
Member

rljacob commented Jan 18, 2017

True but we are still in beta testing mode and master is BFB with beta0 right now. We shouldn't change that without @golaz ok.

@mt5555
Copy link
Contributor

mt5555 commented Jan 18, 2017

Agreed - except in this case, the existing code (with threads turned on ) is not even reproducible. Thus on Edison master is not BFB with beta0, since two beta0 runs wont even agree with each other.

@rljacob
Copy link
Member

rljacob commented Jan 18, 2017

I'm surprised the coupled group didn't notice that.

@wlin7
Copy link
Contributor

wlin7 commented Jan 18, 2017

@golaz reported a non-BFB restart for coupled run, but at the time was not suspecting atm having issue. There might still be other issues contributing to the non-BFB.

@golaz
Copy link
Contributor

golaz commented Jan 18, 2017

@rljacob: the low-res coupled simulation beta0 was run on Edison using either 173 or 375 nodes (layouts taken from https://acme-climate.atlassian.net/wiki/display/CH/PE+layouts+for+faster+throughput+with+low-res+v1+alpha+coupled). These layouts are non-threaded, which explains why we did not see the same problem.

The non-BFB issue that @wlin7 mentions happened when I changed layouts from 173 to 375 without realizing that BFBFLAG was set to FALSE.

@rljacob
Copy link
Member

rljacob commented Jan 18, 2017

Thanks for clarifying.

@wlin7
Copy link
Contributor

wlin7 commented Jan 19, 2017

@singhbalwinder , how to know if this non-BFB fix has been tested on next? I can see in the [CDash ] (http://my.cdash.org/index.php?project=ACME_Climate) that your more recent merge for hash 4268089 has been tested. But no where to find if the non-BFB threading fix (commit hash 6ce7561) has been tested.

@singhbalwinder
Copy link
Contributor Author

My merge (4268089) to next is after your merge (6ce7561) to next, which means your merge was already in when I merged into next. Therefore both of our merges have been tested. Does this makes sense?

As there is no new failure, please go ahead and merge it to master.

@wlin7
Copy link
Contributor

wlin7 commented Jan 19, 2017

Great. I am going to merge it to master right away.
Are the testing of next always applied to all the commits on next up to the time of tests, instead of per commit basis? What if there are failed tests when multiple commits are tested at once?

@wlin7 wlin7 merged commit f781610 into master Jan 19, 2017
wlin7 added a commit that referenced this pull request Jan 19, 2017
Merge branch 'singhbalwinder/atm/fix-non-BFB-threading-runs' (PR #1219)

This PR addresses an issue which makes the model non-deterministic
(i.e. non-BFB) when run with more than one thread. PR #1147 introduced
a logical variable (cldfsnow_logic) which was declared and assigned at
module level. This kind of declaration automatically sets a variable
with 'SAVE' attribute which in turn makes the variable a shared variable
(to be shared by all the threads). This PR removes this variables and
retain the same functionality.

Fixes #1203

[BFB] - Bit-For-Bit
@singhbalwinder
Copy link
Contributor Author

They are not tested per commit and we allow multiple BFB commits in a day but only one if it is a non-BFB commit. If there is an issue with a commit, we have to go back and test each commit (but that happens rarely).

@mt5555
Copy link
Contributor

mt5555 commented Jan 19, 2017

and as a reminder - if anyone commits a non-BFB commit, no other commits are allowed, and the integrator should let us all know by emailing the integrators email list.

@wlin7
Copy link
Contributor

wlin7 commented Jan 19, 2017

Thank you for the reminder. This fix is now in the master,

@singhbalwinder
Copy link
Contributor Author

Great! Thanks @wlin7 .

rljacob pushed a commit that referenced this pull request Feb 27, 2017
Merge branch 'singhbalwinder/atm/fix-non-BFB-threading-runs' (PR #1219)

This PR addresses an issue which makes the model non-deterministic
(i.e. non-BFB) when run with more than one thread. PR #1147 introduced
a logical variable (cldfsnow_logic) which was declared and assigned at
module level. This kind of declaration automatically sets a variable
with 'SAVE' attribute which in turn makes the variable a shared variable
(to be shared by all the threads). This PR removes this variables and
retain the same functionality.

Fixes #1203

[BFB] - Bit-For-Bit
agsalin pushed a commit that referenced this pull request Apr 13, 2017
additional include path needed for cmake check_function_exists

standard cmake module check_function_exists was not being found when scripts_regression_tests.py was run from cron.

Test suite: scripts_regression_tests.py (also ran tests with pio2 as default)
Test baseline:
Test namelist changes:
Test status: bit for bit

Fixes cmake build issues introduced in PR #1202

User interface changes?:

Code review:
@singhbalwinder singhbalwinder deleted the singhbalwinder/atm/fix-non-BFB-threading-runs branch July 25, 2017 17:19
rljacob pushed a commit that referenced this pull request Apr 16, 2021
Merge branch 'singhbalwinder/atm/fix-non-BFB-threading-runs' (PR #1219)

This PR addresses an issue which makes the model non-deterministic
(i.e. non-BFB) when run with more than one thread. PR #1147 introduced
a logical variable (cldfsnow_logic) which was declared and assigned at
module level. This kind of declaration automatically sets a variable
with 'SAVE' attribute which in turn makes the variable a shared variable
(to be shared by all the threads). This PR removes this variables and
retain the same functionality.

Fixes #1203

[BFB] - Bit-For-Bit
rljacob pushed a commit that referenced this pull request May 6, 2021
Merge branch 'singhbalwinder/atm/fix-non-BFB-threading-runs' (PR #1219)

This PR addresses an issue which makes the model non-deterministic
(i.e. non-BFB) when run with more than one thread. PR #1147 introduced
a logical variable (cldfsnow_logic) which was declared and assigned at
module level. This kind of declaration automatically sets a variable
with 'SAVE' attribute which in turn makes the variable a shared variable
(to be shared by all the threads). This PR removes this variables and
retain the same functionality.

Fixes #1203

[BFB] - Bit-For-Bit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants