-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug fixed 1d evp #568
Bug fixed 1d evp #568
Conversation
…r task to avoid gathering constant arrays when running EVP 1D implementation. This solves issue with departure errors resulting from erroneous initialization of full land blocks in EVP 1D gathering. Correct initialization of remaining variables for full land blocks in EVP 1D gathering.
…ge of file. Machine 'freya' was double in file.
I can run a test suite on this before we merge with the 1d evp turned on. Let me know when you think it's ready. It seems there are at least a couple things still to do,
|
I updated with the output from the full test suite. It shows (as expected that binary identical results should not be expected). |
note that all test (except 1) fails when compared to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty straight-forward. Are there also documentation changes needed? Suggestions for code modifications below. I'm a bit confused about the state of the tests. Are you saying that they should all not be BFB, and there's only one test that doesn't complete and still needs a closer look? It would be a good idea to run the QC tests comparing standard 2D evp with the 1D version (or did we already do that?). Thanks for figuring out the issues here.
! Initialize global primary grid lengths array with ghost cells from | ||
! global primary grid lengths array | ||
|
||
subroutine primary_grid_lengths_global_ext(ARRAY_O, ARRAY_I) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes more sense to me for this routine to be put in the boundary modules, since it fills the ghost cells. This is all done on master task so MPI doesn't matter, and yet it would need to be put in both mpi and serial directories... That's the down side of how the code is currently structured, but I think we should stick with it for this, otherwise things can get really confusing.
call gather_global_ext(G_stress12_2, I_stress12_2, master_task, distrb_info) | ||
call gather_global_ext(G_stress12_3, I_stress12_3, master_task, distrb_info) | ||
call gather_global_ext(G_stress12_4, I_stress12_4, master_task, distrb_info) | ||
call gather_global_ext(G_icetmask, I_icetmask, master_task, distrb_info, 0 ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's curious that icetmask has 0 at the end of the argument list and iceumask has .false.! I see that is how they are defined. No need to fix it here...
@@ -314,6 +315,7 @@ subroutine input_data | |||
ndtd = 1 ! dynamic time steps per thermodynamic time step | |||
ndte = 120 ! subcycles per dynamics timestep: ndte=dt_dyn/dte | |||
kevp_kernel = 0 ! EVP kernel (0 = 2D, >0: 1D. Only ver. 2 is implemented yet) | |||
pgl_global_ext = .false. ! if true, init primary grid lebgths (global ext.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spell lengths
@@ -314,6 +315,7 @@ subroutine input_data | |||
ndtd = 1 ! dynamic time steps per thermodynamic time step | |||
ndte = 120 ! subcycles per dynamics timestep: ndte=dt_dyn/dte | |||
kevp_kernel = 0 ! EVP kernel (0 = 2D, >0: 1D. Only ver. 2 is implemented yet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment here is kind of confusing. Does kevp_kernel=102 mean 1D, version 02? Do we have a version 1? Should we simplify the kevp_kernel choices? I remember that we chose 102 because it wasn't really ready, so this was a way to 'hide' it from users, or at least make it less likely they'd set it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally several versions of 1d evp was created. Version 1 included most of the refactoring. Version 2 moved derived grid parameters of HTE and HTN to the 1d solver as a scalar.
Version 3 changed some of the variables from real4 to real8. The influence of this was limited change especially when compared to the accuracy of the iteration. In the end only v2 was implemented. I think that kevp_kernel could be included in kdyn as option 4. I have not thought this thrue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right.
102="Our version 2"
Version 1 and version 2 gave identical results (maybe except for really aggressive flags, I do not recall exactly). But v2 only takes about half of memory and is a bit faster. For conservative flags we were also able to produce BFB results.
That's not the case for v3, where many internal variables was calculated as real*4. But it was again faster and also takes up less memory. As I recall, the uncertainties calculated using uvel,vvel was less or comparable to the uncertainties obtained across different computational architecture.
v3 is an interesting exercise I think and worth to considering for special cases to reduce memory load (and gain of speed as well).
Sorry about the confusion. I may have confused myself a bit as well. This test has was used. ./cice.setup --suite first_suite,base_suite,travis_suite,decomp_suite,reprosum_suite,quick_suite --mach freya --env intel --testid test8-kevp102-0 --bgen baseline8 The following results has been achieved. EVP-2d EVP-1d Here we have 7 not successful test. They all complete their runs but fail the comparison with other FAIL freya_intel_restart_gbox128_4x2_short test All 7 are gone if compile flag “-no-vec” is turned on. Debug 1/ spacecurves. This was found in #560 as well 2/ The second is a call to tt = mod(ftime/secday,dayyr). This is in the initialization, thus I expect dayyr to be 0 (I have not checked). This would result in a division by 0. orrtl: error (65): floating invalid |
…568 (comment)). @mhrib questions to default value settings removed.
See comments in #575 re kevp_kernel namelist value |
… to MPI and serial implementations of ice_boundary modules (CICE-Consortium#568 (comment)). Please note duplication of subroutine.
Thanks for the recent updates. I think the next things to do are to get rid of "102" and switch kevp_kernal=1 for the 1d implementation (0 = standard 2d). This is as suggested in #575. Actually, what I propose is that we change kevp_kernal to a string called evp_algorithm (or something like it) and have the valid values be "standard_2d" and "shared_memory_1d" (or something like that). We want to move away from numbers for namelist entries when we can. kevp_kernal is not really in use yet, so I think it's safe to do that. Then, we'll want to add a test to turn on the 1d solver. We can do those things as a separate PR and I can take it on if you like. I think we want to do it before the release though. |
I can implement the evp_algorithm part with the suggested changes. One question not directly related but while I am at it. Is the revised evp included in the eap solver? |
Changes requested by @eclare108213 in CICE-Consortium#568
I think it depends whether revised evp is it's own standalone flag relative to other flags. Could we have evp with/without revised evp with/without 1d solver? If so, it might be nice to keep them as separate flags, otherwise we have to deal with multiple combinations of things being set by 1 flag. It might be nice to create a tree of the possible dynamics solver options in the documentation, something like (this is proto code only, not necessarily complete/correct). This assumes it adds to the understanding of what things go together in terms of setting up the dynamics. Just an idea.
|
If I remember correctly, revp was set up so that it could be used with eap, but I don't think that was ever actually implemented and tested. So maybe it wouldn't work. At any rate, revp seems like a fundamentally different approach than the 1D kernel - it's more about how the subcycling iteration is handled, while kevp is more about the vectorization, right? I'd keep them separate. I like @apcraig 's outline of dynamics options, for the documentation. This outline should be essentially what's coded in the verbose diagnostics. |
Clean up. Changed namelist parameter kevp to evp_algorithm
I will do a QC test of 1d vs 2d with standard optimization and report back. Looking back at the PR, these modifications primarily fix issues in the 1d evp that prevented identical results with 2d (at reduced optimization). That includes at least a fix to the tiny area computation, addition of grid_lengths_global_ext, and addition of skiptcell logic. In addition, the namelist was changed so we now have strings instead of integers to set evp_algorithm, and a bunch of other formatting (i.e. indentation) was cleaned up. There may have been a few other changes, maybe @srethmeier, @mhrib, or @TillRasmussen can comment. The issues related to 1x1 and 2x2 blocks, calendar, and spacecurve were addressed elsewhere. I agree there is still an outstanding issue about what parts of the dynamics work with each other (i.e. revp, eap plans, etc). I think it should also include overall discussion of control, namelist, and implementation of the dynamics. I have created a new issue, #619 that can serve as a general place holder for those issues. This is something that was raised recently in our monthly meetings as well. |
The qc test with the 1d evp fails at the end of the 4th year with zap snow temperature errors,
The baseline 2d runs fine. This suggests there still may be some issues with the 1d implementation. I actually ran it twice, once with 9x4 pes (threaded) and another time with 36x1 (not threaded) and both failed at the same time in the same way. I guess that means it's a robust problem. I still suggest we merge this but then create a new issue noting the new problem discovered. |
@apcraig, I believe that actually summarizes all major things covered in this PR. We mainly fixed three issues:
Then we did some modifications to the OpenMP parts of the EVP 1D implementation. The core part of the implementation (stress and stepu) was implemented to be NUMA-aware for performance. This was done differently for the interfacing part of the implementation, but is now aligned throughout the EVP 1D implementation for all of it to be NUMA-aware. As for namelist changes, the only change that has been made, is the renaming of Finally, we did a full cleanup of
Maybe @TillRasmussen can update the PR checklist with this, so that it is at the top of the PR? I can't modify it. |
Okay, looks like there still is something we need to look at here. I'll try and dig into it next week.
Sounds good to me. Whatever you prefer and find to be the best approach. |
Not completely sure what this entails? Is it that the combination of Revised EVP and EVP 1D has not been tested? |
Revised evp should work |
1 similar comment
Revised evp should work |
Yes, if rEVP and 1D can be run together, then test to make sure it works. Otherwise put an abort in the code in case they're both turned on. Thanks for summarizing all the changes in the PR. I'd like to better understand which, if any, change the answers when 1D EVP is turned off. @apcraig 's testing showed regression failures for alt04, 1x1 and 2x2 blocks, and unit tests, plus some that timed out due to the low optimization level. Please confirm that you're confident that all of these failures are expected, and that none of them affect the 2D EVP results. My preference would be to find and fix the problem causing snow temperatures to explode, but I'm fine with merging this PR and then fixing that problem. Thanks everyone -- 1D EVP has been a huge effort! |
Regarding the testing and failed tests. All of the 2d tests are bit-for-bit with this PR. This PR has no impact on the standard 2d results which are most of our test suite. The failed regression tests in the 1d (with this PR) vs 2d (with master) with debug on are also all expected. It's also only the regression (comparison with baseline) part of the tests that fail, the tests themselves pass. The alt04 test fails because it has the 1d evp on by default and we have changed answers. The 1x1 and 2x2 tests fail because my sandboxes did not include the recent fixes. The unit tests fail because there continue to be a couple of nagging (but still bit-for-bit) issues in the regression testing on unit tests. The tests that timed out could be rerun with longer submission times and I will try to do some of that today to see. The aborted QC test case is unfortunate. It could be that the roundoff errors introduced by the 1d evp have changed the state of the solution such that an abort is created. I don't think this is the most likely case because then we'd be encountering this as well with 2d cases. It could be that there are still some differences in 1d and 2d that would appear if we ran both cases with debug on for 5 years. We really only tested bit-for-bit with debug on for days or months. However, if all we are doing is introducing a roundoff difference, why is the model aborting. It should be more robust than that. It's surprising to me with all the testing that we've done that the QC evp 1d run aborted. |
I submitted most of the 1d tests that timed out for intel and gnu with more time and they all pass including regression vs 2d. So I think all the 1d vs 2d debug results are accounted for now with no outstanding issues. The only known issue is the QC failure after 4 years (out of a 5 year run). |
@apcraig Thank you for implementing this. I agree that a skiptmask is needed. |
Done |
I can confirm that some test fail due to wall time on the DMI system as well. All were fixed by increasing wall time.
|
Does anyone object to merging this at this point? We should create a new issue as part of the merge to track outstanding issues. |
I'm okay with merging this, although the QC failure is a nagging issue. Does it mean that the code may abort for any user trying to run with 1Devp, or is that just a failure in the QC software? |
The QC failure is an abort in the ice model, nothing related to the QC test as far as I can tell. It took 4 years to hit it. If we believe the 2d implementation is robust, then this suggests there may still be a bug in the 1d implementation that's possibly triggered only on timescales of months to years. All our bit-for-bit testing with debug flags has been shorter than a year. We could run QC with debug flags on with 1d and 2d and see what happens. My concern is the time to completion in that case with reduced optimization, running 1 degree for 5 years. But it's probably worth doing. I propose we add that idea to the issue that we'll create. |
I guess you can't restart the QC from shortly before the crash with debug flags on, and expect it to crash again. Would that be worth a try? |
@eclare108213, we could try something like that but no guarantees it'll help. There are probably a number of steps to try to isolate the problem. The first thing I might try is to turn debug on with 1d and 2d and run 5 years, 1 year at a time with restarts, just to see if the models are bit-for-bit throughout and to see if the 1d with debug also fails. Then we might try running evp1d with optimization but threading off. This might provide insight about an OpenMP issues. Depending what we learn there, we might create a case with a restart just before the failure and then start debugging the actual abort. |
That's quite a debugging project. It needs to be done, but I don't think it needs to be done before this particular PR is merged. |
I will try to rerun the qc test and output more diagnostic. |
@dabail10 Is this the type of error you get for CFL violations in CESM?
I'm wondering if the 1D evp QC test just happens to be hitting one of these, and the 2D case barely misses it. (The "incremental" part of incremental remap assumes that ice moves no farther than 1 grid cell in 1 time step.) If restarting the 1D case with a reduced timestep runs through this point, CFL could be the culprit -- there might not be anything wrong with the 1D evp implementation at all, just unlucky. @apcraig Here's a suggestion for the warning system. Is is possible to have it call print_state for the failing grid cell, just for the aborts that are from solutions becoming physically unreasonable, like this one? |
It could be a CFL issue, but then I think if it were a matter of being "unlucky", then we'd be seeing it more often in the 2d. I think it's unlikely that the 2d is quite robust for lots of configurations, but that the 1d gets unlucky on the first long run we do. @eclare108213, I like your idea of trying to leverage print_state more. I have created an issue, #622. |
See #623 for followup discussion. |
For detailed information about submitting Pull Requests (PRs) to the CICE-Consortium,
please refer to: https://github.com/CICE-Consortium/About-Us/wiki/Resource-Index#information-for-developers
PR checklist
Removes all bad departure point bugs
Copied from @ssrethmeier
The issue with departure errors, resulting from default initialization of full land blocks in the gathering part of the EVP 1D implementation. To solve this, and to avoid gathering and scattering constant arrays, the HTE and HTN arrays (including ghost cells) are now, for the EVP 1D implementation, allocated and given their values during initialization
The issue with tests not being bit-for-bit comparable when compiled with debug/-O0 optimization, resulting from the order of computations for computing tinyarea not being identical between the EVP 2D implementation and the EVP 1D implementation. This was solved by making the EVP 1D computation order identical to the EVP 2D computation order
The issue with tests not being bit-for-bit comparable when compiled with debug/-O0 optimization, resulting from icetmask not being a subset of iceumask, as initialily assumed in the EVP 1D implementation. This was solved by adding a skiptcell logical array as well for skipping stress calculations in the EVP 1D implementation
Then we did some modifications to the OpenMP parts of the EVP 1D implementation. The core part of the implementation (stress and stepu) was implemented to be NUMA-aware for performance. This was done differently for the interfacing part of the implementation, but is now aligned throughout the EVP 1D implementation for all of it to be NUMA-aware.
As for namelist changes, the only change that has been made, is the renaming of kevp_kernel to evp_algorithm and changing it from integer to string. EVP 2D is now enabled by setting evp_algorithm = 'standard_2d' instead of kevp_kernel = 0 and EVP 1D by setting evp_algorithm = 'shared_mem_1d' instead of kevp_kernel = 102. In connection with this, the option set_nml.evp1d was also added. Documentation has also been updated to reflect this modification.
Finally, we did a full cleanup of ice_dyn_evp_1d.F90 and some cleanup of ice_dyn_evp.F90. This has mainly included:
This implementation should work for revised evp and "traditional evp". The first should be tested.
@TillRasmussen @srethmeier (Stefan Rethmeier, DMI)
Removes bad departure point bug bug.
failed test
"restart gbox128 4x2. This test runs but fails to restart exactly.
/results.csh | grep FAIL
FAIL freya_intel_restart_gbox128_4x2_kevp102_short test
FAIL freya_intel_logbfb_gx3_1x1x50x58x4_diag1_droundrobin_kevp102_maskhalo_reprosum_thread bfbcomp freya_intel_logbfb_gx3_4x2x25x29x4_diag1_dslenderX2_reprosum different-data
FAIL freya_intel_logbfb_gx3_4x1x25x116x1_diag1_dslenderX1_kevp102_maskhalo_reprosum_thread bfbcomp freya_intel_logbfb_gx3_4x2x25x29x4_diag1_dslenderX2_reprosum different-data
FAIL freya_intel_logbfb_gx3_1x20x5x29x80_diag1_dsectrobin_kevp102_reprosum_short bfbcomp freya_intel_logbfb_gx3_4x2x25x29x4_diag1_dslenderX2_reprosum different-data
FAIL freya_intel_logbfb_gx3_8x2x8x10x20_diag1_droundrobin_kevp102_reprosum bfbcomp freya_intel_logbfb_gx3_4x2x25x29x4_diag1_dslenderX2_reprosum different-data
FAIL freya_intel_logbfb_gx3_6x2x50x58x1_diag1_droundrobin_kevp102_reprosum bfbcomp freya_intel_logbfb_gx3_4x2x25x29x4_diag1_dslenderX2_reprosum different-data
FAIL freya_intel_logbfb_gx3_6x2x4x29x18_diag1_dspacecurve_kevp102_maskhalo_reprosum bfbcomp freya_intel_logbfb_gx3_4x2x25x29x4_diag1_dslenderX2_reprosum different-data
FAIL - test failed
7 of 390 tests FAILED
This is a bug fix. The 1d evp solver speeds up the evp solver substantially but is as expected not bit for bit reproducible. If it is chosen the solver will use OMP only. The bug was in the gather method when masked blocks where neigbour of an ocean point. When derivatives were found it got the special value which ny default is a very big number. velocities and stresses are set to zero, whereas HTE and HTN stores a global 2d Array.
There are alternative choices to fix the bug.
a/ ffil the halo zones in the gather routine with values of neighbouring blocks.
b/ merge the conversion from 2d to 1d and the gather routine.
A bit of cleaning should be done
There should be test including 1devp