-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
+More consistent treatment of input_filename = 'F' #1370
+More consistent treatment of input_filename = 'F' #1370
Conversation
Restructured the code slightly so that the output files that are generated when input_filename = 'F' is exactly the same as if it is 'n' if there are no restart files in the restart_input_dir, or as if it is 'r' if the restart files are there. Previously, the solutions with 'F' worked this way, but no ocean_geometry.nc or Vertical_grid.nc files were written when WRITE_GEOM=1, regardless of the presence or absence of the restart files, and the MOM_parameter_doc.all files differed slightly between the 'n' and 'F' or 'r' cases. As a part of these changes, the determination of whether this is a new run is moved earlier in the algorithm, and now sits outside of MOM_initialize_state. All solutions are bitwise identical, but there are changes in the position of the PARALLEL_RESTARTFILES and REFERENCE_HEIGHT entries in most MOM_parameter_doc.all files.
There is an answer change in tc2.a when using restarts with GNU on Gaea. Neither GitHub Actions nor my home machine caught this issue, which both use newer compiler versions (at least 9.2) whereas Gaea uses 7.3.0 so it may only happen in older compilers. This may have something to do with expected default namelist values. |
I am able to reproduce the gnu restart problems with TC2 (and I know why I missed this problem). I will be debugging this PR and correcting it tomorrow, hopefully. |
If this is a real issue then we should try to modify our tests so that the newer compilers can catch it. |
Changed the declarations of the vardesc and fields arrays to allocatable in write_ocean_geometry_files, primarily to get one of the TC test cases to run properly with the gcc compiler by shifting the memory for these arrays from stack to heap. The reason why this change works is not clear. Some comments describing these variables were also added. All answers are bitwise identical.
After further investigation, the resstart reproducability appears to be tickling some sort of memory corruption issue in TC2 that I do not understand. This failure of TC2 is triggered if the call to ocean_write_geometry file occurs after the call to tracer_registry_init. In particular if only 18 variables' metadata is stored in the 23 member array of vardesc types in ocean_write_geometry, every thing is fine, but if the metadata for the 19th is stored, the TC2 restart tests start to fail with gcc 7.3.0, even if a return call is used to keep this array from actually being used after it is set. (Similar problems seem to arise with gcc 8.3.0 or 9.2.0 as well, although I haven't put all of the time into narrowing down exactly when in the code this occurs with those compiler versions.) All of the memory in question was being set automatically from stack, and not via an allocate call, so this seems to be some sort of memory management issue with the particular version of the gnu compiler that is being used on Gaea when working with arrays of structures of a sufficient size. The fact that this issue only arises in one of the TCs, for which we have all diagnostics enabled, and not in any of our regression tests suggests to me that we might be approaching memory limits associated with the increasing number of diagnostics in MOM6. Perhaps a broad survey of the memory associated with each diagnostic (e.g., fixed character string lengths) or other use of static memory related to I/O would be in order. Changing ocean_write_geometry to explicitly allocate and deallocate its memory from the heap avoids this problem, but I find it to be profoundly disconcerting that I do not understand how this actually arose, or what we could do to avoid anything similar in the future. I am adding a modification to this PR to MOM_shared_initialization.F90 that addresses this problem. |
It does seem a bit odd, maybe We could hit it with valgrind, but I think let's take this as a win and move on. |
https://gitlab.gfdl.noaa.gov/ogrp/MOM6/-/pipelines/12461 ✔️ 🟡 This requires a parameter update. |
I tested this on Gaea with GCC 8.3 and it did not raise the error, so I am going to speculate that the issue was a compiler bug, and the change here is sufficient to avert it. Maybe it's something more severe but I think let's leave this for now and just take a note in the future. |
Restructured the code slightly so that the output files that are generated
when input_filename = 'F' is exactly the same as if it is 'n' if there are no
restart files in the restart_input_dir, or as if it is 'r' if the restart files
are there. Previously, the solutions with 'F' worked this way, but no
ocean_geometry.nc or Vertical_grid.nc files were written when WRITE_GEOM=1,
regardless of the presence or absence of the restart files, and the
MOM_parameter_doc.all files differed slightly between the 'n' and 'F' or 'r'
cases. As a part of these changes, the determination of whether this is a new
run is moved earlier in the algorithm, and now sits outside of MOM_initialize_state.
All solutions are bitwise identical, but there are changes in the position of the
PARALLEL_RESTARTFILES and REFERENCE_HEIGHT entries in most
MOM_parameter_doc.all files.
Despite appearances to the contrary, this is actually a relatively simple
single-commit PR with just two files modified, that builds on MOM6 PR #1369.