-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible memory leak in filter_topo program #105
Comments
A small test script with data was created to run the filter_topo program stand-alone (on venus): |
I tested the script on Venus using 'develop' at 0f5f3cb. A baseline 'oro' file was created using the standard compile. I then recompiled with the 'debug' option on:
I was hoping it would crash with an out of bounds array or divide by zero. But it ran to completion. The output did not match the baseline:
Some differences are large. All differences were confined to the top and bottom three rows. |
Ported the test script to Hera - |
Tested the script on Hera. Compiled 'develop' at 0f5f3cb using the GNU compiler with the debug option (DCMAKE_BUILD_TYPE=Debug). Was hoping it would find a memory leak or an out-of-bounds array. It did crash with memory leaks. To fix the leaks, I updated the source code as follows:
The test then ran to completion. |
Ran the test on Hera four times:
The output files from each test were compared. The two Intel tests showed large differences indicative of a problem:
The two GNU tests showed very small differences:
The GNU debug and Intel debug showed very small differences:
The Intel debug and GNU release tests produced no differences:
What does this say? That the Intel compiler optimizations are not playing nice with the filter code? |
running the GNU debug option. Fixes ufs-community#105
Compiled my Intel 'release' build type: Comparing 0f5f3cb and acba0fb shows no differences.
Comparing acba0fb and dbe2ba5 shows large differences.
Intel 'debug' build type: Comparing 0f5f3cb and acba0fb shows no differences.
Comparing acba0fb and dbe2ba5 shows large differences.
|
Repeated the tests using GNU. GNU 'release' build: Comparing 0f5f3cb and acba0fb shows no differences.
Comparing acba0fb and dbe2ba5 shows no differences.
GNU 'debug' build: Comparing acba0fb and dbe2ba5 shows no differences.
Recall, the 0f5f3cb tag would not run in 'debug' mode because of memory leaks. Comparing the 'debug' and 'release' builds at dbe2ba5 shows very small differences, as expected:
|
There are several arrays being allocated in routine "read_grid_file". Some are initialized with a constant value, while others are not initialized at all. I wonder if the compilers are filling these arrays with random garbage that should be later overwritten (but is not) by valid data. So I updated the branch to initialize these arrays. The values I chose (some were set to a large number, and some to zero) were made so the test would match the baseline. If the code is working properly, then I expect I could initialize these arrays to any value and the result should not not change. Using the GNU 'debug' option, which is giving me a consistent (but not necessarily correct) answer, I ran ccc6692 on Hera, then compared to my baseline:
This is a very small difference. Next, I will check each array. |
Something is not right about the setting of 'geolon_c' and 'geolat_c' in 'read_grid_file'. These variables are allocated with a halo:
For my test, the indices are:
Later the 'grid' file is read and the arrays (not including the halo) are initialized:
For my test, the indices are:
At this point the halo contains who knows what. Later, the routine tries to use these halo values:
Interestingly, there is a routine to fill in halo values. But since this is a regional grid (regional=.true.), it is never called:
Will ask the regional group for their take on this. |
I got to same line. Since computation of the dxc and dyc requires geolat_c's and geolon_c's halo points, I'll try to fill in halo points like in global case. |
I created routine which extrapolates geolat_c and geolon_c halo points. These points are used for dxc and dyc calculation. This should fix the problem. |
For regional grids, we create several 'grid' files with and without halos. Instead of extrapolating, should one of the 'halo' 'grid' files be used? We only care about filtering on the halo0 file, correct? |
I guess we should be OK since it is done for global and regional nest similar way:
|
@RatkoVasic-NOAA Are you working from my branch? I don't see your fork. |
I used hard copy: |
Since the coding changes are non-trivial, I would create a fork and branch: https://github.com/NOAA-EMC/UFS_UTILS/wiki/3.-Checking-out-code-and-making-changes |
I'll do it. |
to allow for using it in unit tests. Create simple unit test for that routine. Fixes ufs-community#105
Tested Ratko's latest version of his branch (852405d) on Hera using my C424 test case: Intel/Release, Intel/Debug, GNU/Release, GNU/Debug all gave bit identical answers. Yeah! |
Seeing odd behavior when running in regional mode. My comment from a previous issue (#91).
Discovered when running the grid_gen regression test from develop (270f9dc). It happens with the 'regional' regression test. If I fix the rank of
phis
in routine FV3_zs_filter (should be 3, not 4) and print out the value of phis(1,1), I get this difference in file C96_oro_data.tile7.halo4.nc when compared the regression test baseline:If I add another print statement for phis(1,2), I get this difference:
So there is likely some kind of memory leak going on with the 'regional' option.
Originally posted by @GeorgeGayno-NOAA in #91 (comment)
The text was updated successfully, but these errors were encountered: