Possible memory leak in filter_topo program #105

GeorgeGayno-NOAA · 2020-05-18T18:43:55Z

Seeing odd behavior when running in regional mode. My comment from a previous issue (#91).

Discovered when running the grid_gen regression test from develop (270f9dc). It happens with the 'regional' regression test. If I fix the rank of phis in routine FV3_zs_filter (should be 3, not 4) and print out the value of phis(1,1), I get this difference in file C96_oro_data.tile7.halo4.nc when compared the regression test baseline:

Variable  Group Count     Sum  AbsSum      Min     Max   Range     Mean  StdDev
orog_filt /       396 103.033 318.697 -25.4385 26.2191 51.6576 0.260184 2.97099

If I add another print statement for phis(1,2), I get this difference:

Variable  Group Count       Sum AbsSum      Min     Max   Range        Mean   StdDev
orog_filt /       370 -0.375389 68.313 -2.69452 3.75537 6.44989 -0.00101457 0.484862

So there is likely some kind of memory leak going on with the 'regional' option.

Originally posted by @GeorgeGayno-NOAA in #91 (comment)

The text was updated successfully, but these errors were encountered:

GeorgeGayno-NOAA · 2021-08-03T19:22:24Z

A small test script with data was created to run the filter_topo program stand-alone (on venus):
/gpfs/dell2/emc/modeling/noscrub/George.Gayno/filter_topo_bug

GeorgeGayno-NOAA · 2021-08-03T20:23:24Z

I tested the script on Venus using 'develop' at 0f5f3cb. A baseline 'oro' file was created using the standard compile. I then recompiled with the 'debug' option on:

diff --git a/build_all.sh b/build_all.sh
index fe457c9c..70936332 100755
--- a/build_all.sh
+++ b/build_all.sh
@@ -26,7 +26,7 @@ else
  set -x
 fi

-CMAKE_FLAGS="-DCMAKE_INSTALL_PREFIX=../ -DEMC_EXEC_DIR=ON"
+CMAKE_FLAGS="-DCMAKE_INSTALL_PREFIX=../ -DEMC_EXEC_DIR=ON -DCMAKE_BUILD_TYPE=Debug"

I was hoping it would crash with an out of bounds array or divide by zero. But it ran to completion. The output did not match the baseline:

+ nccmp -dmfqS oro.C424.tile7.nc oro.C424.tile7.nc.1pass
Variable  Group Count     Sum  AbsSum      Min     Max   Range       Mean  StdDev
orog_filt /       415 2.26251 2429.39 -231.994 232.415 464.409 0.00545183 24.9542

Some differences are large. All differences were confined to the top and bottom three rows.

GeorgeGayno-NOAA · 2021-08-04T20:42:57Z

A small test script with data was created to run the filter_topo program stand-alone (on venus):
/gpfs/dell2/emc/modeling/noscrub/George.Gayno/filter_topo_bug

Ported the test script to Hera - /scratch1/NCEPDEV/da/George.Gayno/ufs_utils.git/filter_topo_bug

GeorgeGayno-NOAA · 2021-08-04T20:50:31Z

Tested the script on Hera. Compiled 'develop' at 0f5f3cb using the GNU compiler with the debug option (DCMAKE_BUILD_TYPE=Debug). Was hoping it would find a memory leak or an out-of-bounds array. It did crash with memory leaks. To fix the leaks, I updated the source code as follows:

--- a/sorc/grid_tools.fd/filter_topo.fd/filter_topo.F90
+++ b/sorc/grid_tools.fd/filter_topo.fd/filter_topo.F90
@@ -81,9 +81,13 @@ program filter_topo
   call FV3_zs_filter(is,ie,js,je,isd,ied,jsd,jed,npx,npy,npx,ntiles,grid_type, &
                      stretch_fac, nested, area, dxa, dya, dx, dy, dxc, dyc, sin_sg, oro, regional )

+  deallocate(sin_sg, dx, dy, mask, dxc, dyc, area, dxa, dya)
+
   !--- write out the data
   call write_topo_file(is,ie,js,je,ntiles,oro(is:ie,js:je,:),regional )

+  deallocate (oro)
+
 contains

The test then ran to completion.

GeorgeGayno-NOAA · 2021-08-04T21:03:48Z

Ran the test on Hera four times:

Intel with 'release' build type.
Intel with 'debug' build type.
GNU with 'release' build type.
GNU with debug build type.

The output files from each test were compared.

The two Intel tests showed large differences indicative of a problem:

$ nccmp -dmfqS oro.C424.tile7.nc.intel.O3 oro.C424.tile7.nc.intel.debug
Variable  Group Count      Sum  AbsSum      Min     Max   Range        Mean  StdDev
orog_filt /       415 -2.26251 2429.39 -232.415 231.994 464.409 -0.00545183 24.9542

The two GNU tests showed very small differences:

$ nccmp -dmfqS oro.C424.tile7.nc.gnu.O3 oro.C424.tile7.nc.gnu.debug
Variable  Group Count          Sum      AbsSum          Min          Max Range         Mean StdDev
orog_filt /         1 -4.13247e-07 4.13247e-07 -4.13247e-07 -4.13247e-07     0 -4.13247e-07      0

The GNU debug and Intel debug showed very small differences:

$  nccmp -dmfqS oro.C424.tile7.nc.gnu.debug oro.C424.tile7.nc.intel.debug
Variable  Group Count         Sum      AbsSum         Min         Max Range        Mean StdDev
orog_filt /         1 4.13247e-07 4.13247e-07 4.13247e-07 4.13247e-07     0 4.13247e-07      0

The Intel debug and GNU release tests produced no differences:

$ nccmp -dmfqS oro.C424.tile7.nc.gnu.O3 oro.C424.tile7.nc.intel.debug
/scratch1/NCEPDEV/da/George.Gayno/ufs_utils.git/filter_topo_bug/data $

What does this say? That the Intel compiler optimizations are not playing nice with the filter code?

running the GNU debug option. Fixes ufs-community#105

Fixes ufs-community#105

GeorgeGayno-NOAA · 2021-08-05T18:17:49Z

Compiled my bugfix/filter_topo branch on Hera at 0f5f3cb, acba0fb, and dbe2ba5. Reran the test and compared the output files. None of the changes I made should change results.

Intel 'release' build type:

Comparing 0f5f3cb and acba0fb shows no differences.

$ nccmp -dmfqS oro.C424.tile7.nc.intel.O3.0f5f3cb oro.C424.tile7.nc.intel.O3.acba0fb
/scratch1/NCEPDEV/da/George.Gayno/ufs_utils.git/filter_topo_bug/data $

Comparing acba0fb and dbe2ba5 shows large differences.

$ nccmp -dmfqS oro.C424.tile7.nc.intel.O3.acba0fb oro.C424.tile7.nc.intel.O3.dbe2ba
Variable  Group Count       Sum  AbsSum      Min     Max   Range         Mean  StdDev
orog_filt /       534 -0.222672 812.112 -64.0984 66.2278 130.326 -0.000416988 5.64295

Intel 'debug' build type:

Comparing 0f5f3cb and acba0fb shows no differences.

$ nccmp -dmfqS oro.C424.tile7.nc.intel.debug.0f5f3cb oro.C424.tile7.nc.intel.debug.acba0fb
/scratch1/NCEPDEV/da/George.Gayno/ufs_utils.git/filter_topo_bug/data $

Comparing acba0fb and dbe2ba5 shows large differences.

$ nccmp -dmfqS oro.C424.tile7.nc.intel.debug.acba0fb oro.C424.tile7.nc.intel.debug.dbe2ba
Variable  Group Count      Sum  AbsSum      Min    Max  Range      Mean  StdDev
orog_filt /        46 0.157118 107.935 -11.7211 32.257 43.978 0.0034156 6.00324

GeorgeGayno-NOAA · 2021-08-05T18:39:42Z

Repeated the tests using GNU.

GNU 'release' build:

Comparing 0f5f3cb and acba0fb shows no differences.

$ nccmp -dmfqS oro.C424.tile7.nc.gnu.O3.0f5f3cb oro.C424.tile7.nc.gnu.O3.acba0fb
/scratch1/NCEPDEV/da/George.Gayno/ufs_utils.git/filter_topo_bug/data $

Comparing acba0fb and dbe2ba5 shows no differences.

$ nccmp -dmfqS oro.C424.tile7.nc.gnu.O3.acba0fb oro.C424.tile7.nc.gnu.O3.dbe2ba
/scratch1/NCEPDEV/da/George.Gayno/ufs_utils.git/filter_topo_bug/data $

GNU 'debug' build:

Comparing acba0fb and dbe2ba5 shows no differences.

$ nccmp -dmfqS oro.C424.tile7.nc.gnu.debug.acba0fb oro.C424.tile7.nc.gnu.debug.dbe2ba
/scratch1/NCEPDEV/da/George.Gayno/ufs_utils.git/filter_topo_bug/data $

Recall, the 0f5f3cb tag would not run in 'debug' mode because of memory leaks.

Comparing the 'debug' and 'release' builds at dbe2ba5 shows very small differences, as expected:

$ nccmp -dmfqS oro.C424.tile7.nc.gnu.O3.dbe2ba oro.C424.tile7.nc.gnu.debug.dbe2ba
Variable  Group Count          Sum      AbsSum          Min          Max Range         Mean StdDev
orog_filt /         1 -4.13247e-07 4.13247e-07 -4.13247e-07 -4.13247e-07     0 -4.13247e-07      0

Fixes ufs-community#105.

GeorgeGayno-NOAA · 2021-08-06T15:13:31Z

There are several arrays being allocated in routine "read_grid_file". Some are initialized with a constant value, while others are not initialized at all. I wonder if the compilers are filling these arrays with random garbage that should be later overwritten (but is not) by valid data. So I updated the branch to initialize these arrays. The values I chose (some were set to a large number, and some to zero) were made so the test would match the baseline. If the code is working properly, then I expect I could initialize these arrays to any value and the result should not not change.

Using the GNU 'debug' option, which is giving me a consistent (but not necessarily correct) answer, I ran ccc6692 on Hera, then compared to my baseline:

$ nccmp -dmfqS oro.C424.tile7.nc oro.C424.tile7.nc.gnu.debug.dbe2ba
Variable  Group Count          Sum      AbsSum          Min          Max Range         Mean StdDev
orog_filt /         1 -4.13247e-07 4.13247e-07 -4.13247e-07 -4.13247e-07     0 -4.13247e-07      0

This is a very small difference. Next, I will check each array.

GeorgeGayno-NOAA · 2021-08-06T18:53:51Z

Something is not right about the setting of 'geolon_c' and 'geolat_c' in 'read_grid_file'. These variables are allocated with a halo:

allocate(geolon_c(isd:ied+1,jsd:jed+1,ntiles))          
allocate(geolat_c(isd:ied+1,jsd:jed+1,ntiles))

For my test, the indices are:

isd,ied,jsd,jed -2 225 -2 207

Later the 'grid' file is read and the arrays (not including the halo) are initialized:

geolon_c(1:npx,1:npy,t) = tmpvar(1:ni+1:2,1:nj+1:2)*PI/180.
geolat_c(1:npx,1:npy,t) = tmpvar(1:ni+1:2,1:nj+1:2)*PI/180.

For my test, the indices are:

npx,npy 223 205

At this point the halo contains who knows what.

Later, the routine tries to use these halo values:

    do j=jsd,jed
          do i=isd+1,ied
             g1(1) = geolon_c(i,j,t); g1(2) = geolat_c(i,j,t)
             g2(1) = geolon_c(i-1,j,t); g2(2) = geolat_c(i-1,j,t)

Interestingly, there is a routine to fill in halo values. But since this is a regional grid (regional=.true.), it is never called:

   if( .not. regional ) then
      call fill_cubic_grid_halo(geolon_c, geolon_c, ng, 1, 1, 1, 1)
      call fill_cubic_grid_halo(geolat_c, geolat_c, ng, 1, 1, 1, 1)
      if(.not. nested) call fill_bgrid_scalar_corners(geolon_c, ng, npx, npy, isd, jsd, XDir)
      if(.not. nested) call fill_bgrid_scalar_corners(geolat_c, ng, npx, npy, isd, jsd, YDir)
    endif

Will ask the regional group for their take on this.

RatkoVasic-NOAA · 2021-08-16T18:01:08Z

I got to same line. Since computation of the dxc and dyc requires geolat_c's and geolon_c's halo points, I'll try to fill in halo points like in global case.

Fixes ufs-community#105

RatkoVasic-NOAA · 2021-08-19T15:48:41Z

I created routine which extrapolates geolat_c and geolon_c halo points. These points are used for dxc and dyc calculation. This should fix the problem.

GeorgeGayno-NOAA · 2021-08-19T15:51:57Z

I created routine which extrapolates geolat_c and geolon_c halo points. These points are used for dxc and dyc calculation. This should fix the problem.

For regional grids, we create several 'grid' files with and without halos. Instead of extrapolating, should one of the 'halo' 'grid' files be used? We only care about filtering on the halo0 file, correct?

RatkoVasic-NOAA · 2021-08-20T15:13:11Z

I guess we should be OK since it is done for global and regional nest similar way:

   if( .not. regional ) then
      call fill_cubic_grid_halo(geolon_c, geolon_c, ng, 1, 1, 1, 1)
      call fill_cubic_grid_halo(geolat_c, geolat_c, ng, 1, 1, 1, 1)
      if(.not. nested) call fill_bgrid_scalar_corners(geolon_c, ng, npx, npy, isd, jsd, XDir)
      if(.not. nested) call fill_bgrid_scalar_corners(geolat_c, ng, npx, npy, isd, jsd, YDir)
    else
      call fill_regional_halo(geolon_c, ng)
      call fill_regional_halo(geolat_c, ng)
    endif

GeorgeGayno-NOAA · 2021-08-20T15:26:27Z

@RatkoVasic-NOAA Are you working from my branch? I don't see your fork.

RatkoVasic-NOAA · 2021-08-20T15:27:43Z

I used hard copy:
Hera:/scratch1/NCEPDEV/nems/Ratko.Vasic/filter_topo/UFS_UTILS/sorc/grid_tools.fd/filter_topo.fd

GeorgeGayno-NOAA · 2021-08-20T15:31:58Z

I used hard copy:
Hera:/scratch1/NCEPDEV/nems/Ratko.Vasic/filter_topo/UFS_UTILS/sorc/grid_tools.fd/filter_topo.fd

Since the coding changes are non-trivial, I would create a fork and branch: https://github.com/NOAA-EMC/UFS_UTILS/wiki/3.-Checking-out-code-and-making-changes

RatkoVasic-NOAA · 2021-08-20T15:33:05Z

I'll do it.

…eferences issue ufs-community#105

to allow for using it in unit tests. Create simple unit test for that routine. Fixes ufs-community#105

Fixes ufs-community#105

GeorgeGayno-NOAA · 2021-09-02T20:58:29Z

Tested Ratko's latest version of his branch (852405d) on Hera using my C424 test case: /scratch1/NCEPDEV/da/George.Gayno/ufs_utils.git/filter_topo_bug

Intel/Release, Intel/Debug, GNU/Release, GNU/Debug all gave bit identical answers. Yeah!

GeorgeGayno-NOAA added the bug Something isn't working label May 18, 2020

GeorgeGayno-NOAA mentioned this issue Jul 27, 2021

Bug fix for return code and file name length in filter_topo #553

Merged

GeorgeGayno-NOAA self-assigned this Aug 3, 2021

GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Aug 4, 2021

Fix memory leak in filter_topo.F90 that was found by

acba0fb

running the GNU debug option. Fixes ufs-community#105

GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Aug 5, 2021

Fix rank of 'phis' array in routine FV3_zs_filter.

dbe2ba5

Fixes ufs-community#105

GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Aug 6, 2021

In routine read_grid_file, initialize each array after allocation.

ccc6692

Fixes ufs-community#105.

GeorgeGayno-NOAA assigned RatkoVasic-NOAA Aug 11, 2021

GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Aug 17, 2021

Merge branch 'develop' into bugfix/filter_topo

38cb8c1

Fixes ufs-community#105

RatkoVasic-NOAA added a commit to RatkoVasic-NOAA/UFS_UTILS that referenced this issue Aug 20, 2021

Add extrapolation for geolat and geolon to halo points. This commit r…

b602411

…eferences issue ufs-community#105

RatkoVasic-NOAA mentioned this issue Aug 20, 2021

In filter_topo added extrapolation to halo points #571

Merged

GeorgeGayno-NOAA added a commit to RatkoVasic-NOAA/UFS_UTILS that referenced this issue Aug 26, 2021

Move new routine 'ftst_fill_regional_halo' to utils.F90

f87f1f5

to allow for using it in unit tests. Create simple unit test for that routine. Fixes ufs-community#105

GeorgeGayno-NOAA added a commit to RatkoVasic-NOAA/UFS_UTILS that referenced this issue Aug 26, 2021

Comment out call to routine in new unit test.

88fa6d9

Fixes ufs-community#105

GeorgeGayno-NOAA closed this as completed in 69e009f Sep 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible memory leak in filter_topo program #105

Possible memory leak in filter_topo program #105

GeorgeGayno-NOAA commented May 18, 2020 •

edited

Loading

GeorgeGayno-NOAA commented Aug 3, 2021

GeorgeGayno-NOAA commented Aug 3, 2021

GeorgeGayno-NOAA commented Aug 4, 2021

GeorgeGayno-NOAA commented Aug 4, 2021

GeorgeGayno-NOAA commented Aug 4, 2021

GeorgeGayno-NOAA commented Aug 5, 2021 •

edited

Loading

GeorgeGayno-NOAA commented Aug 5, 2021 •

edited

Loading

GeorgeGayno-NOAA commented Aug 6, 2021

GeorgeGayno-NOAA commented Aug 6, 2021

RatkoVasic-NOAA commented Aug 16, 2021

RatkoVasic-NOAA commented Aug 19, 2021

GeorgeGayno-NOAA commented Aug 19, 2021

RatkoVasic-NOAA commented Aug 20, 2021

GeorgeGayno-NOAA commented Aug 20, 2021

RatkoVasic-NOAA commented Aug 20, 2021

GeorgeGayno-NOAA commented Aug 20, 2021

RatkoVasic-NOAA commented Aug 20, 2021

GeorgeGayno-NOAA commented Sep 2, 2021

Possible memory leak in filter_topo program #105

Possible memory leak in filter_topo program #105

Comments

GeorgeGayno-NOAA commented May 18, 2020 • edited Loading

GeorgeGayno-NOAA commented Aug 3, 2021

GeorgeGayno-NOAA commented Aug 3, 2021

GeorgeGayno-NOAA commented Aug 4, 2021

GeorgeGayno-NOAA commented Aug 4, 2021

GeorgeGayno-NOAA commented Aug 4, 2021

GeorgeGayno-NOAA commented Aug 5, 2021 • edited Loading

GeorgeGayno-NOAA commented Aug 5, 2021 • edited Loading

GeorgeGayno-NOAA commented Aug 6, 2021

GeorgeGayno-NOAA commented Aug 6, 2021

RatkoVasic-NOAA commented Aug 16, 2021

RatkoVasic-NOAA commented Aug 19, 2021

GeorgeGayno-NOAA commented Aug 19, 2021

RatkoVasic-NOAA commented Aug 20, 2021

GeorgeGayno-NOAA commented Aug 20, 2021

RatkoVasic-NOAA commented Aug 20, 2021

GeorgeGayno-NOAA commented Aug 20, 2021

RatkoVasic-NOAA commented Aug 20, 2021

GeorgeGayno-NOAA commented Sep 2, 2021

GeorgeGayno-NOAA commented May 18, 2020 •

edited

Loading

GeorgeGayno-NOAA commented Aug 5, 2021 •

edited

Loading

GeorgeGayno-NOAA commented Aug 5, 2021 •

edited

Loading