Possible memory leak #585

duncanwp · 2016-09-16T07:37:08Z

I'm getting consistent seg-faults somewhere out of netCDF-python/netCDF-c/HDF and I'm struggling to track it down.

The only way I can reliable reproduce it is on OS X with a specific set of input data files, and a specific output filename... It also occurs on Windows but less reliably. Predictably I've been unable to reproduce it on Linux with a debug enabled build.

I've attached the stack out of OSX when it occurs, and also some valgrind output from the Linux test (which doesn't seg fault but does produce potential memory leaks). The most interesting lines in the valgrind output are the "Uninitialised value was created by a heap allocation" warnings starting around line 560, and then the "Invalid write of size 1" at 660, and "Address 0x600e530 is 0 bytes after a block of size 16 alloc'd" at 673.

It's also possible these last two might be related to #506.

I'm using HDF 1.8.17, netCDF 4.4.0, netcdf-python 1.2.4.

Apologies I've not given you much to go on, if it helps I can share a recipe for the environment I'm using on OSX and the input data files (about 50Mb).

WardF · 2016-09-16T14:29:32Z

I will take a look at this to see if I can track it down or duplicate in the C library; environment info and data files would be helpful if they are easy to provide. Thanks :)

duncanwp · 2016-09-16T14:58:37Z

Great, thanks!

Here's the environment, although it can probably be more easily created just doing: conda create -c conda-forge -n cis_env cis

The data is here, and a script to reproduce the crash here.

Unfortunately you'll also have to download this plugin and point to it by setting CIS_PLUGIN_HOME to wherever you download it to... I did say it was hard to reproduce!!

duncanwp · 2016-09-21T10:05:47Z

Hi @WardF, have you had a chance to look into this? I'm happy to help if I can.

duncanwp · 2016-09-21T14:26:19Z

I can seemingly fix the problem if I remove the call to Dataset.filepath(). This is also where one of the valgrind memory errors was coming from, so a good place to look would probably be around line 13596 in _netCDF.c and the call to nc_inq_path but I can't see anything obviously wrong with my rusty C...

WardF · 2016-09-21T17:54:31Z

I haven't yet, no, but will investigate as soon as I can. I'll also take a look around nc_inq_path to see if I can identify an issue in the C library. Python isn't my day-to-day so I will need to find the time to set up the python environment, but that will be my next step if nothing jumps out in the C code.

WardF · 2016-09-21T18:36:32Z

Off the top of my head, if the pointers being passed to nc_inq_path aren't being allocated properly, that would cause some potential issues. I need to look at how it is arrived at to see if that might be what's happening, though.

WardF · 2016-09-21T19:55:01Z

Ok, we'll probably need one of the python devs/experts to weigh in. Looking at the valgrind output you provided, I see the following:

WARNING:root:Identified 401 point(s) which were missing values for some or all coordinates - these points have been removed from the data.
==2756== Invalid write of size 1
==2756==    at 0x4C3106F: strcpy (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2756==    by 0x98426BB: nc_inq_path (in /usr/lib/x86_64-linux-gnu/libnetcdf.so.11.0.0)
==2756==    by 0x8C48491: __pyx_pf_7netCDF4_10_netCDF4_d_7Dataset_8filepath (_netCDF4_d.c:13596)
==2756==    by 0x8C48034: __pyx_pw_7netCDF4_10_netCDF4_d_7Dataset_9filepath (_netCDF4_d.c:13413)
==2756==    by 0x531286: call_function (ceval.c:4334)
==2756==    by 0x52BF12: PyEval_EvalFrameEx (ceval.c:2987)
==2756==    by 0x52EDE8: PyEval_EvalCodeEx (ceval.c:3582)
==2756==    by 0x531C80: fast_function (ceval.c:4445)
==2756==    by 0x531879: call_function (ceval.c:4370)
==2756==    by 0x52BF12: PyEval_EvalFrameEx (ceval.c:2987)
==2756==    by 0x531B6E: fast_function (ceval.c:4435)
==2756==    by 0x531879: call_function (ceval.c:4370)

First, I should probably replace the call to strcpy with a call to strncpy, but I don't think that is the issue here.

The way I'm reading this is that nc_inq_path() is being passed a pointer to memory that hasn't been allocated, which is then passed to strcpy(), resulting in the error you're seeing and reported by Valgrind. I'm not sure how to go about determining if this is the case in the netcdf4-python code, though; perhaps @jswhit or @dopplershift would be able to determine this in short order?

I do see some other Valgrind issues reported from libnetcdf, although not directly related to nc_inq_path(); I'll continue investigating on my end in the C library and see if I can identify a cause.

jswhit · 2016-09-21T20:06:39Z

Here's the relevant cython code from netcdf4-python

            with nogil:
                ierr = nc_inq_path(self._grpid, &pathlen, NULL)
            if ierr != NC_NOERR:
                raise RuntimeError((<char *>nc_strerror(ierr)).decode('ascii'))
            c_path = <char *>malloc(sizeof(char) * pathlen)
            if not c_path:
                raise MemoryError()
            try:
                with nogil:
                    ierr = nc_inq_path(self._grpid, &pathlen, c_path)
                if ierr != NC_NOERR:
                    raise RuntimeError((<char *>nc_strerror(ierr)).decode('ascii'))
                py_path = c_path[:pathlen] # makes a copy of pathlen bytes from c_string
            finally:
                free(c_path)
            return py_path.decode('ascii')

nc_inq_path is being call with NULL as the third argument to get the length of the string.
The C string is then allocated, and another call to nc_inq_path retrieves it. The c_string is then copied to a python object by slicing it, and the c string is freed in the 'finally' block.

WardF · 2016-09-21T21:21:06Z

Thanks @jswhit. @duncanwp I'm looking at the stack.txt file you attached above and I'm seeing the crash is somewhere down in libhdf5, H50_msg_read_oh; this doesn't necessarily mean anything, if the crash is a memory issue (as it appears to be) it can be coming from anywhere. I'll keep investigating.

I notice you're using Python 2.7; is it possible to test this with Python 3x?

WardF · 2016-09-21T22:00:21Z

I'm trying to recreate this on OSX as described above, but am seeing the following. The output is the same regardless of whether I'm using miniconda2 or miniconda3. I've set CIS_PLUGIN_HOME, downloaded the plugin package from git, etc.

Traceback (most recent call last):
  File "crash.py", line 16, in <module>
    ["N3", "N10", "N15_NONVOL", "N15_VOLATILE", "N3_10"], product='GASSP')
  File "/Users/wfisher/miniconda2/envs/cis_env/lib/python2.7/site-packages/cis/__init__.py", line 74, in read_data_list
    return DataReader().read_data_list(expand_filelist(filenames), variables, product, aliases)
  File "/Users/wfisher/miniconda2/envs/cis_env/lib/python2.7/site-packages/cis/data_io/data_reader.py", line 90, in read_data_list
    var_data = self._get_data_func(filenames, variable, product)
  File "/Users/wfisher/miniconda2/envs/cis_env/lib/python2.7/site-packages/cis/data_io/products/AProduct.py", line 205, in get_data
    % (product_cls.__name__, type(e).__name__, e.args[0]), e)
cis.data_io.products.AProduct.ProductPluginException: An error occurred retrieving data using the product GASSP. Check that this is the correct product plugin for your chosen data. Exception was TypeError: get_metadata() takes exactly 1 argument (2 given).
(cis_env) wfisher@mort:~/Desktop/ghp585$

WardF · 2016-09-21T22:01:55Z

Also, the data lives in $(pwd)/ACE1

duncanwp · 2016-09-22T08:00:30Z

I notice you're using Python 2.7; is it possible to test this with Python 3x?

I've tried it with Python 3.5 and it has exactly the same issue. Feel free to test either.

I'm trying to recreate this on OSX as described above, but am seeing the following

Ah, sorry about that - I was hacking around some of our code to see if the deferred loading of data files might be causing the problem (which it doesn't seem to be) and this got committed by mistake. If you take the plugin file again it should be fixed now.

Looking through the output again and comparing with the cython @jswhit provided it looks like the allocation of c_path (c_path = <char *>malloc(sizeof(char) * pathlen)) raises the first warning:

==2756==  Address 0x600e530 is 0 bytes after a block of size 16 alloc'd
==2756==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2756==    by 0x8C48422: __pyx_pf_7netCDF4_10_netCDF4_d_7Dataset_8filepath (_netCDF4_d.c:13536)
==2756==    by 0x8C48034: __pyx_pw_7netCDF4_10_netCDF4_d_7Dataset_9filepath (_netCDF4_d.c:13413)
...

and then the second call to nc_inq_path (ierr = nc_inq_path(self._grpid, &pathlen, c_path)) raises the next one:

==2756== Invalid write of size 1
==2756==    at 0x4C3106F: strcpy (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2756==    by 0x98426BB: nc_inq_path (in /usr/lib/x86_64-linux-gnu/libnetcdf.so.11.0.0)
==2756==    by 0x8C48491: __pyx_pf_7netCDF4_10_netCDF4_d_7Dataset_8filepath (_netCDF4_d.c:13596)
==2756==    by 0x8C48034: __pyx_pw_7netCDF4_10_netCDF4_d_7Dataset_9filepath (_netCDF4_d.c:13413)
...

jjhelmus · 2016-10-20T20:28:22Z

I'm getting a similar segmentation fault. The following minimal script results in a segmentation fault on my system. Note that ONLY a 88 character filename results in a segmentation fault.

import netCDF4

dset = netCDF4.Dataset('./filename_of_length_88_filler_filler_filler_filler_filler_filler_filler_filler_filler_.nc', 'w')
dset.close()

infile = 'filename_of_length_88_filler_filler_filler_filler_filler_filler_filler_filler_filler_.nc'
dataset = netCDF4.Dataset(infile)
print(dataset.filepath())

jjhelmus · 2016-10-20T21:39:35Z

I was not able to get cygdb to work but I was able to build a version of netcdf4-python from the current master (97f1515) with debugging symbol against a debug version of Python 3.5.2 and get a backtrace of the segmentation fault in the above script:

Starting program: /usr/bin/python3-dbg seg_fault.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
*** Error in `/usr/bin/python3-dbg': free(): invalid next size (fast): 0x00000000010220a0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x77725)[0x7ffff6d15725]
/lib/x86_64-linux-gnu/libc.so.6(+0x7ff4a)[0x7ffff6d1df4a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7ffff6d21abc]
/home/jhelmus/dev/netcdf4-python/netCDF4/_netCDF4.cpython-35dm-x86_64-linux-gnu.so(+0x86287)[0x7ffff5f28287]
/home/jhelmus/dev/netcdf4-python/netCDF4/_netCDF4.cpython-35dm-x86_64-linux-gnu.so(+0x8675c)[0x7ffff5f2875c]
/usr/bin/python3-dbg[0x54a625]
/usr/bin/python3-dbg(PyEval_EvalFrameEx+0x8949)[0x547bc3]
/usr/bin/python3-dbg[0x54a1a0]
/usr/bin/python3-dbg(PyEval_EvalCodeEx+0x23)[0x54a290]
/usr/bin/python3-dbg(PyEval_EvalCode+0x24)[0x54a2b9]
/usr/bin/python3-dbg[0x424ee8]
/usr/bin/python3-dbg(PyRun_FileExFlags+0x9a)[0x42797a]
/usr/bin/python3-dbg(PyRun_SimpleFileExFlags+0x31f)[0x427d17]
/usr/bin/python3-dbg(PyRun_AnyFileExFlags+0x61)[0x427e7a]
/usr/bin/python3-dbg[0x4370dc]
/usr/bin/python3-dbg(Py_Main+0x9bd)[0x437c0b]
/usr/bin/python3-dbg(main+0x163)[0x41f319]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7ffff6cbe830]
/usr/bin/python3-dbg(_start+0x29)[0x41f0e9]

(gdb) where
#0  0x00007ffff6cd3418 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007ffff6cd501a in __GI_abort () at abort.c:89
#2  0x00007ffff6d1572a in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7ffff6e2e6b0 "*** Error in `%s': %s: 0x%s ***\n")
    at ../sysdeps/posix/libc_fatal.c:175
#3  0x00007ffff6d1df4a in malloc_printerr (ar_ptr=<optimized out>, ptr=<optimized out>, str=0x7ffff6e2e728 "free(): invalid next size (fast)", action=3)
    at malloc.c:5007
#4  _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:3868
#5  0x00007ffff6d21abc in __GI___libc_free (mem=<optimized out>) at malloc.c:2969
#6  0x00007ffff5f28287 in __pyx_pf_7netCDF4_8_netCDF4_7Dataset_8filepath (__pyx_v_self=<optimized out>) at netCDF4/_netCDF4.c:13715
#7  0x00007ffff5f2875c in __pyx_pw_7netCDF4_8_netCDF4_7Dataset_9filepath (__pyx_v_self=<optimized out>, unused=<optimized out>) at netCDF4/_netCDF4.c:13447
#8  0x000000000054a625 in call_function (pp_stack=pp_stack@entry=0x7fffffffd8a0, oparg=oparg@entry=0) at ../Python/ceval.c:4684
#9  0x0000000000547bc3 in PyEval_EvalFrameEx (f=f@entry=Frame 0xaf1598, for file seg_fault.py, line 8, in <module> (), throwflag=throwflag@entry=0)
    at ../Python/ceval.c:3236
#10 0x000000000054a1a0 in _PyEval_EvalCodeWithName (_co=_co@entry=<code at remote 0x7ffff7e47dc0>,

Line 13715 of _netCDF4.c is the free(__pyx_v_c_path); line:

  /* "netCDF4/_netCDF4.pyx":1889
 *                 py_path = c_path[:pathlen] # makes a copy of pathlen bytes from c_string
 *             finally:
 *                 free(c_path)             # <<<<<<<<<<<<<<
 *             return py_path.decode('ascii')
 *         ELSE:
 */
  /*finally:*/ {
    /*normal exit:*/{
      free(__pyx_v_c_path);
      goto __pyx_L10;
    }

Is it possible that the line:

py_path = c_path[:pathlen] # makes a copy of pathlen bytes from c_string

Makes reference rather than a copy and that Python is free-ing this memory prior to the Cython free statement?

jjhelmus · 2016-10-20T22:03:24Z

Does nc_inq_path happen to copy pathlen+1 characters (a NULL terminated path) into the memory location pointed to by the path parameter? If so then the malloc needs to be one larger than pathlen to accommodate the NULL character.

Specifically the follow change to _netCDF4.pyx removes the segmentation fault in my test case.

@@ -1876,7 +1876,7 @@ open/create the Dataset. Requires netcdf >= 4.1.2"""
                 ierr = nc_inq_path(self._grpid, &pathlen, NULL)
             if ierr != NC_NOERR:
                 raise RuntimeError((<char *>nc_strerror(ierr)).decode('ascii'))
-            c_path = <char *>malloc(sizeof(char) * pathlen)
+            c_path = <char *>malloc(sizeof(char) * (pathlen + 1))
             if not c_path:
                 raise MemoryError()
             try:

jjhelmus · 2016-10-21T15:39:54Z

I can confirm that the nc_inq_path function does indeed copy pathlen+1 characters into the c_path array regardless of its size. Changing the filepath method as follows:

        cdef int ierr
        cdef size_t pathlen
        cdef char *c_path
        IF HAS_NC_INQ_PATH:
            with nogil:
                ierr = nc_inq_path(self._grpid, &pathlen, NULL)
            if ierr != NC_NOERR:
                raise RuntimeError((<char *>nc_strerror(ierr)).decode('ascii'))
            c_path = <char *>malloc(sizeof(char) * (pathlen + 2))
            c_path[pathlen-1] = 42
            c_path[pathlen] = 42
            c_path[pathlen+1] = 42
            if not c_path:
                raise MemoryError()
            try:
                with nogil:
                    ierr = nc_inq_path(self._grpid, &pathlen, c_path)
                if ierr != NC_NOERR:
                    raise RuntimeError((<char *>nc_strerror(ierr)).decode('ascii'))
                py_path_0 = c_path[pathlen-1] # makes a copy of pathlen bytes from c_string
                py_path_1 = c_path[pathlen] # makes a copy of pathlen bytes from c_string
                py_path_2 = c_path[pathlen+1] # makes a copy of pathlen bytes from c_string
            finally:
                free(c_path)
            return py_path_0, py_path_1, py_path_2

This returns (99, 0, 42) which is 'c', NULL, and the sentinel value. I'll submit a PR to increase the size of c_path by one shortly.

Increase the buffer used to store the filepath by one character to accommodate the NULL character copied by the nc_inq_path function. closes Unidata#585

duncanwp · 2016-10-23T16:23:06Z

Great work tracking it down @jjhelmus, hopefully this can be merged in time for the next release.

jswhit · 2016-10-24T18:17:15Z

Yes, kudos @jjhelmus! Before I merge, I'd like to hear from the unidata devs (@WardF and/or @DennisHeimbigner) as to whether nc_inq_path indeed returns the null terminator, and if so is this the expected behavior or a bug?

jjhelmus mentioned this issue Oct 21, 2016

BUG: Increase buffer used to store filepath #595

Merged

jswhit closed this as completed in #595 Oct 26, 2016

braaannigan mentioned this issue Dec 15, 2017

open_mfdataset() memory error in v0.10 pydata/xarray#1745

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible memory leak #585

Possible memory leak #585

duncanwp commented Sep 16, 2016 •

edited

Loading

WardF commented Sep 16, 2016

duncanwp commented Sep 16, 2016

duncanwp commented Sep 21, 2016

duncanwp commented Sep 21, 2016

WardF commented Sep 21, 2016

WardF commented Sep 21, 2016

WardF commented Sep 21, 2016

jswhit commented Sep 21, 2016 •

edited

Loading

WardF commented Sep 21, 2016

WardF commented Sep 21, 2016

WardF commented Sep 21, 2016

duncanwp commented Sep 22, 2016 •

edited

Loading

jjhelmus commented Oct 20, 2016 •

edited

Loading

jjhelmus commented Oct 20, 2016

jjhelmus commented Oct 20, 2016

jjhelmus commented Oct 21, 2016

duncanwp commented Oct 23, 2016

jswhit commented Oct 24, 2016

Possible memory leak #585

Possible memory leak #585

Comments

duncanwp commented Sep 16, 2016 • edited Loading

WardF commented Sep 16, 2016

duncanwp commented Sep 16, 2016

duncanwp commented Sep 21, 2016

duncanwp commented Sep 21, 2016

WardF commented Sep 21, 2016

WardF commented Sep 21, 2016

WardF commented Sep 21, 2016

jswhit commented Sep 21, 2016 • edited Loading

WardF commented Sep 21, 2016

WardF commented Sep 21, 2016

WardF commented Sep 21, 2016

duncanwp commented Sep 22, 2016 • edited Loading

jjhelmus commented Oct 20, 2016 • edited Loading

jjhelmus commented Oct 20, 2016

jjhelmus commented Oct 20, 2016

jjhelmus commented Oct 21, 2016

duncanwp commented Oct 23, 2016

jswhit commented Oct 24, 2016

duncanwp commented Sep 16, 2016 •

edited

Loading

jswhit commented Sep 21, 2016 •

edited

Loading

duncanwp commented Sep 22, 2016 •

edited

Loading

jjhelmus commented Oct 20, 2016 •

edited

Loading