Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New netCDF4=1.6.1 (most probably) causing a fairly large number of Segmentation Faults #141

Closed
valeriupredoi opened this issue Sep 15, 2022 · 30 comments

Comments

@valeriupredoi
Copy link

Heads up guys, we are seeing some very frequent segfaults in our CI when we have the new, hours-old, netCDF4=1.6.1 in our environment. It's most probably due to it, since HDF5 has been at 1.12.2 for a while now - more than a month, and with netCDF4=1.6.0 all works fine (and other packages staying at the same version and hash point). Apologies if this proves out to be due to a different package, but better safe than sorry in terms of a forewarning. I can test more if you guys need me to be a third-party testbed. Cheers muchly 🍺

@ocefpaf
Copy link
Member

ocefpaf commented Sep 20, 2022

xref,: Unidata/netcdf4-python#1192 (comment)

@xylar and @hmaarrfk are you experiencing issues with v1.6.1? I wonder if we should pull the packages.

@valeriupredoi
Copy link
Author

testing a bit more with 1.6.1/1.6.0 from conda-forge (on a Ubuntu machine), it looks like the installation may be responsible for the odd behaviour we're seeing - conda compilers perhaps (just a flip of the coin)? Unidata/netcdf4-python#1192 (comment)

@hmaarrfk
Copy link
Contributor

i'm sorry, i haven't been experiencing issues. Maybe we can run the tests on our conda-forge CIs?

@hmaarrfk
Copy link
Contributor

it may be tat some of our constraints prevent our upgrade.

@hmaarrfk
Copy link
Contributor

netcdf4                   1.6.0           nompi_py39h6ced12a_102    conda-forge

I used to have this installed. I updated my dev environment to 1.6.1 i"ll see what happens in a few.

@xylar
Copy link
Contributor

xylar commented Sep 21, 2022

I've been using netcdf4 1.6.1 on Linux (both on my Ubuntu laptop and on HPC) without incident so far. I think it might be worth doing some more investigation of the segfaults that @valeriupredoi is seeing or finding out of others report them before pulling the package.

@xylar
Copy link
Contributor

xylar commented Sep 21, 2022

I see, it looks like other folks are having trouble and reporting it upstream on Unidata/netcdf4-python#1192?

@TomDufall
Copy link

I'm using Amazon Linux 2 and getting a combination of 'NetCDF: Not a valid ID' errors and segfaults periodically. I'm installing iris-grib via conda-forge, which will pull in iris and then this package.

Sample segfault with Python faulthandler:

Fatal Python error: Segmentation fault

Current thread 0x00007febd77fe700 (most recent call first):
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/fileformats/netcdf.py", line 477 in __getitem__
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/array/core.py", line 122 in getter
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/utils.py", line 71 in apply
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in _execute_task
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 149 in get
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/optimization.py", line 990 in __call__
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in _execute_task
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in <genexpr>
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in _execute_task
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in <genexpr>
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in _execute_task
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in <genexpr>
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in _execute_task
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in <genexpr>
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in _execute_task
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in <genexpr>
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in _execute_task
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in <genexpr>
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in _execute_task
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in <genexpr>
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/core.py", line 119 in _execute_task
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/local.py", line 224 in execute_task
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/local.py", line 238 in <listcomp>
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/dask/local.py", line 238 in batch_execute_tasks
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/concurrent/futures/thread.py", line 57 in run
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/concurrent/futures/thread.py", line 80 in _worker
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/threading.py", line 870 in run
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007febef7bd700 (most recent call first):
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/concurrent/futures/thread.py", line 78 in _worker
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/threading.py", line 870 in run
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007fec14037740 (most recent call first):
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/common/mixin.py", line 127 in name
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/cube.py", line 2424 in <lambda>
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/cube.py", line 2422 in dim_coords
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/cube.py", line 1790 in coords
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/_representation/cube_summary.py", line 102 in _summary_coord_extra
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/_representation/cube_summary.py", line 146 in __init__
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/_representation/cube_summary.py", line 204 in <listcomp>
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/_representation/cube_summary.py", line 203 in __init__
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/_representation/cube_summary.py", line 344 in add_vector_section
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/_representation/cube_summary.py", line 348 in __init__
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/_representation/cube_printout.py", line 165 in __init__
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/cube.py", line 2483 in summary
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/iris/cube.py", line 2491 in __repr__
  File "<string>", line 3 in __repr__
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/dataclasses.py", line 368 in wrapper
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/logging/__init__.py", line 373 in getMessage
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/logging/__init__.py", line 668 in format
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/logging/__init__.py", line 929 in format
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/logging/__init__.py", line 1085 in emit
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/logging/__init__.py", line 1187 in emit
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/logging/__init__.py", line 954 in handle
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/logging/__init__.py", line 1661 in callHandlers
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/logging/__init__.py", line 1599 in handle
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/logging/__init__.py", line 1589 in _log
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/logging/__init__.py", line 1475 in error
  File "/home/ec2-user/environment/<internal package - path removed>", line 57 in compare_cubes
  File "/home/ec2-user/environment/<internal package - path removed>", line 178 in <listcomp>
  File "/home/ec2-user/environment/<internal package - path removed>", line 178 in _main
  File "/home/ec2-user/environment/<internal package - path removed>", line 252 in main
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/click/core.py", line 610 in invoke
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/click/core.py", line 1066 in invoke
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/click/core.py", line 782 in main
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/site-packages/click/core.py", line 829 in __call__
  File "/home/ec2-user/environment/<internal package - path removed>", line 277 in <module>
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/bdb.py", line 580 in run
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/pdb.py", line 1547 in _runmodule
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/pdb.py", line 1703 in main
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/pdb.py", line 1732 in <module>
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/runpy.py", line 87 in _run_code
  File "/home/ec2-user/miniconda/envs/val4/lib/python3.8/runpy.py", line 194 in _run_module_as_main
Segmentation fault
```

@valeriupredoi
Copy link
Author

I have narrowed down the issue to Segfaults and stray HDF errors at file close point happening only when dealing with cached netCDF files, for us, with 1.6.1, see ESMValGroup/ESMValCore#1727 (comment) - I ran hundreds of tests with simply reading files directly off disk and did not see any issue; we're investigating what's that caching having to do with the issue, but was wondering @TomDufall @trexfeathers and others reporting issues you guys doing simple reads off disk or using some sort of caching? BTW that issue from the cached data test does not hit us with 1.6.0 (tested proper with it too)

@jswhit
Copy link

jswhit commented Sep 21, 2022

@valeriupredoi do you have a self-contained example I can run to reproduce the segfault to see if a particular code change in netcdf4-python 1.6.1 is the culprit?

@TomDufall
Copy link

We're saving NetCDF files to disk with boto3 and loading them with iris.load_raw([path.as_uri() for path in filepaths]). The only caching we're explicitly doing is reusing files if they're already downloaded. I'm not sure if Iris does any caching under the hood when loading files.

@valeriupredoi
Copy link
Author

valeriupredoi commented Sep 22, 2022

@jswhit I do, here - run this as a test with eg pytest -n 2 test_this.py

import iris
import numpy as np
import pickle
import platform
import pytest

TEST_REVISION = 1

def get_cache_key(value):
    """Get a cache key that is hopefully unique enough for unpickling.

    If this doesn't avoid problems with unpickling the cached data,
    manually clean the pytest cache with the command `pytest --cache-clear`.
    """
    py_version = platform.python_version()
    return (f'{value}_iris-{iris.__version__}_'
            f'numpy-{np.__version__}_python-{py_version}'
            f'rev-{TEST_REVISION}')


@pytest.fixture(scope="module")
def timeseries_cubes_month(request):
    """Load representative timeseries data."""
    cache_key = get_cache_key("sample_data/monthly")
    data = request.config.cache.get(cache_key, None)
    cubes = pickle.loads(data.encode('latin1'))

    return cubes


def test_io(timeseries_cubes_month):
    cubes = timeseries_cubes_month
    for i, c in enumerate(cubes):
        try:
            c.data
        except RuntimeError as exc:
            print("Crap!", i, c, exc)
            raise

you will need the sample data we are using from here https://github.com/ESMValGroup/ESMValCore/tree/main/tests/sample_data/multimodel_statistics - that's being loaded from cache. Let me know if you have issues running the mini test! 🍺

@valeriupredoi
Copy link
Author

OK guys we're zeroing in our test failures (related, closely or distantly, to 1.6.1) - proves out that tests pass fine if one clears the pytest cache before running, see ESMValGroup/ESMValCore#1727 bottom(ish) comments; I am also running a monitoring PR where I run our tests every hour to deterimne if there are any other fails from other tests ESMValGroup/ESMValCore#1730

@TomDufall
Copy link

TomDufall commented Sep 22, 2022

It's not as compact as I'd like, but here's an example that reliably produces an invalid ID or segfault error for me after a few iterations. The file is nothing special but I can supply it if needed/find a publicly-accessible example file tomorrow. Removing one of the max diff/value range calls significantly reduces the failure rate. Normally, this code would have two different cubes - one NetCDF and one GRIB, but I cut down the code to keep in minimal.

from pathlib import Path
from typing import Iterable, Optional, Tuple

from dask import array, compute
from dask.array.core import Array
import iris
from iris.cube import Cube

iris.FUTURE.datum_support = True  # silence annoying warning

filepath = Path("ukv_20220818T0300Z-PT0000H00M-temperature_at_surface.nc").resolve()

def max_diff(
    lazy_data_a: Array, lazy_data_b: Array) -> Array:
    """
    Calculate the maximum absolute difference between the data values of two
    lazy arrays. Nan values are ignored.
    """
    diff_array = lazy_data_a - lazy_data_b
    absolute_differences = abs(diff_array)
    return array.nanmax(absolute_differences)


def value_range(lazy_data: Array) -> Array:
    """
    Calculate the range of an array. Specifically, the difference between
    the maximum and minimum values within the array. Nan values are ignored.
    """
    return array.nanmax(lazy_data) - array.nanmin(lazy_data)

def compare_cubes(cube_a: Cube, cube_b: Cube) -> Tuple:
    """
    Compare the data values of two cubes. Return a tuple containing the
    maximum difference between the two cubes, the value range of cube_a
    and the value range of cube_b. The cubes must have the same shape.
    If not, then an error is raised.
    """
    lazy_a = cube_a.lazy_data()
    lazy_b = cube_b.lazy_data()
    mod = cube_a.units.modulus
    return compute(
        max_diff(lazy_a, lazy_b),
        value_range(lazy_a),
        value_range(lazy_b),
    )

def compare():
    cube_a = iris.load(filepath)[0]
    cube_b = iris.load(filepath)[0]
    compare_cubes(cube_a, cube_b)

i = 0
while True:
    i += 1
    print(i)
    compare()

@hmaarrfk
Copy link
Contributor

I'm somewhat worried that you are using multi-threading with dask and this file.

I some parts of the netcdf library are serial. that could be the issue.

@jswhit
Copy link

jswhit commented Sep 22, 2022

the fact that all C lib calls now release the GIL - including close (in version 1.6.1) might also be the issue. @TomDufall if you can post your sample file I will try with and without releasing the GIL for Dataset.close().

@dopplershift
Copy link
Member

netCDF is not threadsafe (IIRC even for reading).

@jswhit
Copy link

jswhit commented Sep 22, 2022

Using the example from @TomDufall (with. my own test data) I am able to reproduce the segfault with version 1.6.1. No segfault with 1.6.0. I suspect it has to do with the releasing of the GIL introduced in 1.6.1 and the non-threadsafe-ness of netcdf-c.

Clearly some netcdf-c calls should not be called without the GIL. Question now is: which ones?

UPDATE: keeping the GIL on nc_open (i.e. in for call c-lib calls in Dataset.__init__) fixes the segfaults for @TomDufall's test.

@dopplershift
Copy link
Member

IMO, the GIL is for locking the interpreter. That it solves issues with netcdf in a multi-threaded context is incidental. It's on users to manage locking for non-threadsafe resources like netCDF.

@TomDufall
Copy link

Weirdly, on my test, setting the dask scheduler to threaded even with a high or low worker count ((e.g. scheduler="threads", workers=20 or 2 on my 2-core instance), stops the errors. I've tried setting compute to use single-threaded, threads, and processes (the only options I'm aware of) and all 3 work - I'm not sure what default it's using that is so much more error-prone.

Please could someone elaborate on in what scenarios netCDF isn't thread safe? I only use it through Iris, so I've come up with a few potentially problematic access patterns - I hope they're clear enough. In these examples, the cube is using lazy data, so Iris will attempt to use netCDF to load the data only when the access attempt is made. I assume 1 is unsafe and 3 is safe? Is that correct/what about case 2?

  1. Construct a cube from a file. Access the same cube from two threads at once.
  2. Construct two cubes from the same file. Access both cubes from two threads at once.
  3. Construct two cubes from two different files. Access both cubes from two threads at once.

Sample file from earlier now attached if anyone still wants it (had to zip to upload)
ukv_20220818T0300Z-PT0000H00M-temperature_at_surface.zip

@jswhit
Copy link

jswhit commented Sep 22, 2022

@TomDufall I'd be curious to know if Unidata/netcdf4-python#1193 stops the segfaults in all the scenarios you mention.

@dopplershift
Copy link
Member

I believe even read operations have the opportunity to modify in-memory netCDF data structures, so strictly speaking, none of those 3 are safe IIRC. @WardF can you shed any light on what operations (if any) are threadsafe for netCDF?

@WardF
Copy link

WardF commented Sep 22, 2022

Certainly; libnetcdf is not thread safe. Because libhdf5 is historically not thread safe, it would have been a very heavy lift for us to wrap our operations around it in a thread safe way. As it stands, there is an experimental branch @DennisHeimbigner is working on that is looking to offer thread safety, but it is not going to be merged in any time soon.

So, to answer the question above, netCDF is not thread safe, full stop, unless thread safety is handled by the software/program making use of libnetcdf.

@dopplershift
Copy link
Member

And to be clear, there is still value in having the GIL released during API calls without the need for netCDF to be threadsafe.

@DennisHeimbigner
Copy link

I believe Ward is correct and that there is no scenario
in which accessing netcdf-c with multiple threads is safe.
Even just reading data can cause problems because of the
use of internal cache data structures that are potentially
modified when doing simple data reads.

@trexfeathers
Copy link

I have narrowed down the issue to Segfaults and stray HDF errors at file close point happening only when dealing with cached netCDF files, for us, with 1.6.1, see ESMValGroup/ESMValCore#1727 (comment) - I ran hundreds of tests with simply reading files directly off disk and did not see any issue; we're investigating what's that caching having to do with the issue, but was wondering @TomDufall @trexfeathers and others reporting issues you guys doing simple reads off disk or using some sort of caching? BTW that issue from the cached data test does not hit us with 1.6.0 (tested proper with it too)

One of the places we are seeing errors is running Python scripts in Read the Docs builds (e.g. 18171552), which include a loading step. I am almost certain these files are being loaded straight from disk. (I assume that is what @valeriupredoi means by "caching", rather than anything to do with NetCDF's chunk_cache).

@jswhit
Copy link

jswhit commented Sep 26, 2022

@trexfeathers are the Read the Docs builds using threads? (and reading the same file from different threads?). I ask because the working hypothesis now is that all of the segfaults people are seeing with 1.6.1 are related to the fact that netcdf-c is not thread safe, and not releasing the GIL when calling the C lib in 1.6.1 has increased the possibility that segfaults will occur when reading from or writing to the same netCDF4.Dataset object from different threads.

@trexfeathers
Copy link

@trexfeathers are the Read the Docs builds using threads?

Honestly I don't know. This isn't information that is published in RTD's docs, and the question is quite difficult to read around since so much documentation uses RTD.

Sorry, I know this isn't helpful. This topic is beyond my expertise.

@xylar xylar mentioned this issue Sep 30, 2022
@ocefpaf
Copy link
Member

ocefpaf commented Oct 4, 2022

Folks, after some investigation this is not a problem with the conda-forge netcdf4 package. Maybe not even a problem upstream. For those using iris you can workaround it with:

import dask
dask.config.set(scheduler="single-threaded")

I'm closing this for now.

@ocefpaf ocefpaf closed this as completed Oct 4, 2022
@valeriupredoi
Copy link
Author

cheers @ocefpaf - am keeping track of this at our end, so are the iris folk 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants