-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New netCDF4=1.6.1
(most probably) causing a fairly large number of Segmentation Faults
#141
Comments
xref,: Unidata/netcdf4-python#1192 (comment) @xylar and @hmaarrfk are you experiencing issues with v1.6.1? I wonder if we should pull the packages. |
testing a bit more with 1.6.1/1.6.0 from conda-forge (on a Ubuntu machine), it looks like the installation may be responsible for the odd behaviour we're seeing - conda compilers perhaps (just a flip of the coin)? Unidata/netcdf4-python#1192 (comment) |
i'm sorry, i haven't been experiencing issues. Maybe we can run the tests on our conda-forge CIs? |
it may be tat some of our constraints prevent our upgrade. |
I used to have this installed. I updated my dev environment to 1.6.1 i"ll see what happens in a few. |
I've been using netcdf4 1.6.1 on Linux (both on my Ubuntu laptop and on HPC) without incident so far. I think it might be worth doing some more investigation of the segfaults that @valeriupredoi is seeing or finding out of others report them before pulling the package. |
I see, it looks like other folks are having trouble and reporting it upstream on Unidata/netcdf4-python#1192? |
I'm using Amazon Linux 2 and getting a combination of 'NetCDF: Not a valid ID' errors and segfaults periodically. I'm installing iris-grib via conda-forge, which will pull in iris and then this package. Sample segfault with Python faulthandler:
|
I have narrowed down the issue to Segfaults and stray HDF errors at file close point happening only when dealing with cached netCDF files, for us, with 1.6.1, see ESMValGroup/ESMValCore#1727 (comment) - I ran hundreds of tests with simply reading files directly off disk and did not see any issue; we're investigating what's that caching having to do with the issue, but was wondering @TomDufall @trexfeathers and others reporting issues you guys doing simple reads off disk or using some sort of caching? BTW that issue from the cached data test does not hit us with 1.6.0 (tested proper with it too) |
@valeriupredoi do you have a self-contained example I can run to reproduce the segfault to see if a particular code change in netcdf4-python 1.6.1 is the culprit? |
We're saving NetCDF files to disk with boto3 and loading them with |
@jswhit I do, here - run this as a test with eg import iris
import numpy as np
import pickle
import platform
import pytest
TEST_REVISION = 1
def get_cache_key(value):
"""Get a cache key that is hopefully unique enough for unpickling.
If this doesn't avoid problems with unpickling the cached data,
manually clean the pytest cache with the command `pytest --cache-clear`.
"""
py_version = platform.python_version()
return (f'{value}_iris-{iris.__version__}_'
f'numpy-{np.__version__}_python-{py_version}'
f'rev-{TEST_REVISION}')
@pytest.fixture(scope="module")
def timeseries_cubes_month(request):
"""Load representative timeseries data."""
cache_key = get_cache_key("sample_data/monthly")
data = request.config.cache.get(cache_key, None)
cubes = pickle.loads(data.encode('latin1'))
return cubes
def test_io(timeseries_cubes_month):
cubes = timeseries_cubes_month
for i, c in enumerate(cubes):
try:
c.data
except RuntimeError as exc:
print("Crap!", i, c, exc)
raise you will need the sample data we are using from here https://github.com/ESMValGroup/ESMValCore/tree/main/tests/sample_data/multimodel_statistics - that's being loaded from cache. Let me know if you have issues running the mini test! 🍺 |
OK guys we're zeroing in our test failures (related, closely or distantly, to 1.6.1) - proves out that tests pass fine if one clears the pytest cache before running, see ESMValGroup/ESMValCore#1727 bottom(ish) comments; I am also running a monitoring PR where I run our tests every hour to deterimne if there are any other fails from other tests ESMValGroup/ESMValCore#1730 |
It's not as compact as I'd like, but here's an example that reliably produces an invalid ID or segfault error for me after a few iterations. The file is nothing special but I can supply it if needed/find a publicly-accessible example file tomorrow. Removing one of the max diff/value range calls significantly reduces the failure rate. Normally, this code would have two different cubes - one NetCDF and one GRIB, but I cut down the code to keep in minimal.
|
I'm somewhat worried that you are using multi-threading with dask and this file. I some parts of the netcdf library are serial. that could be the issue. |
the fact that all C lib calls now release the GIL - including |
netCDF is not threadsafe (IIRC even for reading). |
Using the example from @TomDufall (with. my own test data) I am able to reproduce the segfault with version 1.6.1. No segfault with 1.6.0. I suspect it has to do with the releasing of the GIL introduced in 1.6.1 and the non-threadsafe-ness of netcdf-c. Clearly some netcdf-c calls should not be called without the GIL. Question now is: which ones? UPDATE: keeping the GIL on |
IMO, the GIL is for locking the interpreter. That it solves issues with netcdf in a multi-threaded context is incidental. It's on users to manage locking for non-threadsafe resources like netCDF. |
Weirdly, on my test, setting the dask scheduler to threaded even with a high or low worker count ((e.g. Please could someone elaborate on in what scenarios netCDF isn't thread safe? I only use it through Iris, so I've come up with a few potentially problematic access patterns - I hope they're clear enough. In these examples, the cube is using lazy data, so Iris will attempt to use netCDF to load the data only when the access attempt is made. I assume 1 is unsafe and 3 is safe? Is that correct/what about case 2?
Sample file from earlier now attached if anyone still wants it (had to zip to upload) |
@TomDufall I'd be curious to know if Unidata/netcdf4-python#1193 stops the segfaults in all the scenarios you mention. |
I believe even read operations have the opportunity to modify in-memory netCDF data structures, so strictly speaking, none of those 3 are safe IIRC. @WardF can you shed any light on what operations (if any) are threadsafe for netCDF? |
Certainly; So, to answer the question above, netCDF is not thread safe, full stop, unless thread safety is handled by the software/program making use of |
And to be clear, there is still value in having the GIL released during API calls without the need for netCDF to be threadsafe. |
I believe Ward is correct and that there is no scenario |
One of the places we are seeing errors is running Python scripts in Read the Docs builds (e.g. |
@trexfeathers are the Read the Docs builds using threads? (and reading the same file from different threads?). I ask because the working hypothesis now is that all of the segfaults people are seeing with 1.6.1 are related to the fact that netcdf-c is not thread safe, and not releasing the GIL when calling the C lib in 1.6.1 has increased the possibility that segfaults will occur when reading from or writing to the same |
Honestly I don't know. This isn't information that is published in RTD's docs, and the question is quite difficult to read around since so much documentation uses RTD. Sorry, I know this isn't helpful. This topic is beyond my expertise. |
Folks, after some investigation this is not a problem with the conda-forge netcdf4 package. Maybe not even a problem upstream. For those using import dask
dask.config.set(scheduler="single-threaded") I'm closing this for now. |
cheers @ocefpaf - am keeping track of this at our end, so are the iris folk 👍 |
Heads up guys, we are seeing some very frequent segfaults in our CI when we have the new, hours-old,
netCDF4=1.6.1
in our environment. It's most probably due to it, since HDF5 has been at 1.12.2 for a while now - more than a month, and withnetCDF4=1.6.0
all works fine (and other packages staying at the same version and hash point). Apologies if this proves out to be due to a different package, but better safe than sorry in terms of a forewarning. I can test more if you guys need me to be a third-party testbed. Cheers muchly 🍺The text was updated successfully, but these errors were encountered: