Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when running tests on CircleCI #644

Closed
bouweandela opened this issue May 19, 2020 · 43 comments · Fixed by #1064
Closed

Segmentation fault when running tests on CircleCI #644

bouweandela opened this issue May 19, 2020 · 43 comments · Fixed by #1064
Labels
automated-testing-workshop automated-testing-workshop bug Something isn't working testing

Comments

@bouweandela
Copy link
Member

bouweandela commented May 19, 2020

The unit tests are failing regularly due to a segmentation fault.

If this happens to your pull request, please go to CircleCI by clicking 'Details' behind the CircleCI 'commit' check below your pull request, and clicking on the failed test job. Once you're on CircleCI, click the 'Rerun' button in the top right corner and choose 'Rerun Workflow from Failed'. Usually, the test will succeed the second time, if not, try again. If you do not see the 'Rerun' button, you need to log in to CircleCI with your GitHub account first.

The problem seems to originate from the netcdf4 library.

Example crashed run: https://app.circleci.com/pipelines/github/ESMValGroup/ESMValCore/2482/workflows/f8e73729-c4cf-408c-bdae-beec24238ac1/jobs/10300/steps

Example stack trace:

tests/integration/cmor/_fixes/cmip6/test_cnrm_cm6_1.py ..Fatal Python error: Segmentation fault
 
Thread 0x00007fc098ff9700 (most recent call first):
  File "/opt/conda/envs/esmvaltool/lib/python3.8/multiprocessing/pool.py", line 576 in _handle_results
  File "/opt/conda/envs/esmvaltool/lib/python3.8/threading.py", line 870 in run
  File "/opt/conda/envs/esmvaltool/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/opt/conda/envs/esmvaltool/lib/python3.8/threading.py", line 890 in _bootstrap
 ... 

Thread 0x00007fc0b97fa700 (most recent call first):
 
  File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/iris/fileformats/netcdf.py", line 391 in __getitem__
  File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/array/core.py", line 98 in getter
  File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/core.py", line 121 in _execute_task
  File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/core.py", line 121 in <genexpr>
  File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/core.py", line 121 in _execute_task
  File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/local.py", line 222 in execute_task
  File "/opt/conda/envs/esmvaltool/lib/python3.8/multiprocessing/pool.py", line 125 in worker
  File "/opt/conda/envs/esmvaltool/lib/python3.8/threading.py", line 870 in run
  File "/opt/conda/envs/esmvaltool/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/opt/conda/envs/esmvaltool/lib/python3.8/threading.py", line 890 in _bootstrap

...

Current thread 0x00007fc0fa7fc700 (most recent call first):
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/iris/fileformats/netcdf.py", line 391 in __getitem__
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/array/core.py", line 98 in getter
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/core.py", line 121 in _execute_task
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/core.py", line 121 in <genexpr>
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/core.py", line 121 in _execute_task
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/local.py", line 222 in execute_task
   File "/opt/conda/envs/esmvaltool/lib/python3.8/multiprocessing/pool.py", line 125 in worker
   File "/opt/conda/envs/esmvaltool/lib/python3.8/threading.py", line 870 in run
   File "/opt/conda/envs/esmvaltool/lib/python3.8/threading.py", line 932 in _bootstrap_inner
   File "/opt/conda/envs/esmvaltool/lib/python3.8/threading.py", line 890 in _bootstrap
  
...

 Thread 0x00007fc19193e740 (most recent call first):
   File "/opt/conda/envs/esmvaltool/lib/python3.8/threading.py", line 302 in wait
   File "/opt/conda/envs/esmvaltool/lib/python3.8/queue.py", line 170 in get
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/local.py", line 133 in queue_get
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/local.py", line 475 in get_async
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/threaded.py", line 76 in get
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/dask/base.py", line 444 in compute
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/iris/_lazy_data.py", line 226 in _co_realise_lazy_arrays
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/iris/_lazy_data.py", line 262 in as_concrete_data
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/iris/_data_manager.py", line 216 in data
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/iris/coords.py", line 640 in points
   File "/develop/tests/integration/cmor/_fixes/cmip6/test_cnrm_cm6_1.py", line 134 in test_cl_fix_metadata
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/python.py", line 182 in pytest_pyfunc_call
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/python.py", line 1477 in runtest
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/runner.py", line 135 in pytest_runtest_call
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/runner.py", line 217 in <lambda>
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/runner.py", line 244 in from_call
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/runner.py", line 216 in call_runtest_hook
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/runner.py", line 186 in call_and_report
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/runner.py", line 100 in runtestprotocol
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/runner.py", line 85 in pytest_runtest_protocol
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/main.py", line 272 in pytest_runtestloop
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/main.py", line 247 in _main
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/main.py", line 191 in wrap_session
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/main.py", line 240 in pytest_cmdline_main
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/_pytest/config/__init__.py", line 124 in main
   File "/develop/.eggs/pytest_runner-5.2-py3.8.egg/ptr.py", line 220 in run_tests
   File "/develop/.eggs/pytest_runner-5.2-py3.8.egg/ptr.py", line 209 in run
   File "/opt/conda/envs/esmvaltool/lib/python3.8/distutils/dist.py", line 985 in run_command
   File "/opt/conda/envs/esmvaltool/lib/python3.8/distutils/dist.py", line 966 in run_commands
   File "/opt/conda/envs/esmvaltool/lib/python3.8/distutils/core.py", line 148 in setup
   File "/opt/conda/envs/esmvaltool/lib/python3.8/site-packages/setuptools/__init__.py", line 144 in setup
   File "setup.py", line 159 in <module>
  
Received "segmentation fault" signal 
@bouweandela
Copy link
Member Author

bouweandela commented May 19, 2020

This is could be because pytest is running the tests in parallel and the netCDF4 library is not thread safe Unidata/netcdf4-python#844, but I'm not sure if pytest actually does this.

@bouweandela
Copy link
Member Author

Actually the threads are created by dask, we're not running pytest in parallel.

@valeriupredoi
Copy link
Contributor

yeah, seen this, albeit it seems to be transitory. I am smelling hdf5 so I pinned it to catch the latest 1.10.6 in #649 - the one I believe the CI is compiling netcdf4 against is 1.10.5 which is more than 7 months old

@valeriupredoi
Copy link
Contributor

gah! just looked at the Artifacts and it is 1.10.6 the CI is installing, bleh

@valeriupredoi

This comment has been minimized.

@valeriupredoi

This comment has been minimized.

@valeriupredoi
Copy link
Contributor

aha! I found the bugger: indeed the final test environment is using hdf5=1.10.5 because the env reverts back to it when installing ncl:

hdf5                          1.10.6-mpi_mpich_ha7d0aea_0 --> 1.10.5-mpi_mpich_ha7d0aea_1004

the problem is pinning it to >=1.10.6 causes conda to blow its pipes and refuse to solve the environment, dumb conda. The good news is I ran multiple test instances with (at least 4, on CI machine) with 1.10.6 and still to see the segfault, so now am pretty sure it's the hdf5 version that's causing it

@valeriupredoi
Copy link
Contributor

conda is a good boi, it's ncl that is not allowing 1.10.6

@bouweandela
Copy link
Member Author

Instead of using very recent versions, maybe we could try using some older versions of hdf5 or netcdf? Because the problem didn't occur earlier?

@valeriupredoi
Copy link
Contributor

OK I had a closer look at this today - some prelim conclusions:

@valeriupredoi
Copy link
Contributor

I also think that the Segfaults we're seeing now are coz of the pytest+pytest-xdist combo since we were not seeing these for Core (where hdf5 was and is indeed at a recent 1.10.6 version)

@bouweandela
Copy link
Member Author

I also think that the Segfaults we're seeing now are coz of the pytest+pytest-xdist combo

Could you attach a stack trace to back that up?

@bouweandela bouweandela mentioned this issue Oct 21, 2020
9 tasks
@stefsmeets stefsmeets changed the title Segmentation fault when running tests Segmentation fault when running tests on CircleCI Dec 3, 2020
@stefsmeets
Copy link
Contributor

Do we have a list of tests that are affected? This issue is somewhat annoying, because it happens all the time.
Can we mark the failing tests as xfail until this is resolved?

tests/integration/cmor/_fixes/cmip6/test_gfdl_cm4.py::test_cl_fix_metadata
tests/integration/cmor/_fixes/test_common.py::test_cl_hybrid_pressure_coord_fix_metadata_with_a
...

@bouweandela
Copy link
Member Author

Do we have a list of tests that are affected?

That would probably be all files that use iris to load a netcdf file

Can we mark the failing tests as xfail until this is resolved?

I'm not sure that would work, because the entire worker running the test crashes. I'm also not convinced it's a good idea because you might miss that you broke a test and merge a pull request that breaks the codebase because all tests pass.

@stefsmeets
Copy link
Contributor

Can we mark the failing tests as xfail until this is resolved?

I'm not sure that would work, because the entire worker running the test crashes. I'm also not convinced it's a good idea because you might miss that you broke a test and merge a pull request that breaks the codebase because all tests pass.

I was thinking they still show up in the summary, but they don't slow down PRs. At the moment, every time I make a commit to a PR I have to go into CircleCI and re-run the failing test (and make sure the fail is because of a segmentation error) and wait another 20 minutes for the result. It is not a good experience.

@valeriupredoi
Copy link
Contributor

valeriupredoi commented Jan 12, 2021

OK this bugs me badly so I had a systematic look at one of the tests failing on a regular basis, namely tests/integration/cmor/_fixes/test_common.py. On my machine I ran sets of 20 repeated tests (2 gworkers per run):

No of fails in 20 runs of the test
run1: 3/20
run2: 5/20
run3: 1/20
run4: 4/20
-----------
avg: 3.25/20 = 16%

with each of the fails counted from the gworker's crash due to Segmentation Fault. So pretty bad on my laptop - a ~20% crash rate is nasty, I noticed worse rates (by eye) on the CI machine.

Now, here's the funny thing - I even found a fix that makes the crash rate drop significantly: encasing the Dataset import in two print statements eg:

print("DATbefore")
from netCDF4 import Dataset
print("DATafter")

makes the thing run with a much lower SegFault rate (I ran the test 160 times and it crashed only once, so a crash rate of <1%);

Another experiment was to run on with a single worker: running with a single gworker (pytest -n 1) reduces the fail rate too to 3/120<3%.

I think there is some time race condition that fails badly when using netCDF4.Dataset - we have not seen these fail anywhere else but the cmor fixes. @bouweandela has been around the block e.g. SciTools/iris#2063 and the related netCDF4-python Issue - so it looks to me as something related to, if not identical. Also hdf5 and netCDF4 are not thread safe. I am not sure if we can ever fix this issue but rather maybe try reduce the segfaulting rate?

@valeriupredoi
Copy link
Contributor

valeriupredoi commented Jan 12, 2021

the only tests that are at risk from this issue are the ones using netCDF4.Dataset:

tests/integration/cmor/_fixes/cmip6/test_cesm2_waccm.py:from netCDF4 import Dataset
tests/integration/cmor/_fixes/cmip6/test_noresm2_lm.py:from netCDF4 import Dataset
tests/integration/cmor/_fixes/cmip6/test_cnrm_cm6_1.py:from netCDF4 import Dataset
tests/integration/cmor/_fixes/cmip6/test_gfdl_esm4.py:from netCDF4 import Dataset
tests/integration/cmor/_fixes/cmip6/test_gfdl_cm4.py:from netCDF4 import Dataset
tests/integration/cmor/_fixes/cmip6/test_cesm2.py:from netCDF4 import Dataset
tests/integration/cmor/_fixes/test_common.py:from netCDF4 import Dataset

I suggest we change these and create a dataset in a different way - iris cube saved to netCDF or simply pass the cube, bypassing the load step. Note that using cdo or such will mean building the dataset with netCDF4.Dataset so we shouldn't do that - what you guys think @jvegasbsc @stefsmeets @bouweandela

@schlunma
Copy link
Contributor

schlunma commented Jan 12, 2021

Would it help to remove the dataset creation from the files and instead store the netcdf files in the repository? Creating the files with iris does not work unfortunately since we need to set invalid attributes.

@stefsmeets
Copy link
Contributor

Would it help to remove the dataset creation the files and instead store the netcdf files in the repository? Creating the files with iris does not work unfortunately since we need to set invalid attributes.

I think this is a good solution. It also simplifies the tests and reduces the load on CircleCI.

@valeriupredoi
Copy link
Contributor

valeriupredoi commented Jan 14, 2021

OK - so here's a new development - the SegFaults are Poisson-distributed - using again test_common.py as benchmark and running equal numbers of runs for: the last test only, the first two tests only, and all of the tests (the whole test script as we have it), I get the following rates using @schlunma 's #938

1 tests (last one): 2/80 fails
2 tests (first two): 3/80 fails
3 tests (all tests): 5/80 fails

yielding the beautiful Poisson distribution equation:

P((a+b)x) = P(ax) + P(bx)

basically we can estimate the SegFault rate for each test that realizes data to 1/80=1.3% (this is of course machine-dependent) and all these happen because of the realization of data at iris-dask interface:

Thread 0x00007fc8880d1740 (most recent call first):
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/threading.py", line 302 in wait
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/queue.py", line 170 in get
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/dask/local.py", line 133 in queue_get
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/dask/local.py", line 475 in get_async
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/dask/threaded.py", line 76 in get
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/dask/base.py", line 567 in compute
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/iris/_lazy_data.py", line 226 in _co_realise_lazy_arrays
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/iris/_lazy_data.py", line 262 in as_concrete_data
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/iris/_data_manager.py", line 216 in data
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/iris/coords.py", line 640 in points
  File "/home/valeriu/ESMValCore/tests/integration/cmor/_fixes/test_common.py", line 78 in hybrid_pressure_coord_fix_metadata
  File "/home/valeriu/ESMValCore/tests/integration/cmor/_fixes/test_common.py", line 101 in test_cl_hybrid_pressure_coord_fix_metadata_with_ap
  File "/home/valeriu/miniconda3/envs/esmvaltool/lib/python3.8/site-packages/_pytest/python.py", line 183 in pytest_pyfunc_call

so not using netCDF4.Dataset improves the rate of SegFaults but doesn't eliminate the risk; still, a 10% decrease in rate is pretty good. Shrinking the tests themselves has no effect, one needs to understand the problem that the system encounters between iris and dask, as I yapped above, but beats me if I know how to debug that. I need a break from doing this stuff since I finally hit a deadend - @bjlittle have you guyes seen these sort of SegFaults before?

@valeriupredoi
Copy link
Contributor

valeriupredoi commented Jan 14, 2021

another piece of information we can infer from the study of these tests and their distribution is that we can get a probability of at least one of them happening and screwing up the test run: they are roughly 15 of these tests in the fixes tests so the probability is 1.3 x 15 = 20% so with Manuel's fix we will see an improvement in the number of SegFaults but still, expect 1 in 5 test runs to be plagued by SegFaults 👍

@sadielbartholomew
Copy link

Hi, I'm not involved in the ESMValGroup projects at all but I work on other libraries that use netcdf4-python, namely cf-python and cfdm, and thought it might be relevant that we have also been seeing numerous segmentation faults across our test suite, both locally and on GitHub Actions, since the Autumn, where previously we had never seen any seg faults that weren't isolated and related to another C library we use.

I'm trying to follow the conclusions you have reached here to get a feel for whether we might be encountering the same underlying problem(s). It is notable that these libraries I work on use neither Iris nor Dask (and we test via unittest rather than pytest), but we do use netCDF4. So if we are encountering similar issues, they will most likely be from netCDF4. Indeed from what I think I have deduced with the seg faults we see with cf-python and cfdm that the seg faults happen in certain cases of opening netCDF files via netCDF4.Dataset(filename, "r") (I've put an example traceback collected using the faulthandler library below in a collapsed section in case you'd like to get a feel for what we are seeing on our side).

So that seems in line with the seg fault issues you are encountering with ESMValCore, at least your tentative conclusion:

the only tests that are at risk from this issue are the ones using netCDF4.Dataset:

@valeriupredoi do you think it might be useful to have a discussion about this in case we might both be running into similar issues? We might be able to help each other out. And I'd really love to get to the bottom of the seg faults in cf-python and cfdm, they are a real hindrance to development work!

Our netCDF seg faults: an example from cfdm

Run date: 2021-01-06 18:17:19.499144
Platform: Linux-4.15.0-54-generic-x86_64-with-glibc2.10 
HDF5 library: 1.10.6 
netcdf library: 4.7.4 
Python: 3.8.5 /home/sadie/anaconda3/envs/cf-env/bin/python
netCDF4: 1.5.4 /home/sadie/anaconda3/envs/cf-env/lib/python3.8/site-packages/netCDF4/__init__.py
numpy: 1.19.4 /home/sadie/anaconda3/envs/cf-env/lib/python3.8/site-packages/numpy/__init__.py
cfdm.core: 1.8.8.0 /home/sadie/cfdm/cfdm/core/__init__.py
cftime: 1.3.0 /home/sadie/anaconda3/envs/cf-env/lib/python3.8/site-packages/cftime/__init__.py
netcdf_flattener: 1.2.0 /home/sadie/anaconda3/envs/cf-env/lib/python3.8/site-packages/netcdf_flattener/__init__.py
cfdm: 1.8.8.0 /home/sadie/cfdm/cfdm/__init__.py

test_read_CDL (__main__.read_writeTest) ... ok
test_read_field (__main__.read_writeTest) ... ok
test_read_mask (__main__.read_writeTest) ... ok
test_read_write_Conventions (__main__.read_writeTest) ... Fatal Python error: Segmentation fault

Current thread 0x00007f4ba80c8740 (most recent call first):
  File "/home/sadie/cfdm/cfdm/data/netcdfarray.py", line 484 in open
  File "/home/sadie/cfdm/cfdm/data/netcdfarray.py", line 133 in __getitem__
  File "/home/sadie/cfdm/cfdm/data/data.py", line 264 in __getitem__
  File "/home/sadie/cfdm/cfdm/data/data.py", line 542 in _item
  File "/home/sadie/cfdm/cfdm/data/data.py", line 2491 in last_element
  File "/home/sadie/cfdm/cfdm/data/data.py", line 455 in __str__
  File "/home/sadie/cfdm/cfdm/data/data.py", line 212 in __repr__
  File "/home/sadie/cfdm/cfdm/read_write/netcdf/netcdfread.py", line 2949 in _create_field
  File "/home/sadie/cfdm/cfdm/read_write/netcdf/netcdfread.py", line 1355 in read
  File "/home/sadie/cfdm/cfdm/decorators.py", line 189 in verbose_override_wrapper
  File "/home/sadie/cfdm/cfdm/read_write/read.py", line 295 in read
  File "test_read_write.py", line 371 in test_read_write_Conventions
  File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/case.py", line 633 in _callTestMethod
  File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/case.py", line 676 in run
  File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/case.py", line 736 in __call__
  File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/suite.py", line 122 in run
  File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/suite.py", line 84 in __call__
  File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/suite.py", line 122 in run
  File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/suite.py", line 84 in __call__
  File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/runner.py", line 176 in run
  File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/main.py", line 271 in runTests
  File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/main.py", line 101 in __init__
  File "test_read_write.py", line 479 in <module>
Segmentation fault (core dumped)

@valeriupredoi
Copy link
Contributor

@sadielbartholomew thanks very much for dropping by and telling us about your issues caused by what it looks to be pretty much the same culprit. We just got rid of using netCDF4.Dataset in our tests in this PR #938 and from my local tests I noticed a drop in SegFault rate by 10% which is good coz that's half of what we had, roughly. But the problem is still there. Yes, let's chat about this on a call, cheers for the suggestion! 👍

@sadielbartholomew
Copy link

sadielbartholomew commented Jan 19, 2021

Thanks @valeriupredoi, that's useful to know. I'll arrange a chat with you via the NCAS Slack.

@bouweandela
Copy link
Member Author

@valeriupredoi To try out if setting the number of threads that pytest uses to run the tests to 1 helps, could you open a pull request that replaces pytest -n 2 with pytest everywhere (in .github/workflows/action-test.yml, package/meta.yaml, .circleci/config.yml) and remove the pytest-xdist dependency from setup.py and package/meta.yaml?

@valeriupredoi
Copy link
Contributor

hey guys, I think I have finally tracked a dominant cause of these segfaults - not what it produces the Segmentation Fault, but rather what exacerbates it statistically i.e. what makes the thing segfault more often than other times. Whereas I think the Segfault itself is inside something that calls the C library of either hdf5 or netCDF4, running the tests with xdist makes the thing much worse: I ran sets of 44 runs of test_common.py from the cmor fixes tests, the one that fails very often on CI, and did that in three ways: calling pytest as usual (with -n 2), ran just the function of one of the tests that kept failing (with python like a peasant) and disabling xdist with -n 0 (which runs the test in purely serial mode) and I got these numbers of Segfaults:

regular pytest (-n 2) fails: 7/44 5/44 6/44 -> 18/132
simple run func fails: 0/132
pytest -n 0 fails: 1/44 0/44 1/44 -> 2/132

18/132 = 14% segfaults is perfecly in line with my estimate here which is still OK after we've removed the calls to Dataset building after the PR by @schlunma but as I see it now, it's clear that running these tests without xdist will improve our Segfault rates by a lot. I am going to open a PR with this in 👍

@valeriupredoi
Copy link
Contributor

valeriupredoi commented Mar 30, 2021

I declare SegFaults extinct (about bloody time, that is) 😁 Fix in #1064

@valeriupredoi
Copy link
Contributor

valeriupredoi commented Mar 31, 2021

this is now closed - but, if we do see SegFaults creeping up, we need to see how often they creep and if it's becoming a problem again, we should mark those sequential too, but we should be careful not to take a hit at speed, that's why I left out the many tests in _fixes since we've not seen those segfault often and I'm afraid if I marked those all we'd take quite a bit of a hit at test runtimes. Also note that the conda build tests have not been changed so we might expect segafults there. Cheers for all the comments and suggestions on this one, folks!

@bouweandela
Copy link
Member Author

Re-opening this issue, since I just saw another case of this. One of the tests failed even when running them with pytest -n 1. Running certain tests sequentially certainly reduced the number of times this problem occurs, but it did not solve it.

@bouweandela bouweandela reopened this Jun 14, 2021
@valeriupredoi valeriupredoi added the automated-testing-workshop automated-testing-workshop label Aug 24, 2021
@valeriupredoi
Copy link
Contributor

closing this since we've not seen these in a while (well, apart from when netcdf4 is at 1.6.1 but that's a different kettle of fish 🐟 )

@bouweandela
Copy link
Member Author

Does that mean we can stop running some tests sequentially now @valeriupredoi?

@valeriupredoi
Copy link
Contributor

ooohhh ahhh I would be too chicken to do that now 🐔 Can try with a PR that we let marinate for a bit longer to see if we get any SegFaults?

@schlunma
Copy link
Contributor

You shouldn't have closed this 😂

The recent test run on CircleCI: https://app.circleci.com/pipelines/github/ESMValGroup/ESMValCore/7862/workflows/ceea1cc3-e590-40d5-8402-5b3e5879679f/jobs/36002

tests/integration/cmor/_fixes/test_common.py .FFatal Python error: Segmentation fault

Current thread 0x00007f740168e700 (most recent call first):
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/iris/fileformats/netcdf.py", line 416 in __getitem__
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/array/core.py", line 120 in getter
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/utils.py", line 71 in apply
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/core.py", line 119 in _execute_task
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/core.py", line 149 in get
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/optimization.py", line 990 in __call__
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/core.py", line 119 in _execute_task
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/core.py", line 119 in <genexpr>
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/core.py", line 119 in _execute_task
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/core.py", line 119 in <genexpr>
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/core.py", line 119 in _execute_task
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/local.py", line 224 in execute_task
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/local.py", line 238 in <listcomp>
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/local.py", line 238 in batch_execute_tasks
  File "/opt/conda/envs/esmvaltool/lib/python3.10/concurrent/futures/thread.py", line 58 in run
  File "/opt/conda/envs/esmvaltool/lib/python3.10/concurrent/futures/thread.py", line 83 in _worker
  File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 953 in run
  File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f7401e8f700 (most recent call first):
  File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 1438 in current_thread
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/threaded.py", line 23 in _thread_get_id
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/local.py", line 231 in execute_task
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/local.py", line 238 in <listcomp>
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/local.py", line 238 in batch_execute_tasks
  File "/opt/conda/envs/esmvaltool/lib/python3.10/concurrent/futures/thread.py", line 64 in run
  File "/opt/conda/envs/esmvaltool/lib/python3.10/concurrent/futures/thread.py", line 81 in _worker
  File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 953 in run
  File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f7402790700 (most recent call first):
  File "/opt/conda/envs/esmvaltool/lib/python3.10/concurrent/futures/thread.py", line 81 in _worker
  File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 953 in run
  File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f75482c2740 (most recent call first):
  File "/opt/conda/envs/esmvaltool/lib/python3.10/threading.py", line 320 in wait
  File "/opt/conda/envs/esmvaltool/lib/python3.10/queue.py", line 171 in get
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/local.py", line 137 in queue_get
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/local.py", line 500 in get_async
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/threaded.py", line 89 in get
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/dask/base.py", line 600 in compute
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/iris/_lazy_data.py", line 215 in _co_realise_lazy_arrays
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/iris/_lazy_data.py", line 252 in as_concrete_data
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/iris/_data_manager.py", line 206 in data
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/iris/coords.py", line 1702 in bounds
  File "/root/project/tests/integration/cmor/_fixes/test_common.py", line 140 in hybrid_height_coord_fix_metadata
  File "/root/project/tests/integration/cmor/_fixes/test_common.py", line 170 in test_cl_hybrid_height_coord_fix_metadata
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/python.py", line 195 in pytest_pyfunc_call
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/python.py", line 1789 in runtest
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/runner.py", line 167 in pytest_runtest_call
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/runner.py", line 260 in <lambda>
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/runner.py", line 339 in from_call
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/runner.py", line 259 in call_runtest_hook
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/runner.py", line 220 in call_and_report
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/runner.py", line 131 in runtestprotocol
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/runner.py", line 112 in pytest_runtest_protocol
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/main.py", line 349 in pytest_runtestloop
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/main.py", line 324 in _main
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/main.py", line 270 in wrap_session
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/main.py", line 317 in pytest_cmdline_main
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/config/__init__.py", line 167 in main
  File "/opt/conda/envs/esmvaltool/lib/python3.10/site-packages/_pytest/config/__init__.py", line 190 in console_main
  File "/opt/conda/envs/esmvaltool/bin/pytest", line 10 in <module>

Extension modules: mypy, mypy.api, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, yaml._yaml, cytoolz.utils, cytoolz.itertoolz, cytoolz.functoolz, cytoolz.dicttoolz, cytoolz.recipes, xxhash._xxhash, psutil._psutil_linux, psutil._psutil_posix, markupsafe._speedups, lz4._version, lz4.frame._frame, scipy._lib._ccallback_c, scipy.sparse._sparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._isolve._iterative, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg._cythonized_array_utils, scipy.linalg._flinalg, scipy.linalg._solve_toeplitz, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_lapack, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.fftpack.convolve, cftime._cftime, cf_units._udunits2, scipy.interpolate._fitpack, scipy.interpolate.dfitpack, scipy.interpolate._bspl, scipy.interpolate._ppoly, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.interpolate.interpnd, scipy.interpolate._rbfinterp_pythran, numpy.linalg.lapack_lite, scipy.ndimage._nd_image, _ni_label, scipy.ndimage._ni_label, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize.__nnls, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.special.cython_special, scipy.stats._stats, beta_ufunc, scipy.stats._boost.beta_ufunc, binom_ufunc, scipy.stats._boost.binom_ufunc, nbinom_ufunc, scipy.stats._boost.nbinom_ufunc, hypergeom_ufunc, scipy.stats._boost.hypergeom_ufunc, ncf_ufunc, scipy.stats._boost.ncf_ufunc, scipy.stats._biasedurn, scipy.stats._levy_stable.levyst, scipy.stats._hypotests_pythran, scipy.stats._statlib, scipy.stats._mvn, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._unuran.unuran_wrapper, shapely.speedups._speedups, pyproj._compat, pyproj._datadir, pyproj._network, pyproj._geod, pyproj.list, pyproj._crs, pyproj._transformer, pyproj.database, pyproj._sync, cartopy.trace, fiona._err, fiona._geometry, fiona._shim, fiona._env, fiona.schema, fiona.ogrext, fiona._crs, netCDF4._netCDF4, PIL._imaging, _cffi_backend, unicodedata2, stratify._conservative, stratify._vinterp, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.tslib, pandas._libs.lib, pandas._libs.hashing, pandas._libs.ops, pandas._libs.arrays, pandas._libs.index, pandas._libs.join, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.internals, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.tslibs.strptime, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, scipy.signal._sigtools, scipy.signal._max_len_seq_inner, scipy.signal._upfirdn_apply, scipy.signal._spline, scipy.signal._sosfilt, scipy.signal._spectral, scipy.signal._peak_finding_utils, shapely.vectorized._vectorized, msgpack._cmsgpack (total: 195)
/bin/bash: line 4:  6828 Segmentation fault      (core dumped) pytest -n 0 -m "sequential" --cov-append --junitxml=report-2.xml

Exited with code exit status 139

@schlunma schlunma reopened this Nov 17, 2022
@valeriupredoi
Copy link
Contributor

Manu, that's netCDF4=1.6.2, also reported by @remi-kazeroni in Tool, we need to pin from !=1.6.1 to restrictive <1.6.1, and that's iris' handling (or mis-handling) of threaded netCDF4 with Dask, for which we have #1727 and ESMValGroup/ESMValTool#2907 which I will edit to warn users not to use any netCDF4>1.6.0. Iris are debating on it in SciTools/iris#5016

These SegFaults reported here were very rare, and we stopped seeing them after netCDF4 ~ 1.5 something. I am not 100% sure they were connected to iris, they remain a mystery. The new ones are more frequent than a new Tory prime minister (this year).

@bouweandela
Copy link
Member Author

bouweandela commented Nov 17, 2022

Can try with a PR that we let marinate for a bit longer to see if we get any SegFaults?

That would be great, it would really speed up the test runs if we can avoid running some tests sequentially.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automated-testing-workshop automated-testing-workshop bug Something isn't working testing
Projects
None yet
5 participants