-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault when running tests on CircleCI #644
Comments
This is could be because |
Actually the threads are created by dask, we're not running pytest in parallel. |
yeah, seen this, albeit it seems to be transitory. I am smelling |
gah! just looked at the Artifacts and it is |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
aha! I found the bugger: indeed the final test environment is using hdf5 1.10.6-mpi_mpich_ha7d0aea_0 --> 1.10.5-mpi_mpich_ha7d0aea_1004 the problem is pinning it to |
conda is a good boi, it's |
Instead of using very recent versions, maybe we could try using some older versions of hdf5 or netcdf? Because the problem didn't occur earlier? |
OK I had a closer look at this today - some prelim conclusions:
|
I also think that the Segfaults we're seeing now are coz of the pytest+pytest-xdist combo since we were not seeing these for Core (where hdf5 was and is indeed at a recent 1.10.6 version) |
Could you attach a stack trace to back that up? |
Do we have a list of tests that are affected? This issue is somewhat annoying, because it happens all the time.
|
That would probably be all files that use iris to load a netcdf file
I'm not sure that would work, because the entire worker running the test crashes. I'm also not convinced it's a good idea because you might miss that you broke a test and merge a pull request that breaks the codebase because all tests pass. |
I was thinking they still show up in the summary, but they don't slow down PRs. At the moment, every time I make a commit to a PR I have to go into CircleCI and re-run the failing test (and make sure the fail is because of a segmentation error) and wait another 20 minutes for the result. It is not a good experience. |
OK this bugs me badly so I had a systematic look at one of the tests failing on a regular basis, namely
with each of the fails counted from the gworker's crash due to Segmentation Fault. So pretty bad on my laptop - a ~20% crash rate is nasty, I noticed worse rates (by eye) on the CI machine. Now, here's the funny thing - I even found a fix that makes the crash rate drop significantly: encasing the
makes the thing run with a much lower SegFault rate (I ran the test 160 times and it crashed only once, so a crash rate of <1%); Another experiment was to run on with a single worker: running with a single gworker ( I think there is some time race condition that fails badly when using |
the only tests that are at risk from this issue are the ones using
I suggest we change these and create a dataset in a different way - iris cube saved to netCDF or simply pass the cube, bypassing the load step. Note that using |
Would it help to remove the dataset creation from the files and instead store the netcdf files in the repository? Creating the files with |
I think this is a good solution. It also simplifies the tests and reduces the load on CircleCI. |
OK - so here's a new development - the SegFaults are Poisson-distributed - using again
yielding the beautiful Poisson distribution equation:
basically we can estimate the SegFault rate for each test that realizes data to 1/80=1.3% (this is of course machine-dependent) and all these happen because of the realization of data at iris-dask interface:
so not using |
another piece of information we can infer from the study of these tests and their distribution is that we can get a probability of at least one of them happening and screwing up the test run: they are roughly 15 of these tests in the fixes tests so the probability is 1.3 x 15 = 20% so with Manuel's fix we will see an improvement in the number of SegFaults but still, expect 1 in 5 test runs to be plagued by SegFaults 👍 |
Hi, I'm not involved in the ESMValGroup projects at all but I work on other libraries that use netcdf4-python, namely cf-python and cfdm, and thought it might be relevant that we have also been seeing numerous segmentation faults across our test suite, both locally and on GitHub Actions, since the Autumn, where previously we had never seen any seg faults that weren't isolated and related to another C library we use. I'm trying to follow the conclusions you have reached here to get a feel for whether we might be encountering the same underlying problem(s). It is notable that these libraries I work on use neither Iris nor Dask (and we test via So that seems in line with the seg fault issues you are encountering with
@valeriupredoi do you think it might be useful to have a discussion about this in case we might both be running into similar issues? We might be able to help each other out. And I'd really love to get to the bottom of the seg faults in cf-python and cfdm, they are a real hindrance to development work! Our netCDF seg faults: an example from cfdmRun date: 2021-01-06 18:17:19.499144
Platform: Linux-4.15.0-54-generic-x86_64-with-glibc2.10
HDF5 library: 1.10.6
netcdf library: 4.7.4
Python: 3.8.5 /home/sadie/anaconda3/envs/cf-env/bin/python
netCDF4: 1.5.4 /home/sadie/anaconda3/envs/cf-env/lib/python3.8/site-packages/netCDF4/__init__.py
numpy: 1.19.4 /home/sadie/anaconda3/envs/cf-env/lib/python3.8/site-packages/numpy/__init__.py
cfdm.core: 1.8.8.0 /home/sadie/cfdm/cfdm/core/__init__.py
cftime: 1.3.0 /home/sadie/anaconda3/envs/cf-env/lib/python3.8/site-packages/cftime/__init__.py
netcdf_flattener: 1.2.0 /home/sadie/anaconda3/envs/cf-env/lib/python3.8/site-packages/netcdf_flattener/__init__.py
cfdm: 1.8.8.0 /home/sadie/cfdm/cfdm/__init__.py
test_read_CDL (__main__.read_writeTest) ... ok
test_read_field (__main__.read_writeTest) ... ok
test_read_mask (__main__.read_writeTest) ... ok
test_read_write_Conventions (__main__.read_writeTest) ... Fatal Python error: Segmentation fault
Current thread 0x00007f4ba80c8740 (most recent call first):
File "/home/sadie/cfdm/cfdm/data/netcdfarray.py", line 484 in open
File "/home/sadie/cfdm/cfdm/data/netcdfarray.py", line 133 in __getitem__
File "/home/sadie/cfdm/cfdm/data/data.py", line 264 in __getitem__
File "/home/sadie/cfdm/cfdm/data/data.py", line 542 in _item
File "/home/sadie/cfdm/cfdm/data/data.py", line 2491 in last_element
File "/home/sadie/cfdm/cfdm/data/data.py", line 455 in __str__
File "/home/sadie/cfdm/cfdm/data/data.py", line 212 in __repr__
File "/home/sadie/cfdm/cfdm/read_write/netcdf/netcdfread.py", line 2949 in _create_field
File "/home/sadie/cfdm/cfdm/read_write/netcdf/netcdfread.py", line 1355 in read
File "/home/sadie/cfdm/cfdm/decorators.py", line 189 in verbose_override_wrapper
File "/home/sadie/cfdm/cfdm/read_write/read.py", line 295 in read
File "test_read_write.py", line 371 in test_read_write_Conventions
File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/case.py", line 633 in _callTestMethod
File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/case.py", line 676 in run
File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/case.py", line 736 in __call__
File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/suite.py", line 122 in run
File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/suite.py", line 84 in __call__
File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/suite.py", line 122 in run
File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/suite.py", line 84 in __call__
File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/runner.py", line 176 in run
File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/main.py", line 271 in runTests
File "/home/sadie/anaconda3/envs/cf-env/lib/python3.8/unittest/main.py", line 101 in __init__
File "test_read_write.py", line 479 in <module>
Segmentation fault (core dumped) |
@sadielbartholomew thanks very much for dropping by and telling us about your issues caused by what it looks to be pretty much the same culprit. We just got rid of using |
Thanks @valeriupredoi, that's useful to know. I'll arrange a chat with you via the NCAS Slack. |
@valeriupredoi To try out if setting the number of threads that pytest uses to run the tests to 1 helps, could you open a pull request that replaces |
hey guys, I think I have finally tracked a dominant cause of these segfaults - not what it produces the Segmentation Fault, but rather what exacerbates it statistically i.e. what makes the thing segfault more often than other times. Whereas I think the Segfault itself is inside something that calls the C library of either hdf5 or netCDF4, running the tests with
18/132 = 14% segfaults is perfecly in line with my estimate here which is still OK after we've removed the calls to |
I declare SegFaults extinct (about bloody time, that is) 😁 Fix in #1064 |
this is now closed - but, if we do see SegFaults creeping up, we need to see how often they creep and if it's becoming a problem again, we should mark those |
Re-opening this issue, since I just saw another case of this. One of the tests failed even when running them with |
closing this since we've not seen these in a while (well, apart from when netcdf4 is at 1.6.1 but that's a different kettle of fish 🐟 ) |
Does that mean we can stop running some tests sequentially now @valeriupredoi? |
ooohhh ahhh I would be too chicken to do that now 🐔 Can try with a PR that we let marinate for a bit longer to see if we get any SegFaults? |
You shouldn't have closed this 😂 The recent test run on CircleCI: https://app.circleci.com/pipelines/github/ESMValGroup/ESMValCore/7862/workflows/ceea1cc3-e590-40d5-8402-5b3e5879679f/jobs/36002
|
Manu, that's netCDF4=1.6.2, also reported by @remi-kazeroni in Tool, we need to pin from !=1.6.1 to restrictive <1.6.1, and that's iris' handling (or mis-handling) of threaded netCDF4 with Dask, for which we have #1727 and ESMValGroup/ESMValTool#2907 which I will edit to warn users not to use any netCDF4>1.6.0. Iris are debating on it in SciTools/iris#5016 These SegFaults reported here were very rare, and we stopped seeing them after netCDF4 ~ 1.5 something. I am not 100% sure they were connected to iris, they remain a mystery. The new ones are more frequent than a new Tory prime minister (this year). |
That would be great, it would really speed up the test runs if we can avoid running some tests sequentially. |
The unit tests are failing regularly due to a segmentation fault.
If this happens to your pull request, please go to CircleCI by clicking 'Details' behind the CircleCI 'commit' check below your pull request, and clicking on the failed test job. Once you're on CircleCI, click the 'Rerun' button in the top right corner and choose 'Rerun Workflow from Failed'. Usually, the test will succeed the second time, if not, try again. If you do not see the 'Rerun' button, you need to log in to CircleCI with your GitHub account first.
The problem seems to originate from the netcdf4 library.
Example crashed run: https://app.circleci.com/pipelines/github/ESMValGroup/ESMValCore/2482/workflows/f8e73729-c4cf-408c-bdae-beec24238ac1/jobs/10300/steps
Example stack trace:
The text was updated successfully, but these errors were encountered: