Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate segfaults with NetCDF4 >1.6.0 #5016

Closed
trexfeathers opened this issue Oct 7, 2022 · 17 comments · Fixed by #5095
Closed

Investigate segfaults with NetCDF4 >1.6.0 #5016

trexfeathers opened this issue Oct 7, 2022 · 17 comments · Fixed by #5095

Comments

@trexfeathers
Copy link
Contributor

📰 Custom Issue

A bad interaction between Iris, Dask and NetCDF4 is causing segfaults when loading. This is apparently fixed if Dask is set to single-threaded mode - see Unidata/netcdf4-python#1192 (comment)

This workaround removes a large part of the benefit of using Dask in Iris, so we need to find out if we can instead fix via changes to the Iris loading code.

Before the next release (v3.4), 1 of the below needs to happen:

  • Permanent fix - change Iris' loading code to avoid the segfaults
  • Temporary workaround - pin NetCDF4 <=1.6.0
  • Temporary workaround, with big performance regressions - set Dask to single-threaded within Iris
@trexfeathers
Copy link
Contributor Author

@bjlittle has shared how xarray handles NetCDF4 not being thread-safe:

https://github.com/pydata/xarray/blob/main/xarray/backends/file_manager.py#L52

@trexfeathers trexfeathers self-assigned this Oct 14, 2022
@trexfeathers trexfeathers moved this from 📋 Backlog to 🔖 Ready in Iris v3.4.0 Oct 14, 2022
@bjlittle
Copy link
Member

bjlittle commented Oct 20, 2022

Ping @valeriupredoi 👍

@valeriupredoi
Copy link

awesome, cheers @bjlittle - also, let me know if I can help 🍺

@trexfeathers
Copy link
Contributor Author

Current thinking

Fix will be a lot of work. We should get started as soon as we can. But we don't need to coincide with the planned release schedule.

Rationale

single-threaded? ❌

Have confirmed the suggestion that the single-threaded Dask scheduler avoids the problem. However much of our user-base would not cope with the slowdowns this would bring, especially if they are depending on large compute clusters to get fast enough execution times.

File locking? ⏳

File locking (like Xarray) is not going to be a quick fix. Iris' load and save code uses file paths all over the place, which makes it difficult to protect access to a file. @bjlittle and I made a few unsuccessful attempts.

We probably need the common currency in Iris' internals to be a specialised file-handling object, instead of file paths. This new object could control file locking and also pickling, which is important for Dask. This will be a large piece of work, and difficult to estimate since it touches so much of Iris.

Pinning? 😕

Pinning is awkward. It makes environment resolution less flexible, and can cause all sorts of other problems (see conversation on conda-forge/iris-feedstock#86 for some examples). So we want to pin for the minimum amount of time, but missing v3.4 doesn't mean waiting for v3.5 (disaster) - the fix will probably be too big for a bugfix but we can make an impromptu minor release whenever is appropriate.

@trexfeathers
Copy link
Contributor Author

@valeriupredoi will Iris v3.4 still relying on a pin be a big problem for ESMValTool?

@pp-mo
Copy link
Member

pp-mo commented Nov 8, 2022

@trexfeathers this was recently mentioned ESMValGroup/ESMValCore#1776 (comment)

@trexfeathers trexfeathers moved this from 🔖 Ready to 🏗 In progress in Iris v3.4.0 Nov 8, 2022
@pp-mo
Copy link
Member

pp-mo commented Nov 9, 2022

Note : IIUC experiment is showing that the problem still occurs with a process-based scheduler,
so it quite probably isn't a true "thread-safety" issue, but some more general problem with parallelisation ?

See also #5031 (lazy saving trial code) : We see that the mutex problems there are solved by xarray with file-specific locks, which is probably what we need here too.

@pp-mo
Copy link
Member

pp-mo commented Nov 9, 2022

Fix will be a lot of work. We should get started as soon as we can.

IMHO code changes for this should probably be well localised -- just the netcdf load+save modules. So, not a huge rewrite (and not much to document for users, either).
But it isn't so clear whether it will be simple to get to a robust solution -- there could be a lot of research and experimentation hiding in here.

Aside : note also the approach of Xarray, as mentioned above : Its locking solution also has a (netcdf) file caching scheme to improve efficiency : It's probably not a requirement, but it would be interesting to see if this approach would be worth copying too. After all, the current NetcdfDataProxy objects are opening+closing the file for every chunk of data read, which at some point has to be pretty inefficient (!)

@pp-mo
Copy link
Member

pp-mo commented Nov 9, 2022

On the urgency,

@valeriupredoi has recently said "pinning the package is not a long term solution"

@trexfeathers trexfeathers removed their assignment Nov 9, 2022
@valeriupredoi
Copy link

@valeriupredoi will Iris v3.4 still relying on a pin be a big problem for ESMValTool?

sorry I forgot to get meself here, guys! No, not a problem at all - as @zklaus pointed out, moving to Python=3.11 gonna be a bit lengthy anyway since we'd have to wait for all our dependency ducks to get in line with the new Python. But in the long run (ie a few months) it'd be good to have it. Let me know if I can help you in any way BTW 🍺

@valeriupredoi
Copy link

hi guys, heads up and hope you don't mind I opened a repodata patch PR to get the pin in for previous iris versions since it's a bit of a nightmare for us to pin/repodata patch, plus, it's better for the other packages that use iris (initially we wanted to repodata patch in ESMVal-suite, but @bouweandela had the good idea to patch iris instead, and got this through a vote with the other ESMVal tech leads) - please have a look when you have time conda-forge/conda-forge-repodata-patches-feedstock#358

@valeriupredoi
Copy link

OK that PR got merged so iris should now be repo-patched to use only netCDF4 <1.6.1 (after nearly destroying your conda forge package by replacing all the deps with netCDF4 <1.6.1 instead of appending it to them 🤣 )

@trexfeathers
Copy link
Contributor Author

@valeriupredoi are any of you in ESMValTool able to test against a branch on my fork?
#5095, trexfeathers:netcdf_segs

@valeriupredoi
Copy link

@valeriupredoi are any of you in ESMValTool able to test against a branch on my fork? #5095, trexfeathers:netcdf_segs

Martin, sure, I'll take it for a spin a short bit! @bouweandela @zklaus may want to look at it as well, Cheers for working on this issue, mate 🍺

@valeriupredoi
Copy link

hi @trexfeathers I have installed locally your fork but I am getting a lot of iris-related issues a la:

ERROR tests/unit/preprocessor/_weighting/test_weighting_landsea_fraction.py - AttributeError: 'Cube' object has no attribute '_as_defn'
FAILED tests/unit/preprocessor/_derive/test_siextent.py::test_siextent_calculation_siconca - AttributeError: 'CubeList' object has no attribute 'extract_cube'
FAILED tests/sample_data/multimodel_statistics/test_multimodel.py::test_multimodel_regression_day_standard[overlap] - AttributeError: 'DimCoord' object has no attribute 'metadata'

in fact, most of ESMValCore's 308 failed tests (out of 3038 total tests) are due to these (the FAIL, not the ERROR - that occurred once and I had to remove our test so I can run the test suite). Also running pytest in your dev dir:

ERROR lib/iris/tests/test_netcdf.py
ERROR lib/iris/tests/integration/fast_load/test_fast_load.py
ERROR lib/iris/tests/unit/fileformats/pyke_rules/compiled_krb/fc_rules_cf_fc/test_build_auxiliary_coordinate.py
ERROR lib/iris/tests/unit/fileformats/pyke_rules/compiled_krb/fc_rules_cf_fc/test_build_cube_metadata.py
ERROR lib/iris/tests/unit/fileformats/pyke_rules/compiled_krb/fc_rules_cf_fc/test_build_dimension_coordinate.py
ERROR lib/iris/tests/unit/fileformats/pyke_rules/compiled_krb/fc_rules_cf_fc/test_build_geostationary_coordinate_system.py
ERROR lib/iris/tests/unit/fileformats/pyke_rules/compiled_krb/fc_rules_cf_fc/test_build_mercator_coordinate_system.py
ERROR lib/iris/tests/unit/fileformats/pyke_rules/compiled_krb/fc_rules_cf_fc/test_build_stereographic_coordinate_system.py
ERROR lib/iris/tests/unit/fileformats/pyke_rules/compiled_krb/fc_rules_cf_fc/test_build_verticalp_coordinate_system.py
ERROR lib/iris/tests/unit/fileformats/pyke_rules/compiled_krb/fc_rules_cf_fc/test_get_attr_units.py
ERROR lib/iris/tests/unit/fileformats/pyke_rules/compiled_krb/fc_rules_cf_fc/test_get_cf_bounds_var.py
ERROR lib/iris/tests/unit/fileformats/pyke_rules/compiled_krb/fc_rules_cf_fc/test_get_names.py
ERROR lib/iris/tests/unit/fileformats/pyke_rules/compiled_krb/fc_rules_cf_fc/test_has_supported_mercator_parameters.py
ERROR lib/iris/tests/unit/fileformats/pyke_rules/compiled_krb/fc_rules_cf_fc/test_has_supported_stereographic_parameters.py
ERROR lib/iris/tests/unit/fileformats/pyke_rules/compiled_krb/fc_rules_cf_fc/test_reorder_bounds_data.py
ERROR lib/iris/tests/unit/fileformats/um/fast_load/test_FieldCollation.py

Happily no SegFault from running our tests (once, will run more) but am not 100% sure the missing/broken functionality in the dev iris may not mask the SegFaults

@valeriupredoi
Copy link

BTW we can take the discussion to your fork, can open issues there, if you want me to (just so we don't heavy load the discussion on this thread)? 🍺

@trexfeathers
Copy link
Contributor Author

BTW we can take the discussion to your fork, can open issues there, if you want me to (just so we don't heavy load the discussion on this thread)? 🍺

Sounds good. We can talk on #5095.

@trexfeathers trexfeathers moved this to 🆕 New in ESMValTool Feb 20, 2023
@trexfeathers trexfeathers linked a pull request Feb 20, 2023 that will close this issue
3 tasks
@trexfeathers trexfeathers moved this from 🆕 New to 👀 In review in ESMValTool Feb 20, 2023
@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in ESMValTool Feb 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants