-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Netcdf thread safety #5061
Netcdf thread safety #5061
Conversation
Tests have been running for 30mins and counting... |
It is still deadlocking in some operations, to be investigated + fixed. I told you it was WIP ! |
Notes on existing errorsI have been focussing on the set of tests in
N.B. although these are described as errors, something nevertheless "appears to work"? Could be failing to read data, but not checking actual values ?
Update Mon 2022-11-14Discussed with @bjlittle . |
We're currently hoping this can be a bugfix for |
Close in favour of #5095 |
* Unpin netcdf4. * Temporarily enable GHA on this branch. * Temporarily enable GHA on this branch. * Temporarily enable GHA on this branch. * Experiment to disable wheel CI on forks. * Disable segfaulting routines. * More temporary changes to get CI passing. * More temporary changes to get CI passing. * Finessed segfault skipping. * Bring in changed from #5061. * Re-instate test_load_laea_grid. * Adaptations to get the tests passing. * Use typing.Mapping instead. * Get doctests passing. * CF only resolve non-url filenames. * Confirm thread safety fixes. * Remove dummy assignment. * Restored plot_nemo What's New entry. * _add_aux_factories temporarily release global lock. * Remove per-file locking. * Remove remaining test workarounds. * Remove remaining comments. * Correct use of CFReader context manager. * Refactor for easier future maintenance. * Rename netcdf _thread_safe, add header. * Full use of ThreadSafeAggregators. * Full use of ThreadSafeAggregators. * Remove remaining imports of NetCDF4. * Test to ensure netCDF4 is via _thread_safe module. * More refined netcdf._thread_safe classes. * _thread_safe docstrings. * Restore original NetCDF code where possible. * Revert changes to 2.3.rst. * Update lockfiles. * Additions to _thread_safe.py * Remove temporary CI shims. * New locking stategy for NetCDFDataProxy. * NetCDFDataProxy simpler use of netCDF4 lock. * Update lock files. * Go back to using a Threading Lock. * Remove superfluous pass commands in test_cf.py. * Rename _thread_safe to _thread_safe_nc. * Rename thread safe classes to be 'Wrappers'. * Better contained getattr and setattr pattern. * Explicitly name netCDF4 module in _thread_safe_nc docstring. * Better docstring for _ThreadSafeWrapper. * Better comment about THREAD_SAFE_FLAG. * list() wrapping within _GLOBAL_NETCDF4_LOCK, to account for generators. * More accurate thread_safe docstrings in netcdf.saver. * Split netcdf integration tests into multiple modules. * Tests for non-thread-safe NetCDF behaviour. * Docstring accuracy. * Correct use of dask config set (context manager). * Update dependencies. * Review - don't need the first-class import of iris.tests. * Better name for the loading test. * Better selection of data to load. * What's New entry. * Improve tests. * Update lock files. * Increase chunking on test_save. --------- Co-authored-by: Patrick Peglar <patrick.peglar@metoffice.gov.uk>
* Unpin netcdf4. * Temporarily enable GHA on this branch. * Temporarily enable GHA on this branch. * Temporarily enable GHA on this branch. * Experiment to disable wheel CI on forks. * Disable segfaulting routines. * More temporary changes to get CI passing. * More temporary changes to get CI passing. * Finessed segfault skipping. * Bring in changed from SciTools#5061. * Re-instate test_load_laea_grid. * Adaptations to get the tests passing. * Use typing.Mapping instead. * Get doctests passing. * CF only resolve non-url filenames. * Confirm thread safety fixes. * Remove dummy assignment. * Restored plot_nemo What's New entry. * _add_aux_factories temporarily release global lock. * Remove per-file locking. * Remove remaining test workarounds. * Remove remaining comments. * Correct use of CFReader context manager. * Refactor for easier future maintenance. * Rename netcdf _thread_safe, add header. * Full use of ThreadSafeAggregators. * Full use of ThreadSafeAggregators. * Remove remaining imports of NetCDF4. * Test to ensure netCDF4 is via _thread_safe module. * More refined netcdf._thread_safe classes. * _thread_safe docstrings. * Restore original NetCDF code where possible. * Revert changes to 2.3.rst. * Update lockfiles. * Additions to _thread_safe.py * Remove temporary CI shims. * New locking stategy for NetCDFDataProxy. * NetCDFDataProxy simpler use of netCDF4 lock. * Update lock files. * Go back to using a Threading Lock. * Remove superfluous pass commands in test_cf.py. * Rename _thread_safe to _thread_safe_nc. * Rename thread safe classes to be 'Wrappers'. * Better contained getattr and setattr pattern. * Explicitly name netCDF4 module in _thread_safe_nc docstring. * Better docstring for _ThreadSafeWrapper. * Better comment about THREAD_SAFE_FLAG. * list() wrapping within _GLOBAL_NETCDF4_LOCK, to account for generators. * More accurate thread_safe docstrings in netcdf.saver. * Split netcdf integration tests into multiple modules. * Tests for non-thread-safe NetCDF behaviour. * Docstring accuracy. * Correct use of dask config set (context manager). * Update dependencies. * Review - don't need the first-class import of iris.tests. * Better name for the loading test. * Better selection of data to load. * What's New entry. * Improve tests. * Update lock files. * Increase chunking on test_save. --------- Co-authored-by: Patrick Peglar <patrick.peglar@metoffice.gov.uk>
Ultimately, to address #5016
For now, trial code for adding thread-safety and file-locking.
Both in saver and loader.
This trial code is rather rough, and not very thought-through.
At present, all of the tests are locking up + not completing.
See notes below..
learning so far
netcdf library is not considered thread-safe
Actually, it strictly never was, but it got worse with 1.6.1 which now releases the GIL during file operations
according to at least one authorative comment elsewhere
So in addition to any per-file locking that might be considered desirable (e.g. to control access to a file when doing lazy arithmetic between different netcdf variables), we do need a global lock to control all our access to the netcdf library, if any of it can be multi-threaded.
... and with Dask, presumably, access can always be multi-threaded, since data chunks are processed by separate tasks --> different worker threads.
process-based scheduling
like this
Rather to my surprise, lots of our existing code breaks if you simply switch to process-based Dask workers (as are used by default for Bag computations).
I had rather thought that this would "just work" ™️ But it's now clear that a simple "da.store" operation is not serialisable, unless you give it a serialisable lock. Just as we already saw in the lazy saving investigation.
single-threaded
Like this
Solves the above problems, and probably all others, but it's not a practical solution.
Has already proved useful in tracking down where code uses locks wrong ,I.E. failing to release before re-acquiring --> simply hangs when single-threaded.
open/closing of netcdf Dataset objects
This is somewhat simpler than I had feared.
For the reading process, a netCDF4.Dataset is opened by creating a CFReader, from which cubes are extracted, and is held open : in the new code, that is along with the netcdf library global access lock
During the process, the load code and 'actions' refer to to the dataset and its Variables, but once all the cubes have been produced, the CFReader is discarded : Then the dataset is closed, and the only remaining references are in the NetcdfDataProxy objects used to build lazy arrays -- which contain a filepath + variable-name, and thus don't refer to the 'original' Dataset object.
The only tricky part was that the existing code is relying on
CFReader.__del__
to close the dataset.Which is in fact unreliable, and may not always happen when you expected/wanted it. Effectively it depends on how the garbage collector works, which is not a constant. I also found that this is not a technical detail, but really happens.
really simple example.. click to expand
So, this is actually unnecessary, and I reckon we would be better off using the CFReader as a context manager, so it can automatically (always) close after operation. So I am now doing that here.
BTW this is exactly how the netcdf.saver.Saver is already used.
nature of errors.
So far, various operations which succeed nevertheless produce a stream of HDF error messages.
Not clear how that can be (if it suceeds), and being multi-threased, naturally it's hard to debug.
Maybe a fault in my code design, or for some reason something else needs guarding
lazy streaming netcdf-to-netcdf in save operations
There were some very obvious things going wrong with save-load-save roundtrips, like this one.
This highlights that when streaming from lazy netcdf data to save results, it is necessary that the saving process releases the global netcdf-access lock, since otherwise the lazy input compute will lock up.
Which means both the lazy-write and the lazy-read operations must independently ensure that the lock is used when reading/writing.
N.B. with synthetic data, like in these tests, you only see this if you save-load-save : The problem is unique to cases where the saved source data is lazy and in a netcdf file : It's a particularly specific problem.
Which makes changes in the Saver code much more tricky + extensive than those in CFReader
The load_cubes/CFReader action is quite simple, since it involves no "compute" at all, only straight-line code to snapshot the file + create data-proxies. At the end of which, the file is simply closed.
Scope
Not considering process-parallel execution yet (dask config "scheduler='processes'", or distributed), which still does not work.
However, I have provided per-file locking on read, and an eventual solution may need to adopt many of the ideas from #5130