-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: RuntimeError: NetCDF: Not a valid ID
appears randomly when running operations on the same datasets repeatedly
#561
Comments
RuntimeError: NetCDF: Not a valid ID
when getting temporal weights for copies of the same multi-file dataset
RuntimeError: NetCDF: Not a valid ID
when getting temporal weights for copies of the same multi-file datasetRuntimeError: NetCDF: Not a valid ID
when getting temporal weights repeatedly for the same dataset
RuntimeError: NetCDF: Not a valid ID
when getting temporal weights repeatedly for the same datasetRuntimeError: NetCDF: Not a valid ID
appears randomly when running operations on the same datasets repeatedly
I don't think this is necessarily an xCDAT issue based on the random occurrence and the related Xarray issues linked in the issue's description. As Jason mentioned, it might have to do with the I/O on the Climate Program filesystem closing files that are open in Xarray. |
Related to:
It seems like some filesystems do not like parallel access to files. The workaround seems to be to set |
This commit modifies the instructions in the deployment docs for the persistent dask cluster on `salish` that the `make_averaged_dataset` worker uses. It changes the number of threads for each worker from 4 to 1, the memory limit from automatic to 64G, and worker files will now be stored on the /tmp/ file system instead of /dev/shm shared memory file system. The change to 1 thread per worker is a consequence of reading xCDAT/xcdat#561 (comment). Changing the memory limit and worker file storage are a result of research and testing on 1mar24 (see worklog for details).
* Fix typo in skookum deployment docs page title * Update dask commands from hyphenated to sub-commands dask.distributed 2022.10.0 deprecated the `dask-scheduler` and `dask-worker` CLI commands in favour of `dask scheduler` and `dask worker`. * Update dask worker settings in deployment docs This commit modifies the instructions in the deployment docs for the persistent dask cluster on `salish` that the `make_averaged_dataset` worker uses. It changes the number of threads for each worker from 4 to 1, the memory limit from automatic to 64G, and worker files will now be stored on the /tmp/ file system instead of /dev/shm shared memory file system. The change to 1 thread per worker is a consequence of reading xCDAT/xcdat#561 (comment). Changing the memory limit and worker file storage are a result of research and testing on 1mar24 (see worklog for details). * Drop autodoc mocks for nemo_nowcast workers The NEMO_Nowcast package is now installed in the readthedocs build environment. So, autodoc mocks for its imports are no longer required. Removing those mocks silences warnings about mocked objects and missing attributes for the `clear_checklist` and `rotate_logs` worker docs. * Add 'reshapr' to autodoc mocks list This resolves an import error that prevents generation of docs for the `make_averaged_dataset` worker on readthedocs. * Replace inappropriate kbd directives in skookum docs Changed all the kbd directives in the skookum deployment documentation into inline code. This helps to ensure the correct semantic representation of technical terms and commands. re: issue #126 * Improve semantic markup of tmux in skookum docs * Improve semantic markup of tmux in skookum docs * Replace inappropriate kbd directives in docs Changed all the kbd directives in the documentation into inline code. This helps to ensure the correct semantic representation of technical terms and commands. re: issue #126 * Remove unused imports in wave_height_period.py Two unused Python library imports, pathlib and requests, were removed from wave_height_period.py. This cleanup improves the code readability and efficiency. * Update fig dev docs re: `black` for code formatting The documentation has been updated to reflect the change in the automatic code formatting tool used by the `salishsea_site` and `SalishSeaNowcast` packages. Previously, we were using `yapf`, but switched to `black`. * Add SSH keys and config section to skookum docs The documentation now contains a guide for generating passphrase-less RSA and ED25519 key pairs for the remote hosts that SalishSeaCast uses. It includes commands to install the public keys and to edit the ssh configuration file on 'skookum'. re: issue #244
This is done to increase the reliability of Reshapr extractions. With threads>1 we see random occurrences of errors like: `RuntimeError: NetCDF: Not a valid ID` The root cause appears to be that the `netcdf-c` is not thread-safe and a change introduced in `netcdf4-python=1.6.1` removed a work-around for the lack of thread-safety. See discussion in xCDAT/xcdat#561 and in other discussions linked there.
This is done to increase the reliability of Reshapr extractions. With threads>1 we see random occurrences of errors like: `RuntimeError: NetCDF: Not a valid ID` The root cause appears to be that the `netcdf-c` is not thread-safe and a change introduced in `netcdf4-python=1.6.1` removed a work-around for the lack of thread-safety. See discussion in xCDAT/xcdat#561 and in other discussions linked there.
What happened?
For some reason, the
RuntimeError: NetCDF: Not a valid ID
is being thrown at random times when running operations on the same dataset.This has happened when I call the 1) spatial averaging and 2) temporal averaging APIs.
1) Spatial Averaging API
Example:
2) Temporal Averaging API
For this API, the
RuntimeError
stacktrace points to the logic inTemporalAccessor._get_weights()
.Temporal averaging should work fine across multiple copies of the same dataset object.
Possible answers:
_get_weights()
, make a copy oftime_lengths
variable and load that into memory instead of the originaltime_lengths
variable (which might be closed after the first API call, resulting in theRuntime: NetCDF
error.There is a line of code that calculates
time_lengths
using time bounds. Thetime_lengths
are then loaded into memory.xcdat/xcdat/temporal.py
Lines 1209 to 1216 in c83f46e
What did you expect to happen? Are there are possible answers you came
APIs should run work consistently across multiple copies of the same dataset.
@jasonb5 suggests that it could be the LLNL climate filesystem not behaving well with Xarray dataset I/O.
Minimal Complete Verifiable Example (MVCE)
Provided above.
Relevant log output
Provided above.
Anything else we need to know?
Related GitHub issues:
Environment
The text was updated successfully, but these errors were encountered: