Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Random netCDF errors #226

Closed
bnubald opened this issue Mar 10, 2024 · 5 comments · Fixed by #247 or icenet-ai/icenet-pipeline#40
Closed

[BUG] Random netCDF errors #226

bnubald opened this issue Mar 10, 2024 · 5 comments · Fixed by #247 or icenet-ai/icenet-pipeline#40
Assignees
Labels
bug Something isn't working
Milestone

Comments

@bnubald
Copy link
Collaborator

bnubald commented Mar 10, 2024

  • IceNet version: v0.2.7
  • Python version: 3.9/3.11.7
  • Operating System: Linux x64

Description

Running icenet_output on an ensemble run most often causes random netCDF output error.

What I Did

$ ./run_predict_ensemble.sh -b 1 -f 0.6 -p bashpc.sh demo_pipeline_south_ensemble demo_pipeline_south example_small_south_ensemble_forecast testdates.csv

This runs the following command from the icenet library which causes the issue:

$ icenet_output -m --nan -r ../.. -o ../../results/predict example_small_south_ensemble_forecast demo_pipeline_south predict_dates.csv

The forecast dates for which it causes this issue is random.

The error:

[27-02-24 13:03:30 :DEBUG   ] - Adding 1 to tas_anom channel
[27-02-24 13:03:30 :DEBUG   ] - Adding 1 to zg250_anom channel
[27-02-24 13:03:30 :DEBUG   ] - Adding 1 to zg500_anom channel
[27-02-24 13:03:30 :DEBUG   ] - Adding 1 to siconca_linear_trend channel
[27-02-24 13:03:30 :DEBUG   ] - Adding 1 to cos channel
[27-02-24 13:03:30 :DEBUG   ] - Adding 1 to land channel
[27-02-24 13:03:30 :DEBUG   ] - Adding 1 to sin channel
[27-02-24 13:03:30 :DEBUG   ] - Channel quantities deduced:
{'cos': 1,
 'land': 1,
 'siconca_abs': 3,
 'siconca_linear_trend': 93,
 'sin': 1,
 'tas_anom': 3,
 'uas_abs': 3,
 'vas_abs': 3,
 'zg250_anom': 3,
 'zg500_anom': 3}

Total channels: 114
[27-02-24 13:03:30 :DEBUG   ] - ./data/masks already exists
[27-02-24 13:03:30 :INFO    ] - Loading model from ./results/networks/small_demo_pipeline_south_ensemble/small_demo_pipeline_south_ensemble.network_small_demo_pipeline_south.46.h5...
[27-02-24 13:03:31 :DEBUG   ] - Creating converter from 3 to 5
[27-02-24 13:03:31 :INFO    ] - Generating forecast inputs from processed/ files
[27-02-24 13:03:34 :DEBUG   ] - Could not import boto3, continuing with reduced functionality.
[27-02-24 13:03:42 :INFO    ] - Running prediction 2022-02-15
[27-02-24 13:03:45 :INFO    ] - Saving 2022-02-15 - forecast output (1, 432, 432, 93)
[27-02-24 13:03:51 :INFO    ] - Running prediction 2022-02-16
[27-02-24 13:03:53 :WARNING ] - ./results/predict/example_small_south_ensemble_forecast/small_demo_pipeline_south_ensemble.46 output already exists
[27-02-24 13:03:53 :INFO    ] - Saving 2022-02-16 - forecast output (1, 432, 432, 93)
[27-02-24 13:04:00 :INFO    ] - Running prediction 2022-02-17
[27-02-24 13:04:02 :WARNING ] - ./results/predict/example_small_south_ensemble_forecast/small_demo_pipeline_south_ensemble.46 output already exists
[27-02-24 13:04:02 :INFO    ] - Saving 2022-02-17 - forecast output (1, 432, 432, 93)
[27-02-24 13:04:08 :INFO    ] - Running prediction 2022-02-18
[27-02-24 13:04:10 :WARNING ] - ./results/predict/example_small_south_ensemble_forecast/small_demo_pipeline_south_ensemble.46 output already exists
[27-02-24 13:04:10 :INFO    ] - Saving 2022-02-18 - forecast output (1, 432, 432, 93)
HDF5-DIAG: Error detected in HDF5 (1.14.2) thread 1:
  #000: H5A.c line 1891 in H5Aiterate2(): invalid location identifier
    major: Invalid arguments to routine
    minor: Inappropriate type
  #001: H5VLint.c line 1742 in H5VL_vol_object(): invalid identifier type to function
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.14.2) thread 2:
  #000: H5A.c line 679 in H5Aopen_by_name(): unable to synchronously open attribute
    major: Attribute
    minor: Can't open object
  #001: H5A.c line 633 in H5A__open_by_name_api_common(): can't set object access arguments
    major: Attribute
    minor: Can't set value
  #002: H5VLint.c line 2678 in H5VL_setup_name_args(): invalid location identifier
    major: Invalid arguments to routine
    minor: Inappropriate type
  #003: H5VLint.c line 1734 in H5VL_vol_object(): invalid identifier
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.14.2) thread 2:
  #000: H5A.c line 1117 in H5Aget_space(): not an attribute
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.14.2) thread 2:
  #000: H5A.c line 1162 in H5Aget_type(): not an attribute
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.14.2) thread 2:
  #000: H5T.c line 2060 in H5Tget_class(): not a datatype
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.14.2) thread 2:
  #000: H5A.c line 1891 in H5Aiterate2(): invalid location identifier
    major: Invalid arguments to routine
    minor: Inappropriate type
  #001: H5VLint.c line 1742 in H5VL_vol_object(): invalid identifier type to function
    major: Invalid arguments to routine
    minor: Inappropriate type
Traceback (most recent call last):
  File "/data/hpcdata/users/username/miniconda3/envs/icenet/lib/python3.9/site-packages/xarray/backends/file_manager.py", line 211, in _acquire_with_cache_info
    file = self._cache[self._key]
  File "/data/hpcdata/users/username/miniconda3/envs/icenet/lib/python3.9/site-packages/xarray/backends/lru_cache.py", line 56, in __getitem__
    value = self._cache[key]
KeyError: [<class 'netCDF4._netCDF4.Dataset'>, ('/data/hpcdata/users/username/git/icenet/green/ensemble/example_small_south_ensemble_forecast/example_small_south_ensemble_forecast-1/processed/small_demo_pipeline_south/era5/south/zg500/zg500_anom.nc',), 'r', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False)), 'e7ced475-bb21-45e7-bef5-7751cbbb311f']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/hpcdata/users/username/miniconda3/envs/icenet/bin/icenet_predict", line 33, in <module>
    sys.exit(load_entry_point('icenet', 'console_scripts', 'icenet_predict')())
  File "/data/hpcdata/users/username/git/icenet/icenet/icenet/model/predict.py", line 198, in main
    predict_forecast(
  File "/data/hpcdata/users/username/git/icenet/icenet/icenet/model/predict.py", line 72, in predict_forecast
    data_sample = dl.generate_sample(date, prediction=True)
  File "/data/hpcdata/users/username/git/icenet/icenet/icenet/data/loaders/dask.py", line 264, in generate_sample
    var_ds = xr.open_mfdataset([
  File "/data/hpcdata/users/username/miniconda3/envs/icenet/lib/python3.9/site-packages/xarray/backends/api.py", line 1035, in open_mfdataset
    datasets, closers = dask.compute(datasets, closers)
  File "/data/hpcdata/users/username/miniconda3/envs/icenet/lib/python3.9/site-packages/dask/base.py", line 628, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/data/hpcdata/users/username/miniconda3/envs/icenet/lib/python3.9/site-packages/xarray/backends/api.py", line 573, in open_dataset
    backend_ds = backend.open_dataset(
  File "/data/hpcdata/users/username/miniconda3/envs/icenet/lib/python3.9/site-packages/xarray/backends/netCDF4_.py", line 603, in open_dataset
    store = NetCDF4DataStore.open(
  File "/data/hpcdata/users/username/miniconda3/envs/icenet/lib/python3.9/site-packages/xarray/backends/netCDF4_.py", line 400, in open
    return cls(manager, group=group, mode=mode, lock=lock, autoclose=autoclose)
  File "/data/hpcdata/users/username/miniconda3/envs/icenet/lib/python3.9/site-packages/xarray/backends/netCDF4_.py", line 347, in __init__
    self.format = self.ds.data_model
  File "/data/hpcdata/users/username/miniconda3/envs/icenet/lib/python3.9/site-packages/xarray/backends/netCDF4_.py", line 409, in ds
    return self._acquire()
  File "/data/hpcdata/users/username/miniconda3/envs/icenet/lib/python3.9/site-packages/xarray/backends/netCDF4_.py", line 403, in _acquire
    with self._manager.acquire_context(needs_lock) as root:
  File "/data/hpcdata/users/username/miniconda3/envs/icenet/lib/python3.9/contextlib.py", line 119, in __enter__
@bnubald bnubald added the bug Something isn't working label Mar 10, 2024
@bnubald
Copy link
Collaborator Author

bnubald commented Mar 10, 2024

This relates to thread safety and netCDF.

Following issues/discussions are relevant:

pydata/xarray#7079 (comment)

Unidata/netcdf4-python#1192

https://forum.access-hive.org.au/t/netcdf-not-a-valid-id-errors/389/22

@bnubald
Copy link
Collaborator Author

bnubald commented Mar 10, 2024

The installed version of netCDF4 was 1.6.5.

Based on above linked discussions, found downgrading it to 1.6.0 to work. No more errors!

pip install -U netCDF4==1.6.0

Will have to revisit the thread-safety issue of current implementation for newer versions that are not as lenient.

@bnubald
Copy link
Collaborator Author

bnubald commented Mar 21, 2024

This likely relates to #183.

Pinpointing the actual issue is no fun.

Above linked discussions mention the removal of some workarounds to get past threading issues. So, any threading/multiprocessing issues in code are likely surfacing with the removal of these workarounds. At some point will need to try and debug the root cause.

@bnubald bnubald self-assigned this Apr 10, 2024
@bnubald bnubald added this to the v0.2.8 milestone Apr 10, 2024
@bnubald
Copy link
Collaborator Author

bnubald commented Apr 10, 2024

Looks like above error is different to what I'd initially thought. This error occurs when writing to netCDF file.

Whereas, with Ryan's #183 happens when reading. I do get the same error as Ryan when reading with newer versions of netCDF4 too.

For this issue, the xarray docs mention the following:

When using Dask’s distributed scheduler to write NETCDF4 files, it may be necessary to set the environment variable HDF5_USE_FILE_LOCKING=FALSE to avoid competing locks within the HDF5 SWMR file locking scheme. Note that writing netCDF files with Dask’s distributed scheduler is only supported for the netcdf4 backend.

@JimCircadian
Copy link
Member

Looks like above error is different to what I'd initially thought. This error occurs when writing to netCDF file.

Whereas, with Ryan's #183 happens when reading. I do get the same error as Ryan when reading with newer versions of netCDF4 too.

For this issue, the xarray docs mention the following:

When using Dask’s distributed scheduler to write NETCDF4 files, it may be necessary to set the environment variable HDF5_USE_FILE_LOCKING=FALSE to avoid competing locks within the HDF5 SWMR file locking scheme. Note that writing netCDF files with Dask’s distributed scheduler is only supported for the netcdf4 backend.

Good to see the investigation, but not worth considering this a priority at present. More than happy to prioritise it for a more general purpose release: it might be that we have to bootstrap some test installations with different system setups to pinpoint the issues, as I suspect it's a compatibility needle in a haystack. Defer to 0.3.0 for the moment 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants