Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"RuntimeError: NetCDF: Not a valid ID" error when generating samples from dataloader #183

Closed
rchan26 opened this issue Sep 12, 2023 · 2 comments · Fixed by #247
Closed
Assignees
Labels
bug Something isn't working
Milestone

Comments

@rchan26
Copy link
Member

rchan26 commented Sep 12, 2023

Currently working on PyTorch example implementation here, but come across an error when generating samples from the dataloader (via DaskMultiWorkerLoader.generate_sample).

I've created a IceNet dataset which inherits from the torch.Dataset class here. And when iterating through the dataset, I come across the following error:

Traceback (most recent call last):
  File "/data/hpcdata/users/rychan/miniconda3/envs/icenet_pytorch/lib/python3.8/site-packages/xarray/backends/api.py", line 1026, in open_mfdataset
    combined = combine_by_coords(
  File "/data/hpcdata/users/rychan/miniconda3/envs/icenet_pytorch/lib/python3.8/site-packages/xarray/core/combine.py", line 982, in combine_by_coords
    concatenated = _combine_single_variable_hypercube(
  File "/data/hpcdata/users/rychan/miniconda3/envs/icenet_pytorch/lib/python3.8/site-packages/xarray/core/combine.py", line 629, in _combine_single_variable_hypercube
    combined_ids, concat_dims = _infer_concat_order_from_coords(list(datasets))
  File "/data/hpcdata/users/rychan/miniconda3/envs/icenet_pytorch/lib/python3.8/site-packages/xarray/core/combine.py", line 149, in _infer_concat_order_from_coords
    raise ValueError(
ValueError: Could not find any dimension coordinates to use to order the datasets for concatenation

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "pytorch_example.py", line 60, in <module>
    lit_unet_module, unet_model = train_icenet_unet(
  File "/data/hpcdata/users/rychan/notebooks/icenet-notebooks/pytorch_example/train_icenet_unet.py", line 84, in train_icenet_unet
    trainer.fit(lit_module, train_dataloader, val_dataloader)
  File "/data/hpcdata/users/rychan/miniconda3/envs/icenet_pytorch/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 529, in fit
    call._call_and_handle_interrupt(
  File "/data/hpcdata/users/rychan/miniconda3/envs/icenet_pytorch/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 42, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/data/hpcdata/users/rychan/miniconda3/envs/icenet_pytorch/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 568, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/data/hpcdata/users/rychan/miniconda3/envs/icenet_pytorch/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 973, in _run
    results = self._run_stage()
  File "/data/hpcdata/users/rychan/miniconda3/envs/icenet_pytorch/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 1016, in _run_stage
    self.fit_loop.run()
  File "/data/hpcdata/users/rychan/miniconda3/envs/icenet_pytorch/lib/python3.8/site-packages/lightning/pytorch/loops/fit_loop.py", line 201, in run
    self.advance()
  File "/data/hpcdata/users/rychan/miniconda3/envs/icenet_pytorch/lib/python3.8/site-packages/lightning/pytorch/loops/fit_loop.py", line 354, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/data/hpcdata/users/rychan/miniconda3/envs/icenet_pytorch/lib/python3.8/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 133, in run
    self.advance(data_fetcher)
  File "/data/hpcdata/users/rychan/miniconda3/envs/icenet_pytorch/lib/python3.8/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 189, in advance
    batch = next(data_fetcher)
  File "/data/hpcdata/users/rychan/miniconda3/envs/icenet_pytorch/lib/python3.8/site-packages/lightning/pytorch/loops/fetchers.py", line 136, in __next__
    self._fetch_next_batch(self.dataloader_iter)
  File "/data/hpcdata/users/rychan/miniconda3/envs/icenet_pytorch/lib/python3.8/site-packages/lightning/pytorch/loops/fetchers.py", line 150, in _fetch_next_batch
    batch = next(iterator)
  File "/data/hpcdata/users/rychan/miniconda3/envs/icenet_pytorch/lib/python3.8/site-packages/lightning/pytorch/utilities/combined_loader.py", line 284, in __next__
    out = next(self._iterator)
  File "/data/hpcdata/users/rychan/miniconda3/envs/icenet_pytorch/lib/python3.8/site-packages/lightning/pytorch/utilities/combined_loader.py", line 65, in __next__
    out[i] = next(self.iterators[i])
  File "/data/hpcdata/users/rychan/miniconda3/envs/icenet_pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/data/hpcdata/users/rychan/miniconda3/envs/icenet_pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 677, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/data/hpcdata/users/rychan/miniconda3/envs/icenet_pytorch/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/data/hpcdata/users/rychan/miniconda3/envs/icenet_pytorch/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/data/hpcdata/users/rychan/notebooks/icenet-notebooks/pytorch_example/icenet_pytorch_dataset.py", line 26, in __getitem__
    return self._dl.generate_sample(date=pd.Timestamp(self._dates[idx]))
  File "/data/hpcdata/users/rychan/icenet/icenet_fork/icenet/data/loaders/dask.py", line 258, in generate_sample
    var_ds = xr.open_mfdataset(
  File "/data/hpcdata/users/rychan/miniconda3/envs/icenet_pytorch/lib/python3.8/site-packages/xarray/backends/api.py", line 1041, in open_mfdataset
    ds.close()
  File "/data/hpcdata/users/rychan/miniconda3/envs/icenet_pytorch/lib/python3.8/site-packages/xarray/core/common.py", line 1155, in close
    self._close()
  File "/data/hpcdata/users/rychan/miniconda3/envs/icenet_pytorch/lib/python3.8/site-packages/xarray/backends/netCDF4_.py", line 513, in close
    self._manager.close(**kwargs)
  File "/data/hpcdata/users/rychan/miniconda3/envs/icenet_pytorch/lib/python3.8/site-packages/xarray/backends/file_manager.py", line 232, in close
    file.close()
  File "src/netCDF4/_netCDF4.pyx", line 2622, in netCDF4._netCDF4.Dataset.close
  File "src/netCDF4/_netCDF4.pyx", line 2585, in netCDF4._netCDF4.Dataset._close
  File "src/netCDF4/_netCDF4.pyx", line 2029, in netCDF4._netCDF4._ensure_nc_success
RuntimeError: NetCDF: Not a valid ID

At what point this error occurs is quite volatile. This is an issue during training when we're obtaining samples at each epoch.

The training will fail at different points at each run so it's been difficult to really nail down the issue, but we suspect its from the multiprocessing.

@bnubald
Copy link
Collaborator

bnubald commented Apr 11, 2024

Getting same error when running via the icenet-pipeline:

$ ./run_predict_ensemble.sh -f 0.6 -p bashpc.sh tutorial_south_ensemble tutorial_pipeline_south tutorial_south_ensemble_forecast testdates.csv

WARNING:root:./results/predict/tutorial_south_ensemble_forecast/tutorial_south_ensemble.42 output already exists
WARNING:root:./results/predict/tutorial_south_ensemble_forecast/tutorial_south_ensemble.42 output already exists
Traceback (most recent call last):
  File "/data/hpcdata/users/username/miniconda3/envs/icenet0.2.8/lib/python3.11/site-packages/xarray/backends/api.py", line 1026, in open_mfdataset
    combined = combine_by_coords(
               ^^^^^^^^^^^^^^^^^^
  File "/data/hpcdata/users/username/miniconda3/envs/icenet0.2.8/lib/python3.11/site-packages/xarray/core/combine.py", line 982, in combine_by_coords
    concatenated = _combine_single_variable_hypercube(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/hpcdata/users/username/miniconda3/envs/icenet0.2.8/lib/python3.11/site-packages/xarray/core/combine.py", line 629, in _combine_single_variable_hypercube
    combined_ids, concat_dims = _infer_concat_order_from_coords(list(datasets))
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/hpcdata/users/username/miniconda3/envs/icenet0.2.8/lib/python3.11/site-packages/xarray/core/combine.py", line 149, in _infer_concat_order_from_coords
    raise ValueError(
ValueError: Could not find any dimension coordinates to use to order the datasets for concatenation

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/hpcdata/users/username/miniconda3/envs/icenet0.2.8/bin/icenet_predict", line 33, in <module>
    sys.exit(load_entry_point('icenet', 'console_scripts', 'icenet_predict')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/hpcdata/users/username/git/icenet/icenet/icenet/model/predict.py", line 203, in main
    predict_forecast(
  File "/data/hpcdata/users/username/git/icenet/icenet/icenet/model/predict.py", line 75, in predict_forecast
    data_sample = dl.generate_sample(date, prediction=True)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/hpcdata/users/username/git/icenet/icenet/icenet/data/loaders/dask.py", line 263, in generate_sample
    var_ds = xr.open_mfdataset([
             ^^^^^^^^^^^^^^^^^^^
  File "/data/hpcdata/users/username/miniconda3/envs/icenet0.2.8/lib/python3.11/site-packages/xarray/backends/api.py", line 1041, in open_mfdataset
    ds.close()
  File "/data/hpcdata/users/username/miniconda3/envs/icenet0.2.8/lib/python3.11/site-packages/xarray/core/common.py", line 1155, in close
    self._close()
  File "/data/hpcdata/users/username/miniconda3/envs/icenet0.2.8/lib/python3.11/site-packages/xarray/backends/netCDF4_.py", line 513, in close
    self._manager.close(**kwargs)
  File "/data/hpcdata/users/username/miniconda3/envs/icenet0.2.8/lib/python3.11/site-packages/xarray/backends/file_manager.py", line 232, in close
    file.close()
  File "src/netCDF4/_netCDF4.pyx", line 2627, in netCDF4._netCDF4.Dataset.close
  File "src/netCDF4/_netCDF4.pyx", line 2590, in netCDF4._netCDF4.Dataset._close
  File "src/netCDF4/_netCDF4.pyx", line 2034, in netCDF4._netCDF4._ensure_nc_success
RuntimeError: NetCDF: Not a valid ID

Relates to pydata/xarray#7079

Resolution

Some of the fixes mention involve:

  • Running in single-threaded mode
  • Setting parallel=False (which is the default) when reading netcdf datasets using xarray.
    • when using xarray.open_mfdataset()
  • Using netCDF4<1.6.1.

@bnubald bnubald self-assigned this Apr 11, 2024
@bnubald bnubald added the bug Something isn't working label Apr 11, 2024
@bnubald bnubald added this to the v0.2.8 milestone Apr 11, 2024
@JimCircadian
Copy link
Member

This is one of two things, but my memory is failing me (still drinking morning coffee): usually this means there is some gunky data, usually in SIC. Worth checking that you have all the data you need to generate the complete set, or try a different date, to ensure the issue is not library based.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants