Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while saving an altered dataset to NetCDF when loaded from a file #8694

Open
5 tasks done
tarik opened this issue Feb 2, 2024 · 4 comments
Open
5 tasks done
Labels

Comments

@tarik
Copy link

tarik commented Feb 2, 2024

What happened?

When attempting to save an altered Xarray dataset to a NetCDF file using the to_netcdf method, an error occurs if the original dataset is loaded from a file. Specifically, this error does not occur when the dataset is created directly but only when it is loaded from a file.

What did you expect to happen?

The altered Xarray dataset is saved as a NetCDF file using the to_netcdf method.

Minimal Complete Verifiable Example

import xarray as xr


ds = xr.Dataset(
    data_vars=dict(
        win_1=("attempt", [True, False, True, False, False, True]),
        win_2=("attempt", [False, True, False, True, False, False]),
    ),
    coords=dict(
        attempt=[1, 2, 3, 4, 5, 6],
        player_1=("attempt", ["paper", "paper", "scissors", "scissors", "paper", "paper"]),
        player_2=("attempt", ["rock", "scissors", "paper", "rock", "paper", "rock"]),
    )
)
ds.to_netcdf("dataset.nc")

ds_from_file = xr.load_dataset("dataset.nc")

ds_altered = ds_from_file.where(ds_from_file["player_1"] == "paper", drop=True)
ds_altered.to_netcdf("dataset_altered.nc")

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

Traceback (most recent call last):
  File "example.py", line 20, in <module>
    ds_altered.to_netcdf("dataset_altered.nc")
  File ".../python3.9/site-packages/xarray/core/dataset.py", line 2303, in to_netcdf
    return to_netcdf(  # type: ignore  # mypy cannot resolve the overloads:(
  File ".../python3.9/site-packages/xarray/backends/api.py", line 1315, in to_netcdf
    dump_to_store(
  File ".../python3.9/site-packages/xarray/backends/api.py", line 1362, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File ".../python3.9/site-packages/xarray/backends/common.py", line 356, in store
    self.set_variables(
  File ".../python3.9/site-packages/xarray/backends/common.py", line 398, in set_variables
    writer.add(source, target)
  File ".../python3.9/site-packages/xarray/backends/common.py", line 243, in add
    target[...] = source
  File ".../python3.9/site-packages/xarray/backends/scipy_.py", line 78, in __setitem__
    data[key] = value
  File ".../python3.9/site-packages/scipy/io/_netcdf.py", line 1019, in __setitem__
    self.data[index] = data
ValueError: could not broadcast input array from shape (4,5) into shape (4,8)

Anything else we need to know?

Findings:

The issue is related to the encoding information of the dataset becoming invalid after filtering data with the where method. The to_netcdf method takes the available encoding information instead of considering the actual shape of the data.

In the provided examples, the maximum length of strings stored in "player_1" and "player_2" is originally set to 8 characters. However, after filtering with the where method, the maximum length of the string becomes 5 in "player_1" and remains 8 in "player_2.". But the encoding information of the variables still shows a length of 8, particularly the attribute char_dim_name.

Workaround:

A workaround to resolve this issue is to call the drop_encoding method on the dataset before saving it with to_netcdf. This action ensures that the encoding information is not available, and the to_netcdf method is forced to take the actual shapes of the data, preventing the broadcasting error.

Environment

INSTALLED VERSIONS

commit: None
python: 3.9.14 (main, Aug 24 2023, 14:01:46)
[GCC 11.4.0]
python-bits: 64
OS: Linux
OS-release: 6.3.1-060301-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None

xarray: 2024.1.1
pandas: 2.2.0
numpy: 1.26.3
scipy: 1.12.0
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 69.0.3
pip: 23.3.2
conda: None
pytest: None
mypy: None
IPython: None
sphinx: None

@tarik tarik added bug needs triage Issue that has not been reviewed by xarray team member labels Feb 2, 2024
Copy link

welcome bot commented Feb 2, 2024

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@tarik tarik changed the title Error while saving an altered Xarray dataset to NetCDF when loaded from a file Error while saving an altered dataset to NetCDF when loaded from a file Feb 2, 2024
@kmuehlbauer kmuehlbauer removed the needs triage Issue that has not been reviewed by xarray team member label Feb 5, 2024
@kmuehlbauer
Copy link
Contributor

Thanks for raising this. Please see #6323 for discussion on handling encoding in the future.

As you already found, drop_encoding is the way to go for your use-case. This is also mentioned in the docs here https://docs.xarray.dev/en/stable/user-guide/io.html#reading-encoded-data. Please let us know, if and how the documentation could be improved to make this more clear.

@tarik
Copy link
Author

tarik commented Feb 7, 2024

Thank you for pointing out the resources.

I like the idea of the keep_encoding parameter mentioned in #6323.

The documentation could benefit from being more explicit about the issue of invalid encoding and its consequences. Specifically, the section Writing encoded data could mention that some operations may lead to invalid encoding information that can cause errors when writing to a file, for example with to_netcdf(). It could be noted that in these instances, the encoding information can be corrected or removed entirely using drop_encoding(). Additionally, the documentation for functions related to reading and writing files (open_dataset(), to_netcdf(), etc.) could point to the Reading and writing files webpage.

@kmuehlbauer
Copy link
Contributor

@tarik Thanks for taking the time to share your thoughts.

Regarding the documentation changes we always welcome contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants