Error while saving an altered dataset to NetCDF when loaded from a file #8694

tarik · 2024-02-02T14:18:03Z

What happened?

When attempting to save an altered Xarray dataset to a NetCDF file using the to_netcdf method, an error occurs if the original dataset is loaded from a file. Specifically, this error does not occur when the dataset is created directly but only when it is loaded from a file.

What did you expect to happen?

The altered Xarray dataset is saved as a NetCDF file using the to_netcdf method.

Minimal Complete Verifiable Example

import xarray as xr


ds = xr.Dataset(
    data_vars=dict(
        win_1=("attempt", [True, False, True, False, False, True]),
        win_2=("attempt", [False, True, False, True, False, False]),
    ),
    coords=dict(
        attempt=[1, 2, 3, 4, 5, 6],
        player_1=("attempt", ["paper", "paper", "scissors", "scissors", "paper", "paper"]),
        player_2=("attempt", ["rock", "scissors", "paper", "rock", "paper", "rock"]),
    )
)
ds.to_netcdf("dataset.nc")

ds_from_file = xr.load_dataset("dataset.nc")

ds_altered = ds_from_file.where(ds_from_file["player_1"] == "paper", drop=True)
ds_altered.to_netcdf("dataset_altered.nc")

MVCE confirmation

Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.
Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

Traceback (most recent call last):
  File "example.py", line 20, in <module>
    ds_altered.to_netcdf("dataset_altered.nc")
  File ".../python3.9/site-packages/xarray/core/dataset.py", line 2303, in to_netcdf
    return to_netcdf(  # type: ignore  # mypy cannot resolve the overloads:(
  File ".../python3.9/site-packages/xarray/backends/api.py", line 1315, in to_netcdf
    dump_to_store(
  File ".../python3.9/site-packages/xarray/backends/api.py", line 1362, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File ".../python3.9/site-packages/xarray/backends/common.py", line 356, in store
    self.set_variables(
  File ".../python3.9/site-packages/xarray/backends/common.py", line 398, in set_variables
    writer.add(source, target)
  File ".../python3.9/site-packages/xarray/backends/common.py", line 243, in add
    target[...] = source
  File ".../python3.9/site-packages/xarray/backends/scipy_.py", line 78, in __setitem__
    data[key] = value
  File ".../python3.9/site-packages/scipy/io/_netcdf.py", line 1019, in __setitem__
    self.data[index] = data
ValueError: could not broadcast input array from shape (4,5) into shape (4,8)

Anything else we need to know?

Findings:

The issue is related to the encoding information of the dataset becoming invalid after filtering data with the where method. The to_netcdf method takes the available encoding information instead of considering the actual shape of the data.

In the provided examples, the maximum length of strings stored in "player_1" and "player_2" is originally set to 8 characters. However, after filtering with the where method, the maximum length of the string becomes 5 in "player_1" and remains 8 in "player_2.". But the encoding information of the variables still shows a length of 8, particularly the attribute char_dim_name.

Workaround:

A workaround to resolve this issue is to call the drop_encoding method on the dataset before saving it with to_netcdf. This action ensures that the encoding information is not available, and the to_netcdf method is forced to take the actual shapes of the data, preventing the broadcasting error.

Environment

INSTALLED VERSIONS

commit: None
python: 3.9.14 (main, Aug 24 2023, 14:01:46)
[GCC 11.4.0]
python-bits: 64
OS: Linux
OS-release: 6.3.1-060301-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None

xarray: 2024.1.1
pandas: 2.2.0
numpy: 1.26.3
scipy: 1.12.0
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 69.0.3
pip: 23.3.2
conda: None
pytest: None
mypy: None
IPython: None
sphinx: None

The text was updated successfully, but these errors were encountered:

welcome · 2024-02-02T14:18:06Z

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

kmuehlbauer · 2024-02-05T08:14:45Z

Thanks for raising this. Please see #6323 for discussion on handling encoding in the future.

As you already found, drop_encoding is the way to go for your use-case. This is also mentioned in the docs here https://docs.xarray.dev/en/stable/user-guide/io.html#reading-encoded-data. Please let us know, if and how the documentation could be improved to make this more clear.

tarik · 2024-02-07T13:29:01Z

Thank you for pointing out the resources.

I like the idea of the keep_encoding parameter mentioned in #6323.

The documentation could benefit from being more explicit about the issue of invalid encoding and its consequences. Specifically, the section Writing encoded data could mention that some operations may lead to invalid encoding information that can cause errors when writing to a file, for example with to_netcdf(). It could be noted that in these instances, the encoding information can be corrected or removed entirely using drop_encoding(). Additionally, the documentation for functions related to reading and writing files (open_dataset(), to_netcdf(), etc.) could point to the Reading and writing files webpage.

kmuehlbauer · 2024-02-07T13:38:39Z

@tarik Thanks for taking the time to share your thoughts.

Regarding the documentation changes we always welcome contributions.

tarik added bug needs triage Issue that has not been reviewed by xarray team member labels Feb 2, 2024

tarik changed the title ~~Error while saving an altered Xarray dataset to NetCDF when loaded from a file~~ Error while saving an altered dataset to NetCDF when loaded from a file Feb 2, 2024

kmuehlbauer removed the needs triage Issue that has not been reviewed by xarray team member label Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error while saving an altered dataset to NetCDF when loaded from a file #8694

Error while saving an altered dataset to NetCDF when loaded from a file #8694

tarik commented Feb 2, 2024

INSTALLED VERSIONS

welcome bot commented Feb 2, 2024

kmuehlbauer commented Feb 5, 2024

tarik commented Feb 7, 2024 •

edited

Loading

kmuehlbauer commented Feb 7, 2024

Error while saving an altered dataset to NetCDF when loaded from a file #8694

Error while saving an altered dataset to NetCDF when loaded from a file #8694

Comments

tarik commented Feb 2, 2024

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

INSTALLED VERSIONS

welcome bot commented Feb 2, 2024

kmuehlbauer commented Feb 5, 2024

tarik commented Feb 7, 2024 • edited Loading

kmuehlbauer commented Feb 7, 2024

tarik commented Feb 7, 2024 •

edited

Loading