Description
What happened:
Calling DataArray.unstack()
spends time allocating an object-dtype NumPy array from values of the pandas MultiIndex.
What you expected to happen:
Faster unstack.
Minimal Complete Verifiable Example:
import pandas as pd
import numpy as np
import xarray as xr
t = pd.date_range("2000", periods=2)
x = np.arange(1000)
y = np.arange(1000)
component = np.arange(4)
idx = pd.MultiIndex.from_product([t, y, x], names=["time", "y", "x"])
data = np.random.uniform(size=(len(idx), len(component)))
arr = xr.DataArray(
data,
coords={"pixel": xr.DataArray(idx, name="pixel", dims="pixel"),
"component": xr.DataArray(component, name="component", dims="component")},
dims=("pixel", "component")
)
%time _ = arr.unstack()
CPU times: user 6.33 s, sys: 295 ms, total: 6.62 s
Wall time: 6.62 s
Anything else we need to know?:
For this example, >99% of the time is spent at on this line:
Line 4162 in df76461
v.data
for the pixel
array, which is a pandas MultiIndex.
Just going by the comments, it does seem like accessing v.data
is necessary to perform the check. I'm wonder if we could make is_duck_dask_array
a bit smarter, to avoid unnecessarily allocating data?
Alternatively, if that's too difficult, perhaps we could add a flag to unstack
to disable those checks and just take the "slow" path. In my actual use-case, the slow _unstack_full_reindex
is necessary since I have large Dask Arrays. But even then, the unstack completes in less than 3s, while I was getting OOM killed on the v.data
checks.
Environment:
Output of xr.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.8.12 | packaged by conda-forge | (default, Sep 29 2021, 19:52:28)
[GCC 9.4.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-1040-azure
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.1
libnetcdf: 4.8.1
xarray: 0.19.0
pandas: 1.3.3
numpy: 1.20.0
scipy: 1.7.1
netCDF4: 1.5.7
pydap: installed
h5netcdf: 0.11.0
h5py: 3.4.0
Nio: None
zarr: 2.10.1
cftime: 1.5.1
nc_time_axis: 1.3.1
PseudoNetCDF: None
rasterio: 1.2.9
cfgrib: 0.9.9.0
iris: None
bottleneck: 1.3.2
dask: 2021.08.1
distributed: 2021.08.1
matplotlib: 3.4.3
cartopy: 0.20.0
seaborn: 0.11.2
numbagg: None
pint: 0.17
setuptools: 58.0.4
pip: 20.3.4
conda: None
pytest: None
IPython: 7.28.0
sphinx: None