Skip to content

Slow performance of DataArray.unstack() from checking variable.data #5902

Closed
@TomAugspurger

Description

@TomAugspurger

What happened:

Calling DataArray.unstack() spends time allocating an object-dtype NumPy array from values of the pandas MultiIndex.

What you expected to happen:

Faster unstack.

Minimal Complete Verifiable Example:

import pandas as pd
import numpy as np
import xarray as xr

t = pd.date_range("2000", periods=2)
x = np.arange(1000)
y = np.arange(1000)
component = np.arange(4)

idx = pd.MultiIndex.from_product([t, y, x], names=["time", "y", "x"])

data = np.random.uniform(size=(len(idx), len(component)))
arr = xr.DataArray(
    data,
    coords={"pixel": xr.DataArray(idx, name="pixel", dims="pixel"),
            "component": xr.DataArray(component, name="component", dims="component")},
    dims=("pixel", "component")
)

%time _ = arr.unstack()
CPU times: user 6.33 s, sys: 295 ms, total: 6.62 s
Wall time: 6.62 s

Anything else we need to know?:

For this example, >99% of the time is spent at on this line:

any(is_duck_dask_array(v.data) for v in self.variables.values())
, specifically on the call to v.data for the pixel array, which is a pandas MultiIndex.

Just going by the comments, it does seem like accessing v.data is necessary to perform the check. I'm wonder if we could make is_duck_dask_array a bit smarter, to avoid unnecessarily allocating data?

Alternatively, if that's too difficult, perhaps we could add a flag to unstack to disable those checks and just take the "slow" path. In my actual use-case, the slow _unstack_full_reindex is necessary since I have large Dask Arrays. But even then, the unstack completes in less than 3s, while I was getting OOM killed on the v.data checks.

Environment:

Output of xr.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.8.12 | packaged by conda-forge | (default, Sep 29 2021, 19:52:28) 
[GCC 9.4.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-1040-azure
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.1
libnetcdf: 4.8.1

xarray: 0.19.0
pandas: 1.3.3
numpy: 1.20.0
scipy: 1.7.1
netCDF4: 1.5.7
pydap: installed
h5netcdf: 0.11.0
h5py: 3.4.0
Nio: None
zarr: 2.10.1
cftime: 1.5.1
nc_time_axis: 1.3.1
PseudoNetCDF: None
rasterio: 1.2.9
cfgrib: 0.9.9.0
iris: None
bottleneck: 1.3.2
dask: 2021.08.1
distributed: 2021.08.1
matplotlib: 3.4.3
cartopy: 0.20.0
seaborn: 0.11.2
numbagg: None
pint: 0.17
setuptools: 58.0.4
pip: 20.3.4
conda: None
pytest: None
IPython: 7.28.0
sphinx: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions