-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stack + to_array before to_xarray is much faster that a simple to_xarray #2459
Comments
Here are the top entries I see with
There seems to be a suspiciously large amount of effort applying a function to individual datetime objects. |
When I stepped through, it was by-and-large all taken up by https://github.com/pydata/xarray/blob/master/xarray/core/dataset.py#L3121. That's where the boxing & unboxing of the datetimes is from. I haven't yet discovered how the alternative path avoids this work. If anyone has priors please lmk! |
It's 3x faster to unstack & stack all-but-one level, vs reindexing over a filled-out index (and I think always produces the same result). Our current code takes the slow path. I could make that change, but that strongly feels like I don't understand the root cause. I haven't spent much time with reshaping code - lmk if anyone has ideas. idx = cropped.index
full_idx = pd.MultiIndex.from_product(idx.levels, names=idx.names)
reindexed = cropped.reindex(full_idx)
%timeit reindexed = cropped.reindex(full_idx)
# 1 loop, best of 3: 278 ms per loop
%%timeit
stack_unstack = (
cropped
.unstack(list('yz'))
.stack(list('yz'),dropna=False)
)
# 10 loops, best of 3: 80.8 ms per loop
stack_unstack.equals(reindexed)
# True |
My working hypothesis is that pandas has a set of fast routines in C, such that it can stack without reindexing to the full index. The routines only work in 1-2 dimensions. So without some hackery (i.e. converting multi-dimensional arrays to pandas' size and back), the current implementation is reasonable*. Next step would be to write our own routines that can operate on multiple dimensions (numbagg!). Is that consistent with others' views, particularly those who know this area well? '* one small fix that would improve performance of |
The vast majority of the time in xarray's current implementation seems to be spent in See these results from line-profiler:
|
@max-sixty nevermind, you seem to have already discovered that :) |
I've run into this twice. This time I'm seeing a difference of very roughly 100x or more just using a transpose -- I can't test or time it properly right now, but this is what it looks like:
|
@tqfjo unrelated. You're comparing the creation of a dataset with 2 variables with the creation of one with 3000. Unsurprisingly, the latter will take 1500x. If your dataset doesn't functionally contain 3000 variables but just a single two-dimensional variable, use |
@crusaderky Thanks for the pointer to That said, if it helps anyone to know, I did just want a |
I know this is not a recent thread but I found no resolution, and we just ran in the same issue recently. In our case we had a pandas series of roughly 15 milliion entries, with a 3-level multi-index which had to be converted to an xarray.DataArray. The .to_xarray took almost 2 minutes. Unstack + to_array took it down to roughly 3 seconds, provided the last level of the multi index was unstacked. However a much faster solution was through numpy array. The below code is based on the idea of Igor Raush (In this case df is a dataframe with a single column, or a series)
|
Hi All. I stumble across the same issue trying to convert a 5000 column dataframe to xarray (it was never going to happen...). import xarray as xr
import pandas as pd
import numpy as np
xr.__version__
'0.15.1'
pd.__version__
'1.0.5'
df = pd.DataFrame(np.random.randn(200, 500))
%%time
one = df.to_xarray()
CPU times: user 29.6 s, sys: 60.4 ms, total: 29.6 s
Wall time: 29.7 s
%%time
dic={}
for name in df.columns:
dic.update({name:(['index'],df[name].values)})
two = xr.Dataset(dic, coords={'index': ('index', df.index.values)})
CPU times: user 17.6 ms, sys: 158 µs, total: 17.8 ms
Wall time: 17.8 ms
one.equals(two)
True |
Fixes pydataGH-2459 Before: pandas.MultiIndexSeries.time_to_xarray ======= ========= ========== -- subset ------- -------------------- dtype True False ======= ========= ========== int 505±0ms 37.1±0ms float 485±0ms 38.3±0ms ======= ========= ========== After: pandas.MultiIndexSeries.time_to_xarray ======= ========= ========== -- subset ------- -------------------- dtype True False ======= ========= ========== int 11.5±0ms 39.2±0ms float 12.5±0ms 26.6±0ms ======= ========= ========== There are still some cases where we have to fall back to the existing slow implementation, but hopefully they should now be relatively rare.
Thanks for sharing! This is a great tip indeed. I've reimplemented |
Very good news! |
* Add MultiIndexSeries.time_to_xarray() benchmark * Improve the speed of from_dataframe with a MultiIndex Fixes GH-2459 Before: pandas.MultiIndexSeries.time_to_xarray ======= ========= ========== -- subset ------- -------------------- dtype True False ======= ========= ========== int 505±0ms 37.1±0ms float 485±0ms 38.3±0ms ======= ========= ========== After: pandas.MultiIndexSeries.time_to_xarray ======= ========= ========== -- subset ------- -------------------- dtype True False ======= ========= ========== int 11.5±0ms 39.2±0ms float 12.5±0ms 26.6±0ms ======= ========= ========== There are still some cases where we have to fall back to the existing slow implementation, but hopefully they should now be relatively rare. * remove unused import * Simplify converting MultiIndex dataframes * remove comments * remove types with NA * more multiindex dataframe tests * add whats new note * Preserve order of MultiIndex levels in from_dataframe * Add todo note * Rewrite from_dataframe to avoid passing around a dataframe * Require that MultiIndexes are unique even with sparse=True * clarify comment
I was seeing some slow performance around
to_xarray()
on MultiIndexed series, and found that unstacking one of the dimensions before runningto_xarray()
, and then restacking withto_array()
was ~30x faster. This time difference is consistent with larger data sizes.To reproduce:
Create a series with a MultiIndex, ensuring the MultiIndex isn't a simple product:
Two approaches for getting this into xarray;
1 - Simple
.to_xarray()
:This takes 536 ms
2 - unstack in pandas first, and then use
to_array
to do the equivalent of a restack:This takes 17.3 ms
To confirm these are identical:
Problem description
A default operation is much slower than a (potentially) equivalent operation that's not the default.
I need to look more at what's causing the issues. I think it's to do with the
.reindex(full_idx)
, but I'm unclear why it's so much faster in the alternative route, and whether there's a fix that we can make to make the default path fast.Output of
xr.show_versions()
xarray: 0.10.9
pandas: 0.23.4
numpy: 1.15.2
scipy: 1.1.0
netCDF4: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
PseudonetCDF: None
rasterio: None
iris: None
bottleneck: 1.2.1
cyordereddict: None
dask: None
distributed: None
matplotlib: 2.2.3
cartopy: 0.16.0
seaborn: 0.9.0
setuptools: 40.4.3
pip: 18.0
conda: None
pytest: 3.8.1
IPython: 5.8.0
sphinx: None
The text was updated successfully, but these errors were encountered: