You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In pandas, we limit directly calling .values from outside Index, instead accessing Index functions through a smaller API.
But in xarray, I think there are a fair few functions that call .values or implicitly call .__array__() by passing the index into numpy.
As a result, there is a severe slow down when using PeriodIndex. As an example:
In [51]: indexes= [pd.PeriodIndex(start=str((1776+i)), freq='A', periods=300) foriinrange(50)]
In [53]: das= [xr.DataArray(range(300), coords=[index]) forindexinindexes]
In [54]: %timeitxr.concat(das)
# 1 loop, best of 3: 1.38 s per loop
vs DTI:
In [55]: indexes_dt= [pd.DatetimeIndex(start=str((1776+i)), freq='A', periods=300) foriinrange(50)]
In [56]: das_dt= [xr.DataArray(range(300), coords=[index]) forindexinindexes_dt]
In [57]: %timeitxr.concat(das_dt)
# 10 loops, best of 3: 69.2 ms per loop
...a 20x slowdown, on fairly short indexes
@shoyer do you have any ideas of how to resolve this? Is it feasible to not pass Indexes directly into numpy? I haven't gone through in enough depth to have a view there, given I was hoping you could cut through the options. Thank you.
So indeed, each index is getting converted into a NumPy array. Profiling suggests Variable.equals is a likely culprit (0.863/1.159 seconds) and indeed we call .data there:
The good news is that pandas already has it's own vectorized .equals method for indexes. So we should either add a special case if ._data is a PandasIndexWrapper for the generic Variable.equals method, or perhaps better yet add a subclass method IndexVariable.equals that always call .equals on the indexes instead of caling ops.array_equiv on .data.
I need some guidance on how to handle this.
Background
PeriodIndex
has a 'non-numpy' dtype now:When
.values
or.__array__()
are called, the Periods are boxed, which is really slow. The underlying ints are stored in._values
:Problem
In pandas, we limit directly calling
.values
from outside Index, instead accessing Index functions through a smaller API.But in xarray, I think there are a fair few functions that call
.values
or implicitly call.__array__()
by passing the index into numpy.As a result, there is a severe slow down when using
PeriodIndex
. As an example:vs DTI:
...a 20x slowdown, on fairly short indexes
@shoyer do you have any ideas of how to resolve this? Is it feasible to not pass Indexes directly into numpy? I haven't gone through in enough depth to have a view there, given I was hoping you could cut through the options. Thank you.
ref pandas-dev/pandas#14822
CC @sinhkrs @jreback
The text was updated successfully, but these errors were encountered: