PeriodIndex causes severe slow down #1255

max-sixty · 2017-02-08T16:01:16Z

I need some guidance on how to handle this.

Background

PeriodIndex has a 'non-numpy' dtype now:

In [2]: i = pd.PeriodIndex(start=2000, freq='A', periods=10)

In [3]: i
Out[3]:
PeriodIndex(['2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007',
             '2008', '2009'],
            dtype='period[A-DEC]', freq='A-DEC')

In [6]: i.dtype
Out[6]: period[A-DEC]

When .values or .__array__() are called, the Periods are boxed, which is really slow. The underlying ints are stored in ._values:

In [25]: i.values
Out[25]:
array([Period('2000', 'A-DEC'), Period('2001', 'A-DEC'),
       Period('2002', 'A-DEC'), Period('2003', 'A-DEC'),
       Period('2004', 'A-DEC'), Period('2005', 'A-DEC'),
       Period('2006', 'A-DEC'), Period('2007', 'A-DEC'),
       Period('2008', 'A-DEC'), Period('2009', 'A-DEC')], dtype=object)

In [27]: all(i.__array__()==i.values)
Out[27]: True

# underlying:
In [28]: i._values
Out[28]: array([30, 31, 32, 33, 34, 35, 36, 37, 38, 39])

Problem

In pandas, we limit directly calling .values from outside Index, instead accessing Index functions through a smaller API.

But in xarray, I think there are a fair few functions that call .values or implicitly call .__array__() by passing the index into numpy.

As a result, there is a severe slow down when using PeriodIndex. As an example:

In [51]: indexes = [pd.PeriodIndex(start=str((1776 + i)), freq='A', periods=300) for i in range(50)]
In [53]: das = [xr.DataArray(range(300), coords=[index]) for index in indexes]

In [54]: %timeit xr.concat(das)

# 1 loop, best of 3: 1.38 s per loop

vs DTI:

In [55]: indexes_dt = [pd.DatetimeIndex(start=str((1776 + i)), freq='A', periods=300) for i in range(50)]
In [56]: das_dt = [xr.DataArray(range(300), coords=[index]) for index in indexes_dt]
In [57]: %timeit xr.concat(das_dt)
# 10 loops, best of 3: 69.2 ms per loop

...a 20x slowdown, on fairly short indexes

@shoyer do you have any ideas of how to resolve this? Is it feasible to not pass Indexes directly into numpy? I haven't gone through in enough depth to have a view there, given I was hoping you could cut through the options. Thank you.

ref pandas-dev/pandas#14822
CC @sinhkrs @jreback

The text was updated successfully, but these errors were encountered:

shoyer · 2017-02-08T17:02:38Z

The usual place to start is with profiling. Here's what %prun xr.concat(das) gets me:
https://gist.github.com/shoyer/97dff50d5892e3437d4864d93e85ead2

So indeed, each index is getting converted into a NumPy array. Profiling suggests Variable.equals is a likely culprit (0.863/1.159 seconds) and indeed we call .data there:

xarray/xarray/core/variable.py

Line 1000 in d49014d

equiv(self.data, other.data)))

The good news is that pandas already has it's own vectorized .equals method for indexes. So we should either add a special case if ._data is a PandasIndexWrapper for the generic Variable.equals method, or perhaps better yet add a subclass method IndexVariable.equals that always call .equals on the indexes instead of caling ops.array_equiv on .data.

max-sixty mentioned this issue Feb 8, 2017

PERF: Override equals in IndexVariable #1256

Merged

4 tasks

max-sixty closed this as completed in #1256 Feb 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PeriodIndex causes severe slow down #1255

PeriodIndex causes severe slow down #1255

max-sixty commented Feb 8, 2017 •

edited

Loading

shoyer commented Feb 8, 2017

PeriodIndex causes severe slow down #1255

PeriodIndex causes severe slow down #1255

Comments

max-sixty commented Feb 8, 2017 • edited Loading

Background

Problem

shoyer commented Feb 8, 2017

max-sixty commented Feb 8, 2017 •

edited

Loading