Skip to content

PeriodIndex causes severe slow down #1255

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
max-sixty opened this issue Feb 8, 2017 · 1 comment · Fixed by #1256
Closed

PeriodIndex causes severe slow down #1255

max-sixty opened this issue Feb 8, 2017 · 1 comment · Fixed by #1256

Comments

@max-sixty
Copy link
Collaborator

max-sixty commented Feb 8, 2017

I need some guidance on how to handle this.

Background

PeriodIndex has a 'non-numpy' dtype now:

In [2]: i = pd.PeriodIndex(start=2000, freq='A', periods=10)

In [3]: i
Out[3]:
PeriodIndex(['2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007',
             '2008', '2009'],
            dtype='period[A-DEC]', freq='A-DEC')

In [6]: i.dtype
Out[6]: period[A-DEC]

When .values or .__array__() are called, the Periods are boxed, which is really slow. The underlying ints are stored in ._values:

In [25]: i.values
Out[25]:
array([Period('2000', 'A-DEC'), Period('2001', 'A-DEC'),
       Period('2002', 'A-DEC'), Period('2003', 'A-DEC'),
       Period('2004', 'A-DEC'), Period('2005', 'A-DEC'),
       Period('2006', 'A-DEC'), Period('2007', 'A-DEC'),
       Period('2008', 'A-DEC'), Period('2009', 'A-DEC')], dtype=object)

In [27]: all(i.__array__()==i.values)
Out[27]: True

# underlying:
In [28]: i._values
Out[28]: array([30, 31, 32, 33, 34, 35, 36, 37, 38, 39])

Problem

In pandas, we limit directly calling .values from outside Index, instead accessing Index functions through a smaller API.

But in xarray, I think there are a fair few functions that call .values or implicitly call .__array__() by passing the index into numpy.

As a result, there is a severe slow down when using PeriodIndex. As an example:

In [51]: indexes = [pd.PeriodIndex(start=str((1776 + i)), freq='A', periods=300) for i in range(50)]
In [53]: das = [xr.DataArray(range(300), coords=[index]) for index in indexes]

In [54]: %timeit xr.concat(das)

# 1 loop, best of 3: 1.38 s per loop

vs DTI:

In [55]: indexes_dt = [pd.DatetimeIndex(start=str((1776 + i)), freq='A', periods=300) for i in range(50)]
In [56]: das_dt = [xr.DataArray(range(300), coords=[index]) for index in indexes_dt]
In [57]: %timeit xr.concat(das_dt)
# 10 loops, best of 3: 69.2 ms per loop

...a 20x slowdown, on fairly short indexes

@shoyer do you have any ideas of how to resolve this? Is it feasible to not pass Indexes directly into numpy? I haven't gone through in enough depth to have a view there, given I was hoping you could cut through the options. Thank you.

ref pandas-dev/pandas#14822
CC @sinhkrs @jreback

@shoyer
Copy link
Member

shoyer commented Feb 8, 2017

The usual place to start is with profiling. Here's what %prun xr.concat(das) gets me:
https://gist.github.com/shoyer/97dff50d5892e3437d4864d93e85ead2

So indeed, each index is getting converted into a NumPy array. Profiling suggests Variable.equals is a likely culprit (0.863/1.159 seconds) and indeed we call .data there:

equiv(self.data, other.data)))

The good news is that pandas already has it's own vectorized .equals method for indexes. So we should either add a special case if ._data is a PandasIndexWrapper for the generic Variable.equals method, or perhaps better yet add a subclass method IndexVariable.equals that always call .equals on the indexes instead of caling ops.array_equiv on .data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants