pct_change with freq on groupby broken #11811

bobobo1618 · 2015-12-10T06:27:05Z

>>> pd.__version__
u'0.17.1'
# Column 1 is a date column, column 0 is a text column representing a stock symbol, column 12 is a price
>>> rawdat = pd.read_csv('./data.csv', parse_dates=[1], index_col=[0, 1], usecols=[0, 1, 12])
# Get monthly change in value for each stock symbol
>>> rawdat.groupby(level=0).pct_change(freq='M')
<Empty>
>>> rawdat.groupby(level=0).pct_change()
<Lots of data>
>>> rawdat.loc[['AAPL']].reset_index(level=0, drop=True).pct_change(freq='M')
<Lots of data>

bobobo1618 · 2015-12-10T06:51:15Z

Seems resample with a custom how is broken too?

>>> rawdat.groupby(['Symbol']).resample('1M', label='right', how='sum')
<Lots of data>
>>> rawdat.groupby(['Symbol']).resample('1M', label='right', how=lambda period: (period[-1]/period[0]) - 1.0)
<Empty>

sinhrks · 2015-12-10T11:00:29Z

Thanks for the report. Can you attach the content of data.csv as small copy-pastable format?

bobobo1618 · 2015-12-10T16:44:17Z

I'm unable to provide the data itself because it's licensed (and several GB) unfortunately.

When trying to give you a subset though, I discovered that the bug was only reproducible when I read more than 1032393 (an odd but specific number) rows from the CSV. If I read less, the functions above worked perfectly. If I read more, they all broke completely:

>>> rawdat = pd.read_csv('./data.csv', parse_dates=[1], index_col=[1], usecols=[0, 1, 12], nrows=1032393)
>>> rawdat.groupby(['Symbol']).resample('1M', label='right', how=lambda period: (period[-1]/period[0]) - 1.0)
<Data>
>>> rawdat = pd.read_csv('./data.csv', parse_dates=[1], index_col=[1], usecols=[0, 1, 12], nrows=1032394)
>>> rawdat.groupby(['Symbol']).resample('1M', label='right', how=lambda period: (period[-1]/period[0]) - 1.0)
<Empty>

Since that number was so specific, I looked into the data and saw that at that row, there was a gap of a few months in that symbol's data:

2014-06-27  AMBCW   17.490000
2014-09-24  AMBCW   13.500000

I skipped over that with skiprows and found a similar break at:

2004-12-16  AMBI    0.3000
2005-02-28  AMBI    0.3000

So I suspect that the problem is really the gap in the data.

Also though, I found that grouping by multiindex levels didn't work:

# Column 0 is a symbol. 
>>> rawdat = pd.read_csv('./data.csv', parse_dates=[1], index_col=[0, 1], usecols=[0, 1, 12], nrows=1000)
>>> rawdat.reset_index(level=0).groupby(['Symbol']).resample('1M', label='right', how=lambda period: (period[-1]/period[0]) - 1.0)
<Lots of data>
>>> rawdat.groupby(level=0).resample('1M', label='right', how=lambda period: (period[-1]/period[0]) - 1.0)
<Empty>

I can share 1000 rows with you so here's all you need to reproduce the multiindex problem.

jreback · 2015-12-12T14:26:27Z

@bobobo1618 the last might be a bug. Can you put together a copy-pastable example that is reproducible.

bobobo1618 · 2015-12-12T19:36:40Z

Replacing my function with this caused it to work:

def month_change_resample(arraylike):
    if len(arraylike) == 0:
        return 0
    return (arraylike[-1]/arraylike[0]) - 1.0

It seems some of the arraylike objects were of length zero, which would have caused an IndexError during the execution. Is outputting nothing expected behaviour in that situation?

bobobo1618 · 2015-12-12T22:55:14Z

Ah sorry, no it didn't.

Copy pastable example:

curl http://ix.io/mJE > data.csv
cat << EOF > test.py
import pandas as pd
rawdat = pd.read_csv('./data.csv', parse_dates=[1], index_col=[0, 1], usecols=[0, 1, 12], nrows=1000)
def month_change_resample(arraylike):
    if len(arraylike) == 0:
        return 0
    return (arraylike[-1]/arraylike[0]) - 1.0
print("Level result: ")
print(rawdat.groupby(level=0).resample('1M', label='right', how=month_change_resample))
print("No level result:")
print(rawdat.reset_index(level=0).groupby(['Symbol']).resample('1M', label='right', how=month_change_resample))
EOF
python test.py

jreback · 2015-12-13T20:31:43Z

@bobobo1618 a copy-pastable example is one that I can actually copy and past.

IOW, no files are involved. The dataframe should be much shorter.

jreback · 2018-05-30T11:11:42Z

not reproducible and likely same issue as in #21200

sinhrks added Numeric Operations Arithmetic, Comparison, and Logical operations Frequency DateOffsets labels Dec 10, 2015

jreback added MultiIndex Resample resample method labels Dec 12, 2015

jreback mentioned this issue May 25, 2018

BUG: groupby.pct_change() does not work properly in Pandas 0.23.0. Grouping is ignored. #21200

Closed

jreback closed this as completed May 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pct_change with freq on groupby broken #11811

pct_change with freq on groupby broken #11811

bobobo1618 commented Dec 10, 2015

bobobo1618 commented Dec 10, 2015

sinhrks commented Dec 10, 2015

bobobo1618 commented Dec 10, 2015

jreback commented Dec 12, 2015

bobobo1618 commented Dec 12, 2015

bobobo1618 commented Dec 12, 2015

jreback commented Dec 13, 2015

jreback commented May 30, 2018

pct_change with freq on groupby broken #11811

pct_change with freq on groupby broken #11811

Comments

bobobo1618 commented Dec 10, 2015

bobobo1618 commented Dec 10, 2015

sinhrks commented Dec 10, 2015

bobobo1618 commented Dec 10, 2015

jreback commented Dec 12, 2015

bobobo1618 commented Dec 12, 2015

bobobo1618 commented Dec 12, 2015

jreback commented Dec 13, 2015

jreback commented May 30, 2018