Skip to content

pct_change with freq on groupby broken #11811

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bobobo1618 opened this issue Dec 10, 2015 · 8 comments
Closed

pct_change with freq on groupby broken #11811

bobobo1618 opened this issue Dec 10, 2015 · 8 comments
Labels
Frequency DateOffsets MultiIndex Numeric Operations Arithmetic, Comparison, and Logical operations Resample resample method

Comments

@bobobo1618
Copy link

>>> pd.__version__
u'0.17.1'
# Column 1 is a date column, column 0 is a text column representing a stock symbol, column 12 is a price
>>> rawdat = pd.read_csv('./data.csv', parse_dates=[1], index_col=[0, 1], usecols=[0, 1, 12])
# Get monthly change in value for each stock symbol
>>> rawdat.groupby(level=0).pct_change(freq='M')
<Empty>
>>> rawdat.groupby(level=0).pct_change()
<Lots of data>
>>> rawdat.loc[['AAPL']].reset_index(level=0, drop=True).pct_change(freq='M')
<Lots of data>
@bobobo1618
Copy link
Author

Seems resample with a custom how is broken too?

>>> rawdat.groupby(['Symbol']).resample('1M', label='right', how='sum')
<Lots of data>
>>> rawdat.groupby(['Symbol']).resample('1M', label='right', how=lambda period: (period[-1]/period[0]) - 1.0)
<Empty>

@sinhrks
Copy link
Member

sinhrks commented Dec 10, 2015

Thanks for the report. Can you attach the content of data.csv as small copy-pastable format?

@sinhrks sinhrks added Numeric Operations Arithmetic, Comparison, and Logical operations Frequency DateOffsets labels Dec 10, 2015
@bobobo1618
Copy link
Author

I'm unable to provide the data itself because it's licensed (and several GB) unfortunately.

When trying to give you a subset though, I discovered that the bug was only reproducible when I read more than 1032393 (an odd but specific number) rows from the CSV. If I read less, the functions above worked perfectly. If I read more, they all broke completely:

>>> rawdat = pd.read_csv('./data.csv', parse_dates=[1], index_col=[1], usecols=[0, 1, 12], nrows=1032393)
>>> rawdat.groupby(['Symbol']).resample('1M', label='right', how=lambda period: (period[-1]/period[0]) - 1.0)
<Data>
>>> rawdat = pd.read_csv('./data.csv', parse_dates=[1], index_col=[1], usecols=[0, 1, 12], nrows=1032394)
>>> rawdat.groupby(['Symbol']).resample('1M', label='right', how=lambda period: (period[-1]/period[0]) - 1.0)
<Empty>

Since that number was so specific, I looked into the data and saw that at that row, there was a gap of a few months in that symbol's data:

2014-06-27  AMBCW   17.490000
2014-09-24  AMBCW   13.500000

I skipped over that with skiprows and found a similar break at:

2004-12-16  AMBI    0.3000
2005-02-28  AMBI    0.3000

So I suspect that the problem is really the gap in the data.

Also though, I found that grouping by multiindex levels didn't work:

# Column 0 is a symbol. 
>>> rawdat = pd.read_csv('./data.csv', parse_dates=[1], index_col=[0, 1], usecols=[0, 1, 12], nrows=1000)
>>> rawdat.reset_index(level=0).groupby(['Symbol']).resample('1M', label='right', how=lambda period: (period[-1]/period[0]) - 1.0)
<Lots of data>
>>> rawdat.groupby(level=0).resample('1M', label='right', how=lambda period: (period[-1]/period[0]) - 1.0)
<Empty>

I can share 1000 rows with you so here's all you need to reproduce the multiindex problem.

@jreback
Copy link
Contributor

jreback commented Dec 12, 2015

@bobobo1618 the last might be a bug. Can you put together a copy-pastable example that is reproducible.

@jreback jreback added MultiIndex Resample resample method labels Dec 12, 2015
@bobobo1618
Copy link
Author

Replacing my function with this caused it to work:

def month_change_resample(arraylike):
    if len(arraylike) == 0:
        return 0
    return (arraylike[-1]/arraylike[0]) - 1.0

It seems some of the arraylike objects were of length zero, which would have caused an IndexError during the execution. Is outputting nothing expected behaviour in that situation?

@bobobo1618
Copy link
Author

Ah sorry, no it didn't.

Copy pastable example:

curl http://ix.io/mJE > data.csv
cat << EOF > test.py
import pandas as pd
rawdat = pd.read_csv('./data.csv', parse_dates=[1], index_col=[0, 1], usecols=[0, 1, 12], nrows=1000)
def month_change_resample(arraylike):
    if len(arraylike) == 0:
        return 0
    return (arraylike[-1]/arraylike[0]) - 1.0
print("Level result: ")
print(rawdat.groupby(level=0).resample('1M', label='right', how=month_change_resample))
print("No level result:")
print(rawdat.reset_index(level=0).groupby(['Symbol']).resample('1M', label='right', how=month_change_resample))
EOF
python test.py

@jreback
Copy link
Contributor

jreback commented Dec 13, 2015

@bobobo1618 a copy-pastable example is one that I can actually copy and past.

IOW, no files are involved. The dataframe should be much shorter.

@jreback
Copy link
Contributor

jreback commented May 30, 2018

not reproducible and likely same issue as in #21200

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Frequency DateOffsets MultiIndex Numeric Operations Arithmetic, Comparison, and Logical operations Resample resample method
Projects
None yet
Development

No branches or pull requests

3 participants