Skip to content

BUG : DataFrameGroupBy.quantile segfault #28194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
benjaminriviere opened this issue Aug 28, 2019 · 10 comments
Closed

BUG : DataFrameGroupBy.quantile segfault #28194

benjaminriviere opened this issue Aug 28, 2019 · 10 comments
Labels
Duplicate Report Duplicate issue or pull request Groupby quantile quantile method Segfault Non-Recoverable Error

Comments

@benjaminriviere
Copy link

Code Sample

import pandas as pd
import numpy as np


df = pd.DataFrame(data = {
    "x": [1, 1, 1],
    "y": [np.nan, np.nan, np.nan],
    "z": [1, 2, 3]
})

# Works on SeriesGroupBy
df.groupby(["x", "y"])["z"].quantile(0.5)

# Segfault on DataFrameGroupBy
df.groupby(["x", "y"])[["z"]].quantile(0.5)

Problem description

Hello all,
I just noticed that there is still an issue with the quantile function when used on a DataFrameGroupBy object with Pandas 0.25.1. When the groupby operation is done with an empty column, a segfault occurs. Above is a code sample to reproduce the bug. This issue didn't occur with Pandas 0.24.2.

Expected Output

Pandas 0.24.2 gives this output :

Empty DataFrame
Columns: []
Index: []

Output of pd.show_versions()

My computer is running Ubuntu 18.04. Below are the installed packages (in a clean virtual env). The bug doesn't seem to be related to numpy as it also occured with version 1.16.4.

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.6.8.final.0
python-bits      : 64
OS               : Linux
OS-release       : 4.15.0-58-generic
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 0.25.1
numpy            : 1.17.1
pytz             : 2019.2
dateutil         : 2.8.0
pip              : 19.2.3
setuptools       : 41.2.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 28, 2019

Hmm I can't reproduce on master. We fixed one of these in #27826 (this is a different example though).

@benjaminriviere
Copy link
Author

Great then ! I saw the fix but it didn't look related to this issue specifically.

@WillAyd
Copy link
Member

WillAyd commented Aug 28, 2019

Can you try master to make sure? Seemed OK for me

@benjaminriviere
Copy link
Author

I just checked on master and it works. However, there is now a segfault on quantile for SeriesGroupBy. Here's the show_versions() output :

INSTALLED VERSIONS
------------------
commit           : bc65fe6c12dc78679ba8584eee83c6e3e243b5b9
python           : 3.6.8.final.0
python-bits      : 64
OS               : Linux
OS-release       : 4.15.0-58-generic
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 0.25.0+256.gbc65fe6c1
numpy            : 1.17.1
pytz             : 2019.2
dateutil         : 2.8.0
pip              : 19.2.3
setuptools       : 41.2.0
Cython           : 0.29.13
pytest           : 5.1.1
hypothesis       : 4.34.0
sphinx           : 1.8.5
blosc            : 1.8.1
feather          : None
xlsxwriter       : None
lxml.etree       : 4.4.1
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : 2.10.1
IPython          : 7.7.0
pandas_datareader: None
bs4              : 4.8.0
bottleneck       : 1.2.1
fastparquet      : 0.3.2
gcsfs            : None
lxml.etree       : 4.4.1
matplotlib       : 3.1.1
numexpr          : 2.7.0
odfpy            : None
openpyxl         : 2.6.3
pandas_gbq       : None
pyarrow          : 0.14.1
pytables         : None
s3fs             : None
scipy            : 1.3.1
sqlalchemy       : None
tables           : 3.5.2
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None

@TomAugspurger
Copy link
Contributor

However, there is now a segfault on quantile for SeriesGroupBy

Can you show a failing example?

Did you make sure to recompile the C extensions before running that?

@benjaminriviere
Copy link
Author

I used the same code sample :

import pandas as pd
import numpy as np


df = pd.DataFrame(data = {
    "x": [1, 1, 1],
    "y": [np.nan, np.nan, np.nan],
    "z": [1, 2, 3]
})

# Segfault on SeriesGroupBy
df.groupby(["x", "y"])["z"].quantile(0.5)

# Now works on DataFrameGroupBy
df.groupby(["x", "y"])[["z"]].quantile(0.5)

I followed the instructions given in the contribution guide and the Pandas documentation. I recompiled all the C extensions and I didn't get any error. I can try again to see if that changes something.

@jbrockmendel
Copy link
Member

This does not segfault for me on master. @benjaminriviere can you confirm this is still a problem?

@SeppMe
Copy link

SeppMe commented Oct 16, 2019

I ran into this issue today and can confirm the above code crashes on current master (0.26.0.dev0+583.g86e187f). I wrote a bit more about my findings in the other thread on this topic:
#28882 (comment)

@jbrockmendel jbrockmendel added Groupby Segfault Non-Recoverable Error quantile quantile method labels Oct 16, 2019
@smcateer
Copy link

I am getting an even odder result. I can run either of the commands fine, but running any two of them (one after the other) causes a crash:

import pandas as pd
import numpy as np


df = pd.DataFrame(data = {
    "x": [1, 1, 1],
    "y": [np.nan, np.nan, np.nan],
    "z": [1, 2, 3]
})

# If you uncomment any one of these, it runs.
# If you uncomment any two of these, it crashes the kernel.
#df.groupby(["x", "y"])["z"].quantile(0.5)
#df.groupby(["x", "y"])["z"].quantile(0.5)
#df.groupby(["x", "y"])[["z"]].quantile(0.5)
#df.groupby(["x", "y"])[["z"]].quantile(0.5)

I naively pasted in the code above and got a crash.

Not sure if this is unrelated, but I am also seeing groupby.quantile producing wrong results. (e.g. 0.5 quantile > than 0.75 quantile).

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.4.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
machine          : AMD64
processor        : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder        : little
LC_ALL           : en_AU.UTF-8
LANG             : en_AU.UTF-8
LOCALE           : None.None

pandas           : 0.25.2
numpy            : 1.16.5
pytz             : 2019.3
dateutil         : 2.8.0
pip              : 19.3.1
setuptools       : 41.4.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : 2.8.3 (dt dec pq3 ext lo64)
jinja2           : 2.10.3
IPython          : 7.8.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : 3.1.1
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
s3fs             : None
scipy            : 1.3.1
sqlalchemy       : 1.3.10
tables           : None
xarray           : None
xlrd             : 1.2.0
xlwt             : None
xlsxwriter       : None

@WillAyd WillAyd added the Duplicate Report Duplicate issue or pull request label Oct 30, 2019
@WillAyd
Copy link
Member

WillAyd commented Oct 30, 2019

This is a duplicate of #28882 which was just fixed on master. Should have a 0.25.3 release out tomorrow for it

@WillAyd WillAyd closed this as completed Oct 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Groupby quantile quantile method Segfault Non-Recoverable Error
Projects
None yet
Development

No branches or pull requests

6 participants