Skip to content

groupby w/ only NULL-groups crashes since 0.23 #21624

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
crepererum opened this issue Jun 25, 2018 · 5 comments · Fixed by #29067
Closed

groupby w/ only NULL-groups crashes since 0.23 #21624

crepererum opened this issue Jun 25, 2018 · 5 comments · Fixed by #29067
Labels
good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@crepererum
Copy link
Contributor

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
import pandas.testing as pdt


# works:
df1 = pd.DataFrame({
    'g': [None, 'a'],
    'x': 1,
})
actual1 = df1.groupby('g')['x'].transform('sum')
expected1 = pd.Series([np.nan, 1.], name='x')

pdt.assert_series_equal(
    actual1,
    expected1,
)


# crashes:
df2 = pd.DataFrame({
    'g': [None],
    'x': 1,
})
actual2 = df2.groupby('g')['x'].transform('sum')
expected2 = pd.Series([np.nan], name='x')  # crashes with "ValueError: Length of passed values is 1, index implies 0"

pdt.assert_series_equal(
    actual2,
    expected2,
)

Problem description

groupby ignores groups that contain NULL-elements in any of the group columns. In that case, results of transform are NULL (NaN for floats, NaT for time, None for objects). The question is what happens if there are only "NULL groups":

  • before pandas 0.23: an object column with None objects is created
  • pandas 0.23: crash (ValueError: Length of passed values is 1, index implies 0)

Expected Output

A NULL-column according to the input data type.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.49-moby
machine: x86_64
processor:
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.23.1
pytest: 3.4.0
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.0.0
pyarrow: 0.9.0
xarray: None
IPython: 5.7.0
sphinx: 1.6.7
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.0
xlrd: None
xlwt: None
xlsxwriter: 0.8.4
lxml: 3.8.0
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.8.1
s3fs: 0.1
fastparquet: None
pandas_gbq: None
pandas_datareader: None
@WillAyd
Copy link
Member

WillAyd commented Jun 25, 2018

Thanks for the report - investigation and PRs are always welcome!

@jorisvandenbossche jorisvandenbossche added Regression Functionality that used to work in a prior pandas version and removed Bug labels Jun 26, 2018
@jorisvandenbossche jorisvandenbossche added this to the 0.23.2 milestone Jun 26, 2018
@jreback jreback modified the milestones: 0.23.2, 0.23.3 Jun 26, 2018
lopez86 added a commit to lopez86/pandas that referenced this issue Jul 10, 2018
Fixes bug where operations such as transform('sum') raise errors when only a single null group exists.
lopez86 added a commit to lopez86/pandas that referenced this issue Jul 11, 2018
@jreback jreback modified the milestones: 0.23.4, 0.24.0 Jul 20, 2018
@jreback jreback modified the milestones: 0.23.4, 0.23.5 Aug 2, 2018
@jreback jreback modified the milestones: 0.23.5, 0.24.0 Oct 23, 2018
@jreback jreback modified the milestones: 0.24.0, Contributions Welcome Dec 2, 2018
@tobycheese
Copy link
Contributor

I came upon this issue when I got a SIGSEGV after a groupby where one collumn contained some NaN values. Removing them avoided the crash.

However, I cannot see a difference in behaviour between 0.22.0 and master, both raise the ValueError for me.

@crepererum Are you sure your example does not raise that error with 0.22.0?

@crepererum
Copy link
Contributor Author

@tobycheese yes.

@tobycheese
Copy link
Contributor

tobycheese commented Mar 12, 2019

Indeed. With Python 2.7.10 and Pandas 0.22.0 there is no error, with Python 3.6.5 and Pandas 0.22.0 there is the ValueError.

@mroeschke
Copy link
Member

This work on master. Could use a test.

In [83]: df2 = pd.DataFrame({
    ...:     'g': [None],
    ...:     'x': 1,
    ...: })
    ...: actual2 = df2.groupby('g')['x'].transform('sum')
    ...: expected2 = pd.Series([np.nan], name='x')  # crashes with "ValueError: Length of passed values is 1, index implies 0"
    ...:
    ...: pdt.assert_series_equal(
    ...:     actual2,
    ...:     expected2,
    ...: )

In [84]: pd.__version__
Out[84]: '0.26.0.dev0+576.gde67bb72e'

@mroeschke mroeschke added good first issue and removed Groupby Regression Functionality that used to work in a prior pandas version labels Oct 16, 2019
@mroeschke mroeschke added the Needs Tests Unit test(s) needed to prevent regressions label Oct 16, 2019
crepererum added a commit to crepererum/pandas that referenced this issue Oct 17, 2019
crepererum added a commit to crepererum/pandas that referenced this issue Oct 18, 2019
@jreback jreback modified the milestones: Contributions Welcome, 1.0 Oct 18, 2019
HawkinsBA pushed a commit to HawkinsBA/pandas that referenced this issue Oct 29, 2019
proost pushed a commit to proost/pandas that referenced this issue Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this issue Dec 19, 2019
bongolegend pushed a commit to bongolegend/pandas that referenced this issue Jan 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
7 participants