Skip to content

BUG: aggregate(sum) returns wrong result for certain boolean input #7666

Closed
@wheeeyeyeye

Description

@wheeeyeyeye

I have a DataFrame that looks like the following format:

df = pd.DataFrame({'foo': [1, 2, 2], 'bar': [True, False, False]})

I want group this by foo and count the number of True values in the bar column. Counting the True values can be achieved with the sum command.

In [7]: bar = [True, False, True, False, False]

In [8]: sum(bar)
Out[8]: 2

In [9]: sum(df['bar'])
Out[9]: 1

To group and count this:

In [16]: df.groupby('foo').aggregate(sum)
Out[16]:
       bar
foo
1     True
2    False

This output is erroneous. Expected output is:

       bar
foo
1      1
2      0

It works in the following case (changed so that not all cases for foo:2 are false).

In [18]: df = pd.DataFrame({'foo': [1, 2, 2, 2, 2], 'bar': [True, True, True, False, False]})
In [18]: df.groupby('foo').aggregate(sum)
Out[18]:
     bar
foo
1      1
2      2

Here are my installed versions:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.7.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.14.0
nose: 1.3.3
Cython: 0.20.1
numpy: 1.8.1
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 2.1.0
sphinx: 1.2.2
patsy: 0.2.1
scikits.timeseries: None
dateutil: 1.5
pytz: 2014.3
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.3.1
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.5
lxml: 3.3.5
bs4: 4.3.1
html5lib: None
bq: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.4
pymysql: None
psycopg2: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions