Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/inconsistency: pd.Series(dtype=object).sum(skipna=True) does not respect skipna=True #59764

Open
2 of 3 tasks
Code0x58 opened this issue Sep 9, 2024 · 2 comments
Open
2 of 3 tasks
Labels
Bug Closing Candidate May be closeable, needs more eyeballs Reduction Operations sum, mean, min, max, etc.

Comments

@Code0x58
Copy link

Code0x58 commented Sep 9, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np


class X:
    def __init__(self, n, generation=0):
        self.n = n
        self.generation = generation

    def __add__(self, other):
        return X(self.n + other.n, max(self.generation, other.generation) + 1)

    def __mul__(self, other):
        return X(self.n * other.n, max(self.generation, other.generation) + 1)

    def __repr__(self):
        return f'X({self.n}, {self.generation})'
    

s = pd.Series([X(1), X(2), np.nan, X(0)])

# these implicitly have skipna=True anyway
s.sum(skipna=True)

Issue Description

aggregation methods over objects does not respect skipna. Exception produced above

# sum, cumsum, prod, cumprod, ...
AttributeError: 'int' object has no attribute 'n'

Expected Behavior

NaN values are skipped.

Given that all-NaN values are documented to produce NaN

Installed Versions

INSTALLED VERSIONS
------------------
commit                : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140
python                : 3.12.5.final.0
python-bits           : 64
OS                    : Linux
OS-release            : 4.15.0-213-generic
Version               : #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023
machine               : x86_64
processor             :
byteorder             : little
LC_ALL                : None
LANG                  : C.UTF-8
LOCALE                : C.UTF-8

pandas                : 2.2.2
numpy                 : 2.1.1
pytz                  : 2024.1
dateutil              : 2.9.0.post0
setuptools            : None
pip                   : 24.2
Cython                : None
pytest                : None
hypothesis            : None
sphinx                : None
blosc                 : None
feather               : None
xlsxwriter            : None
lxml.etree            : None
html5lib              : None
pymysql               : None
psycopg2              : None
jinja2                : None
IPython               : None
pandas_datareader     : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : None
bottleneck            : None
dataframe-api-compat  : None
fastparquet           : None
fsspec                : None
gcsfs                 : None
matplotlib            : None
numba                 : None
numexpr               : None
odfpy                 : None
openpyxl              : None
pandas_gbq            : None
pyarrow               : None
pyreadstat            : None
python-calamine       : None
pyxlsb                : None
s3fs                  : None
scipy                 : None
sqlalchemy            : None
tables                : None
tabulate              : None
xarray                : None
xlrd                  : None
zstandard             : None
tzdata                : 2024.1
qtpy                  : None
pyqt5                 : None
@Code0x58 Code0x58 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 9, 2024
@rhshadrach
Copy link
Member

rhshadrach commented Sep 9, 2024

Thanks for the report. For efficiency, pandas implements skipna=True by filling NA values with 0. I think to recreate a new array of smaller size would be prohibitively inefficient. If you'd like to be operable with pandas, I'd suggest implementing integer addition (at the very least, with the number 0).

@rhshadrach rhshadrach added Closing Candidate May be closeable, needs more eyeballs Reduction Operations sum, mean, min, max, etc. and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 9, 2024
@Code0x58
Copy link
Author

Code0x58 commented Sep 10, 2024

That makes a fair bit of sense, and works out of the box for things like the built in complex, Fraction, and Decimal builtins which are probably more representative of normal use, as the relevant identity (i.e. 0 or 1) is use.

In the case of operations that produce scalars, like .sum(), .min(), etc. I'd have thought with the cost of operating with python objects that omitting invocations with nulls would actually ended up a little more efficient than dropping in 1 or 0 and throwing that at the next encountered value. I suppose given behaviour on empty series for .min(), .prod(), and sum() you'd want to have special handling everywhere NaN, 1, and 0 if everything else is unchanged, which seems fair. As a bit an extension, if you could provide the identity value and it supported __iadd__ or whatever else, you could likely end up with somewhat better efficiency as you can reasonably claim ownership over a provided identity parameter (or magic method?) and avoid creating intermediate objects.

I can see what you mean for efficiency of the expanding operations like .cumsum(), .cumin(), etc. as you'd need to either copy the object, or more precisely recreate an identity to accumulate to produce a new value. The puritan in me likes the idea of this over "you just get some default int", maybe with an interface like series.cumsum(identity_factory=X).

As much as I think there could be some nice performance gains, and maybe a little bit less surprise with documentation as it stands, this particularly niche case can be solved by handling NaN, 0, and 1 as you say. The "surprise" for the very odd user could be ironed out with documentation explaining what happens with the object dtype along with examples of handling those special cases.

That all said, as interesting as it I've found it to talk about, I don't have any real interest in promoting or working on any sort of "scalars might go faster and explicit identities for in-place ops may be faster still" or "explicit identities allow another way to do expanding". Even updating documentation feels like a bit of a stretch now as I guess this doesn't come up much, as even my case just came up from trying to hack up something into an existing project so have plenty of flexibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Closing Candidate May be closeable, needs more eyeballs Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

No branches or pull requests

2 participants