-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
API: sum of Series of all NaN should return 0 or NaN ? #9422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I guess the However pandas wants to propogate
so I think that you are proposing that |
Yes, we might want to change your line I think This change also has implications for windowed and grouped sum/prod. These might also need a corresponding change to match. |
TST: fix for bottleneck >= 1.0 nansum behavior, xref #9422
* https://github.com/pydata/pandas: (26 commits) disable some deps on 3.2 build Fix meantim typo DOC: use current ipython in doc build PERF: write basic datetimes faster pandas-dev#10271 TST: fix for bottleneck >= 1.0 nansum behavior, xref pandas-dev#9422 add numba example to enhancingperf.rst BUG: SparseSeries constructor ignores input data name BUG: Raise TypeError only if key DataFrame is not empty pandas-dev#10126 ENH: groupby.apply for Categorical should preserve categories (closes pandas-dev#10138) DOC: add in whatsnew/0.17.0.txt DOC: move whatsnew from 0.17.0 -> 0.16.2 BUG: Holiday(..) with both offset and observance raises NotImplementedError pandas-dev#10217 BUG: Index.union cannot handle array-likes BUG: SparseSeries.abs() resets name BUG: Series arithmetic methods incorrectly hold name ENH: Don't infer WOM-5MON if we don't support it (pandas-dev#9425) BUG: Series.align resets name when fill_value is specified BUG: GroupBy.get_group raises ValueError when group key contains NaT Close mysql connection in TestXMySQL to prevent tests freezing BUG: plot doesnt default to matplotlib axes.grid setting (pandas-dev#9792) ...
* commit 'v0.16.1-97-gbc7d48f': (56 commits) disable some deps on 3.2 build Fix meantim typo DOC: use current ipython in doc build PERF: write basic datetimes faster pandas-dev#10271 TST: fix for bottleneck >= 1.0 nansum behavior, xref pandas-dev#9422 add numba example to enhancingperf.rst BUG: SparseSeries constructor ignores input data name BUG: Raise TypeError only if key DataFrame is not empty pandas-dev#10126 ENH: groupby.apply for Categorical should preserve categories (closes pandas-dev#10138) DOC: add in whatsnew/0.17.0.txt DOC: move whatsnew from 0.17.0 -> 0.16.2 BUG: Holiday(..) with both offset and observance raises NotImplementedError pandas-dev#10217 BUG: Index.union cannot handle array-likes BUG: SparseSeries.abs() resets name BUG: Series arithmetic methods incorrectly hold name ENH: Don't infer WOM-5MON if we don't support it (pandas-dev#9425) BUG: Series.align resets name when fill_value is specified BUG: GroupBy.get_group raises ValueError when group key contains NaT Close mysql connection in TestXMySQL to prevent tests freezing BUG: plot doesnt default to matplotlib axes.grid setting (pandas-dev#9792) ...
Yeh an option passed to |
… compat with numpy >= 1.8.2 and bottleneck >= 1.0, pandas-dev#9422 note that passing skipna=False will still return a NaN
@sam-s Pandas uses NaN to represent missing values (i.e., NULL in database semantics). This has its downsides, but is a well established part of the current pandas data model (it will be eventually fixed in pandas 2.0). |
We're now seriously considered three different options:
My question: Is there any precedence (other than old versions of pandas) for choice 2? This may be a compromise solution that keeps both the "result should be zero" and "result should be missing" camps happy, but distinguishing between "empty" and "missing" is something we don't do anywhere else in pandas, so this seems like a huge gotcha to me. I think this status quo was worse than making either choice. |
@shoyer so your rationale is that |
@jreback I believe the information is not lost, but shifted from one case to another. If we decided, instead, to return zero for all NaN At the same time, I believe that, regardless of the decision, people would complain, citing mathematics and yelling 'obviously'. If mathematics were a person, she would perhaps be surprised at how often she was mentioned by both camps. |
@jorisvandenbossche: you say I don't think this comparison is fully correct, as it is comparing two different kinds of sums (the pandas |
@kawochen: could you please explain how |
@sam-s I think we all understand the mathematical issues. That's not the only basis on which this decision is being made. |
I was only making an observation that both camps cited math, and I was not wrong. @brazilbean said:
@sam-s The sum of a empty set being zero is a convention. It is convenient, but it doesn't follow from any definition of (+, R). When it is used in any math books I've read, it is explicitly defined to be zero. Do I want it to be 0? Yes, I think that'd be prettier. Do I think it needs to be? No. |
@kawochen : The sum of a empty set being zero is a convention @kawochen : When it is used in any math books I've read, it is explicitly defined to be zero. @brazilbean : From a mathematics-theory perspective, returning 0 as the sum of "no numbers" is wrong. @shoyer: I think we all understand the mathematical issues |
@sam-s Please stop. This is not the place to prove that others are wrong. |
@sam-s I am not confused. (+, R) is a binary operation on R, not a mysterious unary operation on lists. |
@kawochen : I have no desire to fight. I apologize if I insulted you or anyone else. PS. Again, I understand that math is not the only rationale for making software design decisions (moreover, R with NaN/Inf/&c is not a field). However, a claim was made that |
@sam-s pandas is not a mathematics library, so these mathematical arguments are not persuasive. The issue is comparing all-null data versus empty data. The change that was made was to make the empty data behavior consistent across all reductions. This has nothing to do with mathematical arithmetic theory. Also please stop using NaN as a straw man. NaN is null in pandas. |
@wesm : I am replying to claims that @wesm : The change that was made was to make the empty data behavior consistent across all reductions. |
Here you go:
Do these answers all seem reasonable to you?
This is what I mean by consistency. The question is not whether |
@wesm: I am chagrined by your approach. Now, returning nan/null makes no sense whatsoever, except for a superficial similarity between unrelated cases. @wesm : semantics of an all-null Series versus a length-0 Series
If you have a list
(please do not get hung up on "unknown" vs "missing" terminology, use whatever makes sense to you) Both of the above 2 approaches to treating nulls (1: return null, 2: ignore nulls) are only relevant to situations where we have nulls. But there are no nulls in an empty set, so this is not relevant! When we have no nulls, returning null means that the operation on the data supplied is undefined. IOW, there is no way to produce a number which would satisfy the basic properties of the operations in question. But this is not the case here! Let us see: associativity implies that Thank you very much for listening to me and replying. Again, while I do understand that math is not the only consideration, I beg you to remember that your customer is an applied mathematician like yours truly and we have certain expectations from the basic math operations. Please do not surprise us like this! ;-) Thanks again! |
This is reasonable from a mathematical perspective, but again is not the choice made by databases. In databases (and pandas) nulls can be introduced into empty results quite easily from joins, and in general there is no careful distinction. |
@jorisvandenbossche already pointed out the error.
In other words, you are mixing (the sometimes overloaded) All reasonable return values will have their own supporters, and that's OK -- |
Perhaps another justification: Adding up things a certain number of times is called multiplication. My point isn't to prove that an empty sum is zero, that's been settled for hundreds of years, and doesn't really need our help. My point is that if you define it as anything else, huge numbers of inconsistencies will pop up all over the place, and the solution will be to either add special-case code all over the place, or to stop using Is it possible that people are sticking to their guns here because they don't want to back down? Or they want it acknowledged that one could see things that way, even if it's not the best decision? I'm just having a hard time seeing a logical reason for this discussion and I'm trying to think of psychological reasons, I guess. The justifications for
Let the |
@kenahoo Well, I said more than a year ago I would like
How do your fields require that an empty sum be non-null? I am always quite willing to simply not deal with empty time series and empty data sets. But if you could provide a closer-to-real life example of how it breaks things, it would really help the discussion. @sam-s provided a useful data point, that he is used to checking presence of We shouldn't talk about the sum of an empty
I don't know about your fields, but I quite like the fact that I personally think This consideration isn't even new. It is hard to keep operations in real numbers extended by positive infinity, negative infinity and not-a-number consistent. Now, why is no one bothered by the fact that with Our voices have been heard. Let's only respond when we have new substance to bring to this discussion. |
I have said several messages ago that there's a valid argument for both a sum that yields 0 on no data and one that yields null. I suggest we consider making a 0.21.1 release with this improvement and reverting the behavior of I'm locking this thread because I think the discussion has ceased to be productive some time ago. |
@wesm it's maybe not the best place here, but there are still some things to discuss. First, there is the question if we want to give the user the choice themselves between the options (for those who want the other option than we choose). Eg the suggestions of another method ( Second, I would be fine with having empty sum return 0. But I personally feel this is very inconsistent with |
@jorisvandenbossche I agree with you, let's try to make a final decision at the dev meeting. Two other proposal based on precedence from databases:
Also, for what it's worth: I did some searching for justifications for the SQL behavior, and the best I came up with was this scathing critique. This isn't really decisive, but it's clear there are lots of SQL users who aren't happy with this behavior. |
In the case of I think what we've suggested here is:
We can discuss on the call tomorrow |
As @shoyer listed above (#9422 (comment)), I think we should consider the three possible options tomorrow:
And each of the three options might have different options to circumvent the default behaviour (coalesce-like function, separate method, new keyword, ...) |
To all subscribed here, note that there is a mailing list discussion about this: https://mail.python.org/pipermail/pandas-dev/2017-November/000657.html |
Summary
The question is what the sum of a Series of all NaNs should return (which is equivalent to an empty Series after skipping the NaNs): NaN or 0?
The reason this is a discussion point has the following cause: the internal nansum implementation of pandas returns NaN. But, when bottleneck is installed, pandas will use bottlenecks implementation of nansum, which returns 0 (for the versions >= 1.0).
Bottleneck changed the behaviour from returning NaN to returning 0 to model it after numpy's nansum function.
This has the very annoying consequence that depending on whether bottleneck is installed or not (which is only an optional dependency), you get a different behaviour.
So the decision we need to make, is to either:
Original title: nansum in bottleneck 1.0 will return 0 for all NaN arrays instead of NaN
xref pydata/bottleneck#96
xref #9421
This matches a change from numpy 1.8 -> 1.9.
We should address this for pandas 0.16.
Should we work around the new behavior (probably the simplest choice) or change nansum in pandas?
The text was updated successfully, but these errors were encountered: