-
-
Notifications
You must be signed in to change notification settings - Fork 19.2k
Breaking changes for sum / prod of empty / all-NA #18921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Breaking changes for sum / prod of empty / all-NA #18921
Conversation
| idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02']) | ||
| pd.Series([1, 2], index=idx).resample("12H").sum() | ||
|
|
||
| pd.Series([1, 2], index=idx).resample("12H").sum(min_count=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Add .rolling and .expanding with all-NA and min_count=0.
pandas/tests/frame/test_analytics.py
Outdated
| def wrapper(x): | ||
| return alternative(x.values) | ||
|
|
||
| if skipna_alternative: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seemed like the easiest way of getting these tests to pass. skipna_alternative is only used in sum and prod, and is nansum or nanprod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is ok
| ('mean', np.mean), | ||
| ('median', lambda x: np.median(x) if len(x) > 0 else np.nan), | ||
| ('var', lambda x: np.var(x, ddof=1)), | ||
| ('add', lambda x: np.sum(x) if len(x) > 0 else np.nan), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved these to their own tests in e390d1c#diff-1c9535bf6ad763b307a780e37cc4185cR833 due to #18869
| tm.assert_series_equal(result, expected) | ||
|
|
||
| @pytest.mark.parametrize( | ||
| "method", ['sum', 'mean', 'median', 'std', 'var']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: make sure I have coverage for this elsewhere... I don't remember why I removed it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are handled up above in test_empty.
pandas/tests/test_base.py
Outdated
| def test_nanops(self): | ||
| @pytest.mark.parametrize('klass', [Index, Series]) | ||
| @pytest.mark.parametrize('op', ['max', 'min']) | ||
| def test_nanops_max_min(self, op, klass): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I thought I did this parametrization in #18876. I can move it there if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, these seem to be a remnant of an earlier version where I was adding min_count everywhere. I'll just remove it.
pandas/tests/test_nanops.py
Outdated
| for skipna in [False, True]: | ||
| targartempval = targarval if skipna else targarnanval | ||
| try: | ||
| if skipna and empty_targfunc and pd.isna(targartempval).all(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reasonably sure I refactored these correctly, but would appreciate a close review here. empty_targfunc should be like nansum or nanprod, in which case we want to call it on targartempval. I should double check why just passing nansum for targfunc didn't work though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the imported isna
| assert notna(result[4]) | ||
|
|
||
| # min_periods=0 | ||
| result0 = get_result(arr, 20, min_periods=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to disable this subset of tests, since I don't think it's true that min_period=0 is the same as min_period=1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so make new ones that make this diff clear (i agree they should be different).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, added a keyword to control this check, since it is true for most operations. sum is tested up above in test_missing_minp_zero.
e390d1c to
8cbd59a
Compare
pandas/_libs/window.pyx
Outdated
|
|
||
| if minp == 0: | ||
| # in get_window_indexer, we ensure that minp >= 1. That's fine for | ||
| # all cases except nobs = 0 (all missing values) and minp=0. For |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of the minp logic is handled externally to window.pyx (in window.py)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying to move this there? I don't think that's possible without rewriting roll_sum. All of it's logic assumes that minp is at least 1. The only thing I need the user-provided minp is for passing to calc_sum later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then it should be changed. This is way too hacky. ALL logic related to minp is passed in.
| result.fill(fill_value) | ||
| return result | ||
| if values.size == 0 and kwds.get('min_count') is None: | ||
| # We are empty, returning NA for our type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn’t min_count always passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, just when the caller is sum or prod. Methods like mean don't pass it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this very error prone. The internal functions can always pass min_count I think, so it will be not-None. or at the very least comment on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue here passing through the correct arguments to the nanop. pandas.core.nanops.sum takes min_count, while mean_doesn't. The function calling f knows that, but f doesn't. So I would either
- refactor
fto not do this special casing, and have each nanop properly handle size-0 arrays - add an unused
min_countto each nanop, and always pass it through - have
faddmin_counttokwdswhen it's not None.
I think 1 is the best option, but will take a bit of time. 3 doesn't strike me as any better than the current method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could just pass it thru in make_stat_function? e.g. .you always just pass
# not sum/prod (I know you have a separate function for this)
min_count=-1
return self._reduce(f, name, axis=axis, skipna=skipna,
numeric_only=numeric_only, min_count=min_count)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That'd be option 2. I'd have to add a min_count keyword to each of our methods in pandas.core.nanops, which I don't think we want.
| ('prod', 1), | ||
| ]) | ||
| def test_sum_prod_nanops(self, method, unit): | ||
| idx = ['a', 'b', 'c'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could add tests for Timedelta here as well (should work)
| ]) | ||
| def test_empty(self, method, unit, use_bottleneck): | ||
| with pd.option_context("use_bottleneck", use_bottleneck): | ||
| # GH 9422 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add this issue number
pandas/tests/test_resample.py
Outdated
| pad = DataFrame([[np.nan, np.nan, np.nan, np.nan]], index=[3], | ||
|
|
||
| if func == 'sum': | ||
| fill_value = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u parametrize on func name and fill_value
|
This turned up a slight inconsistency between: a) The first doesn't upsample to include "unobserved" values, while the latter does. An example: df = pd.DataFrame({
'A': [1] * 6 + [2] * 6,
'B': pd.to_datetime(['2017'] * 3 +
['2019'] * 3 +
['2017'] * 3 +
['2019'] * 3),
'C': [1] * 12
})
a = df.groupby([pd.Grouper(key='B', freq='AS'), 'A']).C.count()
b = df.set_index("B").sort_index().groupby("A").resample("AS").C.count()In [53]: a # groupby([grouper, key]).agg
Out[53]:
B A
2017-01-01 1 3
2 3
2019-01-01 1 3
2 3
Name: C, dtype: int64
In [54]: b # groupby(key).resample().agg
Out[54]:
A B
1 2017-01-01 3
2018-01-01 0
2019-01-01 3
2 2017-01-01 3
2018-01-01 0
2019-01-01 3
Name: C, dtype: int64I've use |
8cbd59a to
1902af6
Compare
|
Hello @TomAugspurger! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on December 29, 2017 at 12:25 Hours UTC |
1902af6 to
c213aeb
Compare
|
RE: Grouoer/resample: that does look inconsistent. I think the resample behavior should be used for both? |
|
Somehow I could give an explanation for the difference in behaviour between the groupby and resample example (resample: conform to new index given the frequency -> full time series, groupby: makes groups of my data based on given frequency -> only those groups contained in data). But of course there are also good reasons to have this consistent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed only the whatsnew note.
In general, for the "changes" part of the whatsnew, should we also note somehow that whether this changed for you compared to pre-0.21 behaviour, depends on whether bottlneneck was installed or not (although that is maybe only for the main sum and not for groupby or resample?)
doc/source/whatsnew/v0.22.0.txt
Outdated
| * We've added a ``min_count`` parameter to ``.sum`` and ``.prod`` to control | ||
| the minimum number of valid values for the result to be valid. If fewer than | ||
| ``min_count`` valid values are present, the result is NA. The default is | ||
| ``0``. To restore the 0.21 behavior, use ``min_count=1``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would explicitly say in the summary here that this result is np.nan
doc/source/whatsnew/v0.22.0.txt
Outdated
| ``min_count`` valid values are present, the result is NA. The default is | ||
| ``0``. To restore the 0.21 behavior, use ``min_count=1``. | ||
|
|
||
| Some background: In pandas 0.21.1, we fixed a long-standing inconsistency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.21.1 -> 0.21.0 ?
doc/source/whatsnew/v0.22.0.txt
Outdated
| grouper = pd.Categorical(['a', 'a'], categories=['a', 'b']) | ||
| pd.Series([1, 2]).groupby(grouper).sum() | ||
|
|
||
| pd.Series([1, 2]).groupby(groupuer).sum(min_count=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gruupeur -> grouper
doc/source/whatsnew/v0.22.0.txt
Outdated
|
|
||
| .. code-block:: ipython | ||
|
|
||
| In [1]: import pandas as pd; import numpy as np; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line is not needed
doc/source/whatsnew/v0.22.0.txt
Outdated
|
|
||
| s.resample('2d').sum(min_count=1) | ||
|
|
||
| Upsampling in particular is affected, as this will introduce all-NaN series even |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we emphasize here somewhere this change is only for sum/prod, and not for all other aggregation functions? (I know it is already said above, but I think we have to stress enough it that those examples only apply for specific cases)
I'm not sure. We have two tests that relied on that equivalence to construct the expected result. |
46590dd to
d5e475f
Compare
pandas/_libs/window.pyx
Outdated
| ndarray[int64_t] start, end | ||
| ndarray[double_t] output | ||
|
|
||
| if minp == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback thoughts on this approach? no_min is a boolean that indicates whether or not to consider min_p when choosing if the result is valid in calc_sum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't like this. Why can't you just dispatch on minp==0 above
e.g. line 416
if nobs >= minp or minp==0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I had earlier with the minp2 stuff.
The problem there was that was that in roll_sum, minp gets set to min(minp, 1), so we don't know if the user specified min_periods=0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I don't see that. maybe can just remove that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks bigger than it is because I had to ensure VariableWindowIndexer passes through floor. The main change is just using floor=0 in roll_sum and iterating like range(min(1, minp), ...) instead of just range(minmp, ...).
a034a70 to
b8d1f2b
Compare
|
The only failing tests are the equivalence tests between
@jorisvandenbossche could you explain a bit further? I don't quite understand. IMO, saying |
doc/source/whatsnew/v0.22.0.txt
Outdated
| Pandas 0.22.0 changes the handling of empty and all-NA sums and products. The | ||
| summary is that | ||
|
|
||
| * The sum of an all-NA or empty series is now 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add Series in capitalize with double-backticks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly also use double-backticks around all-NA (or maybe italics)
doc/source/whatsnew/v0.22.0.txt
Outdated
| Based on feedback, we've partially reverted those changes. The default sum for | ||
| all-NA and empty series is now 0 (1 for ``prod``). | ||
|
|
||
| *pandas 0.21* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas 0.21.x
doc/source/whatsnew/v0.22.0.txt
Outdated
| Returning ``NaN`` was the default behavior for pandas 0.20.3 without bottleneck | ||
| installed. | ||
|
|
||
| Note that this affects some other places in the library: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would break out these to separate sub-sections I think (e.g the numbered ones)
pandas/_libs/window.pyx
Outdated
| ndarray[int64_t] start, end | ||
| ndarray[double_t] output | ||
|
|
||
| if minp == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't like this. Why can't you just dispatch on minp==0 above
e.g. line 416
if nobs >= minp or minp==0
?
pandas/core/generic.py
Outdated
| _sum_examples = """\ | ||
| Examples | ||
| -------- | ||
| By default, the sum of an empty series is ``0``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add all-NaN as well
|
|
||
|
|
||
| def _na_for_min_count(values, axis): | ||
| # we either return np.nan or pd.NaT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add doc-string
pandas/tests/frame/test_analytics.py
Outdated
| def wrapper(x): | ||
| return alternative(x.values) | ||
|
|
||
| if skipna_alternative: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is ok
| assert_series_equal(result2, expected) | ||
|
|
||
| def test_groupby_resample_equivalence(self): | ||
| # Currently failing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is failing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a simplification of test_timegrouper_with_reg_groups, they're failing for the same reason (#18921 (comment))
| columns=['A', 'B', 'C', 'D']) | ||
| normal_result = getattr(normal_grouped, func)() | ||
| dt_result = getattr(dt_grouped, func)() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this test moved above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, just parametrized.
| assert notna(result[4]) | ||
|
|
||
| # min_periods=0 | ||
| result0 = get_result(arr, 20, min_periods=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so make new ones that make this diff clear (i agree they should be different).
|
Returning to #18921 (comment), it's only an issue for when there are multiple groupers: In [14]: df = pd.DataFrame({"A": pd.to_datetime(['2015', '2017']), "B": [1, 1]})
In [15]: df.groupby(pd.Grouper(key="A", freq="AS")).B.count()
Out[15]:
A
2015-01-01 1
2016-01-01 0
2017-01-01 1
Freq: AS-JAN, Name: B, dtype: int64
In [16]: df.groupby([pd.Grouper(key="A", freq="AS"), [0, 0]]).B.count()
Out[16]:
A
2015-01-01 0 1
2017-01-01 0 1
Name: B, dtype: int64Still trying to track down the appropriate spot for the fix. Will we want to include this bugfix in 0.22, or save it for 0.23.0? |
|
Fixing this is turning into quite the rabbit hole :/ |
|
IIUC, the issue is in how the new groups are determined In [9]: df = pd.DataFrame({"A": pd.to_datetime(['2015', '2017']), "B": [1, 1]})
In [10]: gr1 = df.groupby(pd.Grouper(key="A", freq="AS")).B
In [11]: gr2 = df.groupby([pd.Grouper(key="A", freq="AS"), [0, 0]])
In [12]: gr1.grouper.result_index
Out[12]: DatetimeIndex(['2015-01-01', '2016-01-01', '2017-01-01'], dtype='datetime64[ns]', name='A', freq='AS-JAN')
In [13]: gr2.grouper.result_index
Out[13]:
MultiIndex(levels=[[2015-01-01 00:00:00, 2016-01-01 00:00:00, 2017-01-01 00:00:00], [0]],
labels=[[0, 2], [0, 0]],
names=['A', None])So In [21]: gr2.grouper.groups
Out[21]:
{(Timestamp('2015-01-01 00:00:00'), 0): Int64Index([0], dtype='int64'),
(Timestamp('2017-01-01 00:00:00'), 0): Int64Index([1], dtype='int64')}I don't know if that's the root problem yet, or just a symptom. |
Codecov Report
@@ Coverage Diff @@
## master #18921 +/- ##
==========================================
- Coverage 91.6% 91.56% -0.05%
==========================================
Files 150 150
Lines 48939 48977 +38
==========================================
+ Hits 44833 44845 +12
- Misses 4106 4132 +26
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a couple of tiny comments
doc/source/whatsnew/v0.22.0.txt
Outdated
| Some background: In pandas 0.21, we fixed a long-standing inconsistency | ||
| in the return value of all-*NA* series depending on whether or not bottleneck | ||
| was installed. See :ref:`whatsnew_0210.api_breaking.bottleneck`_. At the same | ||
| time, we changed the sum and prod of an empty Series to also be ``NaN``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double backticks around Series.
| pd.Series([]).sum() | ||
| pd.Series([np.nan]).sum() | ||
|
|
||
| The default behavior is the same as pandas 0.20.3 with bottleneck installed. It |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say matches numpy? (I know you are saying np.nansum, but can't hurt to actually say numpy)
doc/source/whatsnew/v0.22.0.txt
Outdated
| also matches the behavior of ``np.nansum`` on empty and all-*NA* arrays. | ||
|
|
||
| To have the sum of an empty series return ``NaN``, use the ``min_count`` | ||
| keyword. Thanks to the ``skipna`` parameter, the ``.sum`` on an all-*NA* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skipna=True
doc/source/whatsnew/v0.22.0.txt
Outdated
| To have the sum of an empty series return ``NaN``, use the ``min_count`` | ||
| keyword. Thanks to the ``skipna`` parameter, the ``.sum`` on an all-*NA* | ||
| series is conceptually the same as on an empty. The ``min_count`` parameter | ||
| refers to the minimum number of *valid* values required for a non-NA sum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
valid -> not-null
| pd.Series([1, 2]).groupby(grouper).sum() | ||
|
|
||
| To restore the 0.21 behavior of returning ``NaN`` of unobserved groups, | ||
| use ``min_count>=1``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we actually tests min_count > 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I added a few more in frame/test_analytics.py, test_resample.py, and test_groupby.py.
doc/source/whatsnew/v0.22.0.txt
Outdated
| ^^^^^^^^^^^^^^^^^^^^^ | ||
|
|
||
| Rolling and expanding already have a ``min_periods`` keyword that behaves | ||
| similarly to ``min_count``. The only case that changes is when doing a rolling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar
| @@ -1,14 +1,213 @@ | |||
| .. _whatsnew_0220: | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't u need this?
|
tiny comment. ok merging on green. |
|
OK, all green. I'm going to
|
|
All green. Here we go. |
* API: Change the sum of all-NA / all-Empty sum / prod * Max, not min * Update whatsnew * Parametrize test * Minor cleanups * Refactor skipna_alternative * Split test * Added issue * More updates * linting * linting * Added skips * Doc fixup * DOC: More whatsnew (cherry picked from commit dedfce9)
* API: Change the sum of all-NA / all-Empty sum / prod * Max, not min * Update whatsnew * Parametrize test * Minor cleanups * Refactor skipna_alternative * Split test * Added issue * More updates * linting * linting * Added skips * Doc fixup * DOC: More whatsnew (cherry picked from commit dedfce9)
* API: Change the sum of all-NA / all-Empty sum / prod * Max, not min * Update whatsnew * Parametrize test * Minor cleanups * Refactor skipna_alternative * Split test * Added issue * More updates * linting * linting * Added skips * Doc fixup * DOC: More whatsnew (cherry picked from commit dedfce9)
|
I not sure understand the rationale for making >>> s = pd.Series([0, 0.05, 0.12])
>>> (1+s).prod() -1
0.17600000000000016 # ok
>>> s = pd.Series([0, 0.0, 0.0])
>>> (1+s).prod() -1
0.0 # ok
>>> s = pd.Series([np.nan, np.nan, np.nan])
>>> (1+s).prod() -1
0.0 # how do you interpret this?.IMO percent calculation will be difficult to interpret, when product of Is there a write-up on this, in case I'm misunderstanding this? |
The rationale is that this is consistent with how sum() works: 0 is the additive identity, and 1 is the multiplicative identity. It would be confusing to use a different rule. There are absolutely use cases where these definitions for sum/prod is not desired, but it's impossible for the default behavior to work in all cases. Your use case of calculating percent change sounds like a good use for |
* API: Change the sum of all-NA / all-Empty sum / prod * Max, not min * Update whatsnew * Parametrize test * Minor cleanups * Refactor skipna_alternative * Split test * Added issue * More updates * linting * linting * Added skips * Doc fixup * DOC: More whatsnew
Version 0.22.0 * tag 'v0.22.0': (777 commits) RLS: v0.22.0 DOC: Fix min_count docstring (pandas-dev#19005) DOC: More 0.22.0 updates (pandas-dev#19002) TST: Remove pow test in expressions COMPAT: Avoid td.skip decorator DOC: 0.22.0 release docs (pandas-dev#18983) DOC: Include 0.22.0 whatsnew Breaking changes for sum / prod of empty / all-NA (pandas-dev#18921) ENH: Added a min_count keyword to stat funcs (pandas-dev#18876) RLS: v0.21.1 DOC: Add date to whatsnew (pandas-dev#18740) DOC: Include 0.21.1 whatsnew DOC: Update relase notes (pandas-dev#18739) CFG: Ignore W503 DOC: fix options table (pandas-dev#18730) ENH: support non default indexes in writing to Parquet (pandas-dev#18629) BUG: Fix to_latex with longtable (pandas-dev#17959) (pandas-dev#17960) Parquet: Add error message for no engine found (pandas-dev#18717) BUG: Categorical data fails to load from hdf when all columns are NaN (pandas-dev#18652) DOC: clean-up whatsnew file for 0.21.1 (pandas-dev#18690) ...
* releases: (777 commits) RLS: v0.22.0 DOC: Fix min_count docstring (pandas-dev#19005) DOC: More 0.22.0 updates (pandas-dev#19002) TST: Remove pow test in expressions COMPAT: Avoid td.skip decorator DOC: 0.22.0 release docs (pandas-dev#18983) DOC: Include 0.22.0 whatsnew Breaking changes for sum / prod of empty / all-NA (pandas-dev#18921) ENH: Added a min_count keyword to stat funcs (pandas-dev#18876) RLS: v0.21.1 DOC: Add date to whatsnew (pandas-dev#18740) DOC: Include 0.21.1 whatsnew DOC: Update relase notes (pandas-dev#18739) CFG: Ignore W503 DOC: fix options table (pandas-dev#18730) ENH: support non default indexes in writing to Parquet (pandas-dev#18629) BUG: Fix to_latex with longtable (pandas-dev#17959) (pandas-dev#17960) Parquet: Add error message for no engine found (pandas-dev#18717) BUG: Categorical data fails to load from hdf when all columns are NaN (pandas-dev#18652) DOC: clean-up whatsnew file for 0.21.1 (pandas-dev#18690) ...
Changes the defaults for
min_countso thatsum([])andsum([np.nan])are 0 by default, and NaN withmin_count>=1.I'd recommend looking at only the latest commit until #18876 is merged. I'll probably force push changes here to keep all the relevant changes in the last commit until #18876 is in, rebase on that, and then start pushing changes regularly.
cc @jreback @jorisvandenbossche @shoyer @wesm