Skip to content

Calculate normalized freqs in value_counts correctly when bins is not None #26923

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

krsnik93
Copy link
Contributor

@codecov
Copy link

codecov bot commented Jun 18, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@df3d9b2). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #26923   +/-   ##
=========================================
  Coverage          ?   91.86%           
=========================================
  Files             ?      180           
  Lines             ?    50745           
  Branches          ?        0           
=========================================
  Hits              ?    46619           
  Misses            ?     4126           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.46% <100%> (?)
#single 41.12% <0%> (?)
Impacted Files Coverage Δ
pandas/core/algorithms.py 94.72% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update df3d9b2...81bbfa6. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a whatsnew note, reshape section of bug fixes

@pytest.mark.parametrize('dropna, vals, index', [
(False, [0.6, 0.2, 0.2], [np.nan, 2.0, 1.0]),
(True, [0.5, 0.5], [2.0, 1.0])])
def test_value_counts_normalized(self, dropna, vals, index):
# GH12558
s = Series([1, 2, np.nan, np.nan, np.nan])
dtypes = (np.float64, np.object, 'M8[ns]')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you pull the dtypes into another level of parameterization

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jun 19, 2019
@pep8speaks
Copy link

pep8speaks commented Jun 19, 2019

Hello @krsnik93! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-09-09 07:34:18 UTC

@jreback jreback added this to the 0.25.0 milestone Jun 21, 2019
@jreback
Copy link
Contributor

jreback commented Jun 21, 2019

@krsnik93 can you merge master & add a whatsnew note (bug fix in reshaping is ok). ping on green.

@krsnik93
Copy link
Contributor Author

krsnik93 commented Jun 21, 2019

I have a failing test in tests/groupby/test_value_counts.py. The test compares Series.value_counts against SeriesGroupBy.value_counts. For this issue, I modified Series.value_counts and I think the change is correct. However, I think SeriesGroupBy.value_counts should change as well. This is the troubling bit:

# for compat. with libgroupby.value_counts need to ensure every
# bin is present at every index level, null filled with zeros
diff = np.zeros(len(out), dtype='bool')
for lab in labels[:-1]:
diff |= np.r_[True, lab[1:] != lab[:-1]]
ncat, nbin = diff.sum(), len(levels[-1])
left = [np.repeat(np.arange(ncat), nbin),
np.tile(np.arange(nbin), ncat)]
right = [diff.cumsum() - 1, labels[-1]]
_, idx = _get_join_indexers(left, right, sort=False, how='left')
out = np.where(idx != -1, out[idx], 0)

For some reason, it takes by-group counts of unbinned data (either NaN values or outside of bin breaks) and resets those counts to 0 (last line). I do not understand why is this done, but the whole method is kind hard to read and I am not able to make the required changes.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 24, 2019

cc @jschendel if you have time to take a look at #26923 (comment).

@jreback jreback removed this from the 0.25.0 milestone Jun 28, 2019
@WillAyd
Copy link
Member

WillAyd commented Sep 5, 2019

@krsnik93 can you merge master and fix up merge conflicts?

@WillAyd
Copy link
Member

WillAyd commented Sep 20, 2019

Closing as this looks stale, but @krsnik93 please ping if you'd like to continue and can open back up

@WillAyd WillAyd closed this Sep 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

value_counts unexpected behaviour - bins and dropna
5 participants