Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: preserve categorical & sparse types when grouping / pivot #27071

Merged
merged 11 commits into from
Jun 27, 2019

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Jun 27, 2019

closes #18502
replaces #26550

@jreback jreback added Groupby Compat pandas objects compatability with Numpy or Python functions Categorical Categorical Data Type labels Jun 27, 2019
@jreback jreback added this to the 0.25.0 milestone Jun 27, 2019
@jreback
Copy link
Contributor Author

jreback commented Jun 27, 2019

@WillAyd #27071 (comment)

that test is not appropriate for checking the results of this change as most of those ops don't work on ordered categoricals; i have covered the most common of first/last/min/max above.

@codecov
Copy link

codecov bot commented Jun 27, 2019

Codecov Report

Merging #27071 into master will decrease coverage by 1.38%.
The diff coverage is 93.1%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #27071      +/-   ##
==========================================
- Coverage   92.04%   90.66%   -1.39%     
==========================================
  Files         180      180              
  Lines       50714    50727      +13     
==========================================
- Hits        46680    45991     -689     
- Misses       4034     4736     +702
Flag Coverage Δ
#multiple 90.66% <93.1%> (-0.02%) ⬇️
#single ?
Impacted Files Coverage Δ
pandas/core/groupby/generic.py 88.48% <100%> (-0.86%) ⬇️
pandas/core/groupby/ops.py 96% <100%> (ø) ⬆️
pandas/core/nanops.py 94.76% <100%> (ø) ⬆️
pandas/core/groupby/groupby.py 97.32% <100%> (+0.15%) ⬆️
pandas/core/internals/construction.py 96.21% <100%> (+0.25%) ⬆️
pandas/core/internals/blocks.py 94.95% <71.42%> (-0.19%) ⬇️
pandas/core/computation/pytables.py 62.5% <0%> (-27.75%) ⬇️
pandas/io/pytables.py 64.86% <0%> (-25.44%) ⬇️
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️
pandas/core/computation/common.py 84.21% <0%> (-5.27%) ⬇️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d94146c...424c466. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 27, 2019

Codecov Report

Merging #27071 into master will increase coverage by 50.06%.
The diff coverage is 93.1%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #27071       +/-   ##
===========================================
+ Coverage   41.96%   92.02%   +50.06%     
===========================================
  Files         180      180               
  Lines       50707    50727       +20     
===========================================
+ Hits        21277    46681    +25404     
+ Misses      29430     4046    -25384
Flag Coverage Δ
#multiple 90.66% <93.1%> (?)
#single 41.85% <24.13%> (-0.11%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/generic.py 88.48% <100%> (+73.66%) ⬆️
pandas/core/groupby/ops.py 96% <100%> (+76.23%) ⬆️
pandas/core/nanops.py 94.76% <100%> (+63.17%) ⬆️
pandas/core/groupby/groupby.py 97.32% <100%> (+73.45%) ⬆️
pandas/core/internals/construction.py 96.21% <100%> (+31.81%) ⬆️
pandas/core/internals/blocks.py 94.95% <71.42%> (+41.89%) ⬆️
pandas/core/computation/pytables.py 90.24% <0%> (+0.3%) ⬆️
pandas/io/pytables.py 90.3% <0%> (+0.96%) ⬆️
pandas/core/panel.py 17.8% <0%> (+1.7%) ⬆️
pandas/util/_test_decorators.py 93.84% <0%> (+4.61%) ⬆️
... and 138 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c1673cf...9b8f2b4. Read the comment docs.

try:

result = self._holder._from_sequence(
np.asarray(result).ravel(), dtype=dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned by the asarray here. Is that just so we can do the .ravel?

Consider a silly example like

df.groupby('key').apply(lambda x: x.array)

Will that end up hitting this, and so calling asarray and converting to ndarray?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this make sense?

df.groupby('A').B.apply(lambda x: x.array)
(Pdb) p df
   A                         B
0  1 2000-01-01 18:00:00-06:00
1  1 2000-01-01 18:00:00-06:00
2  2                       NaT
3  2                       NaT
4  3 1999-12-31 18:00:00-06:00
5  3 1999-12-31 18:00:00-06:00
6  1 2000-01-01 18:00:00-06:00
7  4 2000-01-02 18:00:00-06:00
(Pdb) p result
A
1    [2000-01-01 18:00:00-06:00, 2000-01-01 18:00:0...
2                                           [NaT, NaT]
3    [1999-12-31 18:00:00-06:00, 1999-12-31 18:00:0...
4                          [2000-01-02 18:00:00-06:00]
Name: B, dtype: object

@jreback
Copy link
Contributor Author

jreback commented Jun 27, 2019

@WillAyd @jbrockmendel ?

@jbrockmendel
Copy link
Member

Does this also preserve the dtypes under transpose?

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jreback
Copy link
Contributor Author

jreback commented Jun 27, 2019

@jbrockmendel

Does this also preserve the dtypes under transpose?

no that's a more general issues

@jreback
Copy link
Contributor Author

jreback commented Jun 27, 2019

home/vsts/work/1/s/doc/source/whatsnew/v0.25.0.rst:856: WARNING: Unknown interpreted text role "method".

@jorisvandenbossche any idea what this means?

@jreback jreback merged commit ce86c21 into pandas-dev:master Jun 27, 2019
result = ts.resample('3T').mean()
expected = Series([1, 4, 7],
index=pd.date_range('1/1/2000', periods=3, freq='3T'),
dtype='Int64')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback @jorisvandenbossche why returning Int64 here? I would expect float64 or Float64.

e.g. if we do ts[-1] += 1 before the resample, the mean comes back as float64.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should be Float64, because it is only accidentally that the results are all integer-like.
This is one of the cases that I listed in #37494

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Compat pandas objects compatability with Numpy or Python functions Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Categorical dtype doesn't survive groupby of first, max, min, value_counts etc.: unwanted coercion to object
5 participants