Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: grouping with categorical interval columns #34164

Closed
Tracked by #7
antipisa opened this issue May 13, 2020 · 11 comments
Closed
Tracked by #7

BUG: grouping with categorical interval columns #34164

antipisa opened this issue May 13, 2020 · 11 comments
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions

Comments

@antipisa
Copy link

antipisa commented May 13, 2020

Versions:
pandas 1.0.3
numpy 1.18.1

There is a bug in the 1.XXX pandas release that does not allow you to group by a categorical interval index column together with another column.

import numpy as np
import pandas as pd
pd.set_option("use_inf_as_na",True)
t = pd.DataFrame({"x":np.random.randn(100), 'w':np.random.choice(list("ABC"), 100)})
qq = pd.qcut(t['x'], q=np.linspace(0,1,5))

This works and gives the expected result:
t.groupby([qq])['x'].agg('mean')

x (-10.001, -1.0] -1.431893 (-1.0, 0.0] -0.423564 (0.0, 1.0] 0.461174 (1.0, 10.0] 1.662297 Name: x, dtype: float64

This raises a TypeError:
t.groupby([qq,'w'])['x'].agg('mean')

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-43-6d7782f17653> in <module>
----> 1 t.groupby([qq,'w'])['x'].agg('mean')

~/miniconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, *args, **kwargs)
    245 
    246         if isinstance(func, str):
--> 247             return getattr(self, func)(*args, **kwargs)
    248 
    249         elif isinstance(func, abc.Iterable):

~/miniconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in mean(self, *args, **kwargs)
   1223         nv.validate_groupby_func("mean", args, kwargs, ["numeric_only"])
   1224         return self._cython_agg_general(
-> 1225             "mean", alt=lambda x, axis: Series(x).mean(**kwargs), **kwargs
   1226         )
   1227 

~/miniconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
    907             raise DataError("No numeric types to aggregate")
    908 
--> 909         return self._wrap_aggregated_output(output)
    910 
    911     def _python_agg_general(self, func, *args, **kwargs):

~/miniconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in _wrap_aggregated_output(self, output)
    384             output=output, index=self.grouper.result_index
    385         )
--> 386         return self._reindex_output(result)._convert(datetime=True)
    387 
    388     def _wrap_transformed_output(

~/miniconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _reindex_output(self, output, fill_value)
   2481         levels_list = [ping.group_index for ping in groupings]
   2482         index, _ = MultiIndex.from_product(
-> 2483             levels_list, names=self.grouper.names
   2484         ).sortlevel()
   2485 

~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in from_product(cls, iterables, sortorder, names)
    551 
    552         codes = cartesian_product(codes)
--> 553         return MultiIndex(levels, codes, sortorder=sortorder, names=names)
    554 
    555     @classmethod

~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in __new__(cls, levels, codes, sortorder, names, dtype, copy, name, verify_integrity, _set_identity)
    278 
    279         if verify_integrity:
--> 280             new_codes = result._verify_integrity()
    281             result._codes = new_codes
    282 

~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in _verify_integrity(self, codes, levels)
    366 
    367         codes = [
--> 368             self._validate_codes(level, code) for level, code in zip(levels, codes)
    369         ]
    370         new_codes = FrozenList(codes)

~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in <listcomp>(.0)
    366 
    367         codes = [
--> 368             self._validate_codes(level, code) for level, code in zip(levels, codes)
    369         ]
    370         new_codes = FrozenList(codes)

~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in _validate_codes(self, level, code)
    302         to a level with missing values (NaN, NaT, None).
    303         """
--> 304         null_mask = isna(level)
    305         if np.any(null_mask):
    306             code = np.where(null_mask[code], -1, code)

~/miniconda3/lib/python3.7/site-packages/pandas/core/dtypes/missing.py in isna(obj)
    124     Name: 1, dtype: bool
    125     """
--> 126     return _isna(obj)
    127 
    128 

~/miniconda3/lib/python3.7/site-packages/pandas/core/dtypes/missing.py in _isna_old(obj)
    181         return False
    182     elif isinstance(obj, (ABCSeries, np.ndarray, ABCIndexClass, ABCExtensionArray)):
--> 183         return _isna_ndarraylike_old(obj)
    184     elif isinstance(obj, ABCGeneric):
    185         return obj._constructor(obj._data.isna(func=_isna_old))

~/miniconda3/lib/python3.7/site-packages/pandas/core/dtypes/missing.py in _isna_ndarraylike_old(obj)
    281         else:
    282             result = np.empty(shape, dtype=bool)
--> 283             vec = libmissing.isnaobj_old(values.ravel())
    284             result[:] = vec.reshape(shape)
    285 

TypeError: Argument 'arr' has incorrect type (expected numpy.ndarray, got Categorical)

@antipisa antipisa added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 13, 2020
@dsaxton
Copy link
Member

dsaxton commented May 14, 2020

This seems to work on master for me:

[ins] In [1]: import numpy as np                                                                                                                                                                             

[ins] In [2]: import pandas as pd                                                                                                                                                                            

[ins] In [3]: pd.set_option("use_inf_as_na",True)                                                                                                                                                            

[ins] In [4]: t = pd.DataFrame({"x":np.random.randn(100), 'w':np.random.choice(list("ABC"), 100)})                                                                                                           

[ins] In [5]: qq = pd.qcut(t['x'], q=np.linspace(0,1,5))                                                                                                                                                     

[ins] In [6]: t.groupby([qq,'w'])['x'].agg('mean')                                                                                                                                                           
Out[6]: 
x                              w
(-2.7649999999999997, -0.736]  A   -1.412247
                               B   -1.319972
                               C   -1.108550
(-0.736, -0.114]               A   -0.351454
                               B   -0.388151
                               C   -0.404094
(-0.114, 0.587]                A    0.134442
                               B    0.235705
                               C    0.406392
(0.587, 1.864]                 A    1.056471
                               B    0.973123
                               C    1.189502
Name: x, dtype: float64

@dsaxton dsaxton removed the Needs Triage Issue that has not been reviewed by a pandas team member label May 14, 2020
@antipisa
Copy link
Author

Here is my env:

Package                Version   
---------------------- ----------
appnope                0.1.0     
asn1crypto             1.2.0     
attrs                  19.3.0    
backcall               0.1.0     
bleach                 3.1.5     
certifi                2019.11.28
cffi                   1.13.0    
chardet                3.0.4     
conda                  4.8.2     
conda-package-handling 1.6.0     
cryptography           2.8       
decorator              4.4.2     
defusedxml             0.6.0     
entrypoints            0.3       
idna                   2.8       
importlib-metadata     1.6.0     
ipykernel              5.2.1     
ipython                7.14.0    
ipython-genutils       0.2.0     
ipywidgets             7.5.1     
jedi                   0.17.0    
Jinja2                 2.11.2    
jsonschema             3.2.0     
jupyter                1.0.0     
jupyter-client         6.1.3     
jupyter-console        6.1.0     
jupyter-core           4.6.3     
MarkupSafe             1.1.1     
mistune                0.8.4     
mkl-fft                1.0.15    
mkl-random             1.1.0     
mkl-service            2.3.0     
nbconvert              5.6.1     
nbformat               5.0.6     
notebook               6.0.3     
numpy                  1.18.1    
packaging              20.3      
pandas                 1.0.3     
pandocfilters          1.4.2     
parso                  0.7.0     
pexpect                4.8.0     
pickleshare            0.7.5     
pip                    19.3.1    
prometheus-client      0.7.1     
prompt-toolkit         3.0.5     
ptyprocess             0.6.0     
pycosat                0.6.3     
pycparser              2.19      
Pygments               2.6.1     
pyOpenSSL              19.0.0    
pyparsing              2.4.7     
pyq                    4.2.1     
pyrsistent             0.16.0    
PySocks                1.7.1     
python-dateutil        2.8.1     
pytz                   2020.1    
pyzmq                  19.0.1    
qtconsole              4.7.4     
QtPy                   1.9.0     
requests               2.22.0    
ruamel-yaml            0.15.46   
Send2Trash             1.5.0     
setuptools             41.4.0    
six                    1.12.0    
terminado              0.8.3     
testpath               0.4.4     
tornado                6.0.4     
tqdm                   4.36.1    
traitlets              4.3.3     
urllib3                1.24.2    
wcwidth                0.1.9     
webencodings           0.5.1     
wheel                  0.33.6    
widgetsnbextension     3.5.1     
zipp                   3.1.0     

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Aug 7, 2021
@banditelol
Copy link

Also works for me on 1.3.5
image
Should this issue be closed?

@jreback
Copy link
Contributor

jreback commented Mar 27, 2022

would take a regression test to close

@banditelol
Copy link

banditelol commented Mar 27, 2022

Do you mean writing a regression test for the correct behavior like this one, to make sure the behavior stays correct ?

def test_groupby_return_type():
# GH2893, return a reduced type
df1 = DataFrame(
[
{"val1": 1, "val2": 20},
{"val1": 1, "val2": 19},
{"val1": 2, "val2": 27},
{"val1": 2, "val2": 12},
]
)
def func(dataf):
return dataf["val2"] - dataf["val2"].mean()
with tm.assert_produces_warning(FutureWarning):
result = df1.groupby("val1", squeeze=True).apply(func)
assert isinstance(result, Series)
df2 = DataFrame(
[
{"val1": 1, "val2": 20},
{"val1": 1, "val2": 19},
{"val1": 1, "val2": 27},
{"val1": 1, "val2": 12},
]
)
def func(dataf):
return dataf["val2"] - dataf["val2"].mean()
with tm.assert_produces_warning(FutureWarning):
result = df2.groupby("val1", squeeze=True).apply(func)
assert isinstance(result, Series)
# GH3596, return a consistent type (regression in 0.11 from 0.10.1)
df = DataFrame([[1, 1], [1, 1]], columns=["X", "Y"])
with tm.assert_produces_warning(FutureWarning):
result = df.groupby("X", squeeze=False).count()
assert isinstance(result, DataFrame)

@jreback
Copy link
Contributor

jreback commented Mar 27, 2022

yes

@banditelol
Copy link

Noted, can I take on this issue?

@banditelol
Copy link

After tweaking around on notebook, it seems like I could replicate the problem with the condition:

  • Using pandas 1.0.3 (no problem on 1.0.4)
  • MultiIndexing dataframe
  • Using column with type of categoricalDtype not IntervalDtype
  • Using observed=False as groupby's argument
    And it seems like the current test suite has already take this into account with these following lines:
    def test_observed(observed):
    # multiple groupers, don't re-expand the output space
    # of the grouper
    # gh-14942 (implement)
    # gh-10132 (back-compat)
    # gh-8138 (back-compat)
    # gh-8869
    cat1 = Categorical(["a", "a", "b", "b"], categories=["a", "b", "z"], ordered=True)
    cat2 = Categorical(["c", "d", "c", "d"], categories=["c", "d", "y"], ordered=True)
    df = DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]})
    df["C"] = ["foo", "bar"] * 2
    # multiple groupers with a non-cat
    gb = df.groupby(["A", "B", "C"], observed=observed)
    exp_index = MultiIndex.from_arrays(
    [cat1, cat2, ["foo", "bar"] * 2], names=["A", "B", "C"]
    )
    expected = DataFrame({"values": Series([1, 2, 3, 4], index=exp_index)}).sort_index()
    result = gb.sum()
    if not observed:
    expected = cartesian_product_for_groupers(
    expected, [cat1, cat2, ["foo", "bar"]], list("ABC"), fill_value=0
    )
    tm.assert_frame_equal(result, expected)
    gb = df.groupby(["A", "B"], observed=observed)
    exp_index = MultiIndex.from_arrays([cat1, cat2], names=["A", "B"])
    expected = DataFrame({"values": [1, 2, 3, 4]}, index=exp_index)
    result = gb.sum()
    if not observed:
    expected = cartesian_product_for_groupers(
    expected, [cat1, cat2], list("AB"), fill_value=0
    )
    tm.assert_frame_equal(result, expected)

    I want to test pandas version 1.0.3 against this unit test do you have any recommendation to do so?

@PrimeF
Copy link
Contributor

PrimeF commented Apr 20, 2023

I've added a test for this older bug which, as stated above, was already fixed a while back.

@mcgeestocks
Copy link
Contributor

Looks like Closed PR#52818 should have closed this task. Pinging @phofl for visibility.

@aibanez7
Copy link

Hello,
Could this task be closed?
Thank you,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

No branches or pull requests

8 participants