Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: grouping with categorical interval columns #34164

Tracked by #7
antipisa opened this issue May 13, 2020 · 11 comments
Tracked by #7

BUG: grouping with categorical interval columns #34164

antipisa opened this issue May 13, 2020 · 11 comments
good first issue Needs Tests Unit test(s) needed to prevent regressions


Copy link

antipisa commented May 13, 2020

pandas 1.0.3
numpy 1.18.1

There is a bug in the 1.XXX pandas release that does not allow you to group by a categorical interval index column together with another column.

import numpy as np
import pandas as pd
t = pd.DataFrame({"x":np.random.randn(100), 'w':np.random.choice(list("ABC"), 100)})
qq = pd.qcut(t['x'], q=np.linspace(0,1,5))

This works and gives the expected result:

x (-10.001, -1.0] -1.431893 (-1.0, 0.0] -0.423564 (0.0, 1.0] 0.461174 (1.0, 10.0] 1.662297 Name: x, dtype: float64

This raises a TypeError:

TypeError                                 Traceback (most recent call last)
<ipython-input-43-6d7782f17653> in <module>
----> 1 t.groupby([qq,'w'])['x'].agg('mean')

~/miniconda3/lib/python3.7/site-packages/pandas/core/groupby/ in aggregate(self, func, *args, **kwargs)
    246         if isinstance(func, str):
--> 247             return getattr(self, func)(*args, **kwargs)
    249         elif isinstance(func, abc.Iterable):

~/miniconda3/lib/python3.7/site-packages/pandas/core/groupby/ in mean(self, *args, **kwargs)
   1223         nv.validate_groupby_func("mean", args, kwargs, ["numeric_only"])
   1224         return self._cython_agg_general(
-> 1225             "mean", alt=lambda x, axis: Series(x).mean(**kwargs), **kwargs
   1226         )

~/miniconda3/lib/python3.7/site-packages/pandas/core/groupby/ in _cython_agg_general(self, how, alt, numeric_only, min_count)
    907             raise DataError("No numeric types to aggregate")
--> 909         return self._wrap_aggregated_output(output)
    911     def _python_agg_general(self, func, *args, **kwargs):

~/miniconda3/lib/python3.7/site-packages/pandas/core/groupby/ in _wrap_aggregated_output(self, output)
    384             output=output, index=self.grouper.result_index
    385         )
--> 386         return self._reindex_output(result)._convert(datetime=True)
    388     def _wrap_transformed_output(

~/miniconda3/lib/python3.7/site-packages/pandas/core/groupby/ in _reindex_output(self, output, fill_value)
   2481         levels_list = [ping.group_index for ping in groupings]
   2482         index, _ = MultiIndex.from_product(
-> 2483             levels_list, names=self.grouper.names
   2484         ).sortlevel()

~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/ in from_product(cls, iterables, sortorder, names)
    552         codes = cartesian_product(codes)
--> 553         return MultiIndex(levels, codes, sortorder=sortorder, names=names)
    555     @classmethod

~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/ in __new__(cls, levels, codes, sortorder, names, dtype, copy, name, verify_integrity, _set_identity)
    279         if verify_integrity:
--> 280             new_codes = result._verify_integrity()
    281             result._codes = new_codes

~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/ in _verify_integrity(self, codes, levels)
    367         codes = [
--> 368             self._validate_codes(level, code) for level, code in zip(levels, codes)
    369         ]
    370         new_codes = FrozenList(codes)

~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/ in <listcomp>(.0)
    367         codes = [
--> 368             self._validate_codes(level, code) for level, code in zip(levels, codes)
    369         ]
    370         new_codes = FrozenList(codes)

~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/ in _validate_codes(self, level, code)
    302         to a level with missing values (NaN, NaT, None).
    303         """
--> 304         null_mask = isna(level)
    305         if np.any(null_mask):
    306             code = np.where(null_mask[code], -1, code)

~/miniconda3/lib/python3.7/site-packages/pandas/core/dtypes/ in isna(obj)
    124     Name: 1, dtype: bool
    125     """
--> 126     return _isna(obj)

~/miniconda3/lib/python3.7/site-packages/pandas/core/dtypes/ in _isna_old(obj)
    181         return False
    182     elif isinstance(obj, (ABCSeries, np.ndarray, ABCIndexClass, ABCExtensionArray)):
--> 183         return _isna_ndarraylike_old(obj)
    184     elif isinstance(obj, ABCGeneric):
    185         return obj._constructor(obj._data.isna(func=_isna_old))

~/miniconda3/lib/python3.7/site-packages/pandas/core/dtypes/ in _isna_ndarraylike_old(obj)
    281         else:
    282             result = np.empty(shape, dtype=bool)
--> 283             vec = libmissing.isnaobj_old(values.ravel())
    284             result[:] = vec.reshape(shape)

TypeError: Argument 'arr' has incorrect type (expected numpy.ndarray, got Categorical)

@antipisa antipisa added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 13, 2020
Copy link

dsaxton commented May 14, 2020

This seems to work on master for me:

[ins] In [1]: import numpy as np                                                                                                                                                                             

[ins] In [2]: import pandas as pd                                                                                                                                                                            

[ins] In [3]: pd.set_option("use_inf_as_na",True)                                                                                                                                                            

[ins] In [4]: t = pd.DataFrame({"x":np.random.randn(100), 'w':np.random.choice(list("ABC"), 100)})                                                                                                           

[ins] In [5]: qq = pd.qcut(t['x'], q=np.linspace(0,1,5))                                                                                                                                                     

[ins] In [6]: t.groupby([qq,'w'])['x'].agg('mean')                                                                                                                                                           
x                              w
(-2.7649999999999997, -0.736]  A   -1.412247
                               B   -1.319972
                               C   -1.108550
(-0.736, -0.114]               A   -0.351454
                               B   -0.388151
                               C   -0.404094
(-0.114, 0.587]                A    0.134442
                               B    0.235705
                               C    0.406392
(0.587, 1.864]                 A    1.056471
                               B    0.973123
                               C    1.189502
Name: x, dtype: float64

@dsaxton dsaxton removed the Needs Triage Issue that has not been reviewed by a pandas team member label May 14, 2020
Copy link

Here is my env:

Package                Version   
---------------------- ----------
appnope                0.1.0     
asn1crypto             1.2.0     
attrs                  19.3.0    
backcall               0.1.0     
bleach                 3.1.5     
certifi                2019.11.28
cffi                   1.13.0    
chardet                3.0.4     
conda                  4.8.2     
conda-package-handling 1.6.0     
cryptography           2.8       
decorator              4.4.2     
defusedxml             0.6.0     
entrypoints            0.3       
idna                   2.8       
importlib-metadata     1.6.0     
ipykernel              5.2.1     
ipython                7.14.0    
ipython-genutils       0.2.0     
ipywidgets             7.5.1     
jedi                   0.17.0    
Jinja2                 2.11.2    
jsonschema             3.2.0     
jupyter                1.0.0     
jupyter-client         6.1.3     
jupyter-console        6.1.0     
jupyter-core           4.6.3     
MarkupSafe             1.1.1     
mistune                0.8.4     
mkl-fft                1.0.15    
mkl-random             1.1.0     
mkl-service            2.3.0     
nbconvert              5.6.1     
nbformat               5.0.6     
notebook               6.0.3     
numpy                  1.18.1    
packaging              20.3      
pandas                 1.0.3     
pandocfilters          1.4.2     
parso                  0.7.0     
pexpect                4.8.0     
pickleshare            0.7.5     
pip                    19.3.1    
prometheus-client      0.7.1     
prompt-toolkit         3.0.5     
ptyprocess             0.6.0     
pycosat                0.6.3     
pycparser              2.19      
Pygments               2.6.1     
pyOpenSSL              19.0.0    
pyparsing              2.4.7     
pyq                    4.2.1     
pyrsistent             0.16.0    
PySocks                1.7.1     
python-dateutil        2.8.1     
pytz                   2020.1    
pyzmq                  19.0.1    
qtconsole              4.7.4     
QtPy                   1.9.0     
requests               2.22.0    
ruamel-yaml            0.15.46   
Send2Trash             1.5.0     
setuptools             41.4.0    
six                    1.12.0    
terminado              0.8.3     
testpath               0.4.4     
tornado                6.0.4     
tqdm                   4.36.1    
traitlets              4.3.3     
urllib3                1.24.2    
wcwidth                0.1.9     
webencodings           0.5.1     
wheel                  0.33.6    
widgetsnbextension     3.5.1     
zipp                   3.1.0     

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Aug 7, 2021
Copy link

Also works for me on 1.3.5
Should this issue be closed?

Copy link

jreback commented Mar 27, 2022

would take a regression test to close

Copy link

banditelol commented Mar 27, 2022

Do you mean writing a regression test for the correct behavior like this one, to make sure the behavior stays correct ?

def test_groupby_return_type():
# GH2893, return a reduced type
df1 = DataFrame(
{"val1": 1, "val2": 20},
{"val1": 1, "val2": 19},
{"val1": 2, "val2": 27},
{"val1": 2, "val2": 12},
def func(dataf):
return dataf["val2"] - dataf["val2"].mean()
with tm.assert_produces_warning(FutureWarning):
result = df1.groupby("val1", squeeze=True).apply(func)
assert isinstance(result, Series)
df2 = DataFrame(
{"val1": 1, "val2": 20},
{"val1": 1, "val2": 19},
{"val1": 1, "val2": 27},
{"val1": 1, "val2": 12},
def func(dataf):
return dataf["val2"] - dataf["val2"].mean()
with tm.assert_produces_warning(FutureWarning):
result = df2.groupby("val1", squeeze=True).apply(func)
assert isinstance(result, Series)
# GH3596, return a consistent type (regression in 0.11 from 0.10.1)
df = DataFrame([[1, 1], [1, 1]], columns=["X", "Y"])
with tm.assert_produces_warning(FutureWarning):
result = df.groupby("X", squeeze=False).count()
assert isinstance(result, DataFrame)

Copy link

jreback commented Mar 27, 2022


Copy link

Noted, can I take on this issue?

Copy link

After tweaking around on notebook, it seems like I could replicate the problem with the condition:

  • Using pandas 1.0.3 (no problem on 1.0.4)
  • MultiIndexing dataframe
  • Using column with type of categoricalDtype not IntervalDtype
  • Using observed=False as groupby's argument
    And it seems like the current test suite has already take this into account with these following lines:
    def test_observed(observed):
    # multiple groupers, don't re-expand the output space
    # of the grouper
    # gh-14942 (implement)
    # gh-10132 (back-compat)
    # gh-8138 (back-compat)
    # gh-8869
    cat1 = Categorical(["a", "a", "b", "b"], categories=["a", "b", "z"], ordered=True)
    cat2 = Categorical(["c", "d", "c", "d"], categories=["c", "d", "y"], ordered=True)
    df = DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]})
    df["C"] = ["foo", "bar"] * 2
    # multiple groupers with a non-cat
    gb = df.groupby(["A", "B", "C"], observed=observed)
    exp_index = MultiIndex.from_arrays(
    [cat1, cat2, ["foo", "bar"] * 2], names=["A", "B", "C"]
    expected = DataFrame({"values": Series([1, 2, 3, 4], index=exp_index)}).sort_index()
    result = gb.sum()
    if not observed:
    expected = cartesian_product_for_groupers(
    expected, [cat1, cat2, ["foo", "bar"]], list("ABC"), fill_value=0
    tm.assert_frame_equal(result, expected)
    gb = df.groupby(["A", "B"], observed=observed)
    exp_index = MultiIndex.from_arrays([cat1, cat2], names=["A", "B"])
    expected = DataFrame({"values": [1, 2, 3, 4]}, index=exp_index)
    result = gb.sum()
    if not observed:
    expected = cartesian_product_for_groupers(
    expected, [cat1, cat2], list("AB"), fill_value=0
    tm.assert_frame_equal(result, expected)

    I want to test pandas version 1.0.3 against this unit test do you have any recommendation to do so?

Copy link

PrimeF commented Apr 20, 2023

I've added a test for this older bug which, as stated above, was already fixed a while back.

Copy link

Looks like Closed PR#52818 should have closed this task. Pinging @phofl for visibility.

Copy link

Could this task be closed?
Thank you,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
good first issue Needs Tests Unit test(s) needed to prevent regressions
None yet

No branches or pull requests

8 participants