Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INT: include NA values in Categorical.categories #37930

Open
jbrockmendel opened this issue Nov 18, 2020 · 6 comments
Open

INT: include NA values in Categorical.categories #37930

jbrockmendel opened this issue Nov 18, 2020 · 6 comments
Labels
Categorical Categorical Data Type Internals Related to non-user accessible pandas implementation PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint

Comments

@jbrockmendel
Copy link
Member

There are a bunch of places in CategoricalIndex where we check something like:

codes = self.categories.get_indexer(target)
if (codes == -1).any():
    do_something()

This leads to ambiguity, as a -1 code can indicate an NA value that is present, or a non-NA value that is not among self.categories. Having to sort out which we're looking at is a hassle which we should try to avoid.

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member Categorical Categorical Data Type Internals Related to non-user accessible pandas implementation and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 18, 2020
@jreback
Copy link
Contributor

jreback commented Nov 18, 2020

we don't allow na values in the constructor by definition so this should be unambiguous

@jbrockmendel
Copy link
Member Author

we don't allow na values in the constructor by definition so this should be unambiguous

which constructor doesnt allow NA values? I can do

cat = pd.Categorical(["A", "B", np.nan, "C"])

target1 = [np.nan]
target2 = ["D"]

>>> cat.categories.get_indexer(target1)
array([-1])
>>> cat.categories.get_indexer(target2)
array([-1])

@jreback
Copy link
Contributor

jreback commented Nov 18, 2020

check the categories

@jorisvandenbossche
Copy link
Member

When Categorical was originally added, it did support missing values in its categories (which means there are basically two ways to have missing values: a -1 in the codes, or a missing value in the categories). But shortly after, we changed that only allow a single way, i.e -1 in the codes, and thus disallow missing values in the categories.

General constructors like pd.Categorical(["A", "B", np.nan, "C"]) will convert the missing values to -1 in the codes, and specialized constructors check that the constructors don't have nans:

In [15]: pd.Categorical.from_codes([0, 1, 2], categories=["a", "b", None])
...
~/scipy/pandas/pandas/core/dtypes/dtypes.py in validate_categories(categories, fastpath)
    502 
    503             if categories.hasnans:
--> 504                 raise ValueError("Categorical categories cannot be null")
    505 
    506             if not categories.is_unique:

ValueError: Categorical categories cannot be null

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 18, 2020

There's another issue somewhere discussing this, but I can't find it right now. At the time, we (I?) decided against including NA-like values in the categories.

@jbrockmendel
Copy link
Member Author

There's another issue somewhere discussing this, but I can't find it right now. At the time, we (I?) decided against including NA-like values in the categories.

I'm not finding it either, but i think there was something about trying to a specific NA value or possibly multiple distinct NA types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Internals Related to non-user accessible pandas implementation PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint
Projects
None yet
Development

No branches or pull requests

4 participants