Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow creating Categorical with pre-defined categories #10705

Closed
stinodego opened this issue Aug 23, 2023 · 13 comments · Fixed by #12954
Closed

Allow creating Categorical with pre-defined categories #10705

stinodego opened this issue Aug 23, 2023 · 13 comments · Fixed by #12954
Assignees
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Milestone

Comments

@stinodego
Copy link
Member

stinodego commented Aug 23, 2023

Problem description

Blocker for #10701

For the interchange protocol, we need to be able to reconstruct a categorical column from the physical representation and a string column containing the categories, ideally in a zero-copy manner.

import polars as pl

physical = pl.Series([0, 1, 0, 2], dtype=pl.UInt32)
categories = pl.Series(["a", "b", "c"])

dtype = pl.Categorical(categories=categories)
result = physical.cast(dtype)

print(result)
shape: (4,)
Series: '' [cat]
[
        "a"
        "b"
        "a"
        "c"
]

This will probably involve a quite significant overhaul of the Categorical type. I would consider implementing the following requested feature in one go:

@stinodego stinodego added enhancement New feature or an improvement of an existing feature accepted Ready for implementation labels Aug 23, 2023
@github-project-automation github-project-automation bot moved this to Ready in Backlog Aug 23, 2023
@stinodego stinodego moved this from Ready to Next in Backlog Aug 23, 2023
@mcrumiller
Copy link
Contributor

mcrumiller commented Aug 24, 2023

I try to avoid bringing Matlab in here (I'm an ex-Matlab user), but they have a few implementations here and there that I really like. One of those is their datetime system with duration and calendarduration distinction, and the other is their categoricals. For polars, since we're restricting to strings, we don't need special category names, they can just be the initial strings themselves. But I would still propose the following:

s = pl.Series(
    ['a', 'b', 'c'],
    dtype=pl.Categorical(
        values=['g', 'a', 'b', 'd', 'e', 'f'],  # define the ordering here, if ordinal=True
        ordinal=True  # if False, can only perform equality comparison
    )
)

I prefer the ordinal as a boolean variable. Categories in reality, like colors or countries, are not ordered. If they are ordered, then the user can set their ordering. As a consequence, ordered categoricals come with .cat.reorder_categories which re-orders the categories, or .cat.rename_categories, which redefines the labels. In either case, there is no distinction between "lexical" and "physical," as that is an interior implementation detail.

@avimallu
Copy link
Contributor

I agree with @mcrumiller. R has a similar implementation and I do agree that it is far superior to what is currently available across Python (I haven't used pandas's categoricals as much, but I do remember them being a little cumbersome compared to R). Specifically, the following operations might become feasible if categories are implemented in this way:

  1. If an order is defined, then in a to_dummies equivalent, the first value is always the lowest ordinal value in the categorical data.
  2. The tidyverse's forcats package has very good implementations of being able to reorder categories very flexibly, lump together (such as retaining only the top occuring N or top occuring p% of them), anonymizing them (very useful when dealing with sensitive data), expanding the categories with more values, unifying different categories etc.

Where can this be useful for Polars? I've got a few starting advantages:

  1. Defining a specific order when pivoting a categorical ordered column (say I want low, medium, high together).
  2. Grouping by category always ensures that all underlying values are present in the DataFrame - extremely useful when creating reports that go into Excel or a dashboarding tool.
  3. Being able to lump categories together will make it easy to summarize when there are many categories ("show me the top 5 segments with highest sales").
  4. Combining categories from different dataframes created in separate instances.

@stinodego stinodego self-assigned this Aug 24, 2023
@stinodego stinodego moved this from Next to In progress in Backlog Aug 24, 2023
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Aug 25, 2023

One bonus point; once we support Series creation from a fixed set of categories, we should also be able to validate that no new values are subsequently added to that Series - eg: it behaves as an Enum, where any values not originally specified are invalid and would raise an error. This would be incredibly useful/powerful in many contexts 😎👍

@mcrumiller
Copy link
Contributor

mcrumiller commented Aug 25, 2023

we should also be able to validate that no new values are subsequently added to that Series

@alexander-beedie can you give me a reason why this is useful? I rather like not having to explicitly do add_categories, unless the series has ordinal=True in which case yes, it would be very useful.

In pandas, for many of projects I often write up a quick categorical_concat or something similar which 1) first checks for new values, 2) adds those categories, and 3) concats. It's a bit cumbersome sometimes.

@stevenlis
Copy link

@mcrumiller I'm curious about having two categories: "ordered" and "unordered". In my understanding, each value in a categorical variable is assigned a numerical code (e.g., 1, 2, 3...) no matter it's ordered or not. If we care about the codes, we can define an order; otherwise, we can let Polars handle it without concern. Therefore, I don't see the necessity of explicitly specifying "ordinal." To me, once you define the values=['g', 'a', 'b', 'd', 'e', 'f'], you are defining an order. Hence, I find "ordinal" somewhat redundant. I have used "categorical" extensively in pandas and have never understood the distinction. I've always been curious. Have you encountered an example in your workflow in which you much define an unordered categorical?

@mcrumiller
Copy link
Contributor

@stevenlis it's more of an implementation detail I suppose, but tracking an ordered set is substantially harder than an unordered set and requires many more allocations once you start messing with merging, adding, uniting, etc. An unordered set has no notion of sorting, comparison operations (except for equality), and a lot of computation can be avoided with unordered sets.

It's also to know if your detail represents an ordered set or an unordered set, although yes, usually you know if your set is ordered or not.

@stevenlis
Copy link

Oh, I see. Are there any disadvantages to having the following API without ordinal?

# ordered
s = pl.Series(
    ['a', 'b', 'c'], dtype=pl.Categorical(order=['g', 'a', 'b', 'd', 'e', 'f'])
)

# unordered
s = pl.Series(
    ['a', 'b', 'c'], dtype=pl.Categorical
)

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Aug 26, 2023

@alexander-beedie can you give me a reason why this is useful? I rather like not having to explicitly do add_categories, unless the series has ordinal=True in which case yes, it would be very useful.

I'm not suggesting it always constrains the values to those initially declared, but that it should be able to in order to act more like an Enum; this is useful in a myriad of places. It can act as cheap/integrated validation (for example: you have data relating to trading a fixed set of Asian currencies and somebody accidentally concatenates trades against EUR - it could immediately raise an error in this scenario and prevent errors or incorrect results later in the pipeline, etc).

We could have Enum as a specialisation of Categorical, so it's easy to see at-a-glance what the intended behaviour is, or perhaps as an option on Categorical, along the lines of fixed=True or frozen=True. Either approach would be reasonable.

@mcrumiller
Copy link
Contributor

@alexander-beedie yeah I like the fixed version. We could call it protected, having it be true by default when ordinal=True.

@stinodego stinodego removed their assignment Aug 29, 2023
@Wainberg
Copy link
Contributor

Wainberg commented Sep 4, 2023

This is a really vital enhancement! Couple of thoughts:

  1. This goes a long way towards fully supporting an Semantic filtering of Enum columns #7636. I think the main things left are to support comparisons like df.filter(pl.col('month') >= 'Jun') and Implement str namespace functions on categoricals #9773.

  2. This should enable joining categorical columns if they have the same dtype. Right now the following code gives a StringCacheMismatchError: cannot compare categoricals coming from different sources:

a = pl.DataFrame({'a': pl.Series(['foo', 'bar', 'baz'], dtype=pl.Categorical)})
b = pl.DataFrame({'b': pl.Series(['foo', 'bar', 'baz'], dtype=pl.Categorical)})
a.join(b, left_on='a', right_on='b')

After this enhancement (#10705) is implemented, this should become possible:

dtype = pl.Categorical(['foo', 'bar', 'baz'])
a = pl.DataFrame({'a': pl.Series(['foo', 'bar', 'baz'], dtype=dtype)})
b = pl.DataFrame({'b': pl.Series(['foo', 'bar', 'baz'], dtype=dtype)})
a.join(b, left_on='a', right_on='b')
  1. Instead of pl.Categorical(categories=pl.Series(['a', 'b', 'c'])), it would be really clean to be able to write pl.Categorical(['a', 'b', 'c']) or even pl.Categorical('a', 'b', 'c'). pl.Categorical could represent an arbitrarily-ordered categorical as @stevenlis suggests (there doesn't have to be an explicit distinction between ordered and unordered like there is in pandas).

@stinodego stinodego moved this from In progress to Next in Backlog Sep 6, 2023
@lorentzenchr
Copy link
Contributor

lorentzenchr commented Sep 7, 2023

To give a concrete use case: Machine Learning models that support categoricals usually rely on the exact physical representation. For instance, value "cat" always is represented by, say, 2 and value "dog" always by, say, 3.
The data then needs to be processed in advance, e.g. in a ML pipeline, to comply to this requirement. That's more or less what scikit-learn's OrdinalEncoder does. It would be really helpful to be able to implement that behavior within polars itself. Written in steps:

  • Training:
    Learn and store the dtype or exact mapping from logical to physical values.
  • Predicting
    Make sure that the learned mapping is applied, do nothing if dtype already conforms to the requirement.

@lorentzenchr
Copy link
Contributor

#6003 might also be related.

@jmakov
Copy link

jmakov commented Nov 11, 2023

This is currently broken - can't write to Arrow dataset and then read the data again.

@stinodego stinodego added this to the 1.0.0 milestone Nov 23, 2023
@stinodego stinodego moved this from Next to Candidate in Backlog Nov 30, 2023
@c-peters c-peters self-assigned this Nov 30, 2023
@c-peters c-peters moved this from Candidate to In progress in Backlog Dec 1, 2023
@github-project-automation github-project-automation bot moved this from In progress to Done in Backlog Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

9 participants