Allow creating `Categorical` with pre-defined categories #10705

stinodego · 2023-08-23T22:38:57Z

Problem description

Blocker for #10701

For the interchange protocol, we need to be able to reconstruct a categorical column from the physical representation and a string column containing the categories, ideally in a zero-copy manner.

import polars as pl

physical = pl.Series([0, 1, 0, 2], dtype=pl.UInt32)
categories = pl.Series(["a", "b", "c"])

dtype = pl.Categorical(categories=categories)
result = physical.cast(dtype)

print(result)

shape: (4,)
Series: '' [cat]
[
        "a"
        "b"
        "a"
        "c"
]

This will probably involve a quite significant overhaul of the Categorical type. I would consider implementing the following requested feature in one go:

Allow "lexical" and "physical" as arguments to pl.Categorical #7162

The text was updated successfully, but these errors were encountered:

mcrumiller · 2023-08-24T00:29:39Z

I try to avoid bringing Matlab in here (I'm an ex-Matlab user), but they have a few implementations here and there that I really like. One of those is their datetime system with duration and calendarduration distinction, and the other is their categoricals. For polars, since we're restricting to strings, we don't need special category names, they can just be the initial strings themselves. But I would still propose the following:

s = pl.Series(
    ['a', 'b', 'c'],
    dtype=pl.Categorical(
        values=['g', 'a', 'b', 'd', 'e', 'f'],  # define the ordering here, if ordinal=True
        ordinal=True  # if False, can only perform equality comparison
    )
)

I prefer the ordinal as a boolean variable. Categories in reality, like colors or countries, are not ordered. If they are ordered, then the user can set their ordering. As a consequence, ordered categoricals come with .cat.reorder_categories which re-orders the categories, or .cat.rename_categories, which redefines the labels. In either case, there is no distinction between "lexical" and "physical," as that is an interior implementation detail.

avimallu · 2023-08-24T01:40:09Z

I agree with @mcrumiller. R has a similar implementation and I do agree that it is far superior to what is currently available across Python (I haven't used pandas's categoricals as much, but I do remember them being a little cumbersome compared to R). Specifically, the following operations might become feasible if categories are implemented in this way:

If an order is defined, then in a to_dummies equivalent, the first value is always the lowest ordinal value in the categorical data.
The tidyverse's forcats package has very good implementations of being able to reorder categories very flexibly, lump together (such as retaining only the top occuring N or top occuring p% of them), anonymizing them (very useful when dealing with sensitive data), expanding the categories with more values, unifying different categories etc.

Where can this be useful for Polars? I've got a few starting advantages:

Defining a specific order when pivoting a categorical ordered column (say I want low, medium, high together).
Grouping by category always ensures that all underlying values are present in the DataFrame - extremely useful when creating reports that go into Excel or a dashboarding tool.
Being able to lump categories together will make it easy to summarize when there are many categories ("show me the top 5 segments with highest sales").
Combining categories from different dataframes created in separate instances.

alexander-beedie · 2023-08-25T05:35:22Z

One bonus point; once we support Series creation from a fixed set of categories, we should also be able to validate that no new values are subsequently added to that Series - eg: it behaves as an Enum, where any values not originally specified are invalid and would raise an error. This would be incredibly useful/powerful in many contexts 😎👍

mcrumiller · 2023-08-25T14:48:01Z

we should also be able to validate that no new values are subsequently added to that Series

@alexander-beedie can you give me a reason why this is useful? I rather like not having to explicitly do add_categories, unless the series has ordinal=True in which case yes, it would be very useful.

In pandas, for many of projects I often write up a quick categorical_concat or something similar which 1) first checks for new values, 2) adds those categories, and 3) concats. It's a bit cumbersome sometimes.

stevenlis · 2023-08-25T23:48:09Z

@mcrumiller I'm curious about having two categories: "ordered" and "unordered". In my understanding, each value in a categorical variable is assigned a numerical code (e.g., 1, 2, 3...) no matter it's ordered or not. If we care about the codes, we can define an order; otherwise, we can let Polars handle it without concern. Therefore, I don't see the necessity of explicitly specifying "ordinal." To me, once you define the values=['g', 'a', 'b', 'd', 'e', 'f'], you are defining an order. Hence, I find "ordinal" somewhat redundant. I have used "categorical" extensively in pandas and have never understood the distinction. I've always been curious. Have you encountered an example in your workflow in which you much define an unordered categorical?

mcrumiller · 2023-08-25T23:53:47Z

@stevenlis it's more of an implementation detail I suppose, but tracking an ordered set is substantially harder than an unordered set and requires many more allocations once you start messing with merging, adding, uniting, etc. An unordered set has no notion of sorting, comparison operations (except for equality), and a lot of computation can be avoided with unordered sets.

It's also to know if your detail represents an ordered set or an unordered set, although yes, usually you know if your set is ordered or not.

stevenlis · 2023-08-26T00:08:12Z

Oh, I see. Are there any disadvantages to having the following API without ordinal?

# ordered
s = pl.Series(
    ['a', 'b', 'c'], dtype=pl.Categorical(order=['g', 'a', 'b', 'd', 'e', 'f'])
)

# unordered
s = pl.Series(
    ['a', 'b', 'c'], dtype=pl.Categorical
)

alexander-beedie · 2023-08-26T12:02:04Z

@alexander-beedie can you give me a reason why this is useful? I rather like not having to explicitly do add_categories, unless the series has ordinal=True in which case yes, it would be very useful.

I'm not suggesting it always constrains the values to those initially declared, but that it should be able to in order to act more like an Enum; this is useful in a myriad of places. It can act as cheap/integrated validation (for example: you have data relating to trading a fixed set of Asian currencies and somebody accidentally concatenates trades against EUR - it could immediately raise an error in this scenario and prevent errors or incorrect results later in the pipeline, etc).

We could have Enum as a specialisation of Categorical, so it's easy to see at-a-glance what the intended behaviour is, or perhaps as an option on Categorical, along the lines of fixed=True or frozen=True. Either approach would be reasonable.

mcrumiller · 2023-08-27T19:47:51Z

@alexander-beedie yeah I like the fixed version. We could call it protected, having it be true by default when ordinal=True.

Wainberg · 2023-09-04T02:21:07Z

This is a really vital enhancement! Couple of thoughts:

This goes a long way towards fully supporting an Semantic filtering of Enum columns #7636. I think the main things left are to support comparisons like df.filter(pl.col('month') >= 'Jun') and Implement str namespace functions on categoricals #9773.
This should enable joining categorical columns if they have the same dtype. Right now the following code gives a StringCacheMismatchError: cannot compare categoricals coming from different sources:

a = pl.DataFrame({'a': pl.Series(['foo', 'bar', 'baz'], dtype=pl.Categorical)})
b = pl.DataFrame({'b': pl.Series(['foo', 'bar', 'baz'], dtype=pl.Categorical)})
a.join(b, left_on='a', right_on='b')

After this enhancement (#10705) is implemented, this should become possible:

dtype = pl.Categorical(['foo', 'bar', 'baz'])
a = pl.DataFrame({'a': pl.Series(['foo', 'bar', 'baz'], dtype=dtype)})
b = pl.DataFrame({'b': pl.Series(['foo', 'bar', 'baz'], dtype=dtype)})
a.join(b, left_on='a', right_on='b')

Instead of pl.Categorical(categories=pl.Series(['a', 'b', 'c'])), it would be really clean to be able to write pl.Categorical(['a', 'b', 'c']) or even pl.Categorical('a', 'b', 'c'). pl.Categorical could represent an arbitrarily-ordered categorical as @stevenlis suggests (there doesn't have to be an explicit distinction between ordered and unordered like there is in pandas).

lorentzenchr · 2023-09-07T10:33:13Z

To give a concrete use case: Machine Learning models that support categoricals usually rely on the exact physical representation. For instance, value "cat" always is represented by, say, 2 and value "dog" always by, say, 3.
The data then needs to be processed in advance, e.g. in a ML pipeline, to comply to this requirement. That's more or less what scikit-learn's OrdinalEncoder does. It would be really helpful to be able to implement that behavior within polars itself. Written in steps:

Training:
Learn and store the dtype or exact mapping from logical to physical values.
Predicting
Make sure that the learned mapping is applied, do nothing if dtype already conforms to the requirement.

lorentzenchr · 2023-09-07T10:35:13Z

#6003 might also be related.

jmakov · 2023-11-11T23:42:35Z

This is currently broken - can't write to Arrow dataset and then read the data again.

stinodego added enhancement New feature or an improvement of an existing feature accepted Ready for implementation labels Aug 23, 2023

github-project-automation bot added this to Backlog Aug 23, 2023

github-project-automation bot moved this to Ready in Backlog Aug 23, 2023

stinodego moved this from Ready to Next in Backlog Aug 23, 2023

stinodego mentioned this issue Aug 23, 2023

feat(python): Implement from_dataframe natively (interchange protocol) #10701

Merged

stinodego self-assigned this Aug 24, 2023

stinodego moved this from Next to In progress in Backlog Aug 24, 2023

stinodego removed their assignment Aug 29, 2023

stinodego moved this from In progress to Next in Backlog Sep 6, 2023

stinodego mentioned this issue Oct 14, 2023

categorical dtypes with explicit value->label mappings #6003

Closed

c-peters mentioned this issue Oct 18, 2023

feat: Add new Enum categorical data type which allows a fixed set of categories #11822

Merged

stinodego added this to the 1.0.0 milestone Nov 23, 2023

stinodego moved this from Next to Candidate in Backlog Nov 30, 2023

c-peters self-assigned this Nov 30, 2023

c-peters moved this from Candidate to In progress in Backlog Dec 1, 2023

c-peters mentioned this issue Dec 8, 2023

feat(rust): Cast to an enum from int #12954

Merged

ritchie46 closed this as completed in #12954 Dec 8, 2023

github-project-automation bot moved this from In progress to Done in Backlog Dec 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow creating `Categorical` with pre-defined categories #10705

Allow creating `Categorical` with pre-defined categories #10705

stinodego commented Aug 23, 2023 •

edited

Loading

mcrumiller commented Aug 24, 2023 •

edited

Loading

avimallu commented Aug 24, 2023

alexander-beedie commented Aug 25, 2023 •

edited

Loading

mcrumiller commented Aug 25, 2023 •

edited

Loading

stevenlis commented Aug 25, 2023

mcrumiller commented Aug 25, 2023

stevenlis commented Aug 26, 2023

alexander-beedie commented Aug 26, 2023 •

edited

Loading

mcrumiller commented Aug 27, 2023

Wainberg commented Sep 4, 2023

lorentzenchr commented Sep 7, 2023 •

edited

Loading

lorentzenchr commented Sep 7, 2023

jmakov commented Nov 11, 2023

Allow creating Categorical with pre-defined categories #10705

Allow creating Categorical with pre-defined categories #10705

Comments

stinodego commented Aug 23, 2023 • edited Loading

Problem description

mcrumiller commented Aug 24, 2023 • edited Loading

avimallu commented Aug 24, 2023

alexander-beedie commented Aug 25, 2023 • edited Loading

mcrumiller commented Aug 25, 2023 • edited Loading

stevenlis commented Aug 25, 2023

mcrumiller commented Aug 25, 2023

stevenlis commented Aug 26, 2023

alexander-beedie commented Aug 26, 2023 • edited Loading

mcrumiller commented Aug 27, 2023

Wainberg commented Sep 4, 2023

lorentzenchr commented Sep 7, 2023 • edited Loading

lorentzenchr commented Sep 7, 2023

jmakov commented Nov 11, 2023

Allow creating `Categorical` with pre-defined categories #10705

Allow creating `Categorical` with pre-defined categories #10705

stinodego commented Aug 23, 2023 •

edited

Loading

mcrumiller commented Aug 24, 2023 •

edited

Loading

alexander-beedie commented Aug 25, 2023 •

edited

Loading

mcrumiller commented Aug 25, 2023 •

edited

Loading

alexander-beedie commented Aug 26, 2023 •

edited

Loading

lorentzenchr commented Sep 7, 2023 •

edited

Loading