-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow creating Categorical
with pre-defined categories
#10705
Comments
I try to avoid bringing Matlab in here (I'm an ex-Matlab user), but they have a few implementations here and there that I really like. One of those is their datetime system with s = pl.Series(
['a', 'b', 'c'],
dtype=pl.Categorical(
values=['g', 'a', 'b', 'd', 'e', 'f'], # define the ordering here, if ordinal=True
ordinal=True # if False, can only perform equality comparison
)
) I prefer the |
I agree with @mcrumiller. R has a similar implementation and I do agree that it is far superior to what is currently available across Python (I haven't used pandas's categoricals as much, but I do remember them being a little cumbersome compared to R). Specifically, the following operations might become feasible if categories are implemented in this way:
Where can this be useful for Polars? I've got a few starting advantages:
|
One bonus point; once we support Series creation from a fixed set of categories, we should also be able to validate that no new values are subsequently added to that Series - eg: it behaves as an |
@alexander-beedie can you give me a reason why this is useful? I rather like not having to explicitly do In pandas, for many of projects I often write up a quick |
@mcrumiller I'm curious about having two categories: "ordered" and "unordered". In my understanding, each value in a categorical variable is assigned a numerical code (e.g., 1, 2, 3...) no matter it's ordered or not. If we care about the codes, we can define an order; otherwise, we can let Polars handle it without concern. Therefore, I don't see the necessity of explicitly specifying "ordinal." To me, once you define the |
@stevenlis it's more of an implementation detail I suppose, but tracking an ordered set is substantially harder than an unordered set and requires many more allocations once you start messing with merging, adding, uniting, etc. An unordered set has no notion of sorting, comparison operations (except for equality), and a lot of computation can be avoided with unordered sets. It's also to know if your detail represents an ordered set or an unordered set, although yes, usually you know if your set is ordered or not. |
Oh, I see. Are there any disadvantages to having the following API without # ordered
s = pl.Series(
['a', 'b', 'c'], dtype=pl.Categorical(order=['g', 'a', 'b', 'd', 'e', 'f'])
)
# unordered
s = pl.Series(
['a', 'b', 'c'], dtype=pl.Categorical
) |
I'm not suggesting it always constrains the values to those initially declared, but that it should be able to in order to act more like an We could have Enum as a specialisation of Categorical, so it's easy to see at-a-glance what the intended behaviour is, or perhaps as an option on Categorical, along the lines of |
@alexander-beedie yeah I like the fixed version. We could call it |
This is a really vital enhancement! Couple of thoughts:
a = pl.DataFrame({'a': pl.Series(['foo', 'bar', 'baz'], dtype=pl.Categorical)})
b = pl.DataFrame({'b': pl.Series(['foo', 'bar', 'baz'], dtype=pl.Categorical)})
a.join(b, left_on='a', right_on='b') After this enhancement (#10705) is implemented, this should become possible: dtype = pl.Categorical(['foo', 'bar', 'baz'])
a = pl.DataFrame({'a': pl.Series(['foo', 'bar', 'baz'], dtype=dtype)})
b = pl.DataFrame({'b': pl.Series(['foo', 'bar', 'baz'], dtype=dtype)})
a.join(b, left_on='a', right_on='b')
|
To give a concrete use case: Machine Learning models that support categoricals usually rely on the exact physical representation. For instance, value "cat" always is represented by, say, 2 and value "dog" always by, say, 3.
|
#6003 might also be related. |
This is currently broken - can't write to Arrow dataset and then read the data again. |
Problem description
Blocker for #10701
For the interchange protocol, we need to be able to reconstruct a categorical column from the physical representation and a string column containing the categories, ideally in a zero-copy manner.
This will probably involve a quite significant overhaul of the Categorical type. I would consider implementing the following requested feature in one go:
pl.Categorical
#7162The text was updated successfully, but these errors were encountered: