-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement str
namespace functions on categoricals
#9773
Comments
I wonder if this is more efficient than round tripping the cast on all the values...
I imagine that's roughly how to implement the feature, although I'd guess you probably want to do it inside a StringCache but not really sure. |
Hi @mcrumiller Is this still planned for categoricals and enums? |
I really don't want to support this. I don't think categorical should behave like strings semantically. From an implementation perspective it would also be terrible as every operation updates your datatype, which just doesn't make sense to me. |
@ritchie46 that's disappointing--categoricals are efficient ways of storing a small number of strings many times, and it's a shame to not be able to leverage that efficiency for string operations. I have plenty of use cases for this--for example, a (limited) set of 4-character codes where I only need the first 3 characters. I can manage with casting to String and back but there seemed to be a good amount of support for this issue.
Do you mean the set of categories may change after the operation? This happens with a lot of categorical operations. |
Then we should improve efficiency on the string storage. We can intern the strings with views. We should not misuse a data type that is not meant for that and needs global synchronization of categories. This global synchronizations cause already enough complication as it is. Applying expressions that change your datatype depending on the data, should be minimized at all cost. |
(I'm new to polars, and still in the learning phase. I found about this issue while googling how to perform a categorical label remapping. Please forgive me if this comment is out of scope. BTW, thanks for this awsome project.)
I agree, provided that categorical types are restricted to using only opaque labels. In real world applications often labels have intrinsic meaning or specific interpretation, so that string manipulation could be useful and desired. Another example:
Fun fact:
|
Ok, I have come around and agree with you about string ops that change the categories. @ritchie46 what about queries about the values? Going through the
If you think this is reasonable, I can open a new issue and use some of my existing work for these. |
We could discuss if those methods fit in the categorical namespace. Which I think some do. I don't want to implement them independently though. We should be able to call these methods on the categories only and then use the categories to materialize. This can all use the same code path as it is essentially a |
Yep--that's what I had done for I'll close this and perhaps open a new issue for a much smaller subset of methods that would make sense to have a simple fast path. Thanks for the feedback @ritchie46. |
Problem description
All string functions in the
str
namespace should be able to work extremely fast on categorical columns by manipulating the labels alone and expanding the result. However, right now categoricals must first be cast topl.Utf8
and then cast back, which is wasteful:The text was updated successfully, but these errors were encountered: