Implement `str` namespace functions on categoricals #9773

mcrumiller · 2023-07-07T19:03:38Z

Problem description

All string functions in the str namespace should be able to work extremely fast on categorical columns by manipulating the labels alone and expanding the result. However, right now categoricals must first be cast to pl.Utf8 and then cast back, which is wasteful:

import polars as pl

s = pl.Series(["test", "hello", "hello", "hello", "test"], dtype=pl.Categorical)

# current requirement
s2 = s.cast(pl.Utf8).str.slice(1).cast(pl.Categorical)

# proposed
s2 = s.str.slice(1)

The text was updated successfully, but these errors were encountered:

deanm0000 · 2023-07-11T15:27:53Z

I wonder if this is more efficient than round tripping the cast on all the values...

(s
    .to_frame()
    .join(
        s.to_frame().unique().with_columns(
            pl.col('').cast(pl.Utf8()).str.slice(1)
                      .cast(pl.Categorical()).alias('new')), 
        on='')
    .get_column('new').alias('')
    )

I imagine that's roughly how to implement the feature, although I'd guess you probably want to do it inside a StringCache but not really sure.

stevenlis · 2024-08-31T22:47:03Z

Hi @mcrumiller Is this still planned for categoricals and enums?

ritchie46 · 2024-12-01T10:24:54Z

I really don't want to support this. I don't think categorical should behave like strings semantically.

From an implementation perspective it would also be terrible as every operation updates your datatype, which just doesn't make sense to me.

mcrumiller · 2024-12-01T16:27:06Z

@ritchie46 that's disappointing--categoricals are efficient ways of storing a small number of strings many times, and it's a shame to not be able to leverage that efficiency for string operations. I have plenty of use cases for this--for example, a (limited) set of 4-character codes where I only need the first 3 characters. I can manage with casting to String and back but there seemed to be a good amount of support for this issue.

From an implementation perspective it would also be terrible as every operation updates your datatype, which just doesn't make sense to me.

Do you mean the set of categories may change after the operation? This happens with a lot of categorical operations.

ritchie46 · 2024-12-02T08:42:22Z

it's a shame to not be able to leverage that efficiency for string operations

Then we should improve efficiency on the string storage. We can intern the strings with views. We should not misuse a data type that is not meant for that and needs global synchronization of categories. This global synchronizations cause already enough complication as it is. Applying expressions that change your datatype depending on the data, should be minimized at all cost.

miccoli · 2024-12-02T16:20:39Z

(I'm new to polars, and still in the learning phase. I found about this issue while googling how to perform a categorical label remapping. Please forgive me if this comment is out of scope. BTW, thanks for this awsome project.)

I don't think categorical should behave like strings semantically.

I agree, provided that categorical types are restricted to using only opaque labels. In real world applications often labels have intrinsic meaning or specific interpretation, so that string manipulation could be useful and desired.

Another example:

import polars as pl

s = pl.Series(["A1", "A0", "A0", "B1", "B0", "Z"], dtype=pl.Categorical)

# current requirement
b = s.is_in({c for c in s.cat.get_categories() if c.startswith("A")})

# proposed
b2 = s.str.starts_with("A")

Fun fact: s.map_elements(lambda x: x.startswith("A"), return_dtype=pl.Boolean) raises this warning, which actually supports this enhancement:

PolarsInefficientMapWarning: 
Series.map_elements is significantly slower than the native series API.
Only use if you absolutely CANNOT implement your logic otherwise.
Replace this expression...
  - s.map_elements(lambda x: ...)
with this one instead:
  + s.str.starts_with('A')

mcrumiller · 2024-12-02T17:19:14Z

Ok, I have come around and agree with you about string ops that change the categories.

@ritchie46 what about queries about the values? Going through the str API, I see the following that I think would be reasonable to implement on categoricals using a fast path:

contains/contains_any
count_matches
decode/encode
starts_with/ends_with
find/find_many
len_bytes/len_chars
to_date/to_datetime/to_<x>

If you think this is reasonable, I can open a new issue and use some of my existing work for these.

ritchie46 · 2024-12-03T07:49:26Z

We could discuss if those methods fit in the categorical namespace. Which I think some do. I don't want to implement them independently though. We should be able to call these methods on the categories only and then use the categories to materialize. This can all use the same code path as it is essentially a str.method + gather, and thus would not require any code bloat.

mcrumiller · 2024-12-03T14:40:34Z

Yep--that's what I had done for head/tail/slice--they went through str.<method> on the categories alone, but required a remap too, which I can tell we don't want. I'll put something together for len_bytes/len_chars first (something simple). I think contains/starts_with/ends_with and len_bytes/len_chars are probably the only methods above that people would care about, possibly in addition to the to_<x> methods which have an obvious fast path.

I'll close this and perhaps open a new issue for a much smaller subset of methods that would make sense to have a simple fast path. Thanks for the feedback @ritchie46.

mcrumiller added the enhancement New feature or an improvement of an existing feature label Jul 7, 2023

stinodego added this to Backlog Jul 12, 2023

github-project-automation bot moved this to Untriaged in Backlog Jul 12, 2023

stinodego moved this from Untriaged to Ready in Backlog Jul 12, 2023

deanm0000 mentioned this issue Jul 13, 2023

Categorical column with numeric data casting issues #9185

Closed

2 tasks

stinodego removed this from Backlog Jul 14, 2023

stinodego added the accepted Ready for implementation label Jul 14, 2023

github-project-automation bot added this to Backlog Jul 14, 2023

github-project-automation bot moved this to Ready in Backlog Jul 14, 2023

Wainberg mentioned this issue Sep 4, 2023

Allow creating Categorical with pre-defined categories #10705

Closed

Wainberg mentioned this issue Nov 23, 2023

Allow string operations on Categorical types #7647

Closed

stinodego added the A-dtype-categorical Area: categorical data type label Jan 26, 2024

cmdlineluser mentioned this issue Jul 11, 2024

Broken and inconsistent API for dealing with Categorical variables #17576

Closed

2 tasks

cmdlineluser mentioned this issue Sep 12, 2024

Expr.str.replace does not work on Categorical anymore #18717

Open

2 tasks

mcrumiller mentioned this issue Nov 30, 2024

feat: Implement str.slice/str.head/str.tail for categoricals #20080

Closed

ritchie46 removed the accepted Ready for implementation label Dec 1, 2024

mcrumiller closed this as completed Dec 3, 2024

github-project-automation bot moved this from Ready to Done in Backlog Dec 3, 2024

mcrumiller mentioned this issue Dec 9, 2024

feat: Add cat.len_chars and cat.len_bytes #20211

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `str` namespace functions on categoricals #9773

Implement `str` namespace functions on categoricals #9773

mcrumiller commented Jul 7, 2023

deanm0000 commented Jul 11, 2023 •

edited

Loading

stevenlis commented Aug 31, 2024

ritchie46 commented Dec 1, 2024 •

edited

Loading

mcrumiller commented Dec 1, 2024 •

edited

Loading

ritchie46 commented Dec 2, 2024

miccoli commented Dec 2, 2024 •

edited

Loading

mcrumiller commented Dec 2, 2024 •

edited

Loading

ritchie46 commented Dec 3, 2024

mcrumiller commented Dec 3, 2024

Implement str namespace functions on categoricals #9773

Implement str namespace functions on categoricals #9773

Comments

mcrumiller commented Jul 7, 2023

Problem description

deanm0000 commented Jul 11, 2023 • edited Loading

stevenlis commented Aug 31, 2024

ritchie46 commented Dec 1, 2024 • edited Loading

mcrumiller commented Dec 1, 2024 • edited Loading

ritchie46 commented Dec 2, 2024

miccoli commented Dec 2, 2024 • edited Loading

mcrumiller commented Dec 2, 2024 • edited Loading

ritchie46 commented Dec 3, 2024

mcrumiller commented Dec 3, 2024

Implement `str` namespace functions on categoricals #9773

Implement `str` namespace functions on categoricals #9773

deanm0000 commented Jul 11, 2023 •

edited

Loading

ritchie46 commented Dec 1, 2024 •

edited

Loading

mcrumiller commented Dec 1, 2024 •

edited

Loading

miccoli commented Dec 2, 2024 •

edited

Loading

mcrumiller commented Dec 2, 2024 •

edited

Loading