Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement str namespace functions on categoricals #9773

Closed
mcrumiller opened this issue Jul 7, 2023 · 9 comments
Closed

Implement str namespace functions on categoricals #9773

mcrumiller opened this issue Jul 7, 2023 · 9 comments
Labels
A-dtype-categorical Area: categorical data type enhancement New feature or an improvement of an existing feature

Comments

@mcrumiller
Copy link
Contributor

Problem description

All string functions in the str namespace should be able to work extremely fast on categorical columns by manipulating the labels alone and expanding the result. However, right now categoricals must first be cast to pl.Utf8 and then cast back, which is wasteful:

import polars as pl

s = pl.Series(["test", "hello", "hello", "hello", "test"], dtype=pl.Categorical)

# current requirement
s2 = s.cast(pl.Utf8).str.slice(1).cast(pl.Categorical)

# proposed
s2 = s.str.slice(1)
@mcrumiller mcrumiller added the enhancement New feature or an improvement of an existing feature label Jul 7, 2023
@deanm0000
Copy link
Collaborator

deanm0000 commented Jul 11, 2023

I wonder if this is more efficient than round tripping the cast on all the values...

(s
    .to_frame()
    .join(
        s.to_frame().unique().with_columns(
            pl.col('').cast(pl.Utf8()).str.slice(1)
                      .cast(pl.Categorical()).alias('new')), 
        on='')
    .get_column('new').alias('')
    )

I imagine that's roughly how to implement the feature, although I'd guess you probably want to do it inside a StringCache but not really sure.

@stevenlis
Copy link

Hi @mcrumiller Is this still planned for categoricals and enums?

@ritchie46
Copy link
Member

ritchie46 commented Dec 1, 2024

I really don't want to support this. I don't think categorical should behave like strings semantically.

From an implementation perspective it would also be terrible as every operation updates your datatype, which just doesn't make sense to me.

@mcrumiller
Copy link
Contributor Author

mcrumiller commented Dec 1, 2024

@ritchie46 that's disappointing--categoricals are efficient ways of storing a small number of strings many times, and it's a shame to not be able to leverage that efficiency for string operations. I have plenty of use cases for this--for example, a (limited) set of 4-character codes where I only need the first 3 characters. I can manage with casting to String and back but there seemed to be a good amount of support for this issue.

From an implementation perspective it would also be terrible as every operation updates your datatype, which just doesn't make sense to me.

Do you mean the set of categories may change after the operation? This happens with a lot of categorical operations.

@ritchie46
Copy link
Member

it's a shame to not be able to leverage that efficiency for string operations

Then we should improve efficiency on the string storage. We can intern the strings with views. We should not misuse a data type that is not meant for that and needs global synchronization of categories. This global synchronizations cause already enough complication as it is. Applying expressions that change your datatype depending on the data, should be minimized at all cost.

@miccoli
Copy link

miccoli commented Dec 2, 2024

(I'm new to polars, and still in the learning phase. I found about this issue while googling how to perform a categorical label remapping. Please forgive me if this comment is out of scope. BTW, thanks for this awsome project.)

I don't think categorical should behave like strings semantically.

I agree, provided that categorical types are restricted to using only opaque labels. In real world applications often labels have intrinsic meaning or specific interpretation, so that string manipulation could be useful and desired.

Another example:

import polars as pl

s = pl.Series(["A1", "A0", "A0", "B1", "B0", "Z"], dtype=pl.Categorical)

# current requirement
b = s.is_in({c for c in s.cat.get_categories() if c.startswith("A")})

# proposed
b2 = s.str.starts_with("A")

Fun fact: s.map_elements(lambda x: x.startswith("A"), return_dtype=pl.Boolean) raises this warning, which actually supports this enhancement:

PolarsInefficientMapWarning: 
Series.map_elements is significantly slower than the native series API.
Only use if you absolutely CANNOT implement your logic otherwise.
Replace this expression...
  - s.map_elements(lambda x: ...)
with this one instead:
  + s.str.starts_with('A')

@mcrumiller
Copy link
Contributor Author

mcrumiller commented Dec 2, 2024

Ok, I have come around and agree with you about string ops that change the categories.

@ritchie46 what about queries about the values? Going through the str API, I see the following that I think would be reasonable to implement on categoricals using a fast path:

  • contains/contains_any
  • count_matches
  • decode/encode
  • starts_with/ends_with
  • find/find_many
  • len_bytes/len_chars
  • to_date/to_datetime/to_<x>

If you think this is reasonable, I can open a new issue and use some of my existing work for these.

@ritchie46
Copy link
Member

We could discuss if those methods fit in the categorical namespace. Which I think some do. I don't want to implement them independently though. We should be able to call these methods on the categories only and then use the categories to materialize. This can all use the same code path as it is essentially a str.method + gather, and thus would not require any code bloat.

@mcrumiller
Copy link
Contributor Author

Yep--that's what I had done for head/tail/slice--they went through str.<method> on the categories alone, but required a remap too, which I can tell we don't want. I'll put something together for len_bytes/len_chars first (something simple). I think contains/starts_with/ends_with and len_bytes/len_chars are probably the only methods above that people would care about, possibly in addition to the to_<x> methods which have an obvious fast path.

I'll close this and perhaps open a new issue for a much smaller subset of methods that would make sense to have a simple fast path. Thanks for the feedback @ritchie46.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dtype-categorical Area: categorical data type enhancement New feature or an improvement of an existing feature
Projects
Archived in project
Development

No branches or pull requests

6 participants