Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expr.str.replace does not work on Categorical anymore #18717

Open
2 tasks done
douglas-raillard-arm opened this issue Sep 12, 2024 · 3 comments
Open
2 tasks done

Expr.str.replace does not work on Categorical anymore #18717

douglas-raillard-arm opened this issue Sep 12, 2024 · 3 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@douglas-raillard-arm
Copy link
Contributor

douglas-raillard-arm commented Sep 12, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
df = pl.LazyFrame({"a": ["z", "x", "c"]})
df = df.with_columns(pl.col("a").cast(pl.Categorical))
df.with_columns(pl.col("a").str.replace("z", "zzzz")).collect()

Log output

Traceback (most recent call last):
  File "testpolars6.py", line 6, in <module>
    df.with_columns(pl.col("a").str.replace("z", "zzzz")).collect()
  File "venv-3.12/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 2034, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.InvalidOperationError: expected String type, got: cat

Issue description

Categorical dtype used to work with Expr.str.replace but does not anymore.

I'm not sure if it's a bug or if it's a documentation issue, as this was not mentioned in the changelog. If that's the expected behavior, a breaking change entry would probably be for the best.

Expected behavior

Depending on the view point:
A. This should fail as on 1.7.0, since a Categorical is not a String. I suppose Expr.cat could grow str-like methods to keep things separate.

B. This should work as on 1.6.0, since Categorical are kind-of-like String. After all:

In Polars a categorical is defined as a string column which is encoded by a dictionary
https://docs.pola.rs/user-guide/concepts/data-types/categoricals/

Installed versions

--------Version info---------
Polars:              1.7.0
Index type:          UInt32
Platform:            Linux-5.15.0-105-generic-x86_64-with-glibc2.31
Python:              3.12.5 (main, Aug 17 2024, 16:46:07) [GCC 9.4.0]

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
great_tables         <not installed>
matplotlib           <not installed>
nest_asyncio         1.6.0
numpy                2.1.1
openpyxl             <not installed>
pandas               2.2.2
pyarrow              17.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

@douglas-raillard-arm douglas-raillard-arm added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Sep 12, 2024
@ritchie46
Copy link
Member

In 1.6 it incorrectly upcasted your categorical to str. This was silently happening without you being aware. I highly doubt this was what you wanted.

If so, you can now explicitly cast to String before calling the str namespace.

It's indeed "A. This should fail as on 1.7.0, since a Categorical is not a String"

github-actions bot pushed a commit to ARM-software/lisa that referenced this issue Sep 12, 2024
Force <1.7.0 until some issues are figured out:
* problem on readthedocs
* categorical issue: pola-rs/polars#18717
@douglas-raillard-arm
Copy link
Contributor Author

Thanks, I'll make the changes then. So to clarify:

  • Generic expressions like Expr.is_in() or pl.col('mycategory') == <string> are usually expected to work on categoricals
  • Anything in the Expr.str namespace is off-limits, unless of course we .cast(pl.String) first.

For that specific Expr.str.replace() issue, considering I only need to remap the category to a new set of names, I considered:

  1. Expr.replace_strict(): it works, but gives a String output. I can specify return_dtype but will that apparent roundtrip to String wreak performance ?
  2. Expr.replace(): Gives the correct schema, but fails when collecting:
*** polars.exceptions.InvalidOperationError: casting to a categorical with rev map is not allowed

What would be the recommended way ?

@ritchie46
Copy link
Member

Generic expressions like Expr.is_in() or pl.col('mycategory') == <string> are usually expected to work on categoricals
Anything in the Expr.str namespace is off-limits, unless of course we .cast(pl.String) first.

Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants