-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Categorical column with numeric data casting issues #9185
Comments
Yeah, it looks like if you cast directly to a numerical type, it's casting the underlying physical value: s = pl.Series(["100", "500", "200"], dtype=pl.Categorical)
s.cast(pl.Float64)
# shape: (3,)
# Series: '' [f64]
# [
# 0.0
# 1.0
# 2.0
# ]
s.to_physical()
# shape: (3,)
# Series: '' [u32]
# [
# 0
# 1
# 2
# ] Whereas if you cast to a string, it doesn't: s.cast(pl.Utf8)
# shape: (3,)
# Series: '' [str]
# [
# "100"
# "500"
# "200"
# ] |
Maybe casting a categorical to a numerical type should raise an exception the same way as casting a numerical to a categorical does? |
Should it be an error or should it adopt the behavior here |
This seems to be the same problem I ran into. I was trying to cast categorical data into an int like this: s = pl.Series(['1','2','3'], dtype=pl.Categorical)
s.cast(int)
--
i64
0
1
2 I ended up with this workaround: s.cast(str).cast(int)
--
i64
1
2
3 It's a shame because (at least in my case) |
Confirm, I got the same issue when casting directly to Int8.Context: It turns out that the cast function is casting the underlying physical value of ['1','2','3']. Currently I have to work around like this:
Suggestion:
Session Info
|
And the order of categorical should be warned somewhere conspicuous, because it may cause unexpected behavior when sort this type of col. Not only in the |
Polars version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Issue description
If a numerical column is saved as a Categorical in a parquet file or string in a database, we may end up in a situation where the column is read as a Categorical column
Now this column behaves weirdly. If we cast this column into float, it creates very large numbers (I suspect it might be the categorical index?). But if we convert it to Utf8 and then convert to float, it works as expected
Reproducible example
Expected behavior
Installed versions
The text was updated successfully, but these errors were encountered: