Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical column with numeric data casting issues #9185

Closed
2 tasks done
manujosephv opened this issue Jun 2, 2023 · 6 comments · Fixed by #13957
Closed
2 tasks done

Categorical column with numeric data casting issues #9185

manujosephv opened this issue Jun 2, 2023 · 6 comments · Fixed by #13957
Assignees
Labels
A-dtype-categorical Area: categorical data type accepted Ready for implementation bug Something isn't working P-medium Priority: medium python Related to Python Polars

Comments

@manujosephv
Copy link

manujosephv commented Jun 2, 2023

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

If a numerical column is saved as a Categorical in a parquet file or string in a database, we may end up in a situation where the column is read as a Categorical column

Now this column behaves weirdly. If we cast this column into float, it creates very large numbers (I suspect it might be the categorical index?). But if we convert it to Utf8 and then convert to float, it works as expected

Reproducible example

import polars as pl
import numpy as np

l = [ "0.69845702",  "0.69317475",  "2.43642724", "-0.95303469",  "0.60684237",
        "0.69049258",  "0.65931574", "-0.55459717",  "1.87082195", "-0.80401786"]

cat_series = pl.Series('cat_series', l).cast(pl.Categorical)

print(cat_series.cast(pl.Float32).max())
# Output 1179219.0 is wrong

print(cat_series.cast(pl.Utf8).cast(pl.Float32).max())
# Output 2.436427354812622 is right

Expected behavior

import polars as pl
import numpy as np

l = [ "0.69845702",  "0.69317475",  "2.43642724", "-0.95303469",  "0.60684237",
        "0.69049258",  "0.65931574", "-0.55459717",  "1.87082195", "-0.80401786"]

cat_series = pl.Series('cat_series', l).cast(pl.Categorical)

print(cat_series.cast(pl.Float32).max())
# Output 2.436427354812622 is right

print(cat_series.cast(pl.Utf8).cast(pl.Float32).max())
# Output 2.436427354812622 is right

Installed versions

---Version info---
Polars: 0.17.11
Index type: UInt32
Platform: Linux-5.4.0-1091-gke-x86_64-with-centos-7.9.2009-Core
Python: 3.7.0 (default, Oct  9 2018, 10:31:47) 
[GCC 7.3.0]
---Optional dependencies---
numpy: 1.21.5
pandas: 1.1.5
pyarrow: 6.0.1
connectorx: <not installed>
deltalake: <not installed>
fsspec: 2023.1.0
matplotlib: 3.5.3
xlsx2csv: <not installed>
xlsxwriter: <not installed>
@manujosephv manujosephv added bug Something isn't working python Related to Python Polars labels Jun 2, 2023
@cmdlineluser
Copy link
Contributor

Yeah, it looks like if you cast directly to a numerical type, it's casting the underlying physical value:

s = pl.Series(["100", "500", "200"], dtype=pl.Categorical)

s.cast(pl.Float64)

# shape: (3,)
# Series: '' [f64]
# [
# 	0.0
# 	1.0
# 	2.0
# ]

s.to_physical()

# shape: (3,)
# Series: '' [u32]
# [
# 	0
# 	1
# 	2
# ]

Whereas if you cast to a string, it doesn't:

s.cast(pl.Utf8)

# shape: (3,)
# Series: '' [str]
# [
# 	"100"
# 	"500"
# 	"200"
# ]

@datenzauberai
Copy link
Contributor

datenzauberai commented Jun 6, 2023

Maybe casting a categorical to a numerical type should raise an exception the same way as casting a numerical to a categorical does?

@deanm0000
Copy link
Collaborator

deanm0000 commented Jul 13, 2023

Should it be an error or should it adopt the behavior here

@jss367
Copy link

jss367 commented Aug 25, 2023

This seems to be the same problem I ran into. I was trying to cast categorical data into an int like this:

s = pl.Series(['1','2','3'], dtype=pl.Categorical)
s.cast(int)

--
i64
0
1
2

I ended up with this workaround:

s.cast(str).cast(int)

--
i64
1
2
3

It's a shame because (at least in my case) pl.Series.cut returns categorical data, and when you convert them directly to ints you get an off-by-one error.

@qbao96xb
Copy link

qbao96xb commented Aug 29, 2023

Confirm, I got the same issue when casting directly to Int8.

Context:
I transform the column recency into 3 equal ranges whose labels are ['1','2','3'] by using Series.qcut function. After trying to cast the new column (named recency_score), the return outputs are [1, 0, 2] with respect to the order of the mentioned labels.

It turns out that the cast function is casting the underlying physical value of ['1','2','3']. Currently I have to work around like this:

df.with_columns(
  recency_score=(col('recency')*-1).qcut(quantiles=[0.7,0.9],labels=['1','2','3']).cast(Utf8).cast(Int8)
  )

Suggestion:

  1. Update the docstring & user guide to alert users.
  2. Consider checking the dtype of input and providing a correct casting like x.cast(pl.Utf8).cast(user_input_dtype=[Int,Float])
Session Info
-----
polars==0.18.15
IPython             8.14.0
jupyter_client      8.3.0
jupyter_core        5.3.1
jupyterlab          4.0.5

Python 3.11.4 (main, Jul  5 2023, 08:40:20) [Clang 14.0.6 ]
macOS-13.5-arm64-arm-64bit
-----

@xuJ14
Copy link

xuJ14 commented Sep 13, 2023

And the order of categorical should be warned somewhere conspicuous, because it may cause unexpected behavior when sort this type of col. Not only in the set_ordering function.

@stinodego stinodego added needs triage Awaiting prioritization by a maintainer A-dtype-categorical Area: categorical data type P-medium Priority: medium and removed needs triage Awaiting prioritization by a maintainer labels Jan 13, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Jan 19, 2024
@github-project-automation github-project-automation bot moved this from Ready to Done in Backlog Jan 26, 2024
@c-peters c-peters added the accepted Ready for implementation label Jan 29, 2024
@c-peters c-peters self-assigned this Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dtype-categorical Area: categorical data type accepted Ready for implementation bug Something isn't working P-medium Priority: medium python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

9 participants