Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pd.factorize should not upconvert unique values unnecessarily #41132

Merged
merged 4 commits into from
Apr 26, 2021

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Apr 24, 2021

The uniques returned from pd.factorize should keep their original dtype, where possible.

>>> arr =np.array([1,2], dtype="int8")
>>> pd.factorize(arr)
(array([0, 1], dtype=int64), array([1, 2], dtype=int64))  # master
(array([0, 1], dtype=int64), array([1, 2], dtype=int8))  # this PR

elif is_float_dtype(values):
return ensure_float64(values), np.dtype("float64")
dtype = getattr(values, "dtype", np.dtype("float64"))
return ensure_float64(values), dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this code was written before we had non-64 hashtables. i expect a lot of this casting can now be avoided

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably yeah, and if we can avoid them, it'll speed some things up nicely.

But that's for a different PR? I got a feeling there's a lot of detail hidden in that change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough

@topper-123 topper-123 added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Apr 24, 2021
@topper-123 topper-123 added this to the 1.3 milestone Apr 24, 2021
@topper-123 topper-123 added the Bug label Apr 24, 2021
@topper-123 topper-123 changed the title BUG: pd.factorize upconverted unique values (from e.g. int8 -> int64) BUG: pd.factorize should not upconvert unique values unnecessarily Apr 26, 2021
@jreback jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Apr 26, 2021
@jreback jreback merged commit 99e2ee3 into pandas-dev:master Apr 26, 2021
@jreback
Copy link
Contributor

jreback commented Apr 26, 2021

thanks @topper-123

@topper-123 topper-123 deleted the factorize_numpy_dtypes branch April 26, 2021 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants