Skip to content

ENH: allow groupby (and drop_duplicates) on columns containing unhashable types #41759

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ezerkar opened this issue Jun 1, 2021 · 6 comments
Closed
Labels
duplicated duplicated, drop_duplicates Enhancement hashing hash_pandas_object

Comments

@ezerkar
Copy link

ezerkar commented Jun 1, 2021

Is your feature request related to a problem?

well sort of, currently one can not groupby on a column containing unhashable types (e.g dicts)

Describe the solution you'd like

an easy workaround is to groupby on that column as type str and then remap the strings back to their orig type,
wondering if we can provide this process built in so one can groupby on unhashable types if she desires to

Describe alternatives you've considered

add a try except to allow hash(str(x)) in case hash(x) is impossible, or convert the column to str and add it back later

# Your code here, if applicable
@ezerkar ezerkar added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 1, 2021
@mzeitlin11
Copy link
Member

Thanks for the request @ezerkar! I'd be -1 on this - I think this would add considerable complexity for uncertain payoff. This sounds like a workaround better done on the user side - implicitly hashing mutable objects sounds like something which could lead to very confusing behavior. On the user side, you can make sure mutable objects aren't being mutated and breaking hash invariants.

@mzeitlin11 mzeitlin11 added duplicated duplicated, drop_duplicates hashing hash_pandas_object and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 1, 2021
@ezerkar
Copy link
Author

ezerkar commented Jun 1, 2021

Thanks for your comment,
Let me disagree here, having mutable objects in df cells is possible (and explode, for instance, is designed for that specific situation) and [[1,2,3], [1,2,3]] are duplicates by all means, so why should someone expect an error when running drop_duplicates on a df that happens to have lists in one of its columns?
I get that not being able to hash disturbs the regular implementation of groupby and it does make sense that having a mutables will cause delay, but I don't think it should raise an error.
At any rate please refer me to the duplicate

Thanks

@mzeitlin11
Copy link
Member

At any rate please refer me to the duplicate

That label just refers to the fact that this is related to drop_duplicates :)

why should someone expect an error when running drop_duplicates on a df that happens to have lists in one of its columns?

I think this makes sense given that it is default python behavior, eg

>>> set([[1]])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

I think hashing mutable objects might surprise users who have a mutable object in their frame by mistake (which a failure would make very clear). If the intent actually is to drop duplicates of something mutable, I think it makes for the user to have to explicitly define how the hashing should be done (eg by converting to string first)

@mzeitlin11
Copy link
Member

(As a sidenote, if your use case is specifically in reference to better list support, this probably falls under the category of something which would probably be supported if we ever have a specific ListDType (xref #35176))

@ezerkar
Copy link
Author

ezerkar commented Jun 1, 2021

Thanks for clarifying the label LOL
while I see your point, In my opinion since the basic logic of groupby/drop_duplicates does not necessarily include hashing it should not fail. At any rate, for the very least the documents should reflect this.
But I guess, if leaving the situation as is has its pros, we should not make any changes.

@mroeschke
Copy link
Member

Thanks for the report, but agreed I would be -1 due the unexpected behavior as well and best if an external extension array could contain the logic to support this case. Closing since this enhancement request hasn't gotten support from the core team, but happy to reopen if there is revived interest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicated duplicated, drop_duplicates Enhancement hashing hash_pandas_object
Projects
None yet
Development

No branches or pull requests

3 participants