-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Unstable hashtable / duplicated algo for object dtype #27035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
In the class methods, we check that the object is hashable (before doing a pandas/pandas/_libs/hashtable_class_helper.pxi.in Lines 1019 to 1034 in f0919f2
while in the function implementations (such as duplicated), we don't do that check. Eg pandas/pandas/_libs/hashtable_func_helper.pxi.in Lines 153 to 156 in f0919f2
The "pymap" hashtable needs the values to be hashable, but doesn't actually check that? Should we add a cc @jreback |
@jorisvandenbossche yeah you could try that here but this may introduce a performance issue |
From a quick test, it seems that adding a So that's a considerable slowdown. A "proper" fix might be to check in the actual Alternative could be doing a "best effort" check by hashing the first element. So if you have a full object series of unhashable objects, that would at least be catched. But it is not that nice that it depends on the order of the values if the error is raised or not (eg if you have a missing value in the first location). |
xref #12693, where other unhashable objects raise (which is probably the correct behavior with the current implementation or perhaps raise a NotImplementedError and #12693 should maybe an enhancement request instead.) however, the code in the OP is no longer unstable AFAICT |
indeed it does with a DataFrame with more than one column pd.Series(a).to_frame().assign(dup=lambda x: x[0]).duplicated()
|
looks like fixed sometime after 1.2.5 |
From a flaky test in geopandas, I observed the following behaviour:
So you see that sometimes it works, sometimes it does not work.
I am also not fully sure how the object hashtable works (assuming
duplicated
uses the hashtable), as the shapely Point objects are not hashable:The text was updated successfully, but these errors were encountered: