Skip to content

Unstable hashtable / duplicated algo for object dtype #27035

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jorisvandenbossche opened this issue Jun 25, 2019 · 6 comments
Open

Unstable hashtable / duplicated algo for object dtype #27035

jorisvandenbossche opened this issue Jun 25, 2019 · 6 comments
Labels
Bug duplicated duplicated, drop_duplicates

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 25, 2019

From a flaky test in geopandas, I observed the following behaviour:

In [1]: pd.__version__
Out[1]: '0.25.0.dev0+791.gf0919f272'

In [2]: from shapely.geometry import Point 

In [3]: a = np.array([Point(1, 1), Point(1, 1)], dtype=object) 

In [4]: pd.Series(a).duplicated()
Out[4]: 
0    False
1     True
dtype: bool

In [6]: print(pd.Series(a).duplicated()) 
   ...: print(pd.Series(a).duplicated())
0    False
1     True
dtype: bool
0    False
1    False
dtype: bool

So you see that sometimes it works, sometimes it does not work.

I am also not fully sure how the object hashtable works (assuming duplicated uses the hashtable), as the shapely Point objects are not hashable:

In [9]: pd.Series(a).unique()
...
TypeError: unhashable type: 'Point'
@jorisvandenbossche jorisvandenbossche added Bug Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Jun 25, 2019
@jorisvandenbossche
Copy link
Member Author

In the class methods, we check that the object is hashable (before doing a kh_put_pymap). Eg in PyObjectHashTable.unique:

for i in range(n):
val = values[i]
hash(val)
if ignore_na and ((val != val or val is None)
or (use_na_value and val == na_value)):
# if missing values do not count as unique values (i.e. if
# ignore_na is True), skip the hashtable entry for them, and
# replace the corresponding label with na_sentinel
labels[i] = na_sentinel
continue
k = kh_get_pymap(self.table, <PyObject*>val)
if k == self.table.n_buckets:
# k hasn't been seen yet
k = kh_put_pymap(self.table, <PyObject*>val, &ret)

while in the function implementations (such as duplicated), we don't do that check. Eg

{{if dtype == 'object'}}
for i from n > i >= 0:
kh_put_{{ttype}}(table, <PyObject*>values[i], &ret)
out[i] = ret == 0

The "pymap" hashtable needs the values to be hashable, but doesn't actually check that? Should we add a hash(val) in those functions as well?

cc @jreback

@jreback
Copy link
Contributor

jreback commented Jun 25, 2019

@jorisvandenbossche yeah you could try that here

but this may introduce a performance issue

@jorisvandenbossche
Copy link
Member Author

From a quick test, it seems that adding a hash(val) in the for loop to create the hash gives a 20% slowdown (45ms -> 55ms for a string series of 3 million elements).

So that's a considerable slowdown.

A "proper" fix might be to check in the actual khash C code for a return value of -1 of PyObject_Hash, but I would rather not start meddling with that implementation.

Alternative could be doing a "best effort" check by hashing the first element. So if you have a full object series of unhashable objects, that would at least be catched. But it is not that nice that it depends on the order of the values if the error is raised or not (eg if you have a missing value in the first location).

jorisvandenbossche added a commit to geopandas/geopandas that referenced this issue Jun 28, 2019
@jorisvandenbossche jorisvandenbossche added this to the Contributions Welcome milestone Jul 6, 2019
@mroeschke mroeschke added duplicated duplicated, drop_duplicates and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels May 13, 2020
@simonjayhawkins
Copy link
Member

as the shapely Point objects are not hashable:

xref #12693, where other unhashable objects raise (which is probably the correct behavior with the current implementation or perhaps raise a NotImplementedError and #12693 should maybe an enhancement request instead.)

however, the code in the OP is no longer unstable AFAICT

@simonjayhawkins
Copy link
Member

xref #12693, where other unhashable objects raise

indeed it does with a DataFrame with more than one column

pd.Series(a).to_frame().assign(dup=lambda x: x[0]).duplicated()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [18], in <cell line: 1>()
----> 1 pd.Series(a).to_frame().assign(dup=lambda x: x[0]).duplicated()

File ~/pandas/pandas/core/frame.py:6527, in DataFrame.duplicated(self, subset, keep)
   6525 else:
   6526     vals = (col.values for name, col in self.items() if name in subset)
-> 6527     labels, shape = map(list, zip(*map(f, vals)))
   6529     ids = get_group_index(
   6530         labels,
   6531         # error: Argument 1 to "tuple" has incompatible type "List[_T]";
   (...)
   6535         xnull=False,
   6536     )
   6537     result = self._constructor_sliced(duplicated(ids, keep), index=self.index)

File ~/pandas/pandas/core/frame.py:6495, in DataFrame.duplicated.<locals>.f(vals)
   6494 def f(vals) -> tuple[np.ndarray, int]:
-> 6495     labels, shape = algorithms.factorize(vals, size_hint=len(self))
   6496     return labels.astype("i8", copy=False), len(shape)

File ~/pandas/pandas/core/algorithms.py:738, in factorize(values, sort, na_sentinel, size_hint)
    736 else:
    737     values = np.asarray(values)  # convert DTA/TDA/MultiIndex
--> 738     codes, uniques = factorize_array(
    739         values, na_sentinel=na_sentinel, size_hint=size_hint
    740     )
    742 if sort and len(uniques) > 0:
    743     uniques, codes = safe_sort(
    744         uniques, codes, na_sentinel=na_sentinel, assume_unique=True, verify=False
    745     )

File ~/pandas/pandas/core/algorithms.py:547, in factorize_array(values, na_sentinel, size_hint, na_value, mask)
    544 hash_klass, values = _get_hashtable_algo(values)
    546 table = hash_klass(size_hint or len(values))
--> 547 uniques, codes = table.factorize(
    548     values, na_sentinel=na_sentinel, na_value=na_value, mask=mask
    549 )
    551 # re-cast e.g. i8->dt64/td64, uint8->bool
    552 uniques = _reconstruct_data(uniques, original.dtype, original)

File ~/pandas/pandas/_libs/hashtable_class_helper.pxi:5303, in pandas._libs.hashtable.PyObjectHashTable.factorize()

File ~/pandas/pandas/_libs/hashtable_class_helper.pxi:5219, in pandas._libs.hashtable.PyObjectHashTable._unique()

TypeError: unhashable type: 'Point'

@simonjayhawkins
Copy link
Member

simonjayhawkins commented Jun 11, 2022

however, the code in the OP is no longer unstable AFAICT

looks like fixed sometime after 1.2.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug duplicated duplicated, drop_duplicates
Projects
None yet
Development

No branches or pull requests

4 participants