Unstable hashtable / duplicated algo for object dtype #27035

jorisvandenbossche · 2019-06-25T13:14:27Z

From a flaky test in geopandas, I observed the following behaviour:

In [1]: pd.__version__
Out[1]: '0.25.0.dev0+791.gf0919f272'

In [2]: from shapely.geometry import Point 

In [3]: a = np.array([Point(1, 1), Point(1, 1)], dtype=object) 

In [4]: pd.Series(a).duplicated()
Out[4]: 
0    False
1     True
dtype: bool

In [6]: print(pd.Series(a).duplicated()) 
   ...: print(pd.Series(a).duplicated())
0    False
1     True
dtype: bool
0    False
1    False
dtype: bool

So you see that sometimes it works, sometimes it does not work.

I am also not fully sure how the object hashtable works (assuming duplicated uses the hashtable), as the shapely Point objects are not hashable:

In [9]: pd.Series(a).unique()
...
TypeError: unhashable type: 'Point'

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2019-06-25T13:32:27Z

In the class methods, we check that the object is hashable (before doing a kh_put_pymap). Eg in PyObjectHashTable.unique:

pandas/pandas/_libs/hashtable_class_helper.pxi.in

Lines 1019 to 1034 in f0919f2

    
           for i in range(n): 
        
               val = values[i] 
        
               hash(val) 
        
               if ignore_na and ((val != val or val is None) 
        
                                 or (use_na_value and val == na_value)): 
        
                   # if missing values do not count as unique values (i.e. if 
        
                   # ignore_na is True), skip the hashtable entry for them, and 
        
                   # replace the corresponding label with na_sentinel 
        
                   labels[i] = na_sentinel 
        
                   continue 
        
               k = kh_get_pymap(self.table, <PyObject*>val) 
        
               if k == self.table.n_buckets: 
        
                   # k hasn't been seen yet 
        
                   k = kh_put_pymap(self.table, <PyObject*>val, &ret)

while in the function implementations (such as duplicated), we don't do that check. Eg

pandas/pandas/_libs/hashtable_func_helper.pxi.in

Lines 153 to 156 in f0919f2

    
           {{if dtype == 'object'}} 
        
           for i from n > i >= 0: 
        
               kh_put_{{ttype}}(table, <PyObject*>values[i], &ret) 
        
               out[i] = ret == 0

The "pymap" hashtable needs the values to be hashable, but doesn't actually check that? Should we add a hash(val) in those functions as well?

cc @jreback

jreback · 2019-06-25T17:32:07Z

@jorisvandenbossche yeah you could try that here

but this may introduce a performance issue

See pandas-dev/pandas#27035

jorisvandenbossche · 2019-06-27T02:47:07Z

From a quick test, it seems that adding a hash(val) in the for loop to create the hash gives a 20% slowdown (45ms -> 55ms for a string series of 3 million elements).

So that's a considerable slowdown.

A "proper" fix might be to check in the actual khash C code for a return value of -1 of PyObject_Hash, but I would rather not start meddling with that implementation.

Alternative could be doing a "best effort" check by hashing the first element. So if you have a full object series of unhashable objects, that would at least be catched. But it is not that nice that it depends on the order of the values if the error is raised or not (eg if you have a missing value in the first location).

See pandas-dev/pandas#27035

simonjayhawkins · 2022-06-11T14:32:32Z

as the shapely Point objects are not hashable:

xref #12693, where other unhashable objects raise (which is probably the correct behavior with the current implementation or perhaps raise a NotImplementedError and #12693 should maybe an enhancement request instead.)

however, the code in the OP is no longer unstable AFAICT

simonjayhawkins · 2022-06-11T14:39:57Z

xref #12693, where other unhashable objects raise

indeed it does with a DataFrame with more than one column

pd.Series(a).to_frame().assign(dup=lambda x: x[0]).duplicated()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [18], in <cell line: 1>()
----> 1 pd.Series(a).to_frame().assign(dup=lambda x: x[0]).duplicated()

File ~/pandas/pandas/core/frame.py:6527, in DataFrame.duplicated(self, subset, keep)
   6525 else:
   6526     vals = (col.values for name, col in self.items() if name in subset)
-> 6527     labels, shape = map(list, zip(*map(f, vals)))
   6529     ids = get_group_index(
   6530         labels,
   6531         # error: Argument 1 to "tuple" has incompatible type "List[_T]";
   (...)
   6535         xnull=False,
   6536     )
   6537     result = self._constructor_sliced(duplicated(ids, keep), index=self.index)

File ~/pandas/pandas/core/frame.py:6495, in DataFrame.duplicated.<locals>.f(vals)
   6494 def f(vals) -> tuple[np.ndarray, int]:
-> 6495     labels, shape = algorithms.factorize(vals, size_hint=len(self))
   6496     return labels.astype("i8", copy=False), len(shape)

File ~/pandas/pandas/core/algorithms.py:738, in factorize(values, sort, na_sentinel, size_hint)
    736 else:
    737     values = np.asarray(values)  # convert DTA/TDA/MultiIndex
--> 738     codes, uniques = factorize_array(
    739         values, na_sentinel=na_sentinel, size_hint=size_hint
    740     )
    742 if sort and len(uniques) > 0:
    743     uniques, codes = safe_sort(
    744         uniques, codes, na_sentinel=na_sentinel, assume_unique=True, verify=False
    745     )

File ~/pandas/pandas/core/algorithms.py:547, in factorize_array(values, na_sentinel, size_hint, na_value, mask)
    544 hash_klass, values = _get_hashtable_algo(values)
    546 table = hash_klass(size_hint or len(values))
--> 547 uniques, codes = table.factorize(
    548     values, na_sentinel=na_sentinel, na_value=na_value, mask=mask
    549 )
    551 # re-cast e.g. i8->dt64/td64, uint8->bool
    552 uniques = _reconstruct_data(uniques, original.dtype, original)

File ~/pandas/pandas/_libs/hashtable_class_helper.pxi:5303, in pandas._libs.hashtable.PyObjectHashTable.factorize()

File ~/pandas/pandas/_libs/hashtable_class_helper.pxi:5219, in pandas._libs.hashtable.PyObjectHashTable._unique()

TypeError: unhashable type: 'Point'

simonjayhawkins · 2022-06-11T15:07:22Z

however, the code in the OP is no longer unstable AFAICT

looks like fixed sometime after 1.2.5

jorisvandenbossche added Bug Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Jun 25, 2019

jorisvandenbossche added a commit to jorisvandenbossche/geopandas that referenced this issue Jun 27, 2019

TST: no strict xfail on flaky drop_duplicates test

861b17d

See pandas-dev/pandas#27035

jorisvandenbossche mentioned this issue Jun 27, 2019

TST: no strict xfail on flaky drop_duplicates test geopandas/geopandas#1024

Merged

jorisvandenbossche added a commit to geopandas/geopandas that referenced this issue Jun 28, 2019

TST: no strict xfail on flaky drop_duplicates test

5763889

See pandas-dev/pandas#27035

jorisvandenbossche added this to the Contributions Welcome milestone Jul 6, 2019

mroeschke added duplicated duplicated, drop_duplicates and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels May 13, 2020

tehunter mentioned this issue Sep 23, 2022

ENH/PERF: ExtensionArray should offer a duplicated function #48747

Closed

3 tasks

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

mmatous mentioned this issue Nov 8, 2024

BUG: .duplicated() ignoring duplicates for MultiIndex #60236

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unstable hashtable / duplicated algo for object dtype #27035

Unstable hashtable / duplicated algo for object dtype #27035

jorisvandenbossche commented Jun 25, 2019 •

edited

Loading

jorisvandenbossche commented Jun 25, 2019

jreback commented Jun 25, 2019

jorisvandenbossche commented Jun 27, 2019

simonjayhawkins commented Jun 11, 2022

simonjayhawkins commented Jun 11, 2022

simonjayhawkins commented Jun 11, 2022 •

edited

Loading

Unstable hashtable / duplicated algo for object dtype #27035

Unstable hashtable / duplicated algo for object dtype #27035

Comments

jorisvandenbossche commented Jun 25, 2019 • edited Loading

jorisvandenbossche commented Jun 25, 2019

jreback commented Jun 25, 2019

jorisvandenbossche commented Jun 27, 2019

simonjayhawkins commented Jun 11, 2022

simonjayhawkins commented Jun 11, 2022

simonjayhawkins commented Jun 11, 2022 • edited Loading

jorisvandenbossche commented Jun 25, 2019 •

edited

Loading

simonjayhawkins commented Jun 11, 2022 •

edited

Loading