Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.drop_duplicates confuses NULL bytes #34551

Open
marco-neumann-by opened this issue Jun 3, 2020 · 12 comments
Open

BUG: DataFrame.drop_duplicates confuses NULL bytes #34551

marco-neumann-by opened this issue Jun 3, 2020 · 12 comments
Labels
Bug duplicated duplicated, drop_duplicates hashing hash_pandas_object Strings String extension data type and string data

Comments

@marco-neumann-by
Copy link

marco-neumann-by commented Jun 3, 2020

Code Sample, a copy-pastable example

import pandas as pd
import pandas.testing as pdt

df = pd.DataFrame({"col": ["", "\0"]})
ser = df["col"].copy()

ser_actual = ser.drop_duplicates()
ser_expected = pd.Series(["", "\0"], name="col")
pdt.assert_series_equal(ser_actual, ser_expected)  # passes

df_actual = df.drop_duplicates()
df_expected = pd.DataFrame({"col": ["", "\0"]})
pdt.assert_frame_equal(df_actual, df_expected)  # fails, only a single row left

Problem description

Test fails, esp. note the inconsistent behavior between Series.drop_duplicates and DataFrame.drop_duplicates.

Expected Output

Test passes.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-33-generic
machine : x86_64
processor :
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.4
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 19.2.3
setuptools : 41.2.0
Cython : None
pytest : 5.4.1
hypothesis : None
sphinx : 3.0.3
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.15.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pytest : 5.4.1
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

@marco-neumann-by marco-neumann-by added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 3, 2020
@jorisvandenbossche
Copy link
Member

@marco-neumann-jdas Thanks for the report! That's indeed clearly a bug

The difference can also be seen in duplicated:

In [88]: df.duplicated()  
Out[88]: 
0    False
1     True
dtype: bool

In [89]: ser.duplicated()   
Out[89]: 
0    False
1    False
Name: col, dtype: bool

Now, under the hood, it seems that for DataFrame, this is based on factorize algo, and for series a custom duplicated algo:

In [97]: pd.core.algorithms.duplicated(ser)  
Out[97]: array([False, False])

In [98]: pd.core.algorithms.factorize(ser) 
Out[98]: (array([0, 0]), Index([''], dtype='object'))

And here you can see that factorize only sees a single unique value. So I think the bug is inside the factorize code.

And then, for factorize we actually have a string-specialized hash table (while duplicated only has a general object dtype specialization that is used). So for factorize, comparing those two hashtables:

In [99]: from pandas._libs.hashtable import StringHashTable, PyObjectHashTable 

In [100]: table = StringHashTable(len(ser))  

In [101]: table.factorize(ser.values)    
Out[101]: (array([''], dtype=object), array([0, 0]))

In [102]: table = PyObjectHashTable(len(ser)) 

In [103]: table.factorize(ser.values)  
Out[103]: (array(['', '\x00'], dtype=object), array([0, 1]))

indeed shows a difference.

@jorisvandenbossche jorisvandenbossche added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 4, 2020
@jorisvandenbossche jorisvandenbossche added this to the Contributions Welcome milestone Jun 4, 2020
@jorisvandenbossche
Copy link
Member

Ah, #32993 seems a similar report

@MJafarMashhadi
Copy link
Contributor

It goes down to using get_c_string

cdef inline const char* get_c_string(str py_string) except NULL:
return get_c_string_buf_and_size(py_string, NULL)
which ignores the returned string length.
I replaced that with PyUnicode_AsUTF8AndSize and got the string length as an integer in _unique method of StringHashTable.
It stores all the values in vecs array and then passes them one by one to kh_get_str which assumes the input string as a C-style null terminated one.

v = vecs[i]
k = kh_get_str(self.table, v)

The very existence of get_c_string which ignores the string length is the source of all the trouble. Based on it being only used in StringHashTable, I recon it exists because khash ('s kh_get_str) doesn't accept string length as an input.

I couldn't go any further bc I got a bit confused about khash; the definition of kh_get_str is there but I couldn't find its implementation. I would appreciate your input @jorisvandenbossche

Is it an external library or is it a part of pandas? Can we make changes to that? If it is an external library and we cannot change it, what is the best way to go? I am guessing reporting the issue to khash developers and putting some warnings in docs and todo comments in the code about it.


Quick and dirty debugging with a print:

pandas/_libs/hashtable_class_helper.pxi.in:810

try:
    v = get_c_string_buf_and_size(<str>val, &string_length)
except UnicodeEncodeError:
    v = get_c_string_buf_and_size(<str>repr(val), &string_length)
print('what hasher sees:', v, 'what it actually is:',v[:string_length])

Output:

$ python -c 'import pandas as pd; pd.DataFrame({"col": ["a\0Jason", "a\0Kyle"]}).duplicated()'               
what hasher sees: b'a' what it actually is: b'a\x00Jason'
what hasher sees: b'a' what it actually is: b'a\x00Kyle'

@mroeschke mroeschke added duplicated duplicated, drop_duplicates Strings String extension data type and string data and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Aug 7, 2021
@josteinbf
Copy link

josteinbf commented Oct 14, 2021

I've seen the same problem with groupby:

import pandas as pd

print(pd._version.get_versions())

assert (pd.Series(['\x00']) != pd.Series([''])).all()           # ok
assert len(pd.Series(['\x00', '']).to_frame('col1').groupby('col1')) == 2   # fails

My pandas version:

{'date': '2021-09-12T11:02:49+0100', 'dirty': False, 'error': None, 'full-revisionid': '73c68257545b5f8530b7044f56647bd2db92e2ba', 'version': '1.3.3'}

Also, numpy has an issue which seems very similar, see numpy/numpy#20118; however, in numpy this is considered a consequence of how numpy stores strings, and not something that will be changed.

@cr-perry
Copy link

cr-perry commented Jan 16, 2022

I'm experiencing the same issue with the creation of a pd.MultiIndex. Given two distinct input values that are identical up to their null character, the index maps them to a single code value and they then both end up getting assigned with the first string value.

My research trail led me to factorize, StringHashTable and (the elusive) kh_get_str. Not sure how to proceed but happy to help (and of course, add my +1 for the issue).

My reproduction:

>>> a = pd.MultiIndex.from_tuples([('test\x001',), ('test\x002',)])
>>> print(a.levels, a.codes)
[['test1']] [[0, 0]]

note: pandas 1.0.1

@DeanLa
Copy link
Contributor

DeanLa commented Mar 8, 2022

I'm experiencing something similar with nunique, groupby and value_counts.

s = pd.Series(['foo', 'b\x00a\x00z\x00', 'b\x00az'])
s.nunique() # result: 2
s.value_counts().shape[0] # result: 3
s.to_frame('bar').assign(value=1).groupby('bar').value.count().shape[0] # result: 2

Note that which answer is "correct" is up to debate. What's important here is consistency.

pandas : 1.3.5

@Maaslak
Copy link

Maaslak commented Mar 10, 2022

I'm having a similar issue with multiindex. I suppose during deduplication characters after \x00 are not analyzed. As a result names of columns, differing in consecutive characters, are considered to be the same.

>>> list(pd.MultiIndex.from_tuples([('dstrings', '\x00\x01'), ('dstrings', '\x00\xff')]))
[('dstrings', '\x00\x01'), ('dstrings', '\x00\x01')]

@MJafarMashhadi
Copy link
Contributor

MJafarMashhadi commented Mar 10, 2022

Anywhere that a string is used in a hash map/table (Indices, groupbys, column names (which are also indices), count unique, etc) this will occur since hashing logic is implemented in C where strings are null-terminated. Refer to my previous comment: #34551 (comment)

@simonjayhawkins
Copy link
Member

note the inconsistent behavior between Series.drop_duplicates and DataFrame.drop_duplicates

Note that since #45534 the behavior is consistent for a Series and a single column DataFrame, but the issue still persists with a multi-column DataFrame cc @phofl

@phofl
Copy link
Member

phofl commented Jun 13, 2022

This is because I aligned the implementation of both cases with each other. Multi-column drop_duplicates still runs through the previous code paths.

@simonjayhawkins
Copy link
Member

And then, for factorize we actually have a string-specialized hash table (while duplicated only has a general object dtype specialization that is used). So for factorize, comparing those two hashtables:

there is an open issue to use StringHashTable for value_counts / duplicated with strings #14860 which should address this inconistency.

@phofl
Copy link
Member

phofl commented Jun 13, 2022

We are using factorize under the hood for duplicated, which already uses StringHashTable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug duplicated duplicated, drop_duplicates hashing hash_pandas_object Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

10 participants