-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: DataFrame.drop_duplicates confuses NULL bytes #34551
Comments
@marco-neumann-jdas Thanks for the report! That's indeed clearly a bug The difference can also be seen in
Now, under the hood, it seems that for DataFrame, this is based on
And here you can see that And then, for
indeed shows a difference. |
Ah, #32993 seems a similar report |
It goes down to using pandas/pandas/_libs/tslibs/util.pxd Lines 212 to 213 in 0159cba
I replaced that with PyUnicode_AsUTF8AndSize and got the string length as an integer in _unique method of StringHashTable .It stores all the values in vecs array and then passes them one by one to kh_get_str which assumes the input string as a C-style null terminated one.
pandas/pandas/_libs/hashtable_class_helper.pxi.in Lines 822 to 823 in 0159cba
The very existence of I couldn't go any further bc I got a bit confused about khash; the definition of Is it an external library or is it a part of pandas? Can we make changes to that? If it is an external library and we cannot change it, what is the best way to go? I am guessing reporting the issue to khash developers and putting some warnings in docs and todo comments in the code about it. Quick and dirty debugging with a print: pandas/_libs/hashtable_class_helper.pxi.in:810
Output:
|
I've seen the same problem with import pandas as pd
print(pd._version.get_versions())
assert (pd.Series(['\x00']) != pd.Series([''])).all() # ok
assert len(pd.Series(['\x00', '']).to_frame('col1').groupby('col1')) == 2 # fails My pandas version:
Also, numpy has an issue which seems very similar, see numpy/numpy#20118; however, in numpy this is considered a consequence of how numpy stores strings, and not something that will be changed. |
I'm experiencing the same issue with the creation of a My research trail led me to My reproduction:
note: pandas 1.0.1 |
I'm experiencing something similar with s = pd.Series(['foo', 'b\x00a\x00z\x00', 'b\x00az'])
s.nunique() # result: 2
s.value_counts().shape[0] # result: 3
s.to_frame('bar').assign(value=1).groupby('bar').value.count().shape[0] # result: 2 Note that which answer is "correct" is up to debate. What's important here is consistency. pandas : 1.3.5 |
I'm having a similar issue with multiindex. I suppose during deduplication characters after \x00 are not analyzed. As a result names of columns, differing in consecutive characters, are considered to be the same.
|
Anywhere that a string is used in a hash map/table (Indices, groupbys, column names (which are also indices), count unique, etc) this will occur since hashing logic is implemented in C where strings are null-terminated. Refer to my previous comment: #34551 (comment) |
This is because I aligned the implementation of both cases with each other. Multi-column drop_duplicates still runs through the previous code paths. |
there is an open issue to use StringHashTable for value_counts / duplicated with strings #14860 which should address this inconistency. |
We are using factorize under the hood for duplicated, which already uses StringHashTable |
Code Sample, a copy-pastable example
Problem description
Test fails, esp. note the inconsistent behavior between
Series.drop_duplicates
andDataFrame.drop_duplicates
.Expected Output
Test passes.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.6.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-33-generic
machine : x86_64
processor :
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.0.4
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 19.2.3
setuptools : 41.2.0
Cython : None
pytest : 5.4.1
hypothesis : None
sphinx : 3.0.3
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.15.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pytest : 5.4.1
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None
The text was updated successfully, but these errors were encountered: