Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: drop_duplicates on categorical not preserving NA #44405

Closed
3 tasks done
phofl opened this issue Nov 12, 2021 · 3 comments
Closed
3 tasks done

BUG: drop_duplicates on categorical not preserving NA #44405

phofl opened this issue Nov 12, 2021 · 3 comments
Labels
Bug Categorical Categorical Data Type Duplicate Report Duplicate issue or pull request Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays

Comments

@phofl
Copy link
Member

phofl commented Nov 12, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

ser = Series(
    Categorical(
        [True, False, True, False, NA], categories=[True, False], ordered=True
    )
)
ser.drop_duplicates()

Issue Description

NA gets converted to NaN

0     True
1    False
4      NaN
dtype: category
Categories (2, object): [True < False]

Expected Behavior

Returning NA

0     True
1    False
4       NA
dtype: category
Categories (2, object): [True < False]

Installed Versions

INSTALLED VERSIONS ------------------ commit : 01b86ed python : 3.8.12.final.0 python-bits : 64 OS : Linux OS-release : 5.11.0-38-generic Version : #42~20.04.1-Ubuntu SMP Tue Sep 28 20:41:07 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.4.0.dev0+1085.g01b86edbbb
numpy : 1.21.3
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.0.4
Cython : 0.29.24
pytest : 6.2.5
hypothesis : 6.23.1
sphinx : 4.2.0
blosc : None
feather : None
xlsxwriter : 3.0.1
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.28.0
pandas_datareader: None
bs4 : 4.10.0
bottleneck : 1.3.2
fsspec : 2021.11.0
fastparquet : 0.7.1
gcsfs : 2021.05.0
matplotlib : 3.4.3
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : 2021.11.0
scipy : 1.7.2
sqlalchemy : 1.4.25
tables : 3.6.1
tabulate : 0.8.9
xarray : 0.18.0
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1
None

Process finished with exit code 0

@phofl phofl added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 12, 2021
@phofl phofl added Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Nov 12, 2021
@jorisvandenbossche
Copy link
Member

Note that this is not actually related to drop_duplicated. Also the initial categorical series you created will show "NaN" in the repr.

Missing values in Categorical are stored custom as -1 in the integer codes (and not in the categories, so not depending on the actual data type of the categories).
And we don't yet have support in general for "nullable" Categorical (i.e. using NA and NA-semantics for their missing values, and allowing to store a nullable array for the categories).

@jorisvandenbossche jorisvandenbossche added the NA - MaskedArrays Related to pd.NA and nullable extension arrays label Nov 12, 2021
@phofl
Copy link
Member Author

phofl commented Nov 12, 2021

Should we relabel as enhancement for general nullable categorical? Had a quick look unter NA - MaskerArrays and have not seen a duplicate

@mroeschke mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label Nov 14, 2021
@simonjayhawkins simonjayhawkins added the duplicated duplicated, drop_duplicates label Jun 10, 2022
@simonjayhawkins
Copy link
Member

Should we relabel as enhancement for general nullable categorical? Had a quick look unter NA - MaskerArrays and have not seen a duplicate

I think covered by #43836 and #29962. closing to help discussion in less places.

@simonjayhawkins simonjayhawkins added Duplicate Report Duplicate issue or pull request and removed duplicated duplicated, drop_duplicates labels Jun 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Duplicate Report Duplicate issue or pull request Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

No branches or pull requests

4 participants