Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: handling of missing values in Index.__contains__ #59765

Open
jorisvandenbossche opened this issue Sep 9, 2024 · 2 comments
Open

API: handling of missing values in Index.__contains__ #59765

jorisvandenbossche opened this issue Sep 9, 2024 · 2 comments
Labels
API - Consistency Internal Consistency of API/Behavior Index Related to the Index class or subclasses Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@jorisvandenbossche
Copy link
Member

The below table gives an overview of the result value for:

missing_value in idx

i.e. how Index.__contains__ handles various missing value sentinels as input for the different data types.

dtype None nan <NA> NaT
object-none True False False False
object-nan False True False False
object-NA False False True False
datetime True True True True
period True True True True
timedelta True True True True
float64 False True False False
categorical True True True True
interval True True True False
nullable_int False False True False
nullable_float False False True False
string-python False False False False
string-pyarrow False False False False
str-python False False False False

The last three rows with not a single True are specifically problematic, this seems a bug with the StringDtype

But more in general, this is quite inconsistent:

  • For object dtype, we require exact match
  • For datetimelike and categorical, we match any missing-like
  • For interval, we match any missing-like except NaT (also not in case of datetimelike interval dtype)
  • For float we only match NaN
  • For nullable dtypes (int/float), we only match NA

The code to generate the table above:

import numpy as np
import pandas as pd

# from conftest.py
indices_dict = {
    "object-none": pd.Index(["a", None], dtype=object),
    "object-nan": pd.Index(["a", np.nan], dtype=object),
    "object-NA": pd.Index(["a", pd.NA], dtype=object),
    "datetime": pd.DatetimeIndex(["2024-01-01", "NaT"]),
    "period": pd.PeriodIndex(["2024-01-01", None], freq="D"),
    "timedelta": pd.TimedeltaIndex(["1 days", "NaT"]),
    "float64": pd.Index([2.0, np.nan], dtype="float64"),
    "categorical": pd.CategoricalIndex(["a", None]),
    "interval": pd.IntervalIndex.from_tuples([(1, 2), np.nan]),
    "nullable_int": pd.Index([2, None], dtype="Int64"),
    "nullable_float": pd.Index([2.0, None], dtype="Float32"),
    "string-python": pd.Index(["a", None], dtype="string[python]"),
    "string-pyarrow": pd.Index(["a", None], dtype="string[pyarrow]"),
    "str-python": pd.Index(["a", None], dtype=pd.StringDtype("pyarrow", na_value=np.nan))
}

results = []

for dtype, data in indices_dict.items():
    for val in [None, np.nan, pd.NA, pd.NaT]:
        res = val in data
        results.append((dtype, str(val), res))
        
df = pd.DataFrame(results, columns=["dtype", "val", "result"])
df_overview = df.pivot(columns="val", index="dtype", values="result").reindex(columns=df["val"].unique(), index=df["dtype"].unique())

print(df_overview.astype(str).to_markdown())

cc @jbrockmendel I would have expected we had issues about this, but didn't directly find anything

@jorisvandenbossche jorisvandenbossche added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Index Related to the Index class or subclasses API - Consistency Internal Consistency of API/Behavior labels Sep 9, 2024
@jbrockmendel
Copy link
Member

jbrockmendel commented Sep 9, 2024

im not aware of a dedicated issue for this either. i think at one point I made a PR trying to make more of the EA subclasses use is_valid_na_for but that got tabled pending the nan-vs-na topic.

For the datetimelike cases i think/hope that mismatched NaTs will return False (i.e. np.timedelta64("NaT") in my_datetimeindex should always be False). Also Decimal("NaN") should be handled correctly.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Sep 10, 2024

For the datetimelike cases i think/hope that mismatched NaTs will return False (i.e. np.timedelta64("NaT") in my_datetimeindex should always be False).

Indeed, the np.timedelta64("NaT") and np.datetime64("NaT") only give True for timedelta/datetime index, respectively, and all other index dtypes return False for those, with one exception: categorical.

Also Decimal("NaN") should be handled correctly.

In the sense that it is not matched in general (again, except for categorical ..). But it seems also not be matched for object dtype with such decimal: Decimal("NaN") in pd.Index([Decimal("2.0"), Decimal("NaN")], dtype=object) gives False.


Expanded table:

dtype None nan <NA> NaT np.datetime64('NaT') np.timedelta64('NaT') Decimal('NaN')
object-none True False False False False False False
object-nan False True False False False False False
object-NA False False True False False False False
object-decimal-NaN False False False False False False False
datetime True True True True True False False
period True True True True False False False
timedelta True True True True False True False
float64 False True False False False False False
categorical True True True True True True True
interval True True True False False False False
nullable_int False False True False False False False
nullable_float False False True False False False False
string-python False False False False False False False
string-pyarrow False False False False False False False
str-python False False False False False False False
import numpy as np
import pandas as pd
from decimal import Decimal

# from conftest.py
indices_dict = {
    "object-none": pd.Index(["a", None], dtype=object),
    "object-nan": pd.Index(["a", np.nan], dtype=object),
    "object-NA": pd.Index(["a", pd.NA], dtype=object),
    "object-decimal-NaN": pd.Index(["a", Decimal("NaN")], dtype=object),
    "datetime": pd.DatetimeIndex(["2024-01-01", "NaT"]),
    "period": pd.PeriodIndex(["2024-01-01", None], freq="D"),
    "timedelta": pd.TimedeltaIndex(["1 days", "NaT"]),
    "float64": pd.Index([2.0, np.nan], dtype="float64"),
    "categorical": pd.CategoricalIndex(["a", None]),
    "interval": pd.IntervalIndex.from_tuples([(1, 2), np.nan]),
    "nullable_int": pd.Index([2, None], dtype="Int64"),
    "nullable_float": pd.Index([2.0, None], dtype="Float32"),
    "string-python": pd.Index(["a", None], dtype="string[python]"),
    "string-pyarrow": pd.Index(["a", None], dtype="string[pyarrow]"),
    "str-python": pd.Index(["a", None], dtype=pd.StringDtype("pyarrow", na_value=np.nan))
}

results = []

for dtype, data in indices_dict.items():
    for val in [None, np.nan, pd.NA, pd.NaT, np.datetime64("NaT"), np.timedelta64("NaT"), Decimal("NaN")]:
        res = val in data
        results.append((dtype, repr(val), res))
        
df = pd.DataFrame(results, columns=["dtype", "val", "result"])
df_overview = df.pivot(columns="val", index="dtype", values="result").reindex(columns=df["val"].unique(), index=df["dtype"].unique())

print(df_overview.astype(str).to_markdown())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Index Related to the Index class or subclasses Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

2 participants