-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG (string dtype): looking up missing value in the Index fails #59879
Comments
PR #60329 fixes the main bug mentioned above ( But besides fixing the most obvious bug, there is still the question about how to deal with backwards compatibility. >>> pd.options.future.infer_string = False
# this shows `get_loc`, but this then applies to `ser[None]` or `None in idx` etc
>>> pd.Index(["a", "b", None]).get_loc(None)
2 because the index is inferred to be However, with the future string dtype enabled, we no longer infer it as object dtype, but as string dtype and at that point coerce any missing values (None, pd.NA) to NaN. As a result, if we are strict about the index lookup (only the exact sentinel gives a match), the same example as above but with the future string dtype enabled will fail: >>> pd.options.future.infer_string = True
>>> pd.Index(["a", "b", None]).get_loc(None)
...
KeyError: None This is a backwards incompatible change for people that were using To preserve backwards compatibility, I think it would make sense to also coerce missing values to NaN as input for the indexer ( To make it more complex: first, how missing value sentinels are matched in indexing operations is quite inconsistent right now across dtypes (i.e. some other dtypes like datetime-like and categorical also coerce any missing value sentinel, and so are not strict), see #59765 for a detailed overview. Secondly, the above discussion is for scalar lookup ( >>> pd.options.future.infer_string = True
>>> idx = pd.Index(["a", "b", None]) # gets inferred as `str` dtype
>>> idx.get_indexer(["a", None])
array([0, 2]) Here we also infer cc @pandas-dev/pandas-core |
In the very long term, I think only I'm +1 on special casing |
This isn't how that works today though, right? >>> pd.options.future.infer_string = False
>>> pd.Index(["a", "b", np.nan]).get_loc(None)
KeyError: None I think there is something that is going to be lost in translation either way when going from object-backed storage to a real string data type, but I think it is the easiest for users to understand if we are more strict about how missing values are handled |
It depends on how you look at it. For object dtype we don't do any coercing of the input in constructors, so we also don't do that for the input for the indexer ( It is true that whatever choice we make, there is some change in behaviour ( And in addition, keeping |
Ah this is a great point. OK I am fine with continuing along that path then. Long term, it might get tricky with wanting to store None and missing values alongside one another in an object index, but that probably makes sense to solve alongside with the disambiguation of np.nan and pd.NA with floating point dtypes that has been discussed in PDEP 16 |
…44195) ### Rationale for this change With pandas' [PDEP-14](https://pandas.pydata.org/pdeps/0014-string-dtype.html) proposal, pandas is planning to introduce a default string dtype in pandas 3.0 (instead of the current object dtype). This will become the default in pandas 3.0, and can be enabled with an option in the upcoming pandas 2.3 (`pd.options.future.infer_string = True`). To prepare for that, we should start using that string dtype in `to_pandas()` conversions when that option is enabled. ### What changes are included in this PR? - If pandas >= 3.0 is used or the pandas option is enabled, ensure that `to_pandas()` calls use the default string dtype of pandas for string-like columns (string, large_string, string_view) ### Are these changes tested? It is tested in the pandas-nightly crossbow build. There is still one failure that is because of a bug on the pandas side (pandas-dev/pandas#59879) ### Are there any user-facing changes? **This PR includes breaking changes to public APIs.** Depending on the version of pandas, `to_pandas()` will change to use pandas' string dtype instead of object dtype. This is a breaking user-facing change, but essentially just following the equivalent change in default dtype on the pandas side. * GitHub Issue: #43683 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…44195) ### Rationale for this change With pandas' [PDEP-14](https://pandas.pydata.org/pdeps/0014-string-dtype.html) proposal, pandas is planning to introduce a default string dtype in pandas 3.0 (instead of the current object dtype). This will become the default in pandas 3.0, and can be enabled with an option in the upcoming pandas 2.3 (`pd.options.future.infer_string = True`). To prepare for that, we should start using that string dtype in `to_pandas()` conversions when that option is enabled. ### What changes are included in this PR? - If pandas >= 3.0 is used or the pandas option is enabled, ensure that `to_pandas()` calls use the default string dtype of pandas for string-like columns (string, large_string, string_view) ### Are these changes tested? It is tested in the pandas-nightly crossbow build. There is still one failure that is because of a bug on the pandas side (pandas-dev/pandas#59879) ### Are there any user-facing changes? **This PR includes breaking changes to public APIs.** Depending on the version of pandas, `to_pandas()` will change to use pandas' string dtype instead of object dtype. This is a breaking user-facing change, but essentially just following the equivalent change in default dtype on the pandas side. * GitHub Issue: #43683 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…44195) ### Rationale for this change With pandas' [PDEP-14](https://pandas.pydata.org/pdeps/0014-string-dtype.html) proposal, pandas is planning to introduce a default string dtype in pandas 3.0 (instead of the current object dtype). This will become the default in pandas 3.0, and can be enabled with an option in the upcoming pandas 2.3 (`pd.options.future.infer_string = True`). To prepare for that, we should start using that string dtype in `to_pandas()` conversions when that option is enabled. ### What changes are included in this PR? - If pandas >= 3.0 is used or the pandas option is enabled, ensure that `to_pandas()` calls use the default string dtype of pandas for string-like columns (string, large_string, string_view) ### Are these changes tested? It is tested in the pandas-nightly crossbow build. There is still one failure that is because of a bug on the pandas side (pandas-dev/pandas#59879) ### Are there any user-facing changes? **This PR includes breaking changes to public APIs.** Depending on the version of pandas, `to_pandas()` will change to use pandas' string dtype instead of object dtype. This is a breaking user-facing change, but essentially just following the equivalent change in default dtype on the pandas side. * GitHub Issue: #43683 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
#59765 is the general issue about consistency in index lookups for missing values, but shorter term, we at least need to fix being able to lookup missing values for the future string dtype.
With object dtype, this works fine:
But when enabling the string dtype, neither None or np.nan work as key:
The text was updated successfully, but these errors were encountered: