Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-16: Consistent missing value handling #58988

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Jun 12, 2024

Note: this is only a draft, partial proposal. It already lists the main points of the proposal at a high level, but of course we need to add a lot of the details and specifically about backwards compatibility and how we can move towards the end goal (I think those last aspects will actually require the most discussion eventually, the devil is in the details here ;)).

But, I thought to already open this PR to start the discussion on those main high-level points of the proposal (we can already discuss those and see if there is agreement about that), and to have a reference for this NA proposal (since it has come up in a lot of other tangentially related discussions as well), and also to push myself to actually write up those other sections as well ;)

A summary of all (potentially breaking) behaviour changes that I am aware of can be found here: #58243 (comment)

Abstract

This proposal aims to unify the missing value handling across all dtypes:

  • All data types support missing values and use pd.NA exclusively as the user-facing missing value indicator.
  • All data types implement consistent missing value "semantics" corresponding to the current nullable dtypes using pd.NA

This proposal tries to provide a high level way forward, and is not meant to address all implementation details (i.e. all dtypes should follow the above principles, but we don't need to decide in this PDEP for every single dtype how this is exactly achieved)

cc @pandas-dev/pandas-core @pandas-dev/pandas-triage

@jorisvandenbossche jorisvandenbossche added the PDEP pandas enhancement proposal label Jun 12, 2024
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting this

Currently, pandas handles missing data differently for different data types. We
use different types to indicate that a value is missing: ``np.nan`` for
floating-point data, ``np.nan`` or ``None`` for object-dtype data -- typically
strings or booleans -- with missing values, and ``pd.NaT`` for datetimelike
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

floating-point data, np.nan or None for object-dtype data -- typically
strings or booleans -- with missing values, and pd.NaT for datetimelike

Maybe after PDEP-14 gets accepted we should revise this to say "str" uses np.nan, "string" uses pd.NA. Boolean data types do not directly support missing values, so are often cast to object as a result

data. Some other data types, such as integer and bool, cannot store missing data
or are cast to float or object dtype. In addition, pandas 1.0 introduced a new
missing value sentinel, ``pd.NA``, which is being used for the experimental
nullable integer, float, boolean, and string data types, and more recently also
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nullable integer data types came out in 0.24.0 back in January 2019. I think after all that time and having survived two major releases (soon 3) without any substantial change, we should stop referring to these as experimental

4. For backwards compatibility, existing missing value indicators like `NaN` and
`NaT` will be interpreted as `pd.NA` when introduced in user input, IO or
through operations (to ensure it keeps being considered as missing).
Specifically for floating dtypes, in practice this means a float column can
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What prevents us from allowing both NaN and NA for floats? I understand that the major use case historically when assigning np.nan has probably been to introduce a missing value and that should map to pd.NA with this PDEP, but what prevents an operation like 0/0 from still introduction NaN?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh I'd also appreciate it if we could argue about this one a bit longer

glad that this PR says

Potentially distinguishing NA and NaN is left for a separate discussion.

because I think it's fairly important

surely there's perf overhead to having to check whether values are NaN, as opposed to just preserving validity masks?


i'm all for aiming for nullable dtypes by default, and am hoping that this is done properly (as in, giving up on the historical hack of treating NaN as a missing value indicator). add a few nan_as_null arguments in a few places, and enjoy consistency with arrow-based tools

Copy link
Member

@WillAyd WillAyd Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this on the team call yesterday so trying to paraphrase, but @jorisvandenbossche and @Dr-Irv please correct me

An assignment like ser.iloc[0] = np.nan could mean either "assign a missing value" or "assign the NaN" sentinel. The latter is more technically correct, but owing to our history this probably means the former in most cases. So continuing with that assumption in the near term reduces the amount of churn for end users.

Longer term it would for sure be nice to get to a place where something like that actually assigns NaN and a user would have to use pd.NA exclusively for missing values, but we probably need a release or two of the proposal in its current state before we can disentangle that specifically for the float data type. We might even need to add new methods like is_nan instead of isna for that purpose.

Given those considerations are specific to one data type, we might be better off living with that issue through execution of this PDEP and leaving it to a subsequent PDEP that clarifies that behavior just for float data types down the road. So we certainly won't be at the ideal state here, but at least moving in the right direction

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this on the team call yesterday so trying to paraphrase, but @jorisvandenbossche and @Dr-Irv please correct me

Good summary

to the current nullable dtypes using `pd.NA` (i.e. regarding behaviour in
comparisons, see below for details).

3. As a consequence, pandas will move to nullable extension arrays by default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've lost track of if we already implemented this or not, but why do we want to do this generically? Seems like this is provided for compatability with NumPy for the primitive types, but as we expand to non-NumPy types is it really necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get the question here, can you try to clarify? What exactly are we doing "generically"? Or what do you understand that we are providing for compatibility with numpy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2D block structure is for compatability with NumPy right? So I'm curious what would be the perceived advantage of having a 2D extension array of Arrow strings, lists, etc...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, to be clear your comment is not about this first sentence then, only about the last sentence below? (" To preserve the default 2D block structure of the DataFrame internals, the ExtensionArray interface will be extended to support 2D arrays.")

My assumption is that we will only make use of 2D ExtensionArrays for data types where the default dtype now is also 2D (i.e. mostly the numerical ones, maybe the datetimes, but so not for example the arrow dtypes).

The reason that this is included (although it might sound as an implementation detail, and otherwise the PDEP said to leave out implementation details) is that the choice to go all-in on extension arrays and dtypes (which is the consequence of using "nullable extension arrays by default for all dtypes") would, if we stick to the current 1d EAs, an implicit choice to drop 2d blocks. And that would be a change that we should not make implicitly, but very explicit. And so this PDEP makes that explicit by mentioning it, but makes the choice to not change the 2d block ability of pandas in this PDEP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding from previous discussion was that all EAs would support 2D, which would let us rip out a bunch of complexity in the code. In the case of our ArrowEA I expect we would back it by a pa.Table instead of pa.ChunkedArray (but that detail is definitely out of scope for the pdep). Most other cases are ndarray-backed so easy to do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the 2D operations we are trying to optimize for? I think especially since Arrow uses bitmasks and we assumedly move towards that approach, handling that in a 2D fashion seems rather challenging

Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PDEP pandas enhancement proposal Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants