-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDEP-16: Consistent missing value handling #58988
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for starting this
Currently, pandas handles missing data differently for different data types. We | ||
use different types to indicate that a value is missing: ``np.nan`` for | ||
floating-point data, ``np.nan`` or ``None`` for object-dtype data -- typically | ||
strings or booleans -- with missing values, and ``pd.NaT`` for datetimelike |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
floating-point data,
np.nan
orNone
for object-dtype data -- typically
strings or booleans -- with missing values, andpd.NaT
for datetimelike
Maybe after PDEP-14 gets accepted we should revise this to say "str" uses np.nan, "string" uses pd.NA. Boolean data types do not directly support missing values, so are often cast to object as a result
data. Some other data types, such as integer and bool, cannot store missing data | ||
or are cast to float or object dtype. In addition, pandas 1.0 introduced a new | ||
missing value sentinel, ``pd.NA``, which is being used for the experimental | ||
nullable integer, float, boolean, and string data types, and more recently also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nullable integer data types came out in 0.24.0 back in January 2019. I think after all that time and having survived two major releases (soon 3) without any substantial change, we should stop referring to these as experimental
4. For backwards compatibility, existing missing value indicators like `NaN` and | ||
`NaT` will be interpreted as `pd.NA` when introduced in user input, IO or | ||
through operations (to ensure it keeps being considered as missing). | ||
Specifically for floating dtypes, in practice this means a float column can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What prevents us from allowing both NaN and NA for floats? I understand that the major use case historically when assigning np.nan
has probably been to introduce a missing value and that should map to pd.NA with this PDEP, but what prevents an operation like 0/0 from still introduction NaN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh I'd also appreciate it if we could argue about this one a bit longer
glad that this PR says
Potentially distinguishing NA and NaN is left for a separate discussion.
because I think it's fairly important
surely there's perf overhead to having to check whether values are NaN, as opposed to just preserving validity masks?
i'm all for aiming for nullable dtypes by default, and am hoping that this is done properly (as in, giving up on the historical hack of treating NaN as a missing value indicator). add a few nan_as_null
arguments in a few places, and enjoy consistency with arrow-based tools
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this on the team call yesterday so trying to paraphrase, but @jorisvandenbossche and @Dr-Irv please correct me
An assignment like ser.iloc[0] = np.nan
could mean either "assign a missing value" or "assign the NaN" sentinel. The latter is more technically correct, but owing to our history this probably means the former in most cases. So continuing with that assumption in the near term reduces the amount of churn for end users.
Longer term it would for sure be nice to get to a place where something like that actually assigns NaN and a user would have to use pd.NA exclusively for missing values, but we probably need a release or two of the proposal in its current state before we can disentangle that specifically for the float data type. We might even need to add new methods like is_nan
instead of isna
for that purpose.
Given those considerations are specific to one data type, we might be better off living with that issue through execution of this PDEP and leaving it to a subsequent PDEP that clarifies that behavior just for float data types down the road. So we certainly won't be at the ideal state here, but at least moving in the right direction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this on the team call yesterday so trying to paraphrase, but @jorisvandenbossche and @Dr-Irv please correct me
Good summary
to the current nullable dtypes using `pd.NA` (i.e. regarding behaviour in | ||
comparisons, see below for details). | ||
|
||
3. As a consequence, pandas will move to nullable extension arrays by default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've lost track of if we already implemented this or not, but why do we want to do this generically? Seems like this is provided for compatability with NumPy for the primitive types, but as we expand to non-NumPy types is it really necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get the question here, can you try to clarify? What exactly are we doing "generically"? Or what do you understand that we are providing for compatibility with numpy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 2D block structure is for compatability with NumPy right? So I'm curious what would be the perceived advantage of having a 2D extension array of Arrow strings, lists, etc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, to be clear your comment is not about this first sentence then, only about the last sentence below? (" To preserve the default 2D block structure of the DataFrame internals, the ExtensionArray interface will be extended to support 2D arrays.")
My assumption is that we will only make use of 2D ExtensionArrays for data types where the default dtype now is also 2D (i.e. mostly the numerical ones, maybe the datetimes, but so not for example the arrow dtypes).
The reason that this is included (although it might sound as an implementation detail, and otherwise the PDEP said to leave out implementation details) is that the choice to go all-in on extension arrays and dtypes (which is the consequence of using "nullable extension arrays by default for all dtypes") would, if we stick to the current 1d EAs, an implicit choice to drop 2d blocks. And that would be a change that we should not make implicitly, but very explicit. And so this PDEP makes that explicit by mentioning it, but makes the choice to not change the 2d block ability of pandas in this PDEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding from previous discussion was that all EAs would support 2D, which would let us rip out a bunch of complexity in the code. In the case of our ArrowEA I expect we would back it by a pa.Table instead of pa.ChunkedArray (but that detail is definitely out of scope for the pdep). Most other cases are ndarray-backed so easy to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the 2D operations we are trying to optimize for? I think especially since Arrow uses bitmasks and we assumedly move towards that approach, handling that in a 2D fashion seems rather challenging
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Note: this is only a draft, partial proposal. It already lists the main points of the proposal at a high level, but of course we need to add a lot of the details and specifically about backwards compatibility and how we can move towards the end goal (I think those last aspects will actually require the most discussion eventually, the devil is in the details here ;)).
But, I thought to already open this PR to start the discussion on those main high-level points of the proposal (we can already discuss those and see if there is agreement about that), and to have a reference for this NA proposal (since it has come up in a lot of other tangentially related discussions as well), and also to push myself to actually write up those other sections as well ;)
A summary of all (potentially breaking) behaviour changes that I am aware of can be found here: #58243 (comment)
Abstract
This proposal aims to unify the missing value handling across all dtypes:
pd.NA
exclusively as the user-facing missing value indicator.pd.NA
This proposal tries to provide a high level way forward, and is not meant to address all implementation details (i.e. all dtypes should follow the above principles, but we don't need to decide in this PDEP for every single dtype how this is exactly achieved)
cc @pandas-dev/pandas-core @pandas-dev/pandas-triage