PDEP-16: Consistent missing value handling #58988

jorisvandenbossche · 2024-06-12T16:01:46Z

Note: this is only a draft, partial proposal. It already lists the main points of the proposal at a high level, but of course we need to add a lot of the details and specifically about backwards compatibility and how we can move towards the end goal (I think those last aspects will actually require the most discussion eventually, the devil is in the details here ;)).

But, I thought to already open this PR to start the discussion on those main high-level points of the proposal (we can already discuss those and see if there is agreement about that), and to have a reference for this NA proposal (since it has come up in a lot of other tangentially related discussions as well), and also to push myself to actually write up those other sections as well ;)

A summary of all (potentially breaking) behaviour changes that I am aware of can be found here: #58243 (comment)

Abstract

This proposal aims to unify the missing value handling across all dtypes:

All data types support missing values and use pd.NA exclusively as the user-facing missing value indicator.
All data types implement consistent missing value "semantics" corresponding to the current nullable dtypes using pd.NA

This proposal tries to provide a high level way forward, and is not meant to address all implementation details (i.e. all dtypes should follow the above principles, but we don't need to decide in this PDEP for every single dtype how this is exactly achieved)

cc @pandas-dev/pandas-core @pandas-dev/pandas-triage

…ther content

WillAyd

Thanks for starting this

WillAyd · 2024-06-12T17:26:38Z

web/pandas/pdeps/0016-consistent-missing-values.md

+Currently, pandas handles missing data differently for different data types. We
+use different types to indicate that a value is missing: ``np.nan`` for
+floating-point data, ``np.nan`` or ``None`` for object-dtype data -- typically
+strings or booleans -- with missing values, and ``pd.NaT`` for datetimelike


floating-point data, np.nan or None for object-dtype data -- typically
strings or booleans -- with missing values, and pd.NaT for datetimelike

Maybe after PDEP-14 gets accepted we should revise this to say "str" uses np.nan, "string" uses pd.NA. Boolean data types do not directly support missing values, so are often cast to object as a result

WillAyd · 2024-06-12T17:29:10Z

web/pandas/pdeps/0016-consistent-missing-values.md

+data. Some other data types, such as integer and bool, cannot store missing data
+or are cast to float or object dtype. In addition, pandas 1.0 introduced a new
+missing value sentinel, ``pd.NA``, which is being used for the experimental
+nullable integer, float, boolean, and string data types, and more recently also


Nullable integer data types came out in 0.24.0 back in January 2019. I think after all that time and having survived two major releases (soon 3) without any substantial change, we should stop referring to these as experimental

WillAyd · 2024-06-12T17:32:56Z

web/pandas/pdeps/0016-consistent-missing-values.md

+4. For backwards compatibility, existing missing value indicators like `NaN` and
+   `NaT` will be interpreted as `pd.NA` when introduced in user input, IO or
+   through operations (to ensure it keeps being considered as missing).
+   Specifically for floating dtypes, in practice this means a float column can


What prevents us from allowing both NaN and NA for floats? I understand that the major use case historically when assigning np.nan has probably been to introduce a missing value and that should map to pd.NA with this PDEP, but what prevents an operation like 0/0 from still introduction NaN?

tbh I'd also appreciate it if we could argue about this one a bit longer

glad that this PR says

Potentially distinguishing NA and NaN is left for a separate discussion.

because I think it's fairly important

surely there's perf overhead to having to check whether values are NaN, as opposed to just preserving validity masks?

i'm all for aiming for nullable dtypes by default, and am hoping that this is done properly (as in, giving up on the historical hack of treating NaN as a missing value indicator). add a few nan_as_null arguments in a few places, and enjoy consistency with arrow-based tools

We discussed this on the team call yesterday so trying to paraphrase, but @jorisvandenbossche and @Dr-Irv please correct me

An assignment like ser.iloc[0] = np.nan could mean either "assign a missing value" or "assign the NaN" sentinel. The latter is more technically correct, but owing to our history this probably means the former in most cases. So continuing with that assumption in the near term reduces the amount of churn for end users.

Longer term it would for sure be nice to get to a place where something like that actually assigns NaN and a user would have to use pd.NA exclusively for missing values, but we probably need a release or two of the proposal in its current state before we can disentangle that specifically for the float data type. We might even need to add new methods like is_nan instead of isna for that purpose.

Given those considerations are specific to one data type, we might be better off living with that issue through execution of this PDEP and leaving it to a subsequent PDEP that clarifies that behavior just for float data types down the road. So we certainly won't be at the ideal state here, but at least moving in the right direction

We discussed this on the team call yesterday so trying to paraphrase, but @jorisvandenbossche and @Dr-Irv please correct me

Good summary

WillAyd · 2024-06-12T17:34:36Z

web/pandas/pdeps/0016-consistent-missing-values.md

+   to the current nullable dtypes using `pd.NA` (i.e. regarding behaviour in
+   comparisons, see below for details).
+
+3. As a consequence, pandas will move to nullable extension arrays by default


I've lost track of if we already implemented this or not, but why do we want to do this generically? Seems like this is provided for compatability with NumPy for the primitive types, but as we expand to non-NumPy types is it really necessary?

I don't get the question here, can you try to clarify? What exactly are we doing "generically"? Or what do you understand that we are providing for compatibility with numpy?

The 2D block structure is for compatability with NumPy right? So I'm curious what would be the perceived advantage of having a 2D extension array of Arrow strings, lists, etc...

Ah, to be clear your comment is not about this first sentence then, only about the last sentence below? (" To preserve the default 2D block structure of the DataFrame internals, the ExtensionArray interface will be extended to support 2D arrays.")

My assumption is that we will only make use of 2D ExtensionArrays for data types where the default dtype now is also 2D (i.e. mostly the numerical ones, maybe the datetimes, but so not for example the arrow dtypes).

The reason that this is included (although it might sound as an implementation detail, and otherwise the PDEP said to leave out implementation details) is that the choice to go all-in on extension arrays and dtypes (which is the consequence of using "nullable extension arrays by default for all dtypes") would, if we stick to the current 1d EAs, an implicit choice to drop 2d blocks. And that would be a change that we should not make implicitly, but very explicit. And so this PDEP makes that explicit by mentioning it, but makes the choice to not change the 2d block ability of pandas in this PDEP.

My understanding from previous discussion was that all EAs would support 2D, which would let us rip out a bunch of complexity in the code. In the case of our ArrowEA I expect we would back it by a pa.Table instead of pa.ChunkedArray (but that detail is definitely out of scope for the pdep). Most other cases are ndarray-backed so easy to do.

What are the 2D operations we are trying to optimize for? I think especially since Arrow uses bitmasks and we assumedly move towards that approach, handling that in a 2D fashion seems rather challenging

2D operations are dramatically more performant for arithmetic (maybe this will change in a free-threading world?) and axis=1 operations. Also a major use case is np.function(dataframe) which requires performant DataFrame->ndarray conversions (the relevant cases effectively all be single-block float64)

e.g. #61618 (comment) from the "pyarrow by default" thread:

including on numpy or scipy functions on DataFrames, including 2-D operations

As to bitmasks, I'm coming around on that, but think it will take the form of using a list of bitmasks for a 2D array.

github-actions · 2024-07-19T00:06:37Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

datapythonista · 2025-06-06T11:47:13Z

@jorisvandenbossche @phofl are you planning to continue work on this? If not, should we just close and pass the torch to somebody else? Another option is to vote on the part already described here, and let other issues or PDEPs discuss specific details if needed?

Keeping this open in its stale status seems to be harming the project more than being of any value.

jorisvandenbossche · 2025-06-11T18:01:25Z

I would personally like to revive this discussion (although I haven't fully caught up with all other ongoing discussions from the last days to know how that would potentially impact this PDEP).

I can try to fill out the details (of what is currently summarized) in the coming weeks. But also if someone else is interested in helping to move this forward, that is certainly welcome.

phofl and others added 2 commits March 24, 2024 15:08

Add pdep content

3b54a73

rename file, rephrase main proposal points, temporarily remove most o…

d7dcbdb

…ther content

jorisvandenbossche added the PDEP pandas enhancement proposal label Jun 12, 2024

WillAyd reviewed Jun 12, 2024

View reviewed changes

WillAyd mentioned this pull request Jun 25, 2024

ENH: support pd.NA in "category" dtype #47982

Open

3 tasks

asishm mentioned this pull request Jun 27, 2024

BUG: 0/0 with arrow backend is not "NA" #59122

Closed

3 tasks

WillAyd mentioned this pull request Jul 5, 2024

Proper support of nullable dtypes as the Categorical dtype #50711

Open

github-actions bot added the Stale label Jul 19, 2024

WillAyd mentioned this pull request Jul 23, 2024

API: distinguish NA vs NaN in floating dtypes #32265

Open

This was referenced Aug 1, 2024

API: string dtype propagation of NaNs in predicate methods (eg .str.startswith) #54805

Closed

BUG: regression in master for DataFrame.sparse.from_spmatrix #59212

Open

jorisvandenbossche mentioned this pull request Aug 12, 2024

PDEP-13: The pandas Logical Type System #58455

Closed

rhshadrach mentioned this pull request Oct 28, 2024

BUG: skipna=True operations don't skip NaN in FloatingArrays #59965

Open

3 tasks

rhshadrach mentioned this pull request Jan 29, 2025

DOC: Specify what "non-null" means in DataFrame.info() #60802

Closed

1 task

rhshadrach added the PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint label Apr 16, 2025

This was referenced Apr 16, 2025

BUG: reindex (and atleast several other methods) do not respect fill_value=None #61291

Open

BUG: NA are coerced in to NaN by concat #61303

Open

datapythonista marked this pull request as ready for review June 6, 2025 11:47

datapythonista removed the Stale label Jun 6, 2025

datapythonista mentioned this pull request Jun 7, 2025

PDEP-15: Reject PDEP-10 #58623

Open

5 tasks

WillAyd mentioned this pull request Jun 10, 2025

PDEP-18: Nullable Object Dtype #61599

Open

This was referenced Jun 23, 2025

DISC: path to nullable-by-default integers and floats #58243

Closed

POC: NA-only behavior for numpy-nullable dtypes #61708

Draft

POC: PDEP16 default to masked nullable dtypes #61716

Draft

WillAyd mentioned this pull request Jul 1, 2025

Moving to PyArrow dtypes by default #61618

Open

Uh oh!

PDEP-16: Consistent missing value handling #58988

Are you sure you want to change the base?

PDEP-16: Consistent missing value handling #58988

Conversation

jorisvandenbossche commented Jun 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WillAyd Jun 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 19, 2024

Uh oh!

datapythonista commented Jun 6, 2025

Uh oh!

jorisvandenbossche commented Jun 11, 2025

Uh oh!

Uh oh!

jorisvandenbossche commented Jun 12, 2024 •

edited

Loading

WillAyd Jun 13, 2024 •

edited

Loading