Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-13: The pandas Logical Type System #58455

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
196 changes: 196 additions & 0 deletions web/pandas/pdeps/0013-logical-type-system.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
# PDEP-13: The pandas Logical Type System

- Created: 27 Apr 2024
- Status: Under discussion
- Discussion: [#58141](https://github.com/pandas-dev/pandas/issues/58141)
- Author: [Will Ayd](https://github.com/willayd),
- Revision: 3

## Abstract

This PDEP proposes a logical type system for pandas which will decouple user semantics (i.e. _this column should be an integer_) from the pandas implementation/internal (i.e. _we will use NumPy/pyarrow/X to store this array_).By decoupling these through a logical type system, the expectation is that this PDEP will:

* Abstract underlying library differences from end users
* More clearly define the types of data pandas supports
* Allow pandas developers more flexibility to manage type implementations
* Pave the way for continued adoption of Arrow in the pandas code base

## Background

When pandas was originally built, the data types that it exposed were a subset of the NumPy type system. Starting back in version 0.23.0, pandas introduced Extension types, which it also began to use internally as a way of creating arrays instead of exclusively relying upon NumPy. Over the course of the 1.x releases, pandas began using pyarrow for string storage and in version 1.5.0 introduced the high level ``pd.ArrowDtype`` wrapper.

While these new type systems have brought about many great features, they have surfaced three major problems.

### Problem 1: Inconsistent Type Naming / Behavior

There is no better example of our current type system being problematic than strings. Let's assess the number of string iterations a user could create (this is a non-exhaustive list):

```python
dtype=object
dtype=str
dtype="string"
dtype=pd.StringDtype()
dtype=pd.StringDtype("python", na_value=np.nan)
dtype=pd.StringDtype("python", na_value=pd.NA)
dtype=pd.StringDtype("pyarrow")
dtype="string[pyarrow]"
dtype="string[pyarrow_numpy]" # added in 2.1, deprecated in 2.3
dtype=pd.ArrowDtype(pa.string())
dtype=pd.ArrowDtype(pa.large_string())
```

``dtype="string"`` was the first truly new string implementation starting back in pandas 0.23.0, and it is a common pitfall for new users not to understand that there is a huge difference between that and ``dtype=str``. The pyarrow strings have trickled in in more recent releases, but also are very difficult to reason about. The fact that ``dtype="string[pyarrow]"`` is not the same as ``dtype=pd.ArrowDtype(pa.string()`` or ``dtype=pd.ArrowDtype(pa.large_string())`` was a surprise [to the author of this PDEP](https://github.com/pandas-dev/pandas/issues/58321).

While some of these are aliases, the main reason why we have so many different string dtypes is because we have historically used NumPy and created custom missing value solutions around the ``np.nan`` marker, which are incompatible with ``pd.NA``. Our ``pd.StringDtype()`` uses the pd.NA sentinel, as do our pyarrow based solutions; bridging these into one unified solution has proven challenging.

To try and smooth over the different missing value semantics and how they affect the underlying type system, the status quo has always been to add another string dtype. With PDEP-14 we now have a "compatibility" string of ``pd.StringDtype("python|pyarrow", na_value=np.nan)`` that makes a best effort to move users towards all the benefits of PyArrow strings (assuming pyarrow is installed) while retaining backwards-compatible missing value handling with ``np.nan`` as the missing value marker. The usage of the ``pd.StringDtype`` in this manner is a good stepping stone towards the goals of this PDEP, although it is stuck in an "in-between" state without other types following suit.

For instance, if a user calls ``Series.value_counts()`` on the ``pd.StringDtype()``, the type of the returned Series can vary wildly, and in non-obvious ways:

```python
>>> pd.Series(["x"], dtype=pd.StringDtype("python", na_value=pd.NA)).value_counts().dtype
Int64Dtype()
>>> pd.Series(["x"], dtype=pd.StringDtype("pyarrow", na_value=pd.NA)).value_counts().dtype
int64[pyarrow]
>>> pd.Series(["x"], dtype=pd.StringDtype("python", na_value=np.nan)).value_counts().dtype
Int64Dtype()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Int64Dtype()
dtype('int64')

>>> pd.Series(["x"], dtype=pd.StringDtype("pyarrow", na_value=np.nan)).value_counts().dtype
dtype('int64')
```

It is also worth noting that different methods will return different data types. For a pyarrow-backed string with pd.NA, ``Series.value_counts()`` returns a ``int64[pyarrow]`` but ``Series.str.len()`` returns a ``pd.Int64Dtype()``.

A logical type system can help us abstract all of these issues. At the end of the day, this PDEP assumes a user wants a string data type. If they call ``Series.str.value_counts()`` against a Series of that type with missing data, they should get back a Series with an integer data type.

### Problem 2: Inconsistent Constructors

The second problem is that the conventions for constructing types from the various _backends_ are inconsistent. Let's review string aliases used to construct certain types:

| logical type | NumPy | pandas extension | pyarrow |
| int | "int64" | "Int64" | "int64[pyarrow]" |
| string | N/A | "string" | N/A |
| datetime | N/A | "datetime64[us]" | "timestamp[us][pyarrow]" |

"string[pyarrow]" is excluded from the above table because it is misleading; while "int64[pyarrow]" definitely gives you a pyarrow backed string, "string[pyarrow]" gives you a pandas extension array which itself then uses pyarrow. Subtleties like this then lead to behavior differences (see [issue 58321](https://github.com/pandas-dev/pandas/issues/58321)).

If you wanted to try and be more explicit about using pyarrow, you could use the ``pd.ArrowDtype`` wrapper. But this unfortunately exposes gaps when trying to use that pattern across all backends:

| logical type | NumPy | pandas extension | pyarrow |
| int | np.int64 | pd.Int64Dtype() | pd.ArrowDtype(pa.int64()) |
| string | N/A | pd.StringDtype() | pd.ArrowDtype(pa.string()) |
| datetime | N/A | ??? | pd.ArrowDtype(pa.timestamp("us")) |
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved

It would stand to reason in this approach that you could use a ``pd.DatetimeDtype()`` but no such type exists (there is a ``pd.DatetimeTZDtype`` which requires a timezone).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref #24662


### Problem 3: Lack of Clarity on Type Support

The third issue is that the extent to which pandas may support any given type is unclear. Issue [#58307](https://github.com/pandas-dev/pandas/issues/58307) highlights one example. It would stand to reason that you could interchangeably use a pandas datetime64 and a pyarrow timestamp, but that is not always true. Another example is the use of NumPy fixed length strings, which users commonly try to use even though we claim no real support for them (see [#5764](https://github.com/pandas-dev/pandas/issues/57645)).

## Assessing the Current Type System(s)

A best effort at visualizing the current type system(s) with types that we currently "support" or reasonably may want to is shown [in this comment](https://github.com/pandas-dev/pandas/issues/58141#issuecomment-2047763186). Note that this does not include the ["pyarrow_numpy"](https://github.com/pandas-dev/pandas/pull/58451) string data type or the string data type that uses the NumPy 2.0 variable length string data type (see [comment](https://github.com/pandas-dev/pandas/issues/57073#issuecomment-2080798081)) as they are under active discussion.

## Proposal

### Proposed Logical Types

Derived from the hierarchical visual in the previous section, this PDEP proposes that pandas supports at least all of the following _logical_ types, excluding any type widths for brevity:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC we already support all of these, just not with all three of the storages. Is this paragraph suggesting that we implement all of these for all three (four if we include sparse) storages?

Copy link
Member Author

@WillAyd WillAyd Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all - this is just suggesting that these be the logical types we offer, whether through the factory functions or class-based design that follows.

In today's world it is the responsibility of the user to know what may or may not be offered by the various backends and pick from them accordingly. With dates/datetimes being a good example, a user would have to explicitly pick from some combination of pd.ArrowDtype(pa.date32()), pd.ArrowDtype(pa.timestamp(<unit>)) and datetime64[unit] since we don't offer any logical abstraction.

In the model of this PDEP, we would offer something to the effect pd.date() and pd.datetime(), abstracting away the type implementation details. This would be helpful for us as developers to solve things like #58315


- Signed Integer
- Unsigned Integer
- Floating Point
- Decimal
- Boolean
- Date
- Datetime
- Duration
- CalendarInterval
- Period
- Binary
- String
- Dict
- List
- Struct
- Interval
- Object
- Null

One of the major problems this PDEP has tried to highlight is the historical tendency of our team to "create more types" to solve existing problems. To minimize the need for that, this PDEP proposes re-using our existing extension types where possible, and only adding new ones where they do not exist.

The existing extension types which will become our "logical types" are:

- pd.StringDtype()
- pd.IntXXDtype()
- pd.UIntXXDtype()
- pd.FloatXXDtype()
- pd.BooleanDtype()
- pd.PeriodDtype(freq)
- pd.IntervalDtype()

To satisfy all of the types highlighted above, this would require the addition of:

- pd.DecimalDtype()
- pd.DateDtype()
- pd.DatetimeDtype(unit, tz)
- pd.Duration()
- pd.CalendarInterval()
- pd.BinaryDtype()
- pd.DictDtype()
- pd.ListDtype()
- pd.StructDtype()
- pd.ObjectDtype()
- pd.NullDtype()

The storage / backend to each of these types is left as an implementation detail. The fact that ``pd.StringDtype()`` may be backed by Arrow while ``pd.PeriodDtype()`` continues to be a custom solution is of no concern to the end user. Over time this will allow us to adopt more Arrow behind the scenes without breaking the front end for our end users, but _still_ giving us the flexibility to produce data types that Arrow will not implement (e.g. ``pd.ObjectDtype()``).

The methods of each logical type are expected in turn to yield another logical type. This can enable us to smooth over differences between the NumPy and Arrow world, while also leveraging the best of both backends. To illustrate, let's look at some methods where the return type today deviates for end users depending on if they are using NumPy-backed data types or Arrow-backed data types. The equivalent PDEP-13 logical data type is presented as the last column:

| Method | NumPy-backed result | Arrow Backed result type | PDEP-13 result type |
|--------------------|---------------------|--------------------------|--------------------------------|
| Series.str.len() | np.float64 | pa.int64() | pd.Int64Dtype() |
| Series.str.split() | object | pa.list(pa.string()) | pd.ListDtype(pa.StringDtype()) |
| Series.dt.date | object | pa.date32() | pd.DateDtype() |

The ``Series.dt.date`` example is worth an extra look - with a PDEP-13 logical type system in place we would theoretically have the ability to keep our default ``pd.DatetimeDtype()`` backed by our current NumPy-based array but leverage pyarrow for the ``Series.dt.date`` solution, rather than having to implement a DateArray ourselves.

To implement this PDEP, we expect all of the logical types to have at least the following metadata:

* storage: Either "numpy" or "pyarrow". Describes the library used to create the data buffer
* physical_type: Can expose the physical type being used. As an example, StringDtype could return pa.string_view
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think physical_type should be necessary/required for the general case. AFAICT it is relevant (almost?) exclusively for variants of arrow strings, and could be kept as a keyword for just pd.StringDtype.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the types listed here I think it could also be used by Binary (Arrow binary versus large_binary) and Decimal (decimal128 versus decimal256)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it's true that there are multiple dtypes where this could be relevant, to Brock's point: it will not be relevant for all dtypes. So maybe make this an optional metadata item? (or maybe just give as example how the logical dtype can hide implementation details)

* na_value: Either pd.NA, np.nan, or pd.NaT.

While these attributes are exposed as construction arguments to end users, users are highly discouraged from trying to control them directly. Put explicitly, this PDEP allows a user to request a ``pd.XXXDtype(storage="numpy")`` to request a NumPy-backed array, if possible. While pandas may respect that during construction, operations against that data make no guarantees that the storage backend will be persisted through, giving pandas the freedom to convert to whichever storage is internally optimal (Arrow will typically be preferred).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should each dtype necessarily expose each of those metadata items as a construction argument? If there is only one valid option (eg a dtype that is implemented only using numpy or pyarrow for storage, or that has only a single physical type), it seems not necessary to expose this as argument?


### Missing Value Handling
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel wanted to ping this off you. When we talked I said would stick with np.nan as the missing value marker, but since we are reusing the pd.StringDtype() in this version of the proposal I don't think we should move that away from pd.NA

Its not the default but maybe this pd.na_marker="legacy" serves as a good middle ground?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there's a global option, it should go through pd.set_option/get_option

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why data_manager instead of storage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason I wanted to include the word data is because we are strictly referring to the data buffer and nothing else. I don't have a strong preference on the terminology of this though


Missing value handling is a tricky area as developers are split between pd.NA semantics versus np.nan, and the transition path from one to the other is not always clear. This PDEP does not aim to "solve" that issue per se (for that discussion, please refer to PDEP-16), but aims to provide a go-forward path that strikes a reasonable balance between backwards compatibility and a consistent missing value approach in the future.

This PDEP proposes that the default missing value for logical types is ``pd.NA``. The reasoning is two-fold:

1. We are in many cases re-using extension types as logical types, which mostly use pd.NA (StrDtype and datetimes are the exception)
2. For new logical types that have nothing to do with NumPy, using np.nan as a missing value marker is an odd fit

However, to help with backwards compatibility for users that heavily rely on the semantics of ``np.nan`` or ``pd.NaT``, an option of ``pd.na_value = "legacy"`` can be set. This would mean that the missing value indicator for logical types would be:

| Logical Type | Default Missing Value | Legacy Missing Value |
| pd.BooleanDtype() | pd.NA | np.nan |
| pd.IntXXType() | pd.NA | np.nan |
| pd.FloatXXType() | pd.NA | np.nan |
| pd.StringDtype() | pd.NA | np.nan |
| pd.DatetimeType() | pd.NA | pd.NaT |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Period, timedelta64

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I can make this a fully exhaustive list. For the Legacy missing value on BinaryDtype, ListDtype, StructDtype and MapDtype do you think np.nan or pd.NA makes more sense?


However, all data types for which there is no legacy NumPy-backed equivalent will continue to use ``pd.NA``, even in "legacy" mode. Legacy is provided only for backwards compatibility, but ``pd.NA`` usage is encouraged going forward to give users one exclusive missing value indicator and better align with the goals of PDEP-16.

### Transitioning from Current Constructors

To maintain a consistent path forward, _all_ constructors with the implementation of this PDEP are expected to map to the logical types. This means that providing ``np.int64`` as the data type argument makes no guarantee that you actually get a NumPy managed storage buffer; pandas reserves the right to optimize as it sees fit and may decide instead to use PyArrow.

The theory behind this is that the majority of users are not expecting anything particular from NumPy to happen when they say ``dtype=np.int64``. The expectation is that a user just wants _integer_ data, and the ``np.int64`` specification owes to the legacy of pandas' evolution.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the idea here is that if a user writes dtype=np.int64 they might not really "mean it" so we'll ignore the "np" part of it, but if they write dtype=pd.Int64Dtype(backend="numpy") we will respect it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, at least as far as the initial storage goes. I am operating under the (hopefully valid) assumption that the need to be really specific about the storage is an exception


This PDEP makes no guarantee that we will stay that way forever; it is certainly reasonable that, in the future, we deprecate and fully stop support for backend-specifc constructors like ``np.int64`` or ``pd.ArrowDtype(pa.int64())``. However, for the execution of this PDEP, such an initiative is not in scope.

## PDEP-13 History

- 27 April 2024: Initial version
- 10 May 2024: First revision
- 01 Aug 2024: Revisions for PDEP-14 and PDEP-16