From 344d58bfdaaed9f11c1697037b3fd3b14f7e7502 Mon Sep 17 00:00:00 2001 From: Will Ayd Date: Sat, 27 Apr 2024 12:36:03 -0400 Subject: [PATCH 1/5] PDEP-13: The pandas Logical Type System --- web/pandas/pdeps/0013-logical-type-system.md | 161 +++++++++++++++++++ 1 file changed, 161 insertions(+) create mode 100644 web/pandas/pdeps/0013-logical-type-system.md diff --git a/web/pandas/pdeps/0013-logical-type-system.md b/web/pandas/pdeps/0013-logical-type-system.md new file mode 100644 index 0000000000000..8c686732dcf8d --- /dev/null +++ b/web/pandas/pdeps/0013-logical-type-system.md @@ -0,0 +1,161 @@ +# PDEP-13: The pandas Logical Type System + +- Created: 27 Apr 2024 +- Status: Draft +- Discussion: [#58141](https://github.com/pandas-dev/pandas/issues/58141) +- Author: [Will Ayd](https://github.com/willayd), +- Revision: 1 + +## Abstract + +This PDEP proposes a logical type system for pandas to abstract underlying library differences from end users, clarify the scope of pandas type support, and give pandas developers more flexibility to manage the implementation of types. + +## Background + +When pandas was originally built, the data types that it exposed were a subset of the NumPy type system. Starting back in version 0.23.0, pandas introduced Extension types, which it also began to use internally as a way of creating arrays instead of exclusively relying upon NumPy. Over the course of the 1.x releases, pandas began using pyarrow for string storage and in version 1.5.0 introduced the high level ``pd.ArrowDtype`` wrapper. + +While these new type systems have brought about many great features, they have surfaced three major problems. The first is that we put the onus on users to understand the differences of the physical type implementations. Consider the many ways pandas allows you to create a "string": + +```python +dtype=object +dtype=str +dtype="string" +dtype=pd.StringDtype() +dtype=pd.StringDtype("pyarrow") +dtype="string[pyarrow]" +dtype="string[pyarrow_numpy]" +dtype=pd.ArrowDtype(pa.string()) +``` + +Keeping track of all of these iterations and their subtle differences is difficult even for [core maintainers](https://github.com/pandas-dev/pandas/issues/58321). + +The second problem is that the conventions for constructing types from a given type backend are inconsistent. Let's review string aliases used to construct certain types: + +| logical type | NumPy | pandas extension | pyarrow | +| int | "int64" | "Int64" | "int64[pyarrow]" | +| string | N/A | "string" | N/A | +| datetime | N/A | "datetime64[us]" | "timestamp[us][pyarrow]" | + +"string[pyarrow]" is excluded from the above table because it is misleading; while "int64[pyarrow]" definitely gives you a pyarrow backed string, "string[pyarrow]" gives you a pandas extension array which itself then uses pyarrow, which can introduce behavior differences (see [issue 58321](https://github.com/pandas-dev/pandas/issues/58321)). + +If you wanted to try and be more explicit about using pyarrow, you could use the ``pd.ArrowDtype`` wrapper. But this unfortunately exposes gaps when trying to use that pattern across all backends: + +| logical type | NumPy | pandas extension | pyarrow | +| int | np.int64 | pd.Int64Dtype() | pd.ArrowDtype(pa.int64()) | +| string | N/A | pd.StringDtype() | pd.ArrowDtype(pa.string()) | +| datetime | N/A | ??? | pd.ArrowDtype(pa.timestamp("us")) | + +It would stand to reason in this approach that you could use a ``pd.DatetimeDtype()`` but no such type exists (there is a ``pd.DatetimeTZDtype`` which requires a timezone). + +The third issue is that the extent to which pandas may support any given type is unclear. Issue [#58307](https://github.com/pandas-dev/pandas/issues/58307) highlights one example. It would stand to reason that you could interchangeably use a pandas datetime64 and a pyarrow timestamp, but that is not always true. Another common example is the use of NumPy fixed length strings, which users commonly try to use even though we claim no real support for them (see [#5764](https://github.com/pandas-dev/pandas/issues/57645)). + +## Assessing the Current Type System(s) + +A best effort at visualizing the current type system(s) with types that we currently "support" or reasonably may want to is shown [in this comment](https://github.com/pandas-dev/pandas/issues/58141#issuecomment-2047763186). Note that this does not include the ["pyarrow_numpy"](https://github.com/pandas-dev/pandas/pull/58451) string data type or the string data type that uses the NumPy 2.0 variable length string data type (see [comment](https://github.com/pandas-dev/pandas/issues/57073#issuecomment-2080798081)) as they are under active discussion. + +## Proposal + +Derived from the hierarchical visual in the previous section, this PDEP proposes that pandas supports at least all of the following _logical_ types, excluding any type widths for brevity: + + - Signed Integer + - Unsigned Integer + - Floating Point + - Fixed Point + - Boolean + - Date + - Datetime + - Duration + - Interval + - Period + - Binary + - String + - Dictionary + - List + - Struct + +To ensure we maintain all of the current functionality of our existing type system(s), a base type structure would need to look something like: + +```python +class BaseType: + + @property + def dtype_backend -> Literal["pandas", "numpy", "pyarrow"]: + """ + Library is responsible for the array implementation + """ + ... + + @property + def physical_type: + """ + How does the backend physically implement this logical type? i.e. our + logical type may be a "string" and we are using pyarrow underneath - + is it a pa.string(), pa.large_string(), pa.string_view() or something else? + """ + ... + + @property + def missing_value_marker -> pd.NA|np.nan: + """ + Sentinel used to denote missing values + """ + ... +``` + +The theory behind this PDEP is that most users /should not care/ about the physical type that is being used. But if the abstraction our logical type provides is too much, a user could at least inspect and potentially configure which physical type to use. + +With regards to how we may expose such types to end users, there are two currently recognized proposals. The first would use factory functions to create the logical type, i.e. something like: + +```python +pd.Series(["foo", "bar", "baz"], dtype=pd.string() # assumed common case +pd.Series(["foo", "bar", "baz"], dtype=pd.string(missing_value_marker=np.nan) +pd.Series(["foo", "bar", "baz"], dtype=pd.string(physical_type=pa.string_view()) +``` + +Another approach would be to classes: + +```python +pd.Series(["foo", "bar", "baz"], dtype=pd.StringDtype() +pd.Series(["foo", "bar", "baz"], dtype=pd.StringDtype(missing_value_marker=np.nan) +pd.Series(["foo", "bar", "baz"], dtype=pd.StringDtype(physical_type=pa.string_view()) +``` +Note that the class-based approach would reuse existing classes like ``pd.StringDtype`` but change their purpose, whereas the factory function would more explicitly be a new approach. This is an area that requires more discussion amongst the team. + +## String Type Arguments + +This PDEP proposes that we maintain only a small subset of string arguments that can be used to construct logical types. Those string arguments are: + + - intXX + - uintXX + - floatXX + - string + - datetime64[unit] + - datetime64[unit, tz] + +However, new code should be encouraged to use the logical constructors outlined previously. Particularly for aggregate types, trying to encode all of the information into a string can become unwieldy. Instead, keyword argument use should be encouraged: + +```python +pd.Series(dtype=pd.list(value_type=pd.string())) +``` + +## Bridging Type Systems + +An interesting question arises when a user constructs two logical types with differing physical types. If one is backed by NumPy and the other is backed by pyarrow, what should happen? + +This PDEP proposes the following backends should be prioritized in the following order (1. is the highest priority): + + 1. Arrow + 2. pandas + 3. NumPy + +One reason for this is that Arrow represents the most efficient and assumedly least-lossy physical representations. An obvious example comes when a pyarrow int64 array with missing data gets added to a NumPy int64 array; casting to the latter would lose data. Another reason is that Arrow represents the fastest growing ecosystem of tooling, and the PDEP author believes improving pandas's interoperability within that landscape is extremely important. + +Aside from the backend, the C standard rules for [implicit conversion](https://en.cppreference.com/w/c/language/conversion) should apply to the data buffer, i.e. adding a pyarrow int8 array to a NumPy uint64 array should produce a pyarrow uint64 array. + +For more expensive conversions, pandas retains the right to throw warnings or even error out when two of the same logical type with differing physical types is added. For example, attempting to do string concatenation of string arrays backed by pyarrow and Python objects may throw a ``PerformanceWarning``, or maybe even a ``MemoryError`` if such a conversion exhausts the available system resources. + +The ``BaseType`` proposed above also has a property for the ``missing_value_marker``. Operations that use two logical types with different missing value markers should raise, as there is no clear way to prioritize between the various sentinels. + +## PDEP-11 History + +- 27 April 2024: Initial version From 38381a2da8cd00508a5645a9ad3d3f7db552f1ba Mon Sep 17 00:00:00 2001 From: Will Ayd Date: Sat, 27 Apr 2024 12:38:06 -0400 Subject: [PATCH 2/5] Markdown fix --- web/pandas/pdeps/0013-logical-type-system.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/web/pandas/pdeps/0013-logical-type-system.md b/web/pandas/pdeps/0013-logical-type-system.md index 8c686732dcf8d..17e493d4e3b2b 100644 --- a/web/pandas/pdeps/0013-logical-type-system.md +++ b/web/pandas/pdeps/0013-logical-type-system.md @@ -102,7 +102,7 @@ class BaseType: ... ``` -The theory behind this PDEP is that most users /should not care/ about the physical type that is being used. But if the abstraction our logical type provides is too much, a user could at least inspect and potentially configure which physical type to use. +The theory behind this PDEP is that most users _should not_ care about the physical type that is being used. But if the abstraction our logical type provides is too much, a user could at least inspect and potentially configure which physical type to use. With regards to how we may expose such types to end users, there are two currently recognized proposals. The first would use factory functions to create the logical type, i.e. something like: From 3b947657a4c48a653889a8e915984a59c942c8e6 Mon Sep 17 00:00:00 2001 From: Will Ayd Date: Fri, 10 May 2024 17:22:49 -0400 Subject: [PATCH 3/5] Revision 1 --- web/pandas/pdeps/0013-logical-type-system.md | 156 ++++++++++++------- 1 file changed, 96 insertions(+), 60 deletions(-) diff --git a/web/pandas/pdeps/0013-logical-type-system.md b/web/pandas/pdeps/0013-logical-type-system.md index 17e493d4e3b2b..d8b6e802f4dc7 100644 --- a/web/pandas/pdeps/0013-logical-type-system.md +++ b/web/pandas/pdeps/0013-logical-type-system.md @@ -1,20 +1,29 @@ # PDEP-13: The pandas Logical Type System - Created: 27 Apr 2024 -- Status: Draft +- Status: Under discussion - Discussion: [#58141](https://github.com/pandas-dev/pandas/issues/58141) - Author: [Will Ayd](https://github.com/willayd), -- Revision: 1 +- Revision: 2 ## Abstract -This PDEP proposes a logical type system for pandas to abstract underlying library differences from end users, clarify the scope of pandas type support, and give pandas developers more flexibility to manage the implementation of types. +This PDEP proposes a logical type system for pandas which will decouple user semantics (i.e. _this column should be an integer_) from the pandas implementation/internal (i.e. _we will use NumPy/pyarrow/X to store this array_).By decoupling these through a logical type system, the expectation is that this PDEP will: + + * Abstract underlying library differences from end users + * More clearly define the types of data pandas supports + * Allow pandas developers more flexibility to manage type implementations + * Pave the way for continued adoption of Arrow in the pandas code base ## Background When pandas was originally built, the data types that it exposed were a subset of the NumPy type system. Starting back in version 0.23.0, pandas introduced Extension types, which it also began to use internally as a way of creating arrays instead of exclusively relying upon NumPy. Over the course of the 1.x releases, pandas began using pyarrow for string storage and in version 1.5.0 introduced the high level ``pd.ArrowDtype`` wrapper. -While these new type systems have brought about many great features, they have surfaced three major problems. The first is that we put the onus on users to understand the differences of the physical type implementations. Consider the many ways pandas allows you to create a "string": +While these new type systems have brought about many great features, they have surfaced three major problems. + +### Problem 1: Inconsistent Type Naming / Behavior + +There is no better example of our current type system being problematic than strings. Let's assess the number of string iterations a user could create (this is a non-exhaustive list): ```python dtype=object @@ -25,18 +34,29 @@ dtype=pd.StringDtype("pyarrow") dtype="string[pyarrow]" dtype="string[pyarrow_numpy]" dtype=pd.ArrowDtype(pa.string()) +dtype=pd.ArrowDtype(pa.large_string()) ``` -Keeping track of all of these iterations and their subtle differences is difficult even for [core maintainers](https://github.com/pandas-dev/pandas/issues/58321). +``dtype="string"`` was the first truly new string implementation starting back in pandas 0.23.0, and it is a common pitfall for new users not to understand that there is a huge difference between that and ``dtype=str``. The pyarrow strings have trickled in in more recent memory, but also are very difficult to reason about. The fact that ``dtype="string[pyarrow]"`` is not the same as ``dtype=pd.ArrowDtype(pa.string()`` or ``dtype=pd.ArrowDtype(pa.large_string())`` was a surprise [to the author of this PDEP](https://github.com/pandas-dev/pandas/issues/58321). + +While some of these are aliases, the main reason why we have so many different string dtypes is because we have historically used NumPy and created custom missing value solutions around the ``np.nan`` marker, which are incompatible with the ``pd.NA`` sentinel introduced a few years back. Our ``pd.StringDtype()`` uses the pd.NA sentinel, as do our pyarrow based solutions; bridging these into one unified solution has proven challenging. + +To try and smooth over the different missing value semantics and how they affect the underlying type system, the status quo has been to add another string dtype. ``string[pyarrow_numpy]`` was an attempt to use pyarrow strings but adhere to NumPy nullability semantics, under the assumption that the latter offers maximum backwards compatibility. However, being the exclusive data type that uses pyarrow for storage but NumPy for nullability handling, this data type just adds more inconsistency to how we handle missing data, a problem we have been attempting to solve back since discussions around pandas2. The name ``string[pyarrow_numpy]`` is not descriptive to end users, and unless it is inferred requires users to explicitly ``.astype("string[pyarrow_numpy]")``, again putting a burden on end users to know what ``pyarrow_numpy`` means and to understand the missing value semantics of both systems. -The second problem is that the conventions for constructing types from a given type backend are inconsistent. Let's review string aliases used to construct certain types: +PDEP-14 has been proposed to smooth over that and change our ``pd.StringDtype()`` to be an alias for ``string[pyarrow_numpy]``. This would at least offer some abstraction to end users who just want strings, but on the flip side would be breaking behavior for users that have already opted into ``dtype="string"`` or ``dtype=pd.StringDtype()`` and the related pd.NA missing value marker for the prior 4 years of their existence. + +A logical type system can help us abstract all of these issues. At the end of the day, this PDEP assumes a user wants a string data type. If they call ``Series.str.len()`` against a Series of that type with missing data, they should get back a Series with an integer data type. + +### Problem 2: Inconsistent Constructors + +The second problem is that the conventions for constructing types from the various _backends_ are inconsistent. Let's review string aliases used to construct certain types: | logical type | NumPy | pandas extension | pyarrow | | int | "int64" | "Int64" | "int64[pyarrow]" | | string | N/A | "string" | N/A | | datetime | N/A | "datetime64[us]" | "timestamp[us][pyarrow]" | -"string[pyarrow]" is excluded from the above table because it is misleading; while "int64[pyarrow]" definitely gives you a pyarrow backed string, "string[pyarrow]" gives you a pandas extension array which itself then uses pyarrow, which can introduce behavior differences (see [issue 58321](https://github.com/pandas-dev/pandas/issues/58321)). +"string[pyarrow]" is excluded from the above table because it is misleading; while "int64[pyarrow]" definitely gives you a pyarrow backed string, "string[pyarrow]" gives you a pandas extension array which itself then uses pyarrow. Subtleties like this then lead to behavior differences (see [issue 58321](https://github.com/pandas-dev/pandas/issues/58321)). If you wanted to try and be more explicit about using pyarrow, you could use the ``pd.ArrowDtype`` wrapper. But this unfortunately exposes gaps when trying to use that pattern across all backends: @@ -47,6 +67,8 @@ If you wanted to try and be more explicit about using pyarrow, you could use the It would stand to reason in this approach that you could use a ``pd.DatetimeDtype()`` but no such type exists (there is a ``pd.DatetimeTZDtype`` which requires a timezone). +### Problem 3: Lack of Clarity on Type Support + The third issue is that the extent to which pandas may support any given type is unclear. Issue [#58307](https://github.com/pandas-dev/pandas/issues/58307) highlights one example. It would stand to reason that you could interchangeably use a pandas datetime64 and a pyarrow timestamp, but that is not always true. Another common example is the use of NumPy fixed length strings, which users commonly try to use even though we claim no real support for them (see [#5764](https://github.com/pandas-dev/pandas/issues/57645)). ## Assessing the Current Type System(s) @@ -55,6 +77,8 @@ A best effort at visualizing the current type system(s) with types that we curre ## Proposal +### Proposed Logical Types + Derived from the hierarchical visual in the previous section, this PDEP proposes that pandas supports at least all of the following _logical_ types, excluding any type widths for brevity: - Signed Integer @@ -65,97 +89,109 @@ Derived from the hierarchical visual in the previous section, this PDEP proposes - Date - Datetime - Duration - - Interval + - CalendarInterval - Period - Binary - String - - Dictionary + - Map - List - Struct + - Interval + - Object + +One of the major problems this PDEP has tried to highlight is the historical tendency of our team to "create more types" to solve existing problems. To minimize the need for that, this PDEP proposes re-using our existing extension types where possible, and only adding new ones where they do not exist. + +The existing extension types which will become our "logical types" are: + + - pd.StringDtype() + - pd.IntXXDtype() + - pd.UIntXXDtype() + - pd.FloatXXDtype() + - pd.BooleanDtype() + - pd.PeriodDtype(freq) + - pd.IntervalDtype() + +To satisfy all of the types highlighted above, this would require the addition of: + + - pd.DecimalDtype() + - pd.DateDtype() + - pd.DatetimeDtype(unit, tz) + - pd.Duration() + - pd.CalendarInterval() + - pd.BinaryDtype() + - pd.MapDtype() # or pd.DictDtype() + - pd.ListDtype() + - pd.StructDtype() + - pd.ObjectDtype() + +The storage / backend to each of these types is left as an implementation detail. The fact that ``pd.StringDtype()`` may be backed by Arrow while ``pd.PeriodDtype()`` continues to be a custom solution is of no concern to the end user. Over time this will allow us to adopt more Arrow behind the scenes without breaking the front end for our end users, but _still_ giving us the flexibility to produce data types that Arrow will not implement (e.g. ``pd.ObjectDtype()``). -To ensure we maintain all of the current functionality of our existing type system(s), a base type structure would need to look something like: +The methods of each logical type are expected in turn to yield another logical type. This can enable us to smooth over differences between the NumPy and Arrow world, while also leveraging the best of both backends. To illustrate, let's look at some methods where the return type today deviates for end users depending on if they are using NumPy-backed data types or Arrow-backed data types. The equivalent PDEP-13 logical data type is presented as the last column: + +| Method | NumPy-backed result | Arrow Backed result type | PDEP-13 result type | +|--------------------|---------------------|--------------------------|--------------------------------| +| Series.str.len() | np.float64 | pa.int64() | pd.Int64Dtype() | +| Series.str.split() | object | pa.list(pa.string()) | pd.ListDtype(pa.StringDtype()) | +| Series.dt.date | object | pa.date32() | pd.DateDtype() | + +The ``Series.dt.date`` example is worth an extra look - with a PDEP-13 logical type system in place we would theoretically have the ability to keep our default ``pd.DatetimeDtype()`` backed by our current NumPy-based array but leverage pyarrow for the ``Series.dt.date`` solution, rather than having to implement a DateArray ourselves. + +While this PDEP proposes reusing existing extension types, it also necessitates extending those types with extra metadata: ```python class BaseType: @property - def dtype_backend -> Literal["pandas", "numpy", "pyarrow"]: + def data_manager -> Literal["numpy", "pyarrow"]: """ - Library is responsible for the array implementation + Who manages the data buffer - NumPy or pyarrow """ ... @property def physical_type: """ - How does the backend physically implement this logical type? i.e. our - logical type may be a "string" and we are using pyarrow underneath - - is it a pa.string(), pa.large_string(), pa.string_view() or something else? + For logical types which may have different implementations, what is the + actual implementation? For pyarrow strings this may mean pa.string() versus + pa.large_string() versrus pa.string_view(); for NumPy this may mean object + or their 2.0 string implementation. """ ... @property - def missing_value_marker -> pd.NA|np.nan: + def na_marker -> pd.NA|np.nan|pd.NaT: """ Sentinel used to denote missing values """ ... ``` -The theory behind this PDEP is that most users _should not_ care about the physical type that is being used. But if the abstraction our logical type provides is too much, a user could at least inspect and potentially configure which physical type to use. +``na_marker`` is expected to be read-only (see next section). For advanced users that have a particular need for a storage type, they may be able to construct the data type via ``pd.StringDtype(data_manager=np)`` to assert NumPy managed storage. While the PDEP allows constructing in this fashion, operations against that data make no guarantees that they will respect the storage backend and are free to convert to whichever storage the internals of pandas considers optimal (Arrow will typically be preferred). -With regards to how we may expose such types to end users, there are two currently recognized proposals. The first would use factory functions to create the logical type, i.e. something like: - -```python -pd.Series(["foo", "bar", "baz"], dtype=pd.string() # assumed common case -pd.Series(["foo", "bar", "baz"], dtype=pd.string(missing_value_marker=np.nan) -pd.Series(["foo", "bar", "baz"], dtype=pd.string(physical_type=pa.string_view()) -``` - -Another approach would be to classes: - -```python -pd.Series(["foo", "bar", "baz"], dtype=pd.StringDtype() -pd.Series(["foo", "bar", "baz"], dtype=pd.StringDtype(missing_value_marker=np.nan) -pd.Series(["foo", "bar", "baz"], dtype=pd.StringDtype(physical_type=pa.string_view()) -``` -Note that the class-based approach would reuse existing classes like ``pd.StringDtype`` but change their purpose, whereas the factory function would more explicitly be a new approach. This is an area that requires more discussion amongst the team. - -## String Type Arguments - -This PDEP proposes that we maintain only a small subset of string arguments that can be used to construct logical types. Those string arguments are: - - - intXX - - uintXX - - floatXX - - string - - datetime64[unit] - - datetime64[unit, tz] - -However, new code should be encouraged to use the logical constructors outlined previously. Particularly for aggregate types, trying to encode all of the information into a string can become unwieldy. Instead, keyword argument use should be encouraged: - -```python -pd.Series(dtype=pd.list(value_type=pd.string())) -``` +### Missing Value Handling -## Bridging Type Systems +Missing value handling is a tricky area as developers are split between pd.NA semantics versus np.nan, and the transition path from one to the other is not always clear. -An interesting question arises when a user constructs two logical types with differing physical types. If one is backed by NumPy and the other is backed by pyarrow, what should happen? +Because this PDEP proposes reuse of the existing pandas extension type system, the default missing value marker will consistently be ``pd.NA``. However, to help with backwards compatibility for users that heavily rely on the equality semantics of np.nan, an option of ``pd.na_marker = "legacy"`` can be set. This would mean that the missing value indicator for logical types would be: -This PDEP proposes the following backends should be prioritized in the following order (1. is the highest priority): +| Logical Type | Default Missing Value | Legacy Missing Value | +| pd.BooleanDtype() | pd.NA | np.nan | +| pd.IntXXType() | pd.NA | np.nan | +| pd.FloatXXType() | pd.NA | np.nan | +| pd.StringDtype() | pd.NA | np.nan | +| pd.DatetimeType() | pd.NA | pd.NaT | - 1. Arrow - 2. pandas - 3. NumPy +However, all data types for which there is no legacy NumPy-backed equivalent will continue to use ``pd.NA``, even in "legacy" mode. Legacy is provided only for backwards compatibility, but pd.NA usage is encouraged going forward to give users one exclusive missing value indicator. -One reason for this is that Arrow represents the most efficient and assumedly least-lossy physical representations. An obvious example comes when a pyarrow int64 array with missing data gets added to a NumPy int64 array; casting to the latter would lose data. Another reason is that Arrow represents the fastest growing ecosystem of tooling, and the PDEP author believes improving pandas's interoperability within that landscape is extremely important. +### Transitioning from Current Constructors -Aside from the backend, the C standard rules for [implicit conversion](https://en.cppreference.com/w/c/language/conversion) should apply to the data buffer, i.e. adding a pyarrow int8 array to a NumPy uint64 array should produce a pyarrow uint64 array. +To maintain a consistent path forward, _all_ constructors with the implementation of this PDEP are expected to map to the logical types. This means that providing ``np.int64`` as the data type argument makes no guarantee that you actually get a NumPy managed storage buffer; pandas reserves the right to optimize as it sees fit and may decide instead to just pyarrow. -For more expensive conversions, pandas retains the right to throw warnings or even error out when two of the same logical type with differing physical types is added. For example, attempting to do string concatenation of string arrays backed by pyarrow and Python objects may throw a ``PerformanceWarning``, or maybe even a ``MemoryError`` if such a conversion exhausts the available system resources. +The theory behind this is that the majority of users are not expecting anything particular from NumPy to happen when they say ``dtype=np.int64``. The expectation is that a user just wants _integer_ data, and the ``np.int64`` specification owes to the legacy of pandas' evolution. -The ``BaseType`` proposed above also has a property for the ``missing_value_marker``. Operations that use two logical types with different missing value markers should raise, as there is no clear way to prioritize between the various sentinels. +This PDEP makes no guarantee that we will stay that way forever; it is certainly reasonable that a few years down the road we deprecate and fully stop support for backend-specifc constructors like ``np.int64`` or ``pd.ArrowDtype(pa.int64())``. However, for the execution of this PDEP, such an initiative is not in scope. ## PDEP-11 History - 27 April 2024: Initial version +- 10 May 2024: First revision From 258178d13ab1597711da7b95dff0c4e993e50d24 Mon Sep 17 00:00:00 2001 From: Will Ayd Date: Mon, 20 May 2024 11:35:49 -0400 Subject: [PATCH 4/5] Added NullDtype --- web/pandas/pdeps/0013-logical-type-system.md | 1 + 1 file changed, 1 insertion(+) diff --git a/web/pandas/pdeps/0013-logical-type-system.md b/web/pandas/pdeps/0013-logical-type-system.md index d8b6e802f4dc7..39611176463c9 100644 --- a/web/pandas/pdeps/0013-logical-type-system.md +++ b/web/pandas/pdeps/0013-logical-type-system.md @@ -98,6 +98,7 @@ Derived from the hierarchical visual in the previous section, this PDEP proposes - Struct - Interval - Object + - Null One of the major problems this PDEP has tried to highlight is the historical tendency of our team to "create more types" to solve existing problems. To minimize the need for that, this PDEP proposes re-using our existing extension types where possible, and only adding new ones where they do not exist. From af9da9771ce739c95c7c4c5df922d9f2545ed7c8 Mon Sep 17 00:00:00 2001 From: Will Ayd Date: Thu, 1 Aug 2024 11:47:13 -0400 Subject: [PATCH 5/5] Updates for PDEP-14 and PDEP-16 --- web/pandas/pdeps/0013-logical-type-system.md | 90 ++++++++++---------- 1 file changed, 44 insertions(+), 46 deletions(-) diff --git a/web/pandas/pdeps/0013-logical-type-system.md b/web/pandas/pdeps/0013-logical-type-system.md index 39611176463c9..b9fcd533cf1e7 100644 --- a/web/pandas/pdeps/0013-logical-type-system.md +++ b/web/pandas/pdeps/0013-logical-type-system.md @@ -4,7 +4,7 @@ - Status: Under discussion - Discussion: [#58141](https://github.com/pandas-dev/pandas/issues/58141) - Author: [Will Ayd](https://github.com/willayd), -- Revision: 2 +- Revision: 3 ## Abstract @@ -30,22 +30,37 @@ dtype=object dtype=str dtype="string" dtype=pd.StringDtype() +dtype=pd.StringDtype("python", na_value=np.nan) +dtype=pd.StringDtype("python", na_value=pd.NA) dtype=pd.StringDtype("pyarrow") dtype="string[pyarrow]" -dtype="string[pyarrow_numpy]" +dtype="string[pyarrow_numpy]" # added in 2.1, deprecated in 2.3 dtype=pd.ArrowDtype(pa.string()) dtype=pd.ArrowDtype(pa.large_string()) ``` -``dtype="string"`` was the first truly new string implementation starting back in pandas 0.23.0, and it is a common pitfall for new users not to understand that there is a huge difference between that and ``dtype=str``. The pyarrow strings have trickled in in more recent memory, but also are very difficult to reason about. The fact that ``dtype="string[pyarrow]"`` is not the same as ``dtype=pd.ArrowDtype(pa.string()`` or ``dtype=pd.ArrowDtype(pa.large_string())`` was a surprise [to the author of this PDEP](https://github.com/pandas-dev/pandas/issues/58321). +``dtype="string"`` was the first truly new string implementation starting back in pandas 0.23.0, and it is a common pitfall for new users not to understand that there is a huge difference between that and ``dtype=str``. The pyarrow strings have trickled in in more recent releases, but also are very difficult to reason about. The fact that ``dtype="string[pyarrow]"`` is not the same as ``dtype=pd.ArrowDtype(pa.string()`` or ``dtype=pd.ArrowDtype(pa.large_string())`` was a surprise [to the author of this PDEP](https://github.com/pandas-dev/pandas/issues/58321). -While some of these are aliases, the main reason why we have so many different string dtypes is because we have historically used NumPy and created custom missing value solutions around the ``np.nan`` marker, which are incompatible with the ``pd.NA`` sentinel introduced a few years back. Our ``pd.StringDtype()`` uses the pd.NA sentinel, as do our pyarrow based solutions; bridging these into one unified solution has proven challenging. +While some of these are aliases, the main reason why we have so many different string dtypes is because we have historically used NumPy and created custom missing value solutions around the ``np.nan`` marker, which are incompatible with ``pd.NA``. Our ``pd.StringDtype()`` uses the pd.NA sentinel, as do our pyarrow based solutions; bridging these into one unified solution has proven challenging. -To try and smooth over the different missing value semantics and how they affect the underlying type system, the status quo has been to add another string dtype. ``string[pyarrow_numpy]`` was an attempt to use pyarrow strings but adhere to NumPy nullability semantics, under the assumption that the latter offers maximum backwards compatibility. However, being the exclusive data type that uses pyarrow for storage but NumPy for nullability handling, this data type just adds more inconsistency to how we handle missing data, a problem we have been attempting to solve back since discussions around pandas2. The name ``string[pyarrow_numpy]`` is not descriptive to end users, and unless it is inferred requires users to explicitly ``.astype("string[pyarrow_numpy]")``, again putting a burden on end users to know what ``pyarrow_numpy`` means and to understand the missing value semantics of both systems. +To try and smooth over the different missing value semantics and how they affect the underlying type system, the status quo has always been to add another string dtype. With PDEP-14 we now have a "compatibility" string of ``pd.StringDtype("python|pyarrow", na_value=np.nan)`` that makes a best effort to move users towards all the benefits of PyArrow strings (assuming pyarrow is installed) while retaining backwards-compatible missing value handling with ``np.nan`` as the missing value marker. The usage of the ``pd.StringDtype`` in this manner is a good stepping stone towards the goals of this PDEP, although it is stuck in an "in-between" state without other types following suit. -PDEP-14 has been proposed to smooth over that and change our ``pd.StringDtype()`` to be an alias for ``string[pyarrow_numpy]``. This would at least offer some abstraction to end users who just want strings, but on the flip side would be breaking behavior for users that have already opted into ``dtype="string"`` or ``dtype=pd.StringDtype()`` and the related pd.NA missing value marker for the prior 4 years of their existence. +For instance, if a user calls ``Series.value_counts()`` on the ``pd.StringDtype()``, the type of the returned Series can vary wildly, and in non-obvious ways: -A logical type system can help us abstract all of these issues. At the end of the day, this PDEP assumes a user wants a string data type. If they call ``Series.str.len()`` against a Series of that type with missing data, they should get back a Series with an integer data type. +```python +>>> pd.Series(["x"], dtype=pd.StringDtype("python", na_value=pd.NA)).value_counts().dtype +Int64Dtype() +>>> pd.Series(["x"], dtype=pd.StringDtype("pyarrow", na_value=pd.NA)).value_counts().dtype +int64[pyarrow] +>>> pd.Series(["x"], dtype=pd.StringDtype("python", na_value=np.nan)).value_counts().dtype +Int64Dtype() +>>> pd.Series(["x"], dtype=pd.StringDtype("pyarrow", na_value=np.nan)).value_counts().dtype +dtype('int64') +``` + +It is also worth noting that different methods will return different data types. For a pyarrow-backed string with pd.NA, ``Series.value_counts()`` returns a ``int64[pyarrow]`` but ``Series.str.len()`` returns a ``pd.Int64Dtype()``. + +A logical type system can help us abstract all of these issues. At the end of the day, this PDEP assumes a user wants a string data type. If they call ``Series.str.value_counts()`` against a Series of that type with missing data, they should get back a Series with an integer data type. ### Problem 2: Inconsistent Constructors @@ -69,7 +84,7 @@ It would stand to reason in this approach that you could use a ``pd.DatetimeDtyp ### Problem 3: Lack of Clarity on Type Support -The third issue is that the extent to which pandas may support any given type is unclear. Issue [#58307](https://github.com/pandas-dev/pandas/issues/58307) highlights one example. It would stand to reason that you could interchangeably use a pandas datetime64 and a pyarrow timestamp, but that is not always true. Another common example is the use of NumPy fixed length strings, which users commonly try to use even though we claim no real support for them (see [#5764](https://github.com/pandas-dev/pandas/issues/57645)). +The third issue is that the extent to which pandas may support any given type is unclear. Issue [#58307](https://github.com/pandas-dev/pandas/issues/58307) highlights one example. It would stand to reason that you could interchangeably use a pandas datetime64 and a pyarrow timestamp, but that is not always true. Another example is the use of NumPy fixed length strings, which users commonly try to use even though we claim no real support for them (see [#5764](https://github.com/pandas-dev/pandas/issues/57645)). ## Assessing the Current Type System(s) @@ -84,7 +99,7 @@ Derived from the hierarchical visual in the previous section, this PDEP proposes - Signed Integer - Unsigned Integer - Floating Point - - Fixed Point + - Decimal - Boolean - Date - Datetime @@ -93,7 +108,7 @@ Derived from the hierarchical visual in the previous section, this PDEP proposes - Period - Binary - String - - Map + - Dict - List - Struct - Interval @@ -120,10 +135,11 @@ To satisfy all of the types highlighted above, this would require the addition o - pd.Duration() - pd.CalendarInterval() - pd.BinaryDtype() - - pd.MapDtype() # or pd.DictDtype() + - pd.DictDtype() - pd.ListDtype() - pd.StructDtype() - pd.ObjectDtype() + - pd.NullDtype() The storage / backend to each of these types is left as an implementation detail. The fact that ``pd.StringDtype()`` may be backed by Arrow while ``pd.PeriodDtype()`` continues to be a custom solution is of no concern to the end user. Over time this will allow us to adopt more Arrow behind the scenes without breaking the front end for our end users, but _still_ giving us the flexibility to produce data types that Arrow will not implement (e.g. ``pd.ObjectDtype()``). @@ -137,43 +153,24 @@ The methods of each logical type are expected in turn to yield another logical t The ``Series.dt.date`` example is worth an extra look - with a PDEP-13 logical type system in place we would theoretically have the ability to keep our default ``pd.DatetimeDtype()`` backed by our current NumPy-based array but leverage pyarrow for the ``Series.dt.date`` solution, rather than having to implement a DateArray ourselves. -While this PDEP proposes reusing existing extension types, it also necessitates extending those types with extra metadata: +To implement this PDEP, we expect all of the logical types to have at least the following metadata: -```python -class BaseType: - - @property - def data_manager -> Literal["numpy", "pyarrow"]: - """ - Who manages the data buffer - NumPy or pyarrow - """ - ... - - @property - def physical_type: - """ - For logical types which may have different implementations, what is the - actual implementation? For pyarrow strings this may mean pa.string() versus - pa.large_string() versrus pa.string_view(); for NumPy this may mean object - or their 2.0 string implementation. - """ - ... - - @property - def na_marker -> pd.NA|np.nan|pd.NaT: - """ - Sentinel used to denote missing values - """ - ... -``` + * storage: Either "numpy" or "pyarrow". Describes the library used to create the data buffer + * physical_type: Can expose the physical type being used. As an example, StringDtype could return pa.string_view + * na_value: Either pd.NA, np.nan, or pd.NaT. -``na_marker`` is expected to be read-only (see next section). For advanced users that have a particular need for a storage type, they may be able to construct the data type via ``pd.StringDtype(data_manager=np)`` to assert NumPy managed storage. While the PDEP allows constructing in this fashion, operations against that data make no guarantees that they will respect the storage backend and are free to convert to whichever storage the internals of pandas considers optimal (Arrow will typically be preferred). +While these attributes are exposed as construction arguments to end users, users are highly discouraged from trying to control them directly. Put explicitly, this PDEP allows a user to request a ``pd.XXXDtype(storage="numpy")`` to request a NumPy-backed array, if possible. While pandas may respect that during construction, operations against that data make no guarantees that the storage backend will be persisted through, giving pandas the freedom to convert to whichever storage is internally optimal (Arrow will typically be preferred). ### Missing Value Handling -Missing value handling is a tricky area as developers are split between pd.NA semantics versus np.nan, and the transition path from one to the other is not always clear. +Missing value handling is a tricky area as developers are split between pd.NA semantics versus np.nan, and the transition path from one to the other is not always clear. This PDEP does not aim to "solve" that issue per se (for that discussion, please refer to PDEP-16), but aims to provide a go-forward path that strikes a reasonable balance between backwards compatibility and a consistent missing value approach in the future. + +This PDEP proposes that the default missing value for logical types is ``pd.NA``. The reasoning is two-fold: + + 1. We are in many cases re-using extension types as logical types, which mostly use pd.NA (StrDtype and datetimes are the exception) + 2. For new logical types that have nothing to do with NumPy, using np.nan as a missing value marker is an odd fit -Because this PDEP proposes reuse of the existing pandas extension type system, the default missing value marker will consistently be ``pd.NA``. However, to help with backwards compatibility for users that heavily rely on the equality semantics of np.nan, an option of ``pd.na_marker = "legacy"`` can be set. This would mean that the missing value indicator for logical types would be: +However, to help with backwards compatibility for users that heavily rely on the semantics of ``np.nan`` or ``pd.NaT``, an option of ``pd.na_value = "legacy"`` can be set. This would mean that the missing value indicator for logical types would be: | Logical Type | Default Missing Value | Legacy Missing Value | | pd.BooleanDtype() | pd.NA | np.nan | @@ -182,17 +179,18 @@ Because this PDEP proposes reuse of the existing pandas extension type system, t | pd.StringDtype() | pd.NA | np.nan | | pd.DatetimeType() | pd.NA | pd.NaT | -However, all data types for which there is no legacy NumPy-backed equivalent will continue to use ``pd.NA``, even in "legacy" mode. Legacy is provided only for backwards compatibility, but pd.NA usage is encouraged going forward to give users one exclusive missing value indicator. +However, all data types for which there is no legacy NumPy-backed equivalent will continue to use ``pd.NA``, even in "legacy" mode. Legacy is provided only for backwards compatibility, but ``pd.NA`` usage is encouraged going forward to give users one exclusive missing value indicator and better align with the goals of PDEP-16. ### Transitioning from Current Constructors -To maintain a consistent path forward, _all_ constructors with the implementation of this PDEP are expected to map to the logical types. This means that providing ``np.int64`` as the data type argument makes no guarantee that you actually get a NumPy managed storage buffer; pandas reserves the right to optimize as it sees fit and may decide instead to just pyarrow. +To maintain a consistent path forward, _all_ constructors with the implementation of this PDEP are expected to map to the logical types. This means that providing ``np.int64`` as the data type argument makes no guarantee that you actually get a NumPy managed storage buffer; pandas reserves the right to optimize as it sees fit and may decide instead to use PyArrow. The theory behind this is that the majority of users are not expecting anything particular from NumPy to happen when they say ``dtype=np.int64``. The expectation is that a user just wants _integer_ data, and the ``np.int64`` specification owes to the legacy of pandas' evolution. -This PDEP makes no guarantee that we will stay that way forever; it is certainly reasonable that a few years down the road we deprecate and fully stop support for backend-specifc constructors like ``np.int64`` or ``pd.ArrowDtype(pa.int64())``. However, for the execution of this PDEP, such an initiative is not in scope. +This PDEP makes no guarantee that we will stay that way forever; it is certainly reasonable that, in the future, we deprecate and fully stop support for backend-specifc constructors like ``np.int64`` or ``pd.ArrowDtype(pa.int64())``. However, for the execution of this PDEP, such an initiative is not in scope. -## PDEP-11 History +## PDEP-13 History - 27 April 2024: Initial version - 10 May 2024: First revision +- 01 Aug 2024: Revisions for PDEP-14 and PDEP-16