Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-13: The pandas Logical Type System #58455

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Apr 27, 2024

@pandas-dev/pandas-core

@WillAyd WillAyd requested a review from datapythonista as a code owner April 27, 2024 16:36

## Proposal

Derived from the hierarchical visual in the previous section, this PDEP proposes that pandas supports at least all of the following _logical_ types, excluding any type widths for brevity:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC we already support all of these, just not with all three of the storages. Is this paragraph suggesting that we implement all of these for all three (four if we include sparse) storages?

Copy link
Member Author

@WillAyd WillAyd Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all - this is just suggesting that these be the logical types we offer, whether through the factory functions or class-based design that follows.

In today's world it is the responsibility of the user to know what may or may not be offered by the various backends and pick from them accordingly. With dates/datetimes being a good example, a user would have to explicitly pick from some combination of pd.ArrowDtype(pa.date32()), pd.ArrowDtype(pa.timestamp(<unit>)) and datetime64[unit] since we don't offer any logical abstraction.

In the model of this PDEP, we would offer something to the effect pd.date() and pd.datetime(), abstracting away the type implementation details. This would be helpful for us as developers to solve things like #58315

This PDEP proposes the following backends should be prioritized in the following order (1. is the highest priority):

1. Arrow
2. pandas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe call this "masked"? calling it "pandas" seems to preference it as being canonical

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does pandas extension make more sense? Masked is maybe too specific, since this also refers to things like dictionary-encoded / categorical

...

@property
def physical_type:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there use cases for this other than the pyarrow strings? i.e. is it really needed for the general case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Other areas where I think it would matter would be:

  • Differentiation between decimal128/decimal256
  • Any parametrized type (e.g. datetime types with different units / timezones, container types with different value types)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, thanks. is there a reason to include the decimalN case but not (int|float|uint)N? is that distinct from itemsize?

its still unclear to me why this is needed in the general case. the way it is presented here suggests this is something i access on my dtype object, so in the DatetimeTZDtype case i would check dtype.physical_type.tz instead of dtype.tz? or i could imagine standardizing the dtype constructors so you would write dtype = pd.datetime(physical_type=something_that_specifices_tz) instead of dtype=pd.datetime(tz=tz). Neither of these seems great, so i feel like I'm missing something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question about the decimal case. If coming from a database world its not uncommon to have to pick between different integer sizes, but I think rare to pick between different decimal sizes. With decimals you would usually just select your precision, from which we could dispatch to a 128 or 256 bit implementation...but yea I'm not sure the best way to approach that

Doing something like pd.datetime(physical_type=something_that_specifies_tz) is definitely not what this PDEP suggests users should do. This PDEP suggests the 99% use case would be pd.datetime(tz=tz) and a user should not care if we use pandas or pyarrow underneath that.

Using #58315 as an example again, if we used the ser.dt.date accessor on a series with a logical type of pd.datetime(), then the user should expect back a pd.date() back. That gives us a lot of flexibility as library authors - maybe we still want pd.datetime() to be backed by our datetime extension by default, but could then use pyarrow's date implementation to back pd.date() by default rather than having to build that from scratch

In today's world we are either asking users to explicitly use pa.timestamp() to get back a true date type from ser.dt.date, or they use the traditional pandas extension type and get back dtype=object when they use the accessor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PDEP suggests the 99% use case would be pd.datetime(tz=tz)

This seems reasonable, but then im not clear on why physical_type is in the base class.

but could then use pyarrow's date implementation to back pd.date() by default

Doesn't this then give the user mixed-and-matched propagation behavior that we agreed was unwanted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be easier to discuss if there were more clarity on what this base class is used for. With that very big caveat, my impression is that the relevant keyword/attribute will vary between string/decimal/datetime/whatever and this doesn't belong in the base class at all.

Are you referring to the physical_type keyword or something else? Every logical type instance should have a physical type underneath. So the "string" type could be physically implemented as a pa.string(), pa.large_string(), object, etc...The "datetime" type could be physically implemented as a pa.timestamp or the existing pandas extension datetime. Right now a decimal would only be backed by either a pa.decimal128 or pa.decimal256

If i have a non-pyarrow pd.datetime Series and i do ser.dt.date and get a pd.date Series that under the hood is date32[pyarrow] then i have mixed semantics.

In the logical type world that is totally fine. It is up to our implementation to manage the fact that a datetime can be implemented either as a pa.timestamp or our own datetime type, and a date can only be implemented as a pa.date32, but that is not a concern of the end user. I am definitely not suggesting we develop a non-pyarrow date type

BTW im nitpicking a bunch, but im generally positive on the central idea here.

The feedback is very helpful and its good to flesh this conversation out so no worries. Much appreciated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I certainly understand the idea of a physical type, I am not sure such a simple attribute will fit for all cases. For example for our own masked Int64Dtype, what's the physical type? np.int64? (but that's only for the data array, while there is also a mask array)
And is this physical type then always either a numpy type or pyarrow type?

Copy link
Member Author

@WillAyd WillAyd Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider the physical type in that case to be the pd.core.arrays.integer.Int64Dtype but maybe we should add terminology to this PDEP?

In the scope of this document a logical type has no control over the layout of a data buffer or mask; it simply serves as a conceptual placeholder for the idea of a type. The physical type, by contrast, does have at least some control how its data buffer and mask are laid out

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be easier to discuss if there were more clarity on what this base class is used for. With that very big caveat, my impression is that the relevant keyword/attribute will vary between string/decimal/datetime/whatever and this doesn't belong in the base class at all.

Are you referring to the physical_type keyword or something else? Every logical type instance should have a physical type underneath. So the "string" type could be physically implemented as a pa.string(), pa.large_string(), object, etc...The "datetime" type could be physically implemented as a pa.timestamp or the existing pandas extension datetime. Right now a decimal would only be backed by either a pa.decimal128 or pa.decimal256

The "more clarity" request was for what we do with BaseDtype. I guess examples for both usage and construction with, say, pd.DatetimeTZDtype(unit="ns", tz="US/Pacific") would also be helpful.

If i have a non-pyarrow pd.datetime Series and i do ser.dt.date and get a pd.date Series that under the hood is date32[pyarrow] then i have mixed semantics.

In the logical type world that is totally fine. It is up to our implementation to manage the fact that a datetime can be implemented either as a pa.timestamp or our own datetime type, and a date can only be implemented as a pa.date32, but that is not a concern of the end user. I am definitely not suggesting we develop a non-pyarrow date type

You're saying mixed semantics are fine here, but in #58315 you agreed we should avoid that. What changed? Having consistent/predictable semantics is definitely a concern of the end user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "more clarity" request was for what we do with BaseDtype. I guess examples for both usage and construction with, say, pd.DatetimeTZDtype(unit="ns", tz="US/Pacific") would also be helpful.

I think it should be as simple as:

>>> ser = pd.Series(dtype=pd.datetime(unit="ns", tz="US/Pacific")
>>> ser.dtype
pd.DatetimeDtype(unit="ns", tz="US/Pacific")

Or

>>> ser = pd.Series(dtype=pd.DatetimeDtype(unit="ns", tz="US/Pacific")
>>> ser.dtype
pd.DatetimeDtype(unit="ns", tz="US/Pacific")

You're saying mixed semantics are fine here, but in #58315 you agreed we should avoid that. What changed? Having consistent/predictable semantics is definitely a concern of the end user.

I think interoperating between pyarrow / numpy / our own extension types behind the scenes is perfectly fine. It would be problematic to have those implementation details spill out into the end user space though.

When a user says "I want a datetime type" they should just get something that works like a datetime type. When they then say ser.dt.date() they should get back something that works like a date. Having logical types works as a contract with the end user to promise we will deliver that. Whether we use NumPy, pyarrow, or something else underneath to do that is no concern of the end user

...

@property
def missing_value_marker -> pd.NA|np.nan:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in most places we call this "na_value". Is there a reason to use a different name here? is pd.NA|np.nan here just for the example's sake or is that a restriction you want to impose? What about None? Could a 3rd party EA author do something else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - happy to change the name. I would rather not expand the scope beyond pd.NA or np.nan. I think that if we were starting over from scratch we wouldn't even have this field and just have one universally consistent NA marker that was not configurable. With that said, given the history of development I feel like we will have to continue to support both of these for the immediate future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if we were starting over from scratch we wouldn't even have this field and just have one universally consistent NA marker that was not configurable.

This is one of the main comments I was going to give: IMO this should not be configurable, even though it is true we live currently in this non-ideal world of multiple missing sentinels.
And it might makes things also more complex / dependent on other discussions, but I think for designing those new logical types we should somewhat assume we will in parallel also do a PDEP on converging on a single NA value for missing sentinel (it will actually makes things simpler for the design in this PDEP, it's just a bit a mess to discuss / implement all that on the short term because of the interdependencies).

Even if on the short term we have logical types that have either NA or NaN or NaT, I don't think that it should be something that the user can specify like in your example below (pd.string(missing_value_marker=np.nan)). I think generally the missing value sentinel will depend on the physical type or "backend" or however to call it (like if you choose a pyarrow type, then that determines the missing value marker). So we can still have this attribute (i.e. the current na_value already does this), but it's just the query the information, not to set by the user.

With that said, given the history of development I feel like we will have to continue to support both of these for the immediate future.

If we want to move to those new logical types, we will anyway have a period where we have both the current types and the new types. So if we in parallel decide on a NA PDEP, it's also an option that those new logical types just all use NA. Whether that is feasible of course depends on the timeline of both developments (like ideally the string logical dtype for 3.0 would also already have used NA, but it does not because otherwise we could not yet introduce it by default right now)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generally the missing value sentinel will depend on the physical type or "backend" or however to call it (like if you choose a pyarrow type, then that determines the missing value marker).

I agree with this comment, but we have already implemented this in a way where we disconnect the "backend" from the NA marker with types like "pyarrow_numpy". I agree users ideally won't be touching this field (maybe it can be a read-only property?).

If we removed this altogether I think we would lose the ability to support a transition period for dtypes like pyarrow_numpy and the python_numpy string type you are working on, but I may be missing the bigger picture on that.

I would definitely support another PDEP to unify our missing value handling, but would be wary of trying to tackle two large PDEPs at once. With this field being a part of the logical type proposal, I think it helps us support all of the existing types we have already implemented and give us better flexibility to control / deprecate the various NA markers we allow today

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're certainly right that allowing both makes a transition easier, as is what we are doing with the string dtype. And depending on the timelines of both changes (logical types / NA), it might indeed be needed for other dtypes as well.

So I think my main point is that we shouldn't show this as a "parameter" of the dtype construction, but just as a property of a certain dtype instance.

Essentially what we did for the string dtype was to call this a different "backend" (although multiple string dtype instances with different null value might both be using pyarrow or numpy, but the "backend" argument maps to the concrete EA Array class that is being used, and those are actually different), so it is tied to a specific implementation, and not user-specified. And although this is very ugly with the different strange backend names, I think this is a more future-proof option if we think longer term the NA value should not be user configurable.

It might be nicer / clearer to show this in the StringDtype repr as a separate property instead of being merged in the backend name (so eg StringDtype(storage="python", na_value=NA) / StringDtype(storage="python", na_value=NaN) instead of StringDtype(storage="python") / StringDtype(storage="python_numpy") (because what does "python_numpy" mean to a user.. ?).
But at the same time I also wouldn't want to embed too much the idea that na_value is user configurable, it's just that for the transition period we need a way to be able to specify both versions of the dtype.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that na_value is not for users of the pandas library, but for developers of the pandas library. I think it should be read-only, and possibly a "hidden" property (prefix with an underscore)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely support both of what @jorisvandenbossche and @Dr-Irv are saying here. I think where I am getting tripped up is how we allow users today to swap over to this paradigm.

If someone had a code base where they do ser = pd.Series(..., dtype="string[pyarrow_numpy]") and elsewhere maybe just ser = pd.Series(..., dtype="string[pyarrow]"), how do we think they can port to this new design? Or do we not think that is valuable?

To ensure we maintain all of the current functionality of our existing type system(s), a base type structure would need to look something like:

```python
class BaseType:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this then get mixed into ExtensionDtype? if not, what is this used for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it depends on whether we as a team want to go down the path of offering constructors like pd.string() or pd.StringDtype(); with the former approach this would not mix into the ExtensionDtype, since that would be closer to a physical type. If we went with the latter approach, we would be repurposing / clarifying the scope of the pandas extension type system to be totally logical

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I think through it more it may make sense for these to be separate; the existing ExtensionDtype refers to a structure that pandas wants to contain a particular set of values. For extension types it is possible for there not to be a logical type representation in pandas

Maybe this should be renamed to BaseLogicalType to clairfy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then im unclear on what we're doing with it? and if EA authors are supposed to implement something for it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point - for third party EAs we would not be able to offer a logical type, since we don't know what the EA is. Let's see what other teammates think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the .dtype that users interact with? That's the logical type?

Yea that's where my head is at - essentially they would see / interact with a subclass of the BaseType shown here

For external extension dtypes, I think it will be very hard to extend this concept of logical types to multiple external libraries, like your example of two independent external libraries implementing the same logical type.

Yea this is very tricky. I don't know what the best answer is but I don't really want to muddy the waters between an Extension type and a Logical type; I think these are two totally different things. But this PDEP would need to clarify further how users would use Extension types knowing there is no equivalent logical type for them (or maybe we need to have them register themselves as a logical type?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really want to muddy the waters between an Extension type and a Logical type; I think these are two totally different things

I still don't fully follow why they are totally different things. I know our own types are messy (with the mixture of numpy, numpy-based extension dtypes and pyarrow extension dtypes), but for the external extension dtypes that I know, those are almost all to be considered as "logical" types (eg geometries, tensors, ..)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because our third party extension arrays give EA authors theoretical control over data buffers, masks, and algorithms. A logical type should have none of these

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because our third party extension arrays give EA authors theoretical control over data buffers, masks, and algorithms. A logical type should have none of these

That's definitely true that in that sense an EA is also a physical type. But at the same time, in the idea that a user generally specifies the logical type, all external EAs are in practice also a logical type.

So I understand that those are different concepts, but I think in practice the external EA serves both purposes, and so what I don't fully see is 1) how it would be possible for external EAs to distinguish those concepts, and 2) what the value would be of doing so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your point 1) would definitely need to be developed further. Off the top of my head maybe an extension array would need to define an attribute like logical_type going forward. If the logical type if "foo", we would let that be constructed using pd.foo or pd.FooDtype

I think the value to 2) is that we could in theory also allow a third party EA to make its way into our logical type system. There are no shortage of string implementations that can be surmised, maybe there is some value in having an EA register yet another "string" logical type?

- Signed Integer
- Unsigned Integer
- Floating Point
- Fixed Point
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this correspond to pyarrow's decimal128?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either decimal128 or decimal256

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course those names don't have to match exactly with the eventual APIs, but I would personally just call this "Decimal" (which maps to both Python and Arrow)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you call it "Decimal", is there possible confusion with the python standard library decimal ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm even if there is I'm not sure that would matter. Do you think its that different today from the distinction between a python built-in int versus a "int64"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, because the python decimal type lets you specify any number of digits for the precision, as well as rounding.

So I don't know what the correspondence between pyarrow's decimal128 and decimal256 is. Just concerned that if you name this "decimal", there is potential confusion for users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea the arrow types also have a user-configurable precision / scale. The decimal128 variant offers up to 38 units of precision and 256 goes up to 76.

Taking a step back from Python/Arrow I think the problem you are hinting at is not new. Databases like postgres also offer Decimal types with different internal representations. So there might be some translation / edge cases to handle but generally I think calling it Decimal as a logical type signals the intent of the type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One can argue the same for "integer" which for Python is of unlimited size, while our ints are bound by the max of int64. In the case of the decimal stdlib module, there a few more differences, but regardless from that, if we would add a decimal / fixed point dtype to pandas in the future, I would personally still call it "decimal" (also because of python using that name, regardless of some subtle differences)


## String Type Arguments

This PDEP proposes that we maintain only a small subset of string arguments that can be used to construct logical types. Those string arguments are:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

off the top of my head, this would mean deprecating M8[unit], m8[unit], timedelta64[unit], period[freq], interval, interval[subtype, closed], category. (the first three of which are supported by np.dtype(...)). Just checking that these are all intentional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely would want to deprecate using the NumPy types since those would be circumventing the logical type system. I am -0.5 on continuing to allow timedelta64 because that also refers directly to the extension array implementation name rather than a logical type.

Period seems logical to continue supporting. I don't think I've ever constructed an interval that way before so I'll withhold an opinion on that for now, but generally these are offered for legacy support and not the ideal way of creating types going forward

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely would want to deprecate using the NumPy types

See also some of the discussion in #58243.

But IMO it's a very hard step to actually deprecate specifying numpy dtypes, given how widespread this usage is. And certainly for those types where there is an unambiguous mapping to our new logical types (like np.int64), I am not fully sure what the benefit is about deprecating that.
In any case, if we would deprecate that, it would need to go very slowly, and we should only do that a while after introducing the new pandas types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And certainly for those types where there is an unambiguous mapping to our new logical types (like np.int64), I am not fully sure what the benefit is about deprecating that.

The benefit would be having some consistency in the type constructors we allow and it would decouple the user focus on something have to be NumPy.

I agree it would be a very long deprecation, but on the flip side I don't think a user really believes they are subscribing into any particular set of features when they ask for a np.int64 data type. They really just want a 64 bit integer.

Especially with integer types being zero copy there is very little benefit to asking for a np.int64, and I would argue that is even counterproductive with the possibility of missing values being introduced

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning towards thinking deprecating these would cause a lot of churn for users without much clear benefit. To the extent that you want to encourage users to use different patterns, I'm fine with that.

Copy link
Member Author

@WillAyd WillAyd Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing users to construct physical types like np.int64 alongside logical types like pd.int64 I think defeats the purpose of adding logical types though; in such a case we are just adding more type construction options, which is a growing problem in our current state that this PDEP is trying to solve

There also is a clear benefit to pd.int64 over np.int64, namely the former has flexibility to better handle missing values whereas the latter locks you into numpy semantics

Copy link
Member

@jorisvandenbossche jorisvandenbossche Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing users to construct physical types like np.int64 alongside logical types like pd.int64 I think defeats the purpose of adding logical types though;

In my mind, we wouldn't allow them to create a Series with a "physical type", we would just accept np.int64 as input for the dtype argument, and specifying np.int64 or pd.int64() would give the exact same result: a Series with whatever is the default int64 dtype at that point (right now that is the numpy based int64, in the future maybe a numpy masked int64 or pyarrow int64).

It's certainly true that this doesn't help with the "too many ways" to specify a dtype, and this PDEP would add yet another way, but at least a way that is consistent across all dtypes, and which ensures there is a clear mapping of whathever random input to the resulting dtype.

For example, np.int64 is actually not even a dtype but a scalar constructor (np.dtype("int64") is the dtype), but conveniently accepted as a dtype argument in numpy. Just like the python types are also accepted, in addition to strings.
So with the logical types, you can say that (for now) all of np.int64, np.dtype("int64"), "int64", int, "int" all map to the same as pd.int64(), and longer term we could discuss reducing the number of aliases being accepted.

(I didn't do that with this comment .. ;), but personally I think we should maybe better focus first on the actual (logical) dtypes (the new part in this PDEP), and we can still later consider which aliases should be allowed or not. Short term we need to keep them all working anyway)

EDIT: numpy has very extensive documentation about which values are exactly accepted as dtype descriptors/aliases at https://numpy.org/devdocs/reference/arrays.dtypes.html#specifying-and-constructing-data-types, we should probably do something similar for pandas at some point

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fair. Providing compat is going to be a fact of life for this. Ultimately I don't think it would be that hard for a user to swap out np.int64 for the logical constructor (in such an instance it would just be find/replace), but yea I agree that can be something cleaned up over time. Its not critical to execution of this PDEP


## Bridging Type Systems

An interesting question arises when a user constructs two logical types with differing physical types. If one is backed by NumPy and the other is backed by pyarrow, what should happen?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"constructs" here seems weird. do you mean when two objects with different storages interact? i.e. ser_pyarrow + ser_numpy? If so, I agree with the ordering below, and note that it can be governed by __pandas_priority__.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the ordering as well, but I would be supportive of just raising as well

| string | N/A | pd.StringDtype() | pd.ArrowDtype(pa.string()) |
| datetime | N/A | ??? | pd.ArrowDtype(pa.timestamp("us")) |

It would stand to reason in this approach that you could use a ``pd.DatetimeDtype()`` but no such type exists (there is a ``pd.DatetimeTZDtype`` which requires a timezone).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref #24662

- Period
- Binary
- String
- Dictionary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Categorical instead? Because "dictionary" in Arrow terms is rather the physical type, not logical (like the string column could be dict encoded to save space, but still represent strings). While categorical (like pandas does it now) has specific semantics (i.e. fixed set of allowed values)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong preference, but Dictionary has the advantage of being more generic. With Categorical as a logical type I still think we'd end up offering a separate Dictionary logical type, which may or may not be a bad thing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the point of a Dictionary logical type in addition to Categorical? When would one use that?

Suppose for the case of having dict encoded string values, either you just want to see those as strings, and then this is the String logical type, or either they define a strict set of categories and then this is the Categorical logical type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a logical Dictionary type could offer support for things like Python dictionaries or JSON structures that would seem an odd fit through the lens of a Categorical type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK, we are talking about completely different things then ;) I was assuming you were using "dictionary" in the Arrow sense, but so you are using it in the python sense (in Arrow that is called a map type).

I am fine with in the future having a map/dictionary type (I also don't think this list necessarily needs to be exhaustive, we can also focus on those types for which we have somewhat support nowadays).
But then I do think that "Categorical" is missing in the list above (and now using the term as our existing CategoricalDtype)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK makes sense. Yea I agree let's add Categorical separately then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW i also thought this referred to a Categorical-like dtype.

there is a JSONDtype in the extension tests (im guessing you implemented it long ago) that IIRC has a ton of xfails. id suggest getting that closer to fully-working before committing ourselves to first-class support for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear I don't think we want to offer a first class logical JSON type, but a user can construct that from a combination of the list / dictionary types

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sidenote, I think in other systems JSON is typically physically stored as a string (so that's a bit different from what the current test JSONDtype does)

@jbrockmendel
Copy link
Member

Some big picture thoughts before I call it a day:

  1. Trim out as much as you can. e.g. I think the Bridging section is more or less orthogonal to everything else and could be suggested separately.

  2. The "users shouldn't care what backend they have" theme really only works when behaviors are identical, which is not the case in most of the relevant cases (xref DEPR: rename 'dtype_backend' #58214). Are there cases other than series.dt.date where this comes up? If not, drop that example to avoid getting bogged down.

  3. It would help to clarify the relationship between what you have in mind and ExtensionDtype.

  4. The physical_type and missing_value_marker keywords/attributes seem more distracting than useful at this stage.

  5. IIUC the main point of the proposal is that pd.int64(family="pyarrow|numpy|masked") is clearer than pd.ArrowDtype(pa.int64()) | pd.Int64Dtype() | np.dtype(np.int64).
    a) The easy part of this would be mapping the pd.int64(...) to an existing dtype.
    b) The hard part would be everything else
    i) does obj.dtype give an "old" dtype or something new? what about series.array? Does this entail another abstraction layer?
    ii) efficient/idiomatic checks for e.g. pd.int64?
    iii) Depending on how this relates to ExtensionDtype, does construct_array_type need to become a non-classmethod?

@WillAyd
Copy link
Member Author

WillAyd commented Apr 29, 2024

Cool thanks again for the thorough feedback. Responding to each question...

  1. Trim out as much as you can. e.g. I think the Bridging section is more or less orthogonal to everything else and could be suggested separately.

Noted and yea I was thinking this as well as I was drafting this. Let's see what other team members think though - it is useful to clarify how the different "backends" should interact, and we have gotten that as a user question before - see #58312

2. The "users shouldn't care what backend they have" theme really only works when behaviors are identical, which is not the case in most of the relevant cases

This is a point we disagree on. I think users necessarily have to get bogged down in the details of the different backends today because of how prevalent they are in our API, but that is not a value added activity for end users. I'm thinking of a database where I go in and just declare columns as INTEGER or DOUBLE PRECISION. I imagine that databases over time have managed different implementations of those types but I am not aware of one where you as an end user would be responsible for picking that

3. It would help to clarify the relationship between what you have in mind and ExtensionDtype.

Definitely - noted for next revision. But at a high level I am of the opinion that logical types have nothing to do with data buffers, null masks, or algorithm implementations. Continuing with the database analogy, a logical query plan would say things like "Aggregate these types" or "Join these types". A physical query plan by contrast would say "use this algorithm for summation" or "use this algorithm for equality". An extension type acts more like the latter case

4. The physical_type and missing_value_marker keywords/attributes seem more distracting than useful at this stage.

I certainly dislike these too, but I'm not sure how our current type system could be ported to a logical type system without them. With our many string implementations being the main motivating case, I think this metadata is required for a logical type instance to expose so other internal methods know what they are dealing with. I would definitely at least love for these to be private, but it comes down to how much control we want to give users to pick "string" versus "string[pyarrow]" versus "string[pyarrow_python]" and whatever other iterations they can use today

5. IIUC the main point of the proposal is that pd.int64(family="pyarrow|numpy|masked") is clearer than pd.ArrowDtype(pa.int64()) | pd.Int64Dtype() | np.dtype(np.int64)

To be clear I would sincerely hope a user specifying the backend is an exceptional case; that is only allowed because our current type system allows / quasi-expects it. The logical type is a stepping stone to untangle user requirements from implementation details

@jbrockmendel
Copy link
Member

This is a point we disagree on. I think users necessarily have to get bogged down in the details of the different backends today because of how prevalent they are in our API, but that is not a value added activity for end users. I'm thinking of a database where I go in and just declare columns as INTEGER or DOUBLE PRECISION. I imagine that databases over time have managed different implementations of those types but I am not aware of one where you as an end user would be responsible for picking that

In the database example do the different implementations have different behaviors? It strikes me as insanesilly to say users don't need to know/think/care about things that affect behavior.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 29, 2024

  1. Trim out as much as you can. e.g. I think the Bridging section is more or less orthogonal to everything else and could be suggested separately.

Noted and yea I was thinking this as well as I was drafting this. Let's see what other team members think though - it is useful to clarify how the different "backends" should interact, and we have gotten that as a user question before - see #58312

I agree there are a few parts of the Bridging section that are probably not needed and can be trimmed (eg the fact about following the C standard rules on casting / conversion), but on the other hand the text the will also have to be expanded significantly in other areas. For example all the questions Brock listed in the last item in his comment above (#58455 (comment)) will need answering. Same for capturing a summary of some of the discussion points we have been talking about (we already generated a lot comments in two days, we can't expect everyone to read all that to have a good idea of the discussion)

In the database example do the different implementations have different behaviors? It strikes me as insanesilly to say users don't need to know/think/care about things that affect behavior.

Setting the different null handling aside for a moment (because we know that is a big difference right now, but one we might want to do something about), I think that most reports about different behaviour should be considered as bug reports / limitations of the current implementation, and not inherent differences.
Some examples that have been referenced in the discussion (#58321, #58307, #58315, #53154) all seem to fit in that bucket to me.

(of course there will always be some differences, like small differences because of numeric stability in floating point algorithms, or different capitalization of the German ß (apache/arrow#34599), and it will always be a somewhat subjective line between implementation details versus clear and intentional behavioural differences. But the list of items linked above are all clear things that could be fixed on our side to ensure we provide a consistent behaviour, not inherent differences. It's our choice whether we want to provide a consistent interface or not)

pd.Series(["foo", "bar", "baz"], dtype=pd.StringDtype(missing_value_marker=np.nan)
pd.Series(["foo", "bar", "baz"], dtype=pd.StringDtype(physical_type=pa.string_view())
```
Note that the class-based approach would reuse existing classes like ``pd.StringDtype`` but change their purpose, whereas the factory function would more explicitly be a new approach. This is an area that requires more discussion amongst the team.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is a dtype, I'd prefer having Dtype in the name. That favors the class based approach. But the factory function approach could be used as well, but I'd prefer pd.stringDtype() over pd.string() to make it clear that it's a dtype.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool - I think @jorisvandenbossche has the same preference. So happy to start focus on that pattern more.

The only concern I have is that we are repurposing something that has long existed. pd.Int64Dtype() gets a new meaning and may no longer return the pandas extension type we created...but maybe that isn't a bad thing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont have a strong preference, but do think it will make the discussion easier to focus on just one variant

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright I'll stick with that in the next revision. We can always revisit if that doesn't work out for whatever reason. Thanks for the feedback all

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, my personal preference would be to go with the shorter functional constructors (pd.string(), pd.int64(), etc).
It's easier to use, less boilerplate, generally more similar to other libraries (this was my impression, but looking up a few examples, that is not necessarily true), and ties less into the exact Dtype class (in case two variants of one logical type want to use a different dtype class, although I am not sure this is necessarily something we want, though)

But agreed that for now, I think you can just use one throughout the document to explain everything, and then I would add one section on this topic mentioning some options and why which choice was made (and we can still further bikeshed on the exact syntax, but that feels like an actual bikeshed separate from the core ideas of the PDEP)

(@Dr-Irv if you want "dtype" in the name, yet another alternative would be to put the functions in a namespace, like pd.dtypes.string(). I seem to remember this also came up when I discussed topic in Basel last year)

The only concern I have is that we are repurposing something that has long existed. pd.Int64Dtype() gets a new meaning and may no longer return the pandas extension type we created...but maybe that isn't a bad thing

Well, we still need to discuss what those logical type constructors would actually return ;) In my mind, it could continue to return our extension types (as I wouldn't necessarily distinguish user-facing extension types and logical types), and then pd.Int64Dtype() would not necessarily have to change return value .. (except that it might gain additional keyword to e.g. get the pyarrow-backed integer dtype)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(@Dr-Irv if you want "dtype" in the name, yet another alternative would be to put the functions in a namespace, like pd.dtypes.string(). I seem to remember this also came up when I discussed topic in Basel last year)

That works for me. I'm not a big fan of continually adding things to the top level pandas namespace.


The ``BaseType`` proposed above also has a property for the ``missing_value_marker``. Operations that use two logical types with different missing value markers should raise, as there is no clear way to prioritize between the various sentinels.

## PDEP-11 History
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## PDEP-11 History
## PDEP-13 History

pd.Series(dtype=pd.list(value_type=pd.string()))
```

## Bridging Type Systems
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to address if people can convert between two physical types that back the same logical type?

below, you discuss where operations between 2 objects of different physical types, but same logical type, cause a priority in the result. But what if users want to convert from one physical type to another?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great callout - I assume you have in mind something like adding a logical int + float.

I think the same rules should still apply, i.e. if one is physically a pyarrow int and the other a numpy float you should end up with logical float type backed by a physical pyarrow float

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or if I have a numpy backed string and I want to convert to a pyarrow backed string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those would both be under the logical string type, so if you tried to concatenate them together you would get a pyarrow string per the rules outlined here. Does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I'm saying that let's say you know that you have a Series of logical type string, and you also know that it is backed by numpy string, and you want to convert that Series to have a physical type of arrow. Don't you need a public API that lets you convert from one physical backing to another?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second is that you want an Index with timestamp[pyarrow] dtype to be (or just behave like) a DatetimeIndex.

Yea if we move to a place where the DatetimeIndex is backed by the logical datetime type, then it becomes an implementation detail for us to control how the two different physical types need to fulfill the requirements of a logical datetime type. I think since the pyarrow / NumPy datetime types are zero-copy save for NaT handling the internal development for that should be a lot less burdensome than having a separate DatetimeIndex[pyarrow]

Copy link
Member Author

@WillAyd WillAyd Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you would want to implement a new date dtype with numpy semantics?

No - there would be no need at all for that. In this particular case, if the logical requirement was to go from a datetime -> date, and physically our array happened to be backed by a NumPy array, we internally would go NumPy datetime -> pa.timestamp -> pa.date, with the pa.date being exposed as a pd.date type; if the array was already backed by pyarrow we could skip the first step

This can get tricky if/when PDEP-10 is reverted.

I think even if PDEP-10 got reverted it would limit the logical types we offer (we'd essentially scrap date, list, dictionary, struct, etc...anything where only pyarrow implements something) until pyarrow becomes a hard dependency again

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - there would be no need at all for that [...]

We're definitely talking past each other, because AFAICT the suggestion here directly contradicts discussion in #58315. Let's put a pin in this until the call on Wednesday and see if we can sort it out then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summarizing this part of the discussion from the call: the communication failure was in 1) what "mixed semantics" I was concerned about and 2) what Will meant by not needing to implement anything new.

Consider

dti = pd.date_range("2016-01-01", periods=3)
ser = pd.Series(dti)
ser.iloc[1] = pd.NaT
df = ser.to_frame("datetime")
df["date"] = df["datetime"].dt.date

Now suppose that df["date"] is backed (perhaps indirectly) by a pyarrow array. When I say the columns have "mixed semantics", I am referring to the propagation of nulls in comparisons

mask1 = df["date"] == dti[0].date()
mask2 = df["date"] != dti[0].date()

In this scenario mask1 and mask2 are both nullable and their second entry is pd.NA. Using these masks for indexing gives surprising results (raising ATM, different if the Ice Cream Agreement goes into effect) to users expecting numpy semantics. A user who has not opted in to null propagation has some columns that behave this way and some that do not. Avoiding that scenario is why we implemented "string[pyarrow_numpy]" to have arrow storage with numpy semantics.

OK, all of that is what I had failed to communicate. What Will was saying when he said (paraphrasing) "we wouldn't need to implement another dtype", that didn't exclude a thin wrapper to retain numpy propagation semantics with pyarrow storage.

@WillAyd can you confirm that the above is a valid summary of what we got sorted out at the end of the call?

Copy link
Member Author

@WillAyd WillAyd May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that's correct and a nice summary. This is why the BaseType in the PDEP has the missing_value_marker / na_value - it explicitly decouples the NA handling from the physical type. It specifically allows the logical type to cover the functionality of string[pyarrow_numpy] as you described above, and as we talked about on the call theoretically also allows you to have a logical date type which uses pyarrow behind the scenes but can present pd.NaT as the na_value to the end user

I think this PDEP could help achieve more consistent NA handling over time since it decouples the user expectations from a "backend", but actually propogating one NA marker consistently is left to a separate PDEP

@mroeschke mroeschke added the PDEP pandas enhancement proposal label Apr 29, 2024

It would stand to reason in this approach that you could use a ``pd.DatetimeDtype()`` but no such type exists (there is a ``pd.DatetimeTZDtype`` which requires a timezone).

The third issue is that the extent to which pandas may support any given type is unclear. Issue [#58307](https://github.com/pandas-dev/pandas/issues/58307) highlights one example. It would stand to reason that you could interchangeably use a pandas datetime64 and a pyarrow timestamp, but that is not always true. Another common example is the use of NumPy fixed length strings, which users commonly try to use even though we claim no real support for them (see [#5764](https://github.com/pandas-dev/pandas/issues/57645)).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The third issue is that the extent to which pandas may support any given type is unclear. Issue [#58307](https://github.com/pandas-dev/pandas/issues/58307) highlights one example. It would stand to reason that you could interchangeably use a pandas datetime64 and a pyarrow timestamp, but that is not always true. Another common example is the use of NumPy fixed length strings, which users commonly try to use even though we claim no real support for them (see [#5764](https://github.com/pandas-dev/pandas/issues/57645)).
The third issue is that the extent to which pandas may support any given type is unclear. Issue [#58307](https://github.com/pandas-dev/pandas/issues/58307) highlights one example. It would stand to reason that you could interchangeably use a pandas datetime64 and a pyarrow timestamp, but that is not always true. Another common example is the use of NumPy fixed length strings, which users commonly try to use even though pandas claim no real support for them (see [#5764](https://github.com/pandas-dev/pandas/issues/57645)).


``na_marker`` is expected to be read-only (see next section). For advanced users that have a particular need for a storage type, they may be able to construct the data type via ``pd.StringDtype(data_manager=np)`` to assert NumPy managed storage. While the PDEP allows constructing in this fashion, operations against that data make no guarantees that they will respect the storage backend and are free to convert to whichever storage the internals of pandas considers optimal (Arrow will typically be preferred).

### Missing Value Handling
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel wanted to ping this off you. When we talked I said would stick with np.nan as the missing value marker, but since we are reusing the pd.StringDtype() in this version of the proposal I don't think we should move that away from pd.NA

Its not the default but maybe this pd.na_marker="legacy" serves as a good middle ground?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there's a global option, it should go through pd.set_option/get_option

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why data_manager instead of storage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason I wanted to include the word data is because we are strictly referring to the data buffer and nothing else. I don't have a strong preference on the terminology of this though

| pd.IntXXType() | pd.NA | np.nan |
| pd.FloatXXType() | pd.NA | np.nan |
| pd.StringDtype() | pd.NA | np.nan |
| pd.DatetimeType() | pd.NA | pd.NaT |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Period, timedelta64

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I can make this a fully exhaustive list. For the Legacy missing value on BinaryDtype, ListDtype, StructDtype and MapDtype do you think np.nan or pd.NA makes more sense?


Missing value handling is a tricky area as developers are split between pd.NA semantics versus np.nan, and the transition path from one to the other is not always clear.

Because this PDEP proposes reuse of the existing pandas extension type system, the default missing value marker will consistently be ``pd.NA``. However, to help with backwards compatibility for users that heavily rely on the equality semantics of np.nan, an option of ``pd.na_marker = "legacy"`` can be set. This would mean that the missing value indicator for logical types would be:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel like you're not engaging with what joris and i tried to explain about "you need to do it all at once" on Wednesday's call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry missed this before - what do you think is missing in this section?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section dramatically expands the scope of this PDEP, to encompass a lot of what would be PDEP-NA. If you want to make PDEP-NA, great, go for it. Until then, this one should either a) be tightly scoped so as to not depend on PDEP-NA, or b) be explicitly "this is on the back-burner until after PDEP-NA". I prefer the first option, but you do you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is your objection just to the default of pd.NA or to the more general approach of allowing na_marker to allow different nullability semantics while still using the same logical type? I am indifferent if we felt the need to start with the "legacy" approach of mixed sentinels, but I think the latter na_marker is a critical piece to execute this PDEP

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is your objection just to the default of pd.NA

Yes, that is a deal-breaker for me until we have a dedicated PDEP about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think the na_marker="legacy" makes sense? Curious to know if you think the NA marker for logical types that have no NumPy equivalent (ex: ListDtype) should still be np.nan or pd.NA

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bikeshedding naming aside, I'd be fine with that as long as it is the default until we do an all-at-once switchover following PDEP-NA.

@bionicles
Copy link

bionicles commented Jun 11, 2024

I recognize this is more about the underlying primitive field types than collection types, however I just feel like the collections are arguably also primitive, and this issue has bugged me forever:

To avoid needing to store critical reminders about array axis names and bounds in comments instead of the type hints, how could we include array shape hints (int bounds), and optional array axis names (str) or ideally axis types (convenient newtype int bounds, great for language server to catch shape bugs)?

I've found it quite helpful to separate the axis names from the bounds, but in the case of typed axes, the axis name lives in the type of the bound, so the type checker can also do broadcast alignment. It's a bit tricky in some languages but probably a breeze in python and the only reason we're not doing this is we don't know what we're missing,

imho we want the n-dimensional array shapes to be heterogeneous lists of axis bound pointer newtypes because then the type system can calculate the output shapes of our n-dimensional array operations better, as the world is moving rapidly into higher-dimensional data engineering, just seems reasonable to ensure Pandas has great n-dimensional array shape types, that is all

@WillAyd
Copy link
Member Author

WillAyd commented Jun 11, 2024

Hey @bionicles thanks for providing some feedback. I don't quite follow what you are looking for though - is there maybe a small example you can provide to clarify?

@bionicles
Copy link

bionicles commented Jun 12, 2024

Absolutely, great idea, here is some code

A: demonstrates how array shapes are defined by the data passed in, and bugs can be unclear, and the axes have no names
image

Then we could hack out a basic data structure to hold the names and bounds for the axes in the shapes of the arrays.

from typing import NewType, Type, Tuple, Protocol, Optional
from dataclasses import dataclass, field
import pytest

@dataclass
class Axis:
    "one step along the path of indices to an element within a tensor"
    name: str = "B"
    bound: Optional[int] = 16

    def __post_init__(self):
        if self.bound is not None and self.bound < 0:
            raise ValueError("Axis size must be non-negative")

def test_axis_nonneg():
    with pytest.raises(ValueError):
        Axis(bound=-1)

# We want to write the shapes of the arrays in the type system
Shape = Tuple[Axis, ...]

# Example:
# Rows = Axis("R")
# Columns = Axis("C") # oops, this is a value
# MatrixShape = Tuple[Rows, Columns] # TypeError: Tuple[t0, t1, ...]: each t must be a type. Got Axis(name='R', bound=16).

rows = Axis("R")
columns = Axis("C")
matrix_shape = (rows, columns)

def str_from_shape(shape: Shape) -> str:
    if not shape:
        return "()"
    names = []
    bounds = []
    for a in shape:
        names.append(a.name)
        bounds.append(a.bound)
    return f"Shape({names=}, {bounds=})"

def shape_from_characters_and_bounds(characters: str, bounds: Tuple[Optional[int], ...]) -> Shape:
    "build a shape quickly"
    characters = tuple(characters)
    characters_and_bounds = zip(characters, bounds)
    return tuple(Axis(name=c, bound=b) for c, b in characters_and_bounds)

def shape_from_shapes(a: Shape, b: Shape) -> Shape:
    c = {}
    for axis_in_a in a:
        c[axis_in_a.name] = axis_in_a.bound
    for axis_in_b in b:
        # remember the linear algebra and numpy broadcasting rules
        # if either is 1 or none, any value is OK, take largest
        # however if they are both scalars greater than 1, (m, n)
        # then m == n must hold for compatibility
        if axis_in_b.name not in c:
            c[axis_in_b.name] = axis_in_b.bound
        else:
            lhs_bound = c[axis_in_b.name]
            rhs_bound = axis_in_b.bound
            if (
                (lhs_bound is not None and lhs_bound > 1) and 
                (rhs_bound is not None and rhs_bound > 1) and 
                 (lhs_bound != rhs_bound)
            ):
              message = f"Axis '{axis_in_b.name}' Bounds ({lhs_bound}, {rhs_bound}) Incompatible: \n- lhs={str_from_shape(a)}\n- rhs={str_from_shape(b)}\n"
              raise ValueError(message)
            else:
                if lhs_bound is None and rhs_bound is None:
                    c[axis_in_b.name] = None
                elif lhs_bound is None and rhs_bound is not None:
                    c[axis_in_b.name] = rhs_bound
                elif lhs_bound is not None and rhs_bound is None:
                    c[axis_in_b.name] = lhs_bound
                else:
                    # both are scalars, take largest
                  c[axis_in_b.name] = max(lhs_bound, rhs_bound)
    c_shape = tuple(Axis(name=k, bound=v) for k, v in c.items())
    return c_shape

def compatibility_from_shapes(a: Shape, b: Shape) -> bool:
    try:
        _c = shape_from_shapes(a, b)
        return True
    except ValueError:
        return False

B: demonstrates how clarifying the shape of the arrays can make similar bugs easier to understand
image

https://colab.research.google.com/drive/1OVkRtwv747jXPI11kfKCGGFHWWtRs3ki?usp=sharing

You could also transpose by axis names instead of their order, which is way more flexible because you could swap axes into desired positions by names regardless of their current order

This is along the line of what i'm looking for, however requires features of newer versions of python and would screw up pandas backwards compatibility
https://taoa.io/posts/Shape-typing-numpy-with-pyright-and-variadic-generics

Anyway, then we could have a column that's type hinted to be a specific shape of array, with axes labeled, i bet it looks really cool

"Array[u8, (B, H, W, C), (16, 28, 28, 1)]" is wayyy more clear than "array[int]" and that clarity will help type checkers help us avoid / understand / fix shape bugs which are prevalent in python AI code

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 12, 2024

"Array[u8, (B, H, W, C), (16, 28, 28, 1)]" is wayyy more clear than "array[int]" and that clarity will help type checkers help us avoid / understand / fix shape bugs which are prevalent in python AI code

From a static type checking perspective, this is really difficult (and maybe impossible) to do, even to track the dtypes within a DataFrame. In pandas-stubs, we do a best effort to track the dtype of a Series in a static way, but a DataFrame can consist of multiple Series, of very different dtype. So you'd need support from the python typing system to allow a variadic type, and even then, as computations are done on a DataFrame, it's hard to track the resulting changes in dtype that can occur.

You're asking for doing something similar with tracking the shape of a DataFrame. I don't think that would be possible. Consider a merge operation that merged 2 DataFrames using an inner join. There's no way to know, from a static typing perspective, the resulting number of rows that come from the merge, because the number of rows is dynamic - it depends on the underlying data.

So I don't think that what you are asking for is possible from a static typing perspective.

@WillAyd
Copy link
Member Author

WillAyd commented Jun 12, 2024

Outside of @Dr-Irv comments I also think what you are asking for is not quite in the scope of this PDEP. I would suggest opening a separate issue for discussion about supporting that though

@jbrockmendel
Copy link
Member

jbrockmendel commented Aug 7, 2024 via email

@rhshadrach
Copy link
Member

rhshadrach commented Aug 8, 2024

users don't have to know/care about which backend they have and 2) and we won't put effort into making behaviors match across backends. Choose one or the other.

These aren't mutually exclusive options, ...

If users don't know which backend they have, and the backends differ in behavior in albeit minor aspects, users will see this differences. Differences in behavior on the same input will be reported as bugs. 0.001% of 10 million PyPI downloads per day is 100.

I think the only viable way forward is to treat these as mutually exclusive.

@WillAyd
Copy link
Member Author

WillAyd commented Aug 8, 2024

I think the only viable way forward is to treat these as mutually exclusive.

How would you see us moving back to exclusive backends? Wouldn't that mean that the string data type would have to go back to being an object data type in the "numpy" system, or did you have something else in mind?

@rhshadrach
Copy link
Member

How would you see us moving back to exclusive backends?

Not sure what is meant by "exclusive backends". Do you mean where the user knows what backend they are using?

Wouldn't that mean that the string data type would have to go back to being an object data type in the "numpy" system, or did you have something else in mind?

Not if we embrace the new string dtype.

@bionicles
Copy link

For ML/probability, "float" and "int" are underdetermined in the sense they're not clear on how big those need to be. Making it easier to type hint data types and harder to accidentally screw up numeric precisions could be a solid goal for this PDEP. Assigning various precisions of the same category of data type e.g. "float" "int" "uint" to a shared logical base, we could make it possible to type hint for a category.

One struggle I have is type hinting datatype arguments, as sometimes we need to configure the data types more dynamically, and python generics are not done.

For example, I pass two kinds of float dtype, one being precise to hold big numbers and another being imprecise to save memory

import numpy as np
import ml_dtypes

DEFAULT_IMPRECISE_FLOAT = ml_dtypes.bfloat16
DEFAULT_PRECISE_FLOAT = np.float32

def eval_from_input_and_model(
    input: Input,
    model: Model,
    imprecise_float_dtype: np.floating = DEFAULT_IMPRECISE_FLOAT, 
    precise_float_dtype: np.floating = DEFAULT_PRECISE_FLOAT,
                        ^^^^^^^ logical dtypes can be precision-agnostic
) -> Eval:
    ...

This causes type checker false positives with the dtype type hints because np.floating is np.floating[unknown] while np.float32 is np.floating[_32Bit]

To make something that works with Python type hints / type checking today, it may make sense to make a dtypes(Enum) of every supported dtype. The values of the enum can all be instances the same pd.DataType dataclass with various fields and methods to convert to whatever desired backend

class dtypes(Enum):
    BOOLEAN = DataType(in_python=bool, in_np=np.dtype(bool), in_pa=pa.bool())
    F32 = DataType(in_python=float, in_np=np.float32, in_pa=pa.float32())
    F64 = DataType(in_python=float, in_np=np.float64, in_pa=pa.float64())
    I32 = DataType(in_python=float, in_np=np.float32, in_pa=pa.float32())
    I64 = DataType(in_python=int, in_np=np.int64, in_pa=pa.int64())
    ... so on for all precisions of float/int/uint, for strings, temporal stuff, etc...
)

Benefit of this would be: we could always type hint our data types using a pd.dtypes.XYZ variant, and we could keep this "rosetta stone" of dtype conversion in one place, and it would give us a nice place to affix the methods to convert / display / check dtypes.

A downside is, all the dtypes are in one giant enum, so it's hard to write

thingy: pd.Float = MY_DEFAULT_FLOAT

One possible workaround for that downside could be to make Literals for the Enum variant.name

Float = Literal["F64", "F32", ...]

The Literal["UTC", "America/New_York", ...] works to enable explicit type hinting for time zones, I've dealt with that a bunch this week. Sometimes these generics lead us into using typing.get_origin which is a weird part of the type system and not always great, so what I usually do is make a set of valid values alongside the literal and a typeguard function.

We could try to experiment with an "Enum of Enums" approach where the dtypes enum has sub-enums, one per category:

class dtypes(Enum):
    BOOLEAN = DataType(in_python=bool, in_np=np.dtype(bool), in_pa=pa.bool())
    FLOAT = Float
)

class Float(Enum):
    F32 = DataType(in_python=float, in_np=np.float32, in_pa=pa.float32())
    F64 = DataType(in_python=float, in_np=np.float64, in_pa=pa.float64())

class Int(Enum):
    I32 = DataType(in_python=float, in_np=np.float32, in_pa=pa.float32())
    I64 = DataType(in_python=int, in_np=np.int64, in_pa=pa.int64())

That's a bit weird, and I've never done it, so I don't know if it even works, but it would have a benefit of grouping the dtypes by logical category and could possibly work better with isinstance and issubclass than Literal.

Another downside to the Enum approach might be, how to fit EXTENSION into pd.dtypes if it's an enum, we'd need a way to bail, typing.Protocol could be good for that, with a TypeGuard.

To get this done faster, we could focus on a bijection between arrow and numpy dtypes. If we line up their dtype systems and count all the distinctions, that's a good list of edge cases, as mentioned above. Arrow has from_numpy_dtype and to_pandas_dtype and whichever backend could convert all the dtypes into its necessary format. Since the Pandas codebase initially evolved around numpy dtypes, and re-inventing the dtype wheel is a "bottomless pit task," it may make sense to stick to a numpy-centric public interface even to work with Arrow backends.

I can't claim to know all the edge cases and issues, but I think this PDEP is a good idea, and one way to make the implementation work could be to make an enum with dataclass values (I didn't even know this was possible until recently). Then we could make a Literal for each logical category of dtype with the variant names in that category.

Edge cases:

  • how do we make sure temporal types and time zones are correct
  • how do we handle differences between numpy and arrow
  • how do we type hint extension dtypes

Sorry for long comment, hope this helps guide us, my bad if these issues are already addressed

@bionicles
Copy link

bionicles commented Aug 11, 2024

FWIW, enum of enums does work, although the need to call .value is a bit weird

image

image

@WillAyd
Copy link
Member Author

WillAyd commented Aug 11, 2024

@bionicles thanks for your input. In the PDEP text there is a link to the following image - I think there is some conceptual overlap between being able to lay out a type hierarchy and solve your typing needs:

image

The details of how we actually need to implement that for type checkers might be a little too granular for this PDEP (maybe, maybe not) but hopefully this type of categorization is inline with what you are after

@jorisvandenbossche
Copy link
Member

I don't want to dismiss the type hinting topic (it's interesting and it would be nice if the PDEP could also improve the type hinting situation, although I will have to read and understand it in more detail to say anything constructive about it), but I would suggest moving that to a separate issue.

(while this is the main PR for the PDEP, it's perfectly fine to split certain sub-topics into its own issue to have a more focused discussion, as this PR issue is becoming very lengthy)

@jorisvandenbossche
Copy link
Member

@WillAyd I have re-read the discussion of the last days above multiple times now, and I am not fully following (and as Brock mentioned, some different sentences seem a bit contradicting each other).

I don't think we should put a lot of time into trying to make all the backends work the same; the goal of this PDEP is to really stop caring about all the subtle differences and just have one way of doing things.
...

The problem with the PDEP is that it is currently trying to have it both ways 1) users don't have to know/care about which backend they have and 2) and we won't put effort into making behaviors match across backends. Choose one or the other.

These aren't mutually exclusive options, and this PDEP solves the issue of the latter.

It's not entirely clear to me what you mean exactly with "just have one way of doing things", but I agree with the others that this does sound as two opposite options for which we need to make a choice (i.e. "mutually exclusive"):

Either we say "let's accept that there are significant differences between backends, and we don't put effort in making behaviours consistent". But at that point users will care about those.
And we should also respect the choice of backend then (because it impacts behaviour), while at some point you seemed to indicate that we would not do that for derived/computed columns ("a user-facing option to select storage during construction only, but nothing about our implementation will respect that during computation").

Or either we say: "most users shouldn't have to care about which backend is being used, because that is an implementation detail that (except for some details) has not much directly visible (behavioural) impact". But at that point we will have to put effort in making the behaviour consistent across backends.
(and of course what exactly is considered as a "detail" is very much open for discussion ..)

Yes to be clear we aren't matching anything across backends.
...
I think we all agree it is not value added to try and support "numpy behavior" versus "arrow behavior" versus "pandas extension type behavior," but seems like we are coming at it from different angles. For this PDEP, the assertion is that we try and "solve" that today by having up to 3 different types of every type, with undefined rules on what operations against those types will return. This PDEP clarifies all of that and gets us back to just one type

I quoted this a bit out of context, but this again seems a bit contradictory? If we do not want to support different behaviours, i.e. support both "numpy" behaviour and "pyarrow behaviour" at the same time (with which I agree), then that does mean we need to match behaviour (and put effort in making them match) between a backend based on numpy and based on pyarrow, no?
I fully agree that the current situation with multiple flavours of the same dtype with unclear rules when which type is being used is not great, and ideally we have "just one type" for users to interact with. But IMO that does mean that this "one type" should behave somewhat consistently if it has multiple implementations, otherwise this will IMO just be as confusing (or even more so) as the current situation.

[@mroeschke] Just to throw out my 2c opinion, I would hope with this PDEP that "we won't put effort into making behaviors match across backends."

Personally, I think I am firmly in the camp of "let's do put effort in making behaviours consistent".
And to use Will's concrete example of our current StringDtype:

#58455 (comment) I think its also worth taking a step back and considering that our String implementation from PDEP-14 (and even preceding it) is exactly what this PDEP is proposing

For our string dtypes (within the NaN-variants and within the NA-variants), we do put effort in making the behaviour consistent between the "python" and "pyarrow" storage. In the ArrowStringArray implemenation, we have various places for the string methods where we "patch up" the result of a pyarrow compute kernel to make it match behaviour with the python implementation, or we do implement certain operations with a work-around because pyarrow doesn't support something (e.g. the any/all reductions).

So I think this is a concrete example of where we clearly do put effort in making the backends behave consistently. For the people that argued otherwise, do you think we should also stop doing for that the string dtype, or just not follow that pattern for other dtypes?

@WillAyd
Copy link
Member Author

WillAyd commented Aug 12, 2024

Thanks @jorisvandenbossche for the detailed reply. Some quick comments that hopefully clarify:

It's not entirely clear to me what you mean exactly with "just have one way of doing things", but I agree with the others that this does sound as two opposite options for which we need to make a choice (i.e. "mutually exclusive"):

Mainly that users don't need to make choices based off of how our implementation decides to return different types. Like many of the issues labelled "pyarrow dtype retention" and "Dtype conversions", we have a lot of gaps in the API where we force a user to understand that "this method returns a pyarrow backed boolean data type because we think it can be more performant, but this method returns a NumPy bool because our extension array interface declared it as such." Implementation details like this don't need to be leaked into the API; we can just always return a logical type

I quoted this a bit out of context, but this again seems a bit contradictory? If we do not want to support different behaviours, i.e. support both "numpy" behaviour and "pyarrow behaviour" at the same time (with which I agree), then that does mean we need to match behaviour (and put effort in making them match) between a backend based on numpy and based on pyarrow, no?

The point of this quote within its context was to say that we don't need to try and reimplement NumPy's signed integer overflow handling (both pre NEP-50 and post NEP-50), nor does the use of INT64_MIN as the NaT sentinel in our own implementation of temporal data prevent us from having logical temporal types. It is totally acceptable to have some "undefined behavior," akin to how the C language does not 100% standardize every compiler, and maybe it would help to list those out as part of this PDEP. But generally edge cases shouldn't force us to abandon an entire system

It also applies to the datetime example that I've quoted a few times in this document. In today's world a call to pd.Series.dt.date with a datetime64 returns an object data type, although with a pyarrow timestamp it returns a pyarrow date. This doesn't mean that we all of a sudden now need to implement a "date" type of our own that pd.Series.dt.date can dispatch to instead of object so that both the extension array and pyarrow types "work the same;" instead we just pick the behavior from the different type systems that we want to compose our own

For our string dtypes (within the NaN-variants and within the NA-variants), we do put effort in making the behaviour consistent between the "python" and "pyarrow" storage.

Yes this does go back to the "one way of doing things" comment; I think the way that we are going about strings is great, although strings are even way more extreme than any of the other data types. Just a quick rundown of other types:

  • Integral types are zero-copy
  • Float types are zero-copy, except for NaN behavior that PDEP-16 will standardize
  • Booleans are NOT zero-copy, but have a limited scope
  • datetime64 reserves INT64_MIN as a sentinel, pa.timestamp64 does not

My expectation is that converting those all to logical in sum would be far less effort than what is going into the string type

Hope that helps clarify but let me know what may still be unclear

@WillAyd
Copy link
Member Author

WillAyd commented Aug 12, 2024

Also with the string example, our behaviors will "match" so far as there aren't differences between utf8proc and the Python stdlib. We definitely aren't going to try and bridge those gaps, but that hasn't prevented us from creating that logical type.

@jorisvandenbossche
Copy link
Member

Thanks, that clarifies. I think part of the confusion is that you said multiple times something along the line of "I think we should not put a lot of time into trying to make all the backends work the same", while I would personally interpret your current stance (but please correct if I am misinterpreting!) as "we should try to make backends work the same to the extent possible", but then the details and discussion are of course about that last italic part .., i.e. what is an acceptable difference and what not, what is possible to make consistent and what not.

Because my interpretation is that you also think that (ideally) NA handling is the same across backends? That's clearly something we need to "put time into" if we want to make that happen. But I agree that the capitalization of German Eszett is not easily possible to make consistent (and overflow in operations is an area where we disagree, I think).
I had already started writing a follow-up post going in more detail on this exact topic, so will finish that comment as I think it is still relevant and useful to go more in detail into concrete possible differences.

@WillAyd
Copy link
Member Author

WillAyd commented Aug 12, 2024

"we should try to make backends work the same to the extent possible", but then the details and discussion are of course about that last italic part .., i.e. what is an acceptable difference and what not, what is possible to make consistent and what not.

Thanks @jorisvandenbossche . Your wording is much better than mine

@WillAyd
Copy link
Member Author

WillAyd commented Aug 12, 2024

Because my interpretation is that you also think that (ideally) NA handling is the same across backends?

I think missing values can be handled consistently, although its up to you as an end user which NA "semantics" you want to use. Taking the following example from today's world:

>>> values = ["foo", "bar", None]
>>> pd.Series(values).str.startswith("f")
0     True
1    False
2     None
dtype: object
>>> pd.Series(values, dtype="string").str.startswith("f")
0     True
1    False
2     <NA>
dtype: boolean

With this PDEP you would get the logical data type back in the first example, based off of the rules in the Missing Value Handling section, although its worth questioning what that missing value should be for boolean types (the current PDEP has np.nan). The equality and logical semantics of the missing value indicator are allowed to still vary, if that's what you as an end user are after

@jorisvandenbossche
Copy link
Member

I think missing values can be handled consistently, although its up to you as an end user which NA "semantics" you want to use. Taking the following example from today's world

I know that is today's world, but I would personally still argue that we should not keep allowing the user to choose the NA semantics in the future, but limit this to one set of behaviours. But yes that is PDEP-16, and it's a bit difficult to keep those separate.
Anyway, in today's world, I agree this is a difference between backends that the user can choose (i.e. essentially how we are doing the StringDtype(.., na_value=np.nan|pd.NA) now), but then we still need to ensure that the NaN semantics or NA semantics are consistent across the dtypes (within their type of semantics), and that derived operations also stay withing their type.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 12, 2024

(sorry, long post coming)

I would hope with this PDEP that "we won't put effort into making behaviors match across backends."

Personally, I think I am firmly in the camp of "let's do put effort in making behaviours consistent".

Thinking a bit further on this, this raises two questions: what do we exactly understand with "matching behaviour", and why then have multiple backends if they can't differ?

Matching behaviour across backends

Above, there was some discussion about what we exactly understand under those "subtle" or "significant" differences that we would either want to allow or not allow. It's good to be concrete. For example, quoting Richard:

I think this is part of the "users generally don't care about the subtle differences". Working with pandas in a production setting, I care a great deal. Our code at work has to deal in particular with a lot of NA values,

First, the missing value behaviour is not a "subtle" difference at all (it's actually so significant that there is a separate PDEP for it ;)). So I would say that most aspects related to NA handling (the exact sentinel to use, the propagation behaviour, how it's treated in indexing, whether they sort first or last, etc) are all signifcant behaviour differences, that indeed users do care about (and if we would allow that do differ, something that users should be aware of).

But I would personally argue that this is not something that we should allow to differ between backends (hence PDEP-16).
I know it's a bit difficult to un-tangle those discussions, but in an ideal case we would roll out this PDEP at the same time as the default NA sentinel, and then this PDEP shouldn't have to care about different NAs for the various dtypes. But regardless of that, I think that for the discussion of this PDEP we should assume that NA behaviour will be consistent across dtypes (either because we accepted PDEP-16, or either because we keep both NaN and NA behaviour, but then we should still ensure the behaviour is consistent across all NaN-dtypes and across all NA-dtypes, and ensure users don't easily mix those).

Brock gave some other examples:

pyarrow durations can hold int64_min while np.timedelta64s can't. on the flip side, you can multiply numpy ones by floats, but not pyarrow ones

The first one (support NaT or not) indeed depends a bit on PDEP-16, because if we would decide to not distinguish NA and NaT (the current proposal), then most of the behaviour differences (a certain operation returning NaT or not) go away (except for the tiny corner case of being able to store the smallest possible timedelta or not).
But the second one (multiply timedelta with floats) is something we can perfectly support for pyarrow as well, if we want (multiply the underlying integers and cast the resulting float back to int, which I assume is what numpy is doing as well?)

pyarrow/numpy numeric dtypes handle overflows differently.

To clarify, pyarrow provides numeric operations in both a version that checks for overflow and one that does not, and we chose to use the checked version. So if we want to match behaviour with numpy, we can easily change our pyarrow dtypes to do so.

Now, personally I prefer the behaviour that raises an error for overflow, but again if we want, we can also implement that for our numpy-based dtypes. Of course that comes with a performance penalty (and so maybe it should be possible to turn it off), but that's the case for the pyarrow dtypes now as well.
For example, specifically for overflow in construction/casting, I proposed to stop doing the silent overflow a while ago (#45588).


Summarizing the currently raised examples of behaviour differences:

I would personally call none of those "subtle", but see those all as significant differences (and it is for this kind of differences that I personally argue that we should ensure this is consistent across backends).

And referring to my older comment at #58455 (comment), I also think many currently known or reported differences with the pyarrow backend (certain type specific methods not yet working, something not working when set as the index, ...) should be considered as "bugs" or "not yet implemented".

For the other side of the spectrum, the kind of differences that I personally would consider as edge cases that are acceptable, are for example the capitalization of the German Eszett or the fact that Timedelta('-106752 days +00:12:43.145224193') cannot be stored.


What's the point of multiple backends

If they should behave exactly the same, what's then the point of having multiple backends? (as that could certainly be a reason that one would want a different backend, for a specific behavioural feature)

One reason of course is the ability to have an optional dependency, but this gets us in the PDEP-10/15 discussion of making pyarrow required or not ... Especially when pyarrow would be a required dependency, one could question the need to have multiple backends for several of the default dtypes. For the string dtype there might be sufficient difference in performance characteristics (eg the object-dtype based ond is more efficient when mutating or sorting) that it might be worth to keep both options. But for example for the numeric (int/float) dtypes, if a possible numpy and parrow backend behaves the same and is more or less on par for performance, why maintain both? (and based on performance, I don't think it is necessarily a given that we would then pick pyarrow above numpy)

I think an important reason for the backends is to let us allow more easily to experiment with a new implementation, before making something the default. For example, assume for a moment we already had a StringDtype with python storage for a very long time. When starting the faster pyarrow-based implementation, we could have added it as an optional backend, ask people to turn it on through an option to test it (without otherwise having to change their code), and then after some time switch the default when ready.
But for something like this, the different backends should behave roughly the same (or only differ in something that we explicitly want to make the default behaviour). And right now ArrowDtype is essentially this vehicle we use to experiment with using pyarrow under the hood. But because being a completely different dtype, it is less easy to switch to to test (e.g. right now, besides IO nethods, you would need to change how you specify dtypes to make your code actually use ArrowDtype). Integrating (a subset of) ArrowDtype in default logical dtypes would make it a lot smoother to test this more broadly and potentially switch the default for specific dtypes.

Another way to look at those backends is rather than a "dtype" backend, is to consider it as a "compute" backend or engine.
For the string dtype there is of course a significant different in how the data is stored (hence we use storage there and not backend, at the moment). But for many other types we support by default, especially numeric dtypes, the memory that is being stored is quite similar (especially if we would adopt the masked array approach in the future, and even more if we would consider using bitmaps (xref Will's PoC #59410). And the main difference between the dtypes is which library is used for certain computations.

Similarly like we use numexpr and bottleneck (if installed, and controlled by an option) for certain operations when deemed beneficial, pyarrow could also be treated similarly for certain data types and used for certain operations, instead of being tied to the dtype instance (and thus tied to a specific column).

Anyway, this is maybe getting a bit far from the current discussion, but for the actual PDEP, I don't think we necessarily need to answer all the above questions (e.g. which backend for which dtype? keep multiple backends long term? etc). I would personally say something along the lines of 1) we provide a logical dtype for all our current dtypes (with a consistent interface, etc), and 2) they all have (obviously) at least one implementation (either using numpy, or using pyarrow, or a mixture where it makes sense), but if there are multiple implementations (i.e. parametrizations of the same dtype class), the intent is that they behave the same ("to the extent possible").

@WillAyd
Copy link
Member Author

WillAyd commented Aug 12, 2024

I know it's a bit difficult to un-tangle those discussions, but in an ideal case we would roll out this PDEP at the same time as the default NA sentinel, and then this PDEP shouldn't have to care about different NAs for the various dtypes.

I'm a little bit torn on this. One one hand, it would definitely make for easier internal development if these were bundled together, but on the other hand I think there is some advantage to delivering separately, i.e. the logical types here could be a good stepping stone to achieve PDEP-16. For users that are very concerned about missing value handling, using the "legacy" mode (or whatever we want to call it) from the Missing Value Handling section of this PDEP might help start the untangling process of types with missing value implementations.

But the second one (multiply timedelta with floats) is something we can perfectly support for pyarrow as well, if we want (multiply the underlying integers and cast the resulting float back to int, which I assume is what numpy is doing as well?)

I think what we might be overlooking with this example though is, assuming we really prefer the non-pyarrow behavior, there isn't really that much of a need to make pyarrow change at all. We can prefer the existing timedelta64 storage / computational behavior and just worry about efficient copy to pa.duration if / when it is ever needed, and when it makes sense

Now, personally I prefer the behaviour that raises an error for overflow, but again if we want, we can also implement that for our numpy-based dtypes. Of course that comes with a performance penalty (and so maybe it should be possible to turn it off), but that's the case for the pyarrow dtypes now as well.

This is somewhat of a technical note but we can also punt on that behavior for some time, at least for arithmetic (conversions maybe need some more review). C23 introduces checked arithmetic implemented via <stdckdint.h>, so I would expect that as compilers better support the standard, the performance overhead will be almost negligible, with many libraries converging on the standard. On the flip side, you never know how long it may take all compilers to adopt C23 (MSVC...)

Summarizing the currently raised examples of behaviour differences:

Thanks for this list. I think the PDEP can have a section dedicated to each of these behaviors. Happy to try and draft that, but am open to any content anyone provides up front

What's the point of multiple backends

I think this is a great section, but I'd also add that we stand to gain a lot from the composition of these different providers. We have a ton of custom calendar functionality in pandas using NumPy that I don't see us reimplementing with pyarrow types. On the flip side, pyarrow has some really cool types that probably won't be implemented in NumPy (aggregate types, decimal, etc...). A the main tenet of this PDEP is that we can strategically offer the strengths of types from the different libraries in a way that is consistent and meaningful to the end user

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 12, 2024

they all have (obviously) at least one implementation (either using numpy, or using pyarrow, or a mixture where it makes sense), but if there are multiple implementations (i.e. parametrizations of the same dtype class), the intent is that they behave the same ("to the extent possible").

I've read all of this discussion, and am not sure how I feel about the behavior issue. My main reason is that while there is a summary of major behavior differences given by @jorisvandenbossche at #58455 (comment) , it seems from the discussion there are some minor behavior differences that would need to be better understood.

However, in the future, whether the behavior is entirely consistent between backends, or the behaviors are the same "to the extent possible", documentation will be required to indicate those behavior choices. Even if the behaviors were 100% consistent across backends, if those behaviors differed between how things would work in pandas as compared to non-pandas code using python, numpy or pyarrow, we'd still need that documentation.

So maybe there should be a discussion post somewhere (or a new piece of documentation) that explicitly documents the differences in behavior between the different backends that exist today, and then we can see where the line could be drawn in terms of what behaviors would be consistent and what behaviors would not be.

@WillAyd
Copy link
Member Author

WillAyd commented Aug 12, 2024

Does it help if we say that this PDEP only concerns itself with how types are stored and not necessarily how they are computed? I think from the storage perspective it is very easy to technically validate what differences there are and are not.

With computations, you run into questions like how is overflow handled, why does pandas use kahan summation while NumPy uses pairwise summation while pyarrow uses XXX, why does pandas default to ddof=1 for std while NumPy uses 0?

I don't think anything needs to change with our computation engine as part of this. Long term, yes it might make more sense to use a different computation engine that scales out of core or can be integrated into a lazy evaluation context, but that is also a different PDEP

@jbrockmendel
Copy link
Member

I've been largely critical, here are some bits that I'm +1 on:

  1. General statement "we will attempt to make 'big' behaviors match across backends and document exceptions to that rule"
    a) ser.dt.date returning a date dtype (which would require implementing a DateDtype with nan semantics unless+until PDEP-16 is implemented)
    b) Index(pyarrow_timestamp_dtype) -> DatetimeIndex, analogous for TimedeltaIndex (xref BUG: .loc Indexing with pyarrow backed DatetimeIndex does not allow string comparison #58307)
  2. Move towards using pd.FooDtype(storage_or_maybe_backend=..., na_value=..., foo_specific_kwargs=...) as the recommended/standardized API for specifying dtypes.
    a) consolidate StringArray, StringArrayNumpySemantics, ArrowStringArray, ArrowStringArrayNumpySemantics, and ArrowExtensionArray[ArrowDtype[various_pyarrow_strings]] down to just StringArray. The first four are fairly straightforward to do. The latter would require either deprecating allowing them in the constructor or mapping it to StringArray/StringDtype via __new__.
    i) Since the proliferation of strings are the biggest motivating pain point, I'd be +1 on moving forward with this part independent of the rest of the PDEP.
    b) Parts of this (e.g. whether to retain an na_value keyword) depend on PDEP-16. I lean towards agreeing with Joris on rolling it out at once; otherwise we would have an na_value keyword and then deprecate it almost immediately.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 12, 2024

I know it's a bit difficult to un-tangle those discussions, but in an ideal case we would roll out this PDEP at the same time as the default NA sentinel, and then this PDEP shouldn't have to care about different NAs for the various dtypes.

[Will] #58455 (comment) I'm a little bit torn on this. One one hand, it would definitely make for easier internal development if these were bundled together, but on the other hand I think there is some advantage to delivering separately, i.e. the logical types here could be a good stepping stone to achieve PDEP-16.

There are certainly pros and cons for each option, but maybe this is something we can actually table for later when things get more concrete, as I don't think it is critical for this PDEP whether each dtype will have support for different NA semantics. The dtypes that are currently already extension dtypes and that we would keep will have to support both for at least a transition period (like we now do for StringDtype(.., na_value=np.nan|pd.NA), we could do the same for CategoricalDtype or PeriodDtype etc). But for dtypes where the default is not yet an extension dtype, eg np.int64, I think we can decide later (depending to PDEP 16 on missing values) if it is worth it or needed for a transition period to add na_value to Int64Dtype() or not.

Summarizing the currently raised examples of behaviour differences:

[Will] #58455 (comment) Thanks for this list. I think the PDEP can have a section dedicated to each of these behaviors. Happy to try and draft that, but am open to any content anyone provides up front

I am not sure if the PDEP text should have a section on each of those, that goes maybe too much in the details?
What I think would be good is if we can at least agree at a conceptual level (for example, something like "if a dtype class has multiple backends, then the intent is that it behaves consistently across backends, to the extent possible (and with documented corner cases)", depending on where the consensus ends up), and agree on some examples of behaviours that are definitly not considered as corner case and some examples that could be, without necessarily going in detail on each example.

EDIT: I think this broadly matches with the first point ("general statement") of Brock's comment just above.

What's the point of multiple backends

[Will] #58455 (comment) I think this is a great section, but I'd also add that we stand to gain a lot from the composition of these different providers. We have a ton of custom calendar functionality in pandas using NumPy that I don't see us reimplementing with pyarrow types. On the flip side, pyarrow has some really cool types that probably won't be implemented in NumPy (aggregate types, decimal, etc...). A the main tenet of this PDEP is that we can strategically offer the strengths of types from the different libraries in a way that is consistent and meaningful to the end user

Yes, definitely. That's also why I included "or a mixture [of numpy and pyarrow] where it makes sense" (and mentioned the notion of considering it as a compute engine).
Maybe the main point is that this PDEP does not propose that each dtype should have multiple backends. Some will have, but some could be purely numpy or purely pyarrow or use both. To be clear, just re-reading the relevant section of the current text now, it does not give the impression of a dtype always having multiple backend. The metadata listing has "storage: Either "numpy" or "pyarrow". Describes the library used to create the data buffer" (which actually speaks about storage and not backend)

[Will] #58455 (comment) Does it help if we say that this PDEP only concerns itself with how types are stored and not necessarily how they are computed? I think from the storage perspective it is very easy to technically validate what differences there are and are not.

I think the PDEP should at least state something about behaviour (i.e. not necessarily which library is used, but the result of the computation), as that is the most relevant aspect for users.
(actually the storage is more an implementation detail, so you could also say that for that reason the PDEP should not really concern itself about such implementation detail, i.e. each dtype can decide for its own how it stores the data and whether this is controllable)

Not necessarily in detail (as we can't answer every possible question), but as I mentioned above at least in general terms together with some examples.
And to one of the examples you give, I think the default ddof argument should definitely be the same for different backends when calling ser.std() (assuming each of the underlying compute libraries support specifying that keyword, as is the case right now with numpy and pyarrow).

[Brock] #58455 (comment) 2. i) Since the proliferation of strings are the biggest motivating pain point, I'd be +1 on moving forward with this part independent of the rest of the PDEP.

Not too important, but I wouldn't necessarily say that strings are the biggest motivator. For the dtype constructor StringDtype(..), with PDEP-14 (string dtype), I think we are actually there (except for ArrowDtype, I know, see below), and for the various ...StringArray... classes, we can indeed see how to better consolidate those, as you already started doing (but I would also say that this PDEP should not concern itself too much about the Array classes (like impose implementation details about whether there should also be a single array class or if there can be multiple array classes tied to one dtype class), but focus on the dtype interface).
At that point, the current (post 3.0) situation for strings (StringDtype() and ArrowDtype(pa.string())) is rather similar, or even slightly better, than for example for integers (np.dtype("int64"), pd.Int64Dtype(), pd.ArrowDtype(pa.int64())). So personally, I want this PDEP as much for non-string dtypes.

>>> pd.Series(["x"], dtype=pd.StringDtype("pyarrow", na_value=pd.NA)).value_counts().dtype
int64[pyarrow]
>>> pd.Series(["x"], dtype=pd.StringDtype("python", na_value=np.nan)).value_counts().dtype
Int64Dtype()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Int64Dtype()
dtype('int64')

* physical_type: Can expose the physical type being used. As an example, StringDtype could return pa.string_view
* na_value: Either pd.NA, np.nan, or pd.NaT.

While these attributes are exposed as construction arguments to end users, users are highly discouraged from trying to control them directly. Put explicitly, this PDEP allows a user to request a ``pd.XXXDtype(storage="numpy")`` to request a NumPy-backed array, if possible. While pandas may respect that during construction, operations against that data make no guarantees that the storage backend will be persisted through, giving pandas the freedom to convert to whichever storage is internally optimal (Arrow will typically be preferred).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should each dtype necessarily expose each of those metadata items as a construction argument? If there is only one valid option (eg a dtype that is implemented only using numpy or pyarrow for storage, or that has only a single physical type), it seems not necessary to expose this as argument?

To implement this PDEP, we expect all of the logical types to have at least the following metadata:

* storage: Either "numpy" or "pyarrow". Describes the library used to create the data buffer
* physical_type: Can expose the physical type being used. As an example, StringDtype could return pa.string_view
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it's true that there are multiple dtypes where this could be relevant, to Brock's point: it will not be relevant for all dtypes. So maybe make this an optional metadata item? (or maybe just give as example how the logical dtype can hide implementation details)

@jorisvandenbossche
Copy link
Member

[Brock] #58455 (comment) I've been largely critical, here are some bits that I'm +1 on:

1. General statement "we will attempt to make 'big' behaviors match across backends and document exceptions to that rule"
...
2. Move towards using pd.FooDtype(storage_or_maybe_backend=..., na_value=..., foo_specific_kwargs=...) as the recommended/standardized API for specifying dtypes.
...

FWIW, I think for me this are (or should be) the two main points of the PDEP (and the text could mostly focus on those two items, going into more detail on each).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PDEP pandas enhancement proposal Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants