Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dtype for tz-naive DatetimeArray and TimedeltaArray #24662

Open
TomAugspurger opened this issue Jan 7, 2019 · 12 comments
Open

dtype for tz-naive DatetimeArray and TimedeltaArray #24662

TomAugspurger opened this issue Jan 7, 2019 · 12 comments
Labels
Datetime Datetime data dtype Dtype Conversions Unexpected or buggy dtype conversions Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint Timedelta Timedelta data type

Comments

@TomAugspurger
Copy link
Contributor

Right now, Datetime.Array.dtype can be either np.dtype('M8[ns]) or a DatetimeTZDtype, depending on whether the values are tz-naive or tz-aware. This means that while DatetimeArray[tz-naive] is an instance of ExtensionArray, it doesn't satisfy the minimum ExtensionArray API, which requires that array.dtype be an ExtensionDtype.

In [4]: pd.date_range('2000', periods=4)._data.dtype
Out[4]: dtype('<M8[ns]')

In [5]: pd.date_range('2000', periods=4, tz="CET")._data.dtype
Out[5]: datetime64[ns, CET]

The causes some type-unsoundness for places that are supposed to return an ExtensionArray. The two most prominent being pd.array and Series.array. As an example, following isn't necessarily safe code

def f(ser: Series) -> Callable: 
    return ser.array.dtype.construct_array_type

that will fail for tz-naive datetime data, because its .dtype is a NumPy dtype.


Proposal:

  1. Make a DatetimeDtype (or allow DatetimeTZDtype to have tz=None). Make a TimedeltaArray.dtype
  2. Ensure that DatetimeArray.dtype and TimedeltaArray.dtype is always an ExtensionDtype
  3. Wrap Series.dtype and DatetimeIndex.dtype to continue returning the NumPy dtype

The last step is to avoid breaking code relying on Series[tz-naive].dtype being a NumPy dtype.

@TomAugspurger TomAugspurger added Datetime Datetime data dtype Dtype Conversions Unexpected or buggy dtype conversions Timedelta Timedelta data type ExtensionArray Extending pandas with custom dtypes or arrays. labels Jan 7, 2019
@TomAugspurger TomAugspurger changed the title dtype for tz-naive DatetimeARray and TimedeltaArray dtype for tz-naive DatetimeArray and TimedeltaArray Jan 7, 2019
@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jan 8, 2019

@jorisvandenbossche would you expect the following to be placed in an ExtensionBlock (backed by a TimedeltaArra), or a TimedeltaBlock (backend by an ndarray)?

pd.Series(pd.array(['1H', '2H'], dtype=pd.core.dtypes.dtypes.TimedeltaDtype()))

For now, I think this should continue to be a TimedeltaBlock.

@jorisvandenbossche
Copy link
Member

Yes, that is what I expect as well (so that for those arrays a special case is made to construct a different kind of block). Because we should internally have only one way to store timedelta data.

@TomAugspurger
Copy link
Contributor Author

I abandoned the PR #24674. There are a few issues around DatetimeArray.astype #24674 (comment)

@jbrockmendel
Copy link
Member

leaning towards "dont bother"

@jbrockmendel
Copy link
Member

On further consideration, I think this might be useful for code de-duplication. In particular a bunch of methods that are currently implemented on both Timestamp and DatetimeArray could instead be implemented on DatetimeDtype with a signature that takes int64_t*, which think could then be shared by Timestamp/DTA. Similar with Timedelta/TDA.

@jorisvandenbossche
Copy link
Member

What methods are you thinking about?

@jbrockmendel
Copy link
Member

What methods are you thinking about?

The clearest example is .normalize. Pretty much anything that goes through tslibs.fields.

@jorisvandenbossche
Copy link
Member

We're speaking about a dtype, right?
Since both Timestamp or DatetimeArray don't inherit from a dtype (for DatetimeArray it's an attribute), how does implementing methods on the dtype help sharing?

@jbrockmendel
Copy link
Member

Since both Timestamp or DatetimeArray don't inherit from a dtype (for DatetimeArray it's an attribute), how does implementing methods on the dtype help sharing?

The method on the dtype would take int64_t*, and so would be called from DTA/Timestamp with (ex-cython pointer semantics) self.asi8 and self.value, respectively.

@jorisvandenbossche
Copy link
Member

But if it is just a shared implementation that both Timestamp and DatetimeArray call, this shared implementation can also just be a function that takes int64_t* living somewhere in tslibs?

To be clear: if there are more algos that can be shared between both, that sounds good, I am just trying to understand what the dtype (this issue) has to do with it.

@jbrockmendel
Copy link
Member

can also just be a function that takes int64_t* living somewhere in tslibs?

Both options work. ATM I'm leaning towards an class-based (i.e. dtype-based) approach being cleaner.

@tswast
Copy link
Contributor

tswast commented Mar 25, 2023

A slight argument in favor of a class-based approach: pyarrow to_pandas only allows the types_mapper to return Pandas ExtensionDtype objects that implement __from_arrow__.

@jbrockmendel jbrockmendel added the PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint label Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Dtype Conversions Unexpected or buggy dtype conversions Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint Timedelta Timedelta data type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants