Skip to content

PERF: Deprecate casting of index of dates to DatetimeIndex #23598

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Nov 9, 2018 · 5 comments · Fixed by #36697
Closed

PERF: Deprecate casting of index of dates to DatetimeIndex #23598

TomAugspurger opened this issue Nov 9, 2018 · 5 comments · Fixed by #36697
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance
Milestone

Comments

@TomAugspurger
Copy link
Contributor

In [5]: index = pd.Index([pd.Timestamp('2001'), pd.Timestamp('2002')], dtype=object)

In [6]: pd.Series(1, index=index).index
Out[6]: DatetimeIndex(['2001-01-01', '2002-01-01'], dtype='datetime64[ns]', freq=None)

This was the root cause of #23591. Why are we doing that?

Note that this doesn't affect the case of index=[pd.Timestamp(...), pd.Timestamp(...)], as that would have previously been converted to a DatetimeIndex. It seems be only when you have an Index of datetimes.

I'm going through our test cases that hit this now.

@TomAugspurger
Copy link
Contributor Author

We do hit this when concating two series with DTIs with different timezones (at this point we're an object-dtype Index with different tzs). But that raises anyway.

It seems like tz-naive + tz-aware hits this, and actually goes through

In [5]: a = pd.date_range('2000', periods=1, tz='US/Eastern')

In [6]: b = pd.date_range('2000', periods=1)

In [7]: pd.concat([pd.Series(1, a), pd.Series(2, b)])
> /Users/taugspurger/sandbox/pandas/pandas/core/series.py(355)_set_axis()
-> try:
(Pdb) c
Out[7]:
2000-01-01 00:00:00-05:00    1
2000-01-01 00:00:00          2
dtype: int64

But, presumably that's another opportunity for improving perf? We can fix this earlier in the concat by building a DatetimeIndex rather than an Index of Timestamps.

@TomAugspurger TomAugspurger added Datetime Datetime data dtype Performance Memory or execution speed performance labels Nov 9, 2018
@TomAugspurger TomAugspurger modified the milestones: 0.24.0, Contributions Welcome Nov 9, 2018
@TomAugspurger
Copy link
Contributor Author

Ignore that last bit about tz-aware and tz-naive. That returns an object-dtype index.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Nov 9, 2018

Ohhhhhhh the dtype of the index depends on the order of arguments passed to concat.

In [2]: a = pd.date_range('2000', periods=1, tz='US/Eastern')

In [3]: b = pd.date_range('2000', periods=1)

In [4]: pd.concat([pd.Series(1, a), pd.Series(2, b)]).index
> /Users/taugspurger/sandbox/pandas/pandas/core/series.py(355)_set_axis()
-> try:
(Pdb) c
Out[4]: Index([2000-01-01 00:00:00-05:00, 2000-01-01 00:00:00], dtype='object')

In [5]: pd.concat([pd.Series(1, b), pd.Series(2, a)]).index
> /Users/taugspurger/sandbox/pandas/pandas/core/series.py(355)_set_axis()
-> try:
(Pdb) c
Out[5]: DatetimeIndex(['1999-12-31 19:00:00-05:00', '2000-01-01 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)

#23598 is related here (that's for union, this uses append)

@jorisvandenbossche
Copy link
Member

In the original issue, I agree that we keep preserve the object dtype in the series constructor.

@jbrockmendel
Copy link
Member

I think this is the main reason for Index.is_all_dates, so this would also allow #27744

@jbrockmendel jbrockmendel mentioned this issue Sep 28, 2020
6 tasks
@jreback jreback modified the milestones: Contributions Welcome, 1.2 Sep 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants