Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Default to stdlib timezone objects instead of pytz #34916

Closed
Tracked by #44823
jbrockmendel opened this issue Jun 21, 2020 · 25 comments · Fixed by #59089
Closed
Tracked by #44823

API: Default to stdlib timezone objects instead of pytz #34916

jbrockmendel opened this issue Jun 21, 2020 · 25 comments · Fixed by #59089
Labels
API Design Timezones Timezone data dtype
Milestone

Comments

@jbrockmendel
Copy link
Member

i.e. when we get pd.Timestamp.now("UTC") we should return a Timestamp with datetime.timezone.utc rather than pytz.UTC. Similarly when we parse an ISO8601 datetime we should use tzinfo of timezone(timedelta(seconds=val*60)) instead of pytz.FixedOffset(val)

This isn't that hard to implement, but doing it breaks a couple dozen tests where we are currently checking for pytz objects. This would technically be an API change, so putting it up for discussion before implementing it.

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 21, 2020
@jreback
Copy link
Contributor

jreback commented Jun 21, 2020

i don’t think we easily change this as it’s a failrly large api change

we would need to have an option for this instead

@mroeschke
Copy link
Member

Might be a reasonable version 2.0 break. But +1 to moving the default timezone to the stdlib version

@mroeschke mroeschke added API - Consistency Internal Consistency of API/Behavior Timezones Timezone data dtype API Design and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member API - Consistency Internal Consistency of API/Behavior labels Jun 21, 2020
@jorisvandenbossche
Copy link
Member

Since there is also the discussion of using zoneinfo (in Python 3.9 stdblib, backport https://github.com/pganssle/zoneinfo), which is also a breaking change, those should potentially be considered together?

(zoneinfo actually also has a UTC timezone)

@mroeschke
Copy link
Member

+1 on incorporating zoneinfo as well. timezone object from the current stdlib otherwise are pretty bare.

@jbrockmendel
Copy link
Member Author

Thoughts on doing this without deprecation in 2.0?

@mroeschke
Copy link
Member

mroeschke commented Dec 28, 2021

Thoughts on doing this without deprecation in 2.0?

+1. Might be even better if we make ZoneInfo the default timezone object though if 2.0 still supports < 3.9 backports.zoneinfo would need to be taken as a dependency I think.

@nrontsis
Copy link

Is this still scheduled for pandas 2.0?

@jbrockmendel
Copy link
Member Author

Under discussion, not scheduled.

@mroeschke mroeschke modified the milestones: 2.0, 3.0 Feb 8, 2023
@Zac-HD
Copy link
Contributor

Zac-HD commented Mar 30, 2023

Here's the related conversation from Django, which led to the deprecation of pytz in Django 4.0 and plans for removal in 5.0. Unfortunately, Pandas is starting to fall behind the rest of the Python ecosystem on this one 😥

@MarcoGorelli
Copy link
Member

For UTC, it's already not using pytz:

In [2]: pd.Timestamp.now("UTC").tzinfo
Out[2]: datetime.timezone.utc

The rest didn't happen in time, but it can happen for 3.0 (which should come out in a years' time, since releases are expected to be yearly now). TBH I'd be OK with the switch happening in a minor release as well, but we can discuss that

Unfortunately, Pandas is starting to fall behind the rest of the Python ecosystem on this one

Um, thanks for the encouragement, mate? Not sure what you're trying to achieve here @Zac-HD

@mroeschke
Copy link
Member

We are mainly waiting to drop 3.8 sometime in the next few months so that we can jump straight to ZoneInfo without having to take a temporary dependency on backports.zoneinfo

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Jun 15, 2023

@jbrockmendel did you already have something in progress for removing pytz?

I'd be all for just removing pytz in the next minor release without warning - it's wrong beyond 2038, so this could just be considered a bug fix:

# The following two don't match
In [33]: datetime(2038, 6, 1, tzinfo=ZoneInfo('Europe/Paris')).astimezone(timezone.utc)
Out[33]: datetime.datetime(2038, 5, 31, 22, 0, tzinfo=datetime.timezone.utc)

In [34]: pd.Timestamp('2038-06-01').tz_localize('Europe/Paris').tz_convert('UTC')
Out[34]: Timestamp('2038-05-31 23:00:00+0000', tz='UTC')

# But for 2037, they do
In [35]: datetime(2037, 6, 1, tzinfo=ZoneInfo('Europe/Paris')).astimezone(timezone.utc)
Out[35]: datetime.datetime(2037, 5, 31, 22, 0, tzinfo=datetime.timezone.utc)

In [36]: pd.Timestamp('2037-06-01').tz_localize('Europe/Paris').tz_convert('UTC')
Out[36]: Timestamp('2037-05-31 22:00:00+0000', tz='UTC')

@jbrockmendel
Copy link
Member Author

did you already have something in progress for removing pytz?

Nope.

I'd be all for just removing pytz in the next minor release without warning - it's wrong beyond 2038, so this could just be considered a bug fix:

Hadn't realized that, but not too surprised. I'd be OK with removing it sooner than planned, but won't lose sleep if we wait until 3.0.

@jorisvandenbossche
Copy link
Member

Paul Ganssle has written a package to help deprecate pytz usage (given that the timezone objects have a different interface, switching to zoneinfo can also have breaking changes for people accessing our .tz object and using it directly): https://pytz-deprecation-shim.readthedocs.io/en/latest/

So another option that we could do on the short term is already start using this shim for pytz when zoneinfo is available, as a way to properly deprecate the usage of pytz. And then this could actually pave the way to dropping pytz as the default (or dropping support altogether), switching to zoneinfo in 3.0.

(this was also brought up in the django discussion mentioned above (https://groups.google.com/g/django-developers/c/PtIyadoC-fI), although they did not end up using it out of the box (only recommending as transitional fallback to use manually), as far as I understand from a quick read of it)

@jbrockmendel
Copy link
Member Author

Is the suggestion to disallow pytz (raise?), detect it and swap in non-pytz, or just not default to it when we get a tz=somestr? The latter seems by far the easiest change on our end.

@mroeschke
Copy link
Member

mroeschke commented Jun 16, 2023

Is the suggestion to disallow pytz (raise?), detect it and swap in non-pytz, or just not default to it when we get a tz=somestr?

I was under the impression it would be the latter; we would still support pytz timezones but just change our default (and pytz would become an optional dependency)

@jorisvandenbossche
Copy link
Member

I think it could make sense to eventually also drop support for pytz entirely, but that certainly doesn't have to happen initially, only changing the default.

Adapting the migration plan that @pganssle initially proposed in the above mentioned django thread for our situation, possible stages could look like:

  1. (already done) Drop the requirement that time zone support is pytz-specific, i.e. make it so that users already have the option to use zoneinfo and datetime.timezone.
  2. Document that users should use zoneinfo, even if it's not the default.
  3. Convert all default uses of pytz to using a shim that supports pytz's interface, but raises warnings whenever something pytz-specific is used.
  4. Remove the shims, making zoneinfo the default for all options, but maintaining support for user-supplied pytz zones.
  5. Deprecate support for user-supplied pytz zones, and eventually remove support for pytz entirely.

Step 1 is already done (#37654, PR #46425), and step 2 is possible but is less convenient in pandas' case (in most cases users either get a timezone automatically from importing some data (and here we have to choose a default tzinfo impl to use), or specify the timezone with a string in something like tz_localize("Europe/Brussels"). Although we could recommend using tz_localize(zoneinfo.ZoneInfo("Europe/Brussels")), that doesn't give that much benefits for most users).

So I think the current questions we have to answer are mostly: do we first want to do step 3 (the shim for deprecating pytz specific features), or do we directly go for step 4 (switching the default)? And in either case, when do we do this? (already in 2.1, or only in 3.0?)

And whether we also want to do step 5 (fully deprecating/removing pytz support) is something we can still decide later. There is actually also another issue about deprecating pytz: #46463


Personally, I don't think that many users will access the .tz attribute on a pandas Series or DatetimeIndex or Timestamp, and use this tzinfo object directly by calling methods on this (this is what would be impacted by switching to zoneinfo, because pytz has some different behaviour and methods compared to stdlib timezone objects), and probably most usage is through (vectorized) operations using pandas functions, which aren't affected.
But still, this is a breaking change, so I personally think we should not do that for 2.1, but only for 3.0. If we leave the actual switch for 3.0, that would leave the room to already use the shim to actually raise warnings so users can update to avoid breakage in 3.0. Now, strictly speaking, starting to use this pytz-shim (on top of zoneinfo) also has some corner cases where this changes behaviour (eg regarding arithmetic (https://pytz-deprecation-shim.readthedocs.io/en/latest/migration.html#handling-datetime-arithmetic-normalizing-datetimes), but I think this is probably fine to change in 2.1.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 16, 2023

  1. (already done) Drop the requirement that time zone support is pytz-specific, i.e. make it so that users already have the option to use zoneinfo and datetime.timezone.

One other aspect is also the question if we think our current zoneinfo support is ready to be used by default, as this comes with quite a performance impact:

In [4]: ts = pd.date_range("2012-01-01", freq="T", periods=1_000_000, tz="Europe/Brussels").tz_localize(None)

In [5]: %timeit ts.tz_localize("Europe/Brussels", ambiguous="infer")
288 ms ± 4.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit ts.tz_localize(zoneinfo.ZoneInfo("Europe/Brussels"), ambiguous="infer")
5.79 s ± 352 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

(for this example, around a 20x slowdown for converting naive to aware timestamps)

I am not familiar enough with our timezone conversion code in cython to have a good idea how easy it would be to improve this. But I assume this slowdown is mostly because with zoneinfo, we actually use zoneinfo to do the conversion? (which means converting each value in a DatetimeArray to a datetime.datetime object, use zoneinfo to get the offset / localize, and convert back to our internal int representation, so even though implemented in cython, this is a for loop with python calls. While with pytz we don't actually use pytz element-by-element, but get the offsets and transitions info from the pytz timezone, and then do the conversions ourselves in an efficient cython loop). @jbrockmendel is that a somewhat correct assumption?

@jbrockmendel
Copy link
Member Author

While with pytz we don't actually use pytz element-by-element, but get the offsets and transitions info from the pytz timezone, and then do the conversions ourselves in an efficient cython loop). @jbrockmendel is that a somewhat correct assumption?

Yes. We do the same with dateutil. zoneinfo intentionally doesn't expose them (and IIUC in the future dateutil won't either), so we'd have to get them from somewhere else if we want to avoid making the python calls.

IIRC (big "if") when implementing zoneinfo support I was surprised at how good the performance was compared to the pytz/dateutil cases (not better, just not as bad as I had expected). I think ambiguous="infer" was implemented later, so this poorly-performing case wouldn't have factored in when I formed that "better perf than I expected" impression.

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Jun 16, 2023

I was under the impression it would be the latter; we would still support pytz timezones but just change our default (and pytz would become an optional dependency)

maybe temporarily, but longer term (say, 3.0) I'm suggesting removing pytz completely

@jorisvandenbossche
Copy link
Member

I think ambiguous="infer" was implemented later, so this poorly-performing case wouldn't have factored in when I formed that "better perf than I expected" impression.

I see a quite similar factor in performance difference for the default of ambiguous="raise" (after filtering out the nightly hours in the timeseries ts above):

In [5]: ts2 = ts[(ts.hour > 6) & (ts.hour < 24)]

In [6]: %timeit ts2.tz_localize("Europe/Brussels", ambiguous="raise")
196 ms ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: %timeit ts2.tz_localize(zoneinfo.ZoneInfo("Europe/Brussels"), ambiguous="raise")
3.86 s ± 286 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@MarcoGorelli
Copy link
Member

thanks for looking into this - that looks like a bit of a blocker, will see what can be done

@pganssle
Copy link
Contributor

FWIW my recommendation for pandas has long been that it actually makes sense to have a de novo implementation of time zone handling that is natively vectorized. It's not that hard to implement, particularly since you can mostly crib my implementations.

The only issue is where to get the data, but you can just take it from the same place(s) that zoneinfo does. Some of that is publicly exposed, but I can try to work on exposing methods for getting the exact resource that zoneinfo would use for any given key.

@jbrockmendel
Copy link
Member Author

The only issue is where to get the data, but you can just take it from the same place(s) that zoneinfo does

I've been hesitant to go down this road because I don't want to be in the business of finding and parsing system timezone files. Is there a path where we can re-use the existing implementation in zoneinfo or something else?

@S0LERA
Copy link

S0LERA commented Jan 17, 2024

Any updates on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Timezones Timezone data dtype
Projects
None yet
9 participants