Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support zoneinfo tzinfo objects #37654

Closed
jbrockmendel opened this issue Nov 5, 2020 · 22 comments · Fixed by #46425
Closed

ENH: support zoneinfo tzinfo objects #37654

jbrockmendel opened this issue Nov 5, 2020 · 22 comments · Fixed by #46425
Assignees
Labels
Enhancement Timezones Timezone data dtype

Comments

@jbrockmendel
Copy link
Member

ATM we support pytz, dateutil, and datetime.timezone tzinfos. Now that it is part of the stdlib py39, we should also support zoneinfo tzinfos.

@jbrockmendel jbrockmendel added Enhancement Timezones Timezone data dtype labels Nov 5, 2020
@AlexKirko
Copy link
Member

AlexKirko commented Nov 6, 2020

Congratulations to @pganssle for passing the PEP for this. Implementing the support in Timestamp and covering it in tests is going to be a PITA, I think, but it definitely needs to be done.

@pganssle
Copy link
Contributor

pganssle commented Nov 6, 2020

Just pointing out - you shouldn't have to "support" it, and you wouldn't have to do a single thing if y'all had listened to me for the past 5 years when I said you shouldn't be digging into the internals of your time zone providers.

Whatever work is necessary for you to use the public tzinfo interface is worth it and long overdue.

@AlexKirko
Copy link
Member

Yeah, I might be wrong, but I also don't think the current approach of querying the internals is going to work well here. I don't think there is any decent way forward except ditch that and move to only using public interfaces.

@jbrockmendel
Copy link
Member Author

Whatever work is necessary for you to use the public tzinfo interface is worth it and long overdue.

There's a lot of upside to this. The question is if we can do it without a major perf loss.

@pganssle
Copy link
Contributor

pganssle commented Nov 6, 2020

There's a lot of upside to this. The question is if we can do it without a major perf loss.

I guess it doesn't matter? It's harder to maintain what you have now than it would be to re-implement time zone handling from scratch in a performant way (also something I've suggested in the past as a workable solution).

Won't be long before people start complaining about the deprecation warnings from pytz, and I plan to revamp dateutil's internals to be a lot more like zoneinfo's, so you're about to start hitting issues with that route as well. I will not be issuing deprecation warnings for internal refactoring like I did last time. Pinning dateutil would be problematic since the current version doesn't work with "slim" binaries, which are becoming more commonly deployed.

@mroeschke
Copy link
Member

mroeschke commented Nov 6, 2020

It would be nice if we could coincide the long overdue timezone refactor with defaulting to the standard library timezones objects instead of pytz: #34916

@pganssle
Copy link
Contributor

pganssle commented Nov 6, 2020

@mroeschke I would keep those separate. There are backwards-compatibility concerns with removing pytz as the default that aren't present with just adding support for using the public tzinfo interface.

Best to do the refactor first, then drop pytz as the default.

@AlexKirko
Copy link
Member

@jorisvandenbossche said in #34916 that refactoring Timestamp to handle all tzinfo implementations through a public interface would be a breaking change (and, necessarily, a 2.0 thing). Haven't dug deep into the Timestamp implementation in a while, but it's not obvious to me. Could someone comment?

@jbrockmendel
Copy link
Member Author

said in #34916 that refactoring Timestamp to handle all tzinfo implementations through a public interface would be a breaking change (and, necessarily, a 2.0 thing). Haven't dug deep into the Timestamp implementation in a while, but it's not obvious to me. Could someone comment?

The breaking change in #34916 is about what type of tzinfo objects we use when presented with a string like "US/Eastern" or "UTC". ATM we default to pytz; i opended #34916 because i think we should default to stdlib/zoneinfo instead. This issue is about supporting zoneinfo objects, e.g. ATM:

>>> tz = zoneinfo.ZoneInfo("US/Pacific")
>>> ts = pd.Timestamp.now(tz)
>>> ts.tz_localize(None)
Traceback (most recent call last):
  File "pandas/_libs/tslibs/timezones.pyx", line 246, in pandas._libs.tslibs.timezones.get_dst_info
    num = int(get_utcoffset(tz, None).total_seconds()) * 1_000_000_000
AttributeError: 'NoneType' object has no attribute 'total_seconds'
Exception ignored in: 'pandas._libs.tslibs.tzconversion.tz_convert_from_utc_single'
Traceback (most recent call last):
  File "pandas/_libs/tslibs/timezones.pyx", line 246, in pandas._libs.tslibs.timezones.get_dst_info
    num = int(get_utcoffset(tz, None).total_seconds()) * 1_000_000_000
AttributeError: 'NoneType' object has no attribute 'total_seconds'
Timestamp('1970-01-01 00:00:00')

Most or all of the relevant code is going to be in tslibs.timezones and tslibs.tzconversion.

We should be able to support zoneinfo using only the public API, then can measure perf and evaluate options.

@AlexKirko
Copy link
Member

AlexKirko commented Nov 8, 2020

@jbrockmendel
Thanks for the clarification! I'm a bit swamped with a job search at the moment, but I should be able to play around with this a bit in about two or three weeks. If someone else decides to tackle this before then, I'll be happy to help with reviewing.

UPDATE: Timestamp is too tightly coupled with timezone tslibs code to refactor the latter without the former (at least, I think so after a few attempts to do that). I'll try to tackle both at the same time and post a PR if my solution works, but it will probably take a couple weeks. Will post here anyway.

@AlexKirko AlexKirko self-assigned this Feb 8, 2021
@AlexKirko
Copy link
Member

Apologies for the long absence: I had some health issues. Nothing too serious, but productivity was shot to hell for months. Now that I'm feeling better and should fully recover, I'm self-assigning this. Hope to have something for review in two weeks.

@pganssle
Copy link
Contributor

FYI, I suspect that this will become more important as time goes on. I am relatively close to finishing the integration of zoneinfo's logic into dateutil.tz, which will scramble the dateutil implementation details that pandas is relying on. There's a good chance that pytz will adopt zoneinfo under the hood as well, which again will break pandas.

The sooner pandas gets out a version that works with any time zone provider, the less likely it will be that people will have mismatched versions.

@GF-Huang
Copy link

GF-Huang commented Oct 9, 2021

Progress?

image

image

@jreback
Copy link
Contributor

jreback commented Oct 10, 2021

@GF-Huang you or anyone in the community are welcome to contribute to the 3000+ open issues

@pganssle
Copy link
Contributor

@jreback To be fair, this is not exactly a trivial issue for a new contributor to address (I have spent many hours trying and I still don't have anything presentable), and it's also a seriously critical issue.

In the not-too-distant future pandas will break when dateutil 3.0 is released. When pytz adopts zoneinfo, pandas itself will break.

Obviously this is a volunteer project and no one owes anyone patches, but it's worth considering that a big part of the problem with getting this working comes from the undocumented and complicated way that pandas' internal time zone handling logic works. I've tried asking in the gitter for more details about the intended semantics of some of the relevant functions and no one answered (hopefully this doesn't mean that no one knows...).

Of course, I don't know what that translates to, but if there's any way to raise the profile of this issue among the people with the relevant knowledge to fix it, that would be great.

@jreback
Copy link
Contributor

jreback commented Oct 10, 2021

@pganssle i appreciate that's it's non trivial but we have very limited dedicated resources - we need someone to step up to do this

@erfannariman
Copy link
Member

@jreback To be fair, this is not exactly a trivial issue for a new contributor to address (I have spent many hours trying and I still don't have anything presentable), and it's also a seriously critical issue.

In the not-too-distant future pandas will break when dateutil 3.0 is released. When pytz adopts zoneinfo, pandas itself will break.

Obviously this is a volunteer project and no one owes anyone patches, but it's worth considering that a big part of the problem with getting this working comes from the undocumented and complicated way that pandas' internal time zone handling logic works. I've tried asking in the gitter for more details about the intended semantics of some of the relevant functions and no one answered (hopefully this doesn't mean that no one knows...).

Of course, I don't know what that translates to, but if there's any way to raise the profile of this issue among the people with the relevant knowledge to fix it, that would be great.

I think it would be useful as well to post a summary of the issues you ran into in this topic, since not all devs and contributors are on gitter. Maybe someone can pick it up from there or give useful insights to make progress.

@jreback
Copy link
Contributor

jreback commented Oct 10, 2021

cc @pandas-dev/pandas-core

@jbrockmendel
Copy link
Member Author

I'm planning to address this, likely as part of the same push that implements non-nano support. 4 things I need/want to get done first: #43930 (WIP), #41493 (want to get this in for 1.4 so deprecation can be enforced in 2.0; this is proving hard), using cython3.0 (it will include a cython implementation of the stdlib time module that i intend to make use of in the non-nano implementation), securing funding for the non-nano push.

So best case unless someone gets to it before I do is Nov or Dec.

@pganssle
Copy link
Contributor

@jbrockmendel If / when you do get around to it, feel free to send me an e-mail or something, I'd be happy to coordinate on this and share what I've done so far.

@jorisvandenbossche
Copy link
Member

For reference, the questions that @pganssle mentioned he asked on gitter were:

I dunno if anyone's around who understand the time zone code, but I've been trying to put together a patch to prevent pandas from breaking when it encounters a zoneinfo (and from breaking when dateutil and eventually pytz are updated with new implementations), and I'm a bit confused about the various time zone conversions. For example, this function:

def tz_localize_to_utc(ndarray[int64_t] vals, tzinfo tz, object ambiguous=None,

It says that it localizes a "naïve datetime", but it takes a an array of integers.

Is the idea that a "naïve" datetime is represented under the hood as an offset from 1970-01-01T00:00 in its own local time? It's a weird and confusing choice, but treating it as an actual epoch timestamp seems to be giving the wrong answer.

Not fully sure what you mean with "in its own local time" (since it's not know what the local time zone is), but the values for naive timestamps can be interpreted "as if" the timezone was UTC (so indeed an offset from 1970-01-01T00:00), but disregarding the timezone. For example, the naive datetime "January 1st 1970, 00h00" would be encoded as timestamp value 0.

I'm also not entirely sure what context that function is used in.
Ideally I'd refactor the whole thing so that that function no longer exists, but that may be a bit ambitious for my immediate goals.

@jbrockmendel
Copy link
Member Author

def tz_localize_to_utc(ndarray[int64_t] vals, tzinfo tz, object ambiguous=None,

Big picture: this is a vectorized analogue to a pytz tzinfo's .localize.

It says that it localizes a "naïve datetime", but it takes a an array of integers.

This is half "docstrings need improvement" and half "cython doesnt allow ndarray[datetime64]"

Is the idea that a "naïve" datetime is represented under the hood as an offset from 1970-01-01T00:00 in its own local time?

Not exactly. It's treated as an offset from 1970-01-01T00:00 that is also naive, so "its own local time" isn't relevant/meaningful.

I'm also not entirely sure what context that function is used in.

The DatetimeArray/DatetimeIndex tz_localize, which is used to convert from naive to aware.

Ideally I'd refactor the whole thing so that that function no longer exists, but that may be a bit ambitious for my immediate goals.

Yah let's punt on this for now. Longer term, the more logic we can outsource to dateutil/zoneinfo/whatever the better.

dbkhout added a commit to dbkhout/pywaterinfo that referenced this issue Jan 20, 2022
By converting input_datetime (string or datetime object) to a string before applying pandas' to_datetime, compatibility issues between pandas and zoneinfo are circumvented (pandas-dev/pandas#37654).
dbkhout added a commit to dbkhout/pywaterinfo that referenced this issue Jan 20, 2022
By converting input_datetime (string or datetime object) to a string before applying pandas' to_datetime, compatibility issues between pandas and zoneinfo are circumvented (pandas-dev/pandas#37654).

+ fix typos and remove duplicate code
Fix typos

Remove duplicate code
dbkhout added a commit to dbkhout/pywaterinfo that referenced this issue Jan 20, 2022
By converting input_datetime (string or datetime object) to a string before applying pandas' to_datetime, compatibility issues between pandas and zoneinfo are circumvented (pandas-dev/pandas#37654).

+ fix typos and remove duplicate code
dbkhout added a commit to dbkhout/pywaterinfo that referenced this issue Jan 21, 2022
By converting input_datetime (string or datetime object) to a string before applying pandas' to_datetime, compatibility issues between pandas and zoneinfo are circumvented (pandas-dev/pandas#37654).

+ fix typos and remove duplicate code
dbkhout added a commit to dbkhout/pywaterinfo that referenced this issue Jan 21, 2022
By converting input_datetime (string or datetime object) to a string before applying pandas' to_datetime, compatibility issues between pandas and zoneinfo are circumvented (pandas-dev/pandas#37654).

+ fix typos and remove duplicate code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Timezones Timezone data dtype
Projects
None yet
8 participants