Skip to content

API: should only Area/Location time-zone-identifiers (other than UTC) be allowed? #53250

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MarcoGorelli opened this issue May 16, 2023 · 9 comments
Labels
API Design Timezones Timezone data dtype

Comments

@MarcoGorelli
Copy link
Member

Currently, it's possible to set all kinds of time zones, such as:

In [7]: to_datetime(['2020-01-01']).tz_localize('+01:00')
Out[7]: DatetimeIndex(['2020-01-01 00:00:00+01:00'], dtype='datetime64[ns, UTC+01:00]', freq=None)

In [10]: to_datetime(['2020-01-01']).tz_localize('CET')
Out[10]: DatetimeIndex(['2020-01-01 00:00:00+01:00'], dtype='datetime64[ns, CET]', freq=None)

In [9]: to_datetime(['2020-01-01']).tz_localize('Cuba')
Out[9]: DatetimeIndex(['2020-01-01 00:00:00-05:00'], dtype='datetime64[ns, Cuba]', freq=None)

In https://en.wikipedia.org/wiki/List_of_tz_database_time_zones, it's recommended that people use an 'Area/Location' time zone identifier instead - e.g. 'Africa/Lagos' instead of the first, 'Europe/Paris' instead of the second, and 'America/Havana' instead of the third.

Trying to pass non-area/location tz-identifiers opens people up to common misconceptions and traps about time zones, e.g. that despite Greenwich being in London, London does not observe GMT (it only does for half the year)

In https://en.wikipedia.org/wiki/List_of_tz_database_time_zones, for every single tz-identifier which isn't in the 'Area/Location' format, there's a link to one which is, suggesting to use that one instead.

Would it be safe to make such a restriction?

cc @mroeschke @jbrockmendel @pganssle @rebecca-palmer (sorry for the pings, would really value your input here if possible!)


This would go hand-in-hand with #50887. What we'd get to in the end would be:

Current behaviour (pandas 2.0.1):

In [11]: to_datetime(['2020-01-01 00:00+01:00'])
Out[11]: DatetimeIndex(['2020-01-01 00:00:00+01:00'], dtype='datetime64[ns, UTC+01:00]', freq=None)

In [12]: to_datetime(['2020-01-01 00:00+01:00']).tz_convert('+02:00')
Out[12]: DatetimeIndex(['2020-01-01 01:00:00+02:00'], dtype='datetime64[ns, UTC+02:00]', freq=None)

New behaviour (pandas 3.x):

In [11]: to_datetime(['2020-01-01 00:00+01:00'])
Out[11]: DatetimeIndex(['2019-12-31 23:00:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)

In [12]: to_datetime(['2020-01-01 00:00+01:00']).tz_convert('+02:00')
UnknownTimeZoneError: 'Please use Area/Location time-zone-identifier, see https://en.wikipedia.org/wiki/List_of_tz_database_time_zones

In [13]: to_datetime(['2020-01-01 00:00+01:00']).tz_convert('Europe/Athens')
Out[13]: DatetimeIndex(['2020-01-01 01:00:00+02:00'], dtype='datetime64[ns, Europe/Athens]', freq=None)
@MarcoGorelli MarcoGorelli added API Design Timezones Timezone data dtype labels May 16, 2023
@jbrockmendel
Copy link
Member

The New [11] has UTC instead of UTC+1. Is that intentional? I'm skeptical of that.

Why disallow +02:00? Is that ambiguous?

The relevant code paths just dispatch to pytz.timezone, and in 3.0 will dispatch to zoneinfo.ZoneInfo. This seems benign.

@MarcoGorelli
Copy link
Member Author

The New [11] has UTC instead of UTC+1. Is that intentional?

yup - in #50887 (comment) we discussed converting offset-aware strings to UTC and getting rid of the utc argument, so it would be like pyarrow

'+01:00' isn't ambiguous, but it's not really a meaningful time zone - it doesn't handle whether DST is present, or historical changes. People might put '+01:00' because they know that their time zone is '+01:00', but then forget that their country observes DST, whereas if they'd put 'Europe/Paris' then that would've been handled for them

@jbrockmendel
Copy link
Member

Thanks for clarifying. Should I comment on the +01:00 case on the linked thread to keep discussion here focused?

@MarcoGorelli
Copy link
Member Author

perhaps let's keep that here, and leave the other thread to the topic of disallowing parsing multiple offsets into an Object Index/Series

@jbrockmendel
Copy link
Member

OK. So the suggestion is to make to_datetime(['2020-01-01 00:00+01:00']) convert to UTC. Would Timestamp('2020-01-01 00:00+01:00') do the same?

@MarcoGorelli
Copy link
Member Author

that's right

hmm I wouldn't have thought Timestamp would need to change, as that's just a single element

@mroeschke
Copy link
Member

I am not sure how I feel about this.

  1. Wouldn't it be more explicit for the user to coerce to UTC if desired instead of doing this on the user's behalf?
  2. The result removes the ability for the user to know what was the local hour/minute of the timestamp which could be useful information of the input

I agree that Area/Location timezones are more useful for subsequent features, but anecdotally I feel UTC offset tz lurk more in the wild

@jbrockmendel
Copy link
Member

In general having the scalar behavior match the non-scalar behavior is pretty important. We'd need a compelling reason to have them differ.

@MarcoGorelli
Copy link
Member Author

ok sure, makes sense - closing then, thanks for discussing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Timezones Timezone data dtype
Projects
None yet
Development

No branches or pull requests

3 participants