Skip to content

handle DST appropriately in Timestamp.replace #18618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jan 5, 2018

Conversation

jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Dec 4, 2017

ts = pd.Timestamp('2017-12-03 16:03:24', tz='US/Eastern')
# Timestamp('2017-12-03 16:03:24-0500', tz='US/Eastern')

ts2 = ts.replace(month=6)           # <-- across DST boundary
# master --> Timestamp('2017-06-03 16:03:24-0500', tz='US/Eastern')
# PR     --> Timestamp('2017-06-03 16:03:24-0400', tz='US/Eastern')

ts3 = ts2.tzinfo.normalize(ts2)
# master --> Timestamp('2017-06-03 18:03:24-0400', tz='US/Eastern')
# PR     --> Timestamp('2017-06-03 16:03:24-0400', tz='US/Eastern')

assert ts3 == ts2
assert ts3.hour == ts.hour

Both assertions fail under master, pass under this PR.

The whatsnew entry is not super-informative. Any ideas for a one-sentence description of this fix?

xref #7825 (I think)

  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@gfyoung gfyoung added the Timezones Timezone data dtype label Dec 4, 2017
@gfyoung
Copy link
Member

gfyoung commented Dec 4, 2017

closes #18319, closes #7825 (I think)

Any reason why you're not sure about #7825. If you can run the example code without breaking, that would be considered sufficient to close.

@@ -248,4 +248,4 @@ Other
- Fixed a bug where creating a Series from an array that contains both tz-naive and tz-aware values will result in a Series whose dtype is tz-aware instead of object (:issue:`16406`)
- Fixed construction of a :class:`Series` from a ``dict`` containing ``NaN`` as key (:issue:`18480`)
- Adding a ``Period`` object to a ``datetime`` or ``Timestamp`` object will now correctly raise a ``TypeError`` (:issue:`17983`)
-
- :func:`Timestamp.replace` will now handle Daylight Savings transitions gracefully (:issue:`18319`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "will no longer crash when handling DST transitions" ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crash is a not an appropriate term for user notes.

@codecov
Copy link

codecov bot commented Dec 4, 2017

Codecov Report

Merging #18618 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18618      +/-   ##
==========================================
- Coverage   91.46%   91.45%   -0.02%     
==========================================
  Files         157      157              
  Lines       51449    51449              
==========================================
- Hits        47060    47051       -9     
- Misses       4389     4398       +9
Flag Coverage Δ
#multiple 89.32% <ø> (ø) ⬆️
#single 40.6% <ø> (-0.11%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.81% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6e56195...cfb760c. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 4, 2017

Codecov Report

Merging #18618 into master will decrease coverage by 0.05%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18618      +/-   ##
==========================================
- Coverage   91.57%   91.51%   -0.06%     
==========================================
  Files         150      148       -2     
  Lines       48937    48804     -133     
==========================================
- Hits        44815    44664     -151     
- Misses       4122     4140      +18
Flag Coverage Δ
#multiple 89.89% <ø> (-0.05%) ⬇️
#single 41.63% <ø> (-0.13%) ⬇️
Impacted Files Coverage Δ
pandas/plotting/_converter.py 65.22% <0%> (-1.74%) ⬇️
pandas/core/computation/expressions.py 93.27% <0%> (-1.69%) ⬇️
pandas/util/testing.py 84.41% <0%> (-0.54%) ⬇️
pandas/io/pytables.py 92.45% <0%> (-0.4%) ⬇️
pandas/core/indexes/datetimes.py 95.17% <0%> (-0.28%) ⬇️
pandas/core/indexing.py 92.62% <0%> (-0.19%) ⬇️
pandas/tseries/frequencies.py 93.9% <0%> (-0.15%) ⬇️
pandas/core/generic.py 95.9% <0%> (-0.05%) ⬇️
pandas/core/panel.py 96.83% <0%> (-0.02%) ⬇️
pandas/core/indexes/base.py 96.45% <0%> (ø) ⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2dac793...dced361. Read the comment docs.

@@ -248,4 +248,4 @@ Other
- Fixed a bug where creating a Series from an array that contains both tz-naive and tz-aware values will result in a Series whose dtype is tz-aware instead of object (:issue:`16406`)
- Fixed construction of a :class:`Series` from a ``dict`` containing ``NaN`` as key (:issue:`18480`)
- Adding a ``Period`` object to a ``datetime`` or ``Timestamp`` object will now correctly raise a ``TypeError`` (:issue:`17983`)
-
- :func:`Timestamp.replace` will now handle Daylight Savings transitions gracefully (:issue:`18319`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crash is a not an appropriate term for user notes.

@@ -1136,6 +1136,26 @@ def test_timestamp(self):
dt = ts.to_pydatetime()
assert dt.timestamp() == ts.timestamp()

def test_replace(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can't be the right place, we have lots of replace tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not in test_timestamp... I'll take another look.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like its in test_timezones.

ts_input = datetime(dts.year, dts.month, dts.day, dts.hour, dts.min,
dts.sec, dts.us, tzinfo=_tzinfo)
if _tzinfo is not None and treat_tz_as_pytz(_tzinfo):
# be careful about DST transition, #18319
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make an informative comments.

why woudn't you always localize?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tzinfo.localize changing the tzinfo object is specific to pytz isn't it?

@jreback jreback added the Datetime Datetime data dtype label Dec 4, 2017
@jreback
Copy link
Contributor

jreback commented Dec 4, 2017

lots of tests in #7825, can you add tests / narrow down what is and is not fixed.

@jbrockmendel
Copy link
Member Author

Any reason why you're not sure about #7825. If you can run the example code without breaking, that would be considered sufficient to close.

@gfyoung as jreback mentioned, that issue covered a lot of ground. The crash there can be ignored, not a pandas issue. The OP there gave an example of Timestamp.replace that was incorrect, but did not give the expected output.

@jreback
Copy link
Contributor

jreback commented Dec 5, 2017

@ischwabacher can you have a look

@@ -1229,6 +1229,26 @@ def f():
dt = Timestamp('2013-11-03 01:59:59.999999-0400', tz='US/Eastern')
assert dt.tz_localize(None) == dt.replace(tzinfo=None)

def test_replace_across_dst(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you tests with dateutil as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is pretty specific to pytz API

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is specific to the pytz API because the problem is specific to the pytz API, so you can't write tests that currently fail with dateutil, but you can, in fact, write this test to be agnostic as to whether the test takes a pytz or dateutil zone.

Rather than using localize and normalize, express the "expected" datetimes as UTC datetimes and use .astimezone. pytz and dateutil support astimezone (and, in fact, pytz.normalize essentially just uses .astimezone under the hood). Since pandas provides their own API layer on top of datetime, it is also agnostic to the timezone provider.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using localize and normalize, express the "expected" datetimes as UTC datetimes and use .astimezone

Is this a suggestion for this test or more generally?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a suggestion for this test or more generally?

In this test at least, since with tests you can always engineer your datetime literals such that you never have to use anything except astimezone.

It's also just a useful piece of information to have, that astimezone is one of the few timezone functions where pytz behaves more or less nicely. You only need normalize semantics with pytz for things like replace or "calendar" arithmetic, where you want to change the naive portion of the datetime to a specific value and then change the time zone offset to match. If you want "absolute" arithmetic (where you want to go forward a certain number of hours or seconds or something), then the_operation(dt.astimezone(UTC)).astimezone(dt.tzinfo) will always give the right answer.

In this case you know what the correct answer should be, in absolute time, so you can just declare your initial variable as the correct answer in UTC and convert that to the time zone you care about. If you do it generically like that, you are insulated from any weird quirks of pytz's interface, and you get dateutil support for free (so you can parametrize this test using pytz and dateutil zones).

Of course, this method will fail if you try to construct an imaginary time, since there is no mapping between UTC and imaginary times.

@@ -1229,6 +1229,26 @@ def f():
dt = Timestamp('2013-11-03 01:59:59.999999-0400', tz='US/Eastern')
assert dt.tz_localize(None) == dt.replace(tzinfo=None)

def test_replace_across_dst(self):
# GH#18319
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you provide a 1-liner about what this is testing (I know the title is informative, but a little color)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@jreback
Copy link
Contributor

jreback commented Dec 6, 2017

needs a rebase. also pls add all of the examples from #7825 ; if something doesn't work, then you can open new issue for that part.

@jreback
Copy link
Contributor

jreback commented Dec 6, 2017

further there are a number of timezone issues that this (and/or other fixes) might have resolved. should investigate. anyone want to take this on? @gfyoung @chris-b1 ?

@jbrockmendel
Copy link
Member Author

needs a rebase.

Will do.

also pls add all of the examples from #7825 ; if something doesn't work, then you can open new issue for that part.

#7825 is all over the place (I probably shouldn't have referenced it above). The example in the OP I think is handled, but the poster didn't specify the expected behavior. I think the thing to do is keep that open and ask the OP there to verify if the new behavior closes that issue.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to include some tests on the original issue #7825 if they are not working xfail them.

@@ -320,5 +320,10 @@ Categorical
Other
^^^^^

- Improved error message when attempting to use a Python keyword as an identifier in a numexpr query (:issue:`18221`)
- Fixed a bug where creating a Series from an array that contains both tz-naive and tz-aware values will result in a Series whose dtype is tz-aware instead of object (:issue:`16406`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have rebase issues here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix. Not a big deal, but for changes limited to whatsnew is skipping the CI a viable option? The CI turnaround time is close to a full day and that file gets touched a lot.

@jbrockmendel
Copy link
Member Author

I would like to include some tests on the original issue #7825 if they are not working xfail them.

I'll take another look, but based on the last two looks I concluded that this is not actionable until the author there provides the expected output for the example cases.

Simpler to treat #18319 as the "original issue".

@jbrockmendel
Copy link
Member Author

There's a follow-up to this waiting in the wings that fixes a couple of related bugs.

@pganssle
Copy link
Contributor

I think this needs at least two additional tests - 1. a test for when datetime.replace creates an ambiguous time and 2. a test for when datetime.replace creates an imaginary time.

I think that the way it's done right now, pytz and dateutil zones will do different things in these situations.

@jbrockmendel
Copy link
Member Author

I think this needs at least two additional tests - 1. a test for when datetime.replace creates an ambiguous time and 2. a test for when datetime.replace creates an imaginary time.

Good idea, will do.

@jreback
Copy link
Contributor

jreback commented Dec 15, 2017

I bet this will fix #18785 as well.

@jreback
Copy link
Contributor

jreback commented Dec 15, 2017

I added an example for #18785

@jbrockmendel
Copy link
Member Author

I think this needs at least two additional tests - 1. a test for when datetime.replace creates an ambiguous time and 2. a test for when datetime.replace creates an imaginary time.

I think that the way it's done right now, pytz and dateutil zones will do different things in these situations.

I've added these tests and you're right, these do different things. 1AM on 2014-11-05 is ambiguous in US/Eastern:

>>> ts_dateutil = pd.Timestamp('2014-11-05 01:00:00', tz=dateutil.tz.gettz('US/Eastern'))
>>> ts_pytz = pd.Timestamp('2014-11-05 01:00:00', tz=pytz.timezone('US/Eastern'))
>>> ts_dateutil == ts_pytz
True
>>> res_dateutil = ts_dateutil.replace(day=2)
>>> res_dateutil = Timestamp('2014-11-02 01:00:00-0400', tz='dateutil//usr/share/zoneinfo/US/Eastern')
>>> res_pytz = ts_pytz.replace(day=2)
>>> res_pytz
Timestamp('2014-11-02 01:00:00-0500', tz='US/Eastern')

So the pytz version resolves to the later of the two possibilities while the dateutil version resolves to the earlier. It isn't clear to me that there is an unambiguous "right" thing to do in this case, so as far as pandas is concerned just deferring to pytz/dateutil may be the way to go. Thoughts?

As for imaginary times, both pytz and dateutil versions (both in the PR and in master) fail to raise when ts.replace() results in a non-existent time. My preference would be to address this in a separate issue/PR.

@pganssle
Copy link
Contributor

So the pytz version resolves to the later of the two possibilities while the dateutil version resolves to the earlier. It isn't clear to me that there is an unambiguous "right" thing to do in this case, so as far as pandas is concerned just deferring to pytz/dateutil may be the way to go. Thoughts?

There is indeed an unambiguous right way to do things. The Python documentation before 3.6 suggested that ambiguous times should always resolve to the standard time (the later time). As of Python 3.6, unfortunately this is inverted because the default value of fold is 0 and fold=1 is interpreted as the second time. So dateutil is correct in this case, but you are welcome to defer this as "pytz's problem".

As for imaginary times, both pytz and dateutil versions (both in the PR and in master) fail to raise when ts.replace() results in a non-existent time. My preference would be to address this in a separate issue/PR.

No time zone hooks are called on replace or on any constructor. If you want to retain the semantic meaning of replace, then do nothing. If you want replace to always produce a valid date, it's necessary to choose a default behavior.

To me, it seems like this PR is mostly about making pytz act like dateutil, since dateutil implements the pythonic tzinfo interface and pytz uses its own interface. If you think of it this way, then you should basically just be supporting PEP 495 fold for ambiguous times (not difficult to backport when you're already wrapping datetime internally) and allowing replace to create imaginary times (though if you want to enforce real datetimes in a separate PR, or add some mechanism to normalize imaginary datetimes, that's reasonable).

@jbrockmendel jbrockmendel mentioned this pull request Dec 17, 2017
59 tasks

# Preliminary sanity-check
assert ts_aware == ts_aware.tzinfo.normalize(ts_aware)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to parametrize this across dateutil & pytz, so you need to move this to TestTimeZoneSupportPytz and use self.tz(...)

@jreback
Copy link
Contributor

jreback commented Jan 4, 2018

did you cover @pganssle requests for tests?

@jbrockmendel
Copy link
Member Author

did you cover pganssle requests for test?

@pganssle correct me if I'm wrong but I think the consensus was that there are some lower-level changes that might avoid this class of problem longer-term, but it was OK to consider that out of scope for this PR. (Possibly what you had in mind for #18595?).

@pganssle
Copy link
Contributor

pganssle commented Jan 4, 2018

@jreback While #18595 and this issue share the same root cause - namely that pytz uses a different mechanism for assigning time zones than what is expected in the datetime library - the solutions are different. #18595 is just about having a canonical time zone object for a given series, whereas this issue is about the fact that after every operation with a datetime with attached pytz zone, the zone offsets need to be re-normalized.

To put it another way, if you think of pytz as having a "parent" zone (the one you get by calling pytz.timezone) and "child" zones (the ones that are attached to the datetime by .localize and .normalize, #18595 is about having a standard way to retrieve the "parent" zone, while this PR is about retrieving the correct "child" zone for a given datetime.

@pganssle
Copy link
Contributor

pganssle commented Jan 4, 2018

This is a sort of "out there" idea, but there is one (possibly not particularly performant) thing you could do that would unify the interface between pytz and dateutil (and all other zone providers), and also solve #18595, which is to write a wrapper around pytz that just has the standard tzinfo interface. A rough sketch (eliding some of the details like how to handle ambiguous and imaginary times, and how to handle str and repr, etc):

from datetime import tzinfo
from datetime import datetime, timedelta

import functools

import pytz

def _localize(f):
    @functools.wraps(f)
    def inner_func(self, dt):
        return f(self, self._zone.localize(dt.replace(tzinfo=None)))

    return inner_func

class pytz_zonewrapper(tzinfo):
    def __init__(self, pytz_zone):
        self._zone = pytz.timezone(str(pytz_zone))

    @_localize
    def utcoffset(self, dt):
        return dt.utcoffset()

    @_localize
    def tzname(self, dt):
        return dt.tzname()

    @_localize
    def dst(self, dt):
        return dt.dst()

I would say for now go with @jbrockmendel's approach, but I would think that it might be nice to remove all the pytz-specific code cluttering up time zone handling.

Another option might be to just internally start using dateutil zones. I don't think it's guaranteed, but from my testing tz.gettz(str(pytz_zone)) will always give you the appropriate, valid dateutil zone. I believe that pytz is still considerably faster than dateutil in many areas, which I'm hoping to take care of at some point soon but haven't gotten around to yet.

@jreback jreback added this to the 0.23.0 milestone Jan 5, 2018
@jreback jreback merged commit c0e3767 into pandas-dev:master Jan 5, 2018
@jreback
Copy link
Contributor

jreback commented Jan 5, 2018

thanks, I left #7825 open until we resolve whether this closes it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Timestamp.replace should handle DST transition gracefully
4 participants