Skip to content

BUG: Adjust time values with Period objects in Series.dt.end_time #18952

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jul 31, 2018

Conversation

reidy-p
Copy link
Contributor

@reidy-p reidy-p commented Dec 27, 2017

@@ -645,7 +645,8 @@ def start_time(self):

@property
def end_time(self):
return self.to_timestamp(how='end')
data = (self + 1).start_time.values.astype(int) - 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is reaching into the internals too much
is to_timestakp working? (is it tested)?
fix should be there

Copy link
Contributor Author

@reidy-p reidy-p Dec 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I've just pushed a new commit where I tried to make the changes in to_timestamp rather than in end_time.

It seems to work as expected but is causing a few tests to fail because the tests expected the time part to be 00:00:00. For example:

In [1]: pi = pd.PeriodIndex(['2011-01', '2011-02', '2011-03'], freq='M')
In [2]: exp = pd.DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31'])

In [3]: pi.astype('datetime64[ns]', how='end')[0]
Out[3]: 2011-01-31 23:59:59.999999999

In [4]: exp[0]
Out[4]: 2011-01-31 00:00:00

Previously these were equal but now [3] has a time of 23:59:59.999999999 and [4] has a time of 00:00:00.

Period('2016-01-01 00:01:00', freq='M')],
[Period('2016-01-01 00:00:00', freq='S'),
Period('2016-01-01 00:00:01', freq='S')]
])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test should be with series tests

@codecov
Copy link

codecov bot commented Dec 29, 2017

Codecov Report

Merging #18952 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18952      +/-   ##
==========================================
+ Coverage   92.07%   92.07%   +<.01%     
==========================================
  Files         170      170              
  Lines       50688    50693       +5     
==========================================
+ Hits        46671    46676       +5     
  Misses       4017     4017
Flag Coverage Δ
#multiple 90.48% <100%> (ø) ⬆️
#single 42.3% <11.11%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/datetimes.py 95.44% <ø> (-0.03%) ⬇️
pandas/tseries/offsets.py 97.15% <100%> (ø) ⬆️
pandas/core/indexes/period.py 93.5% <100%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c272c52...a6716ca. Read the comment docs.

@@ -294,7 +294,7 @@ Conversion
- Bug in :meth:`DatetimeIndex.astype` when converting between timezone aware dtypes, and converting from timezone aware to naive (:issue:`18951`)
- Bug in :class:`FY5253` where ``datetime`` addition and subtraction incremented incorrectly for dates on the year-end but not normalized to midnight (:issue:`18854`)
- Bug in :class:`DatetimeIndex` where adding or subtracting an array-like of ``DateOffset`` objects either raised (``np.array``, ``pd.Index``) or broadcast incorrectly (``pd.Series``) (:issue:`18849`)

- Bug in :func:`Series.dt.end_time` where time values in ``Period`` objects were not adjusted (:issue:`17157`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and PeriodIndex.end_time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will need a sub-section showing this I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if we're making the change in to_timestamp(how='end') instead of in dt.end_time (which then calls to_timestamp(how='end')) then I will also include to_timestamp(how='end') in the sub-section

new_data._values[indexer] += 1
new_data = period.periodarr_to_dt64arr(new_data._values, base)
# subtract one nanosecond
new_data[indexer] -= 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just construct the DTI, then - pd.Timedelta(1, 'ns')

note that the display is not correct here

In [23]: p = pd.Period('20170101', 'D')

In [24]: p
Out[24]: Period('2017-01-01', 'D')

In [25]: pi = pd.Index([p])

In [26]: pi
Out[26]: PeriodIndex(['2017-01-01'], dtype='period[D]', freq='D')

In [27]: (pi + 1).to_timestamp(how='start') - pd.Timedelta(1, 'ns')
Out[27]: DatetimeIndex(['2017-01-01'], dtype='datetime64[ns]', freq=None)

In [28]: ((pi + 1).to_timestamp(how='start') - pd.Timedelta(1, 'ns')).values
Out[28]: array(['2017-01-01T23:59:59.999999999'], dtype='datetime64[ns]')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should the DatetimeIndex show the time value like below when to_timestamp(how='end') is called?

In [3]: pi.to_timestamp(how='end')
Out[3]: DatetimeIndex(['2017-01-01 23:59:59.999999999'], dtype='datetime64[ns]', freq=None)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is a separate bug I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reidy-p
Copy link
Contributor Author

reidy-p commented Jan 6, 2018

Ok I made some updates.

There are still a couple of tests failing that I need to spend some more time on. Some tests in tests/test_resample.py are failing because of problems with asfreq('B', 'ffill') or asfreq('M', 'ffill') but other offset aliases seem to be ok. For example, result and expected are no longer equal:

In [2]: ts = pd.Series([1.1, 1.2, 1.3, 1.4, 1.5], index=pd.period_range('1/1/1990', freq='W-MON', periods=5))            
    
In [3]: result = ts.resample('B', convention='end').ffill()
In [4]: result
Out[4]:
1990-01-01    1.1
1990-01-02    1.1
1990-01-03    1.1
1990-01-04    1.1
1990-01-05    1.1
1990-01-08    1.2
1990-01-09    1.2
1990-01-10    1.2
1990-01-11    1.2
1990-01-12    1.2
1990-01-15    1.3
1990-01-16    1.3
1990-01-17    1.3
1990-01-18    1.3
1990-01-19    1.3
1990-01-22    1.4
1990-01-23    1.4
1990-01-24    1.4
1990-01-25    1.4
1990-01-26    1.4
1990-01-29    1.5
Freq: B, dtype: float64

In [5]: expected = result.to_timestamp('B', how='end').asfreq('B', 'ffill').to_period()
In [6]: expected
Out[6]:
/Users/paul/Desktop/pandas-reidy-p/pandas/core/indexes/datetimes.py:581: 
UserWarning: Discarding nonzero nanoseconds in conversion
  index = _generate_regular_range(start, end, periods, offset)

1990-01-01    NaN
1990-01-02    1.1
1990-01-03    1.1
1990-01-04    1.1
1990-01-05    1.1
1990-01-08    1.1
1990-01-09    1.2
1990-01-10    1.2
1990-01-11    1.2
1990-01-12    1.2
1990-01-15    1.2
1990-01-16    1.3
1990-01-17    1.3
1990-01-18    1.3
1990-01-19    1.3
1990-01-22    1.3
1990-01-23    1.4
1990-01-24    1.4
1990-01-25    1.4
1990-01-26    1.4
1990-01-29    1.4
Freq: B, dtype: float64

I think the reason the two results above are not equal is that asfreq creates a new DatetimeIndex which then later calls to_pydatetime() to convert the Timestamps to native Python datetime objects. This leads to a loss of nanosecond precision as shown in the warning. I don't think this loss of precision occurs for resample.

@reidy-p reidy-p force-pushed the period_end_time branch 2 times, most recently from b5d9981 to ee4d1a9 Compare January 6, 2018 14:52
@pep8speaks
Copy link

pep8speaks commented Mar 12, 2018

Hello @reidy-p! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 30, 2018 at 20:08 Hours UTC

@reidy-p reidy-p force-pushed the period_end_time branch 2 times, most recently from d33c48d to c399eae Compare March 12, 2018 23:25
@@ -674,6 +674,15 @@ def to_timestamp(self, freq=None, how='start'):
"""
how = _validate_end_alias(how)

end = how == 'E'
if end:
if freq == 'B':
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason freq='B' had to be handled differently because it was causing a lot of resample tests to fail when handled in the same way as all the other freq values. Not sure if this is a bug or if there is a better way of handling this.

@reidy-p reidy-p force-pushed the period_end_time branch 2 times, most recently from e5836b3 to 0ddcac0 Compare March 13, 2018 22:14
.. _whatsnew_0230.api_breaking.end_time:

Time values in ``dt.end_time`` and ``to_timestamp(how='end')``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be the same length as the text

Time values in ``dt.end_time`` and ``to_timestamp(how='end')``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The time values in ``Period`` and ``PeriodIndex`` objects are now adjusted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add :class: for these

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The time values in ``Period`` and ``PeriodIndex`` objects are now adjusted
appropriately when calling :attr:`Series.dt.end_time`, :attr:`Period.end_time`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you say what was adjusted? IOW in words what a user would want to know about this.


In [2]: p = pd.Period('2017-01-01', 'D')
In [3]: pi = pd.PeriodIndex([p])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to show all of these, maybe [6] and [5] are enough (e.g. a scalar and array one), same for below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you address this

@@ -1195,6 +1196,10 @@ cdef class _Period(object):
freq = self._maybe_convert_freq(freq)
how = _validate_end_alias(how)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have an assert on how? (e.g. it has to be S/E)? if not can you add one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done in _validate_end_alias I think:

def _validate_end_alias(how):
how_dict = {'S': 'S', 'E': 'E',
'START': 'S', 'FINISH': 'E',
'BEGIN': 'S', 'END': 'E'}
how = how_dict.get(str(how).upper())
if how not in set(['S', 'E']):
raise ValueError('How must be one of S or E')
return how

@@ -366,6 +366,17 @@ def test_periods_number_check(self):
with pytest.raises(ValueError):
period_range('2011-1-1', '2012-1-1', 'B')

def test_start_time(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number reference

periods=periods, offset=offset)

dates = list(xdr)
# utc = len(dates) > 0 and dates[0].tzinfo is not None
data = tools.to_datetime(dates)

# Add back in the lost nanoseconds
if isinstance(start, Timestamp) and isinstance(end, Timestamp):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems suspect that you need to do this here, what is the reason?

Copy link
Contributor Author

@reidy-p reidy-p Mar 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reason for this was that changing to_timestamp(how='end') to return a time of 23:59:59.99999999 caused problems with asfreq because asfreq creates a new DatetimeIndex which then later calls to_pydatetime() to convert the Timestamps to native Python datetime objects. This leads to a loss of nanosecond precision.

I think there might be an easier way to fix this and I have pushed a new commit to see if it works.

if freq == 'B':
adjust = Timedelta(1, 'D') - Timedelta(1, 'ns')
return self.to_timestamp(how='start') + adjust
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm the B looks suspect I would think that the bottom expression is correct ``(self + 1)`....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just speculating: is the issue with e.g. Period('2018-03-16', freq='B').end_time coming back as Timestamp('2018-03-18 23:59:59.99999999')? That's going to need special handling, yah.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that there are a couple of tests in tests/test_resample.py that use freq='B' that fail. For example, result and expected are different here:

In [2]: rng = pd.period_range('1/1/1990' , '12/31/1995', freq='W-FRI')                           
In [3]: ts = pd.Series(np.random.randn(len(rng)), index=rng) 
In [4]: result = ts.resample('B', convention='end').ffill()
In [5]: result.head()
1990-01-05    0.248627                                                     
1990-01-08    0.248627 
1990-01-09    0.248627                                                                         
1990-01-10    0.248627  
1990-01-11    0.248627            
Freq: B, dtype: float64    

In [6]: expected = result.to_timestamp('B', how='end')
In [7]: expected = expected.asfreq('B', 'ffill').to_period()
In [8]: expected.head()
Out[8]:
1990-01-08    0.248627 
1990-01-09    0.248627  
1990-01-10    0.248627                       
1990-01-11    0.248627                          
1990-01-12    0.248627                                  
Freq: B, dtype: float64   

But when I use:

if freq == 'B':
    adjust = Timedelta(1, 'D') - Timedelta(1, 'ns')
    return self.to_timestamp(how='start') + adjust
else:                                                    
    adjust = Timedelta(1, 'ns')                               
    return (self + 1).to_timestamp(how='start') - adjust   

it fixes the problem but I'm not sure why.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, this works because we need to rollforward (the + adjust) to make sure we land on a B date.

can you add a nice comment here (about what is going on / why)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually see my comment below. this might work if you add an offset (a Day), which handles the freq move, rather than a Timedelta.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that I should try replacing Timedelta(1, 'D') with Day() to get:

if freq == 'B':
    adjust = Day() - Timedelta(1, 'ns')
    return self.to_timestamp(how='start') + adjust
else:                                                    
    adjust = Timedelta(1, 'ns')                               
    return (self + 1).to_timestamp(how='start') - adjust

If so, it doesn't seem to affect anything (all tests pass using either Timedelta(1, 'D') or Day()).

Another option I considered was to try to condense the if/else in the above code into one case:

adjust = frequencies.to_offset(freq) - Timedelta(1, 'ns')                               
return self.to_timestamp(how='start') - adjust

but it didn't work because some offsets don't support subtraction of a Timedelta, I think.

new_data = period.periodarr_to_dt64arr(new_data._ndarray_values, base)
end = how == 'E'
if end:
indexer = np.where(new_data.notnull())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just ~self._isnan

indexer = np.where(new_data.notnull())
# move forward one period
new_data._values[indexer] += 1
ndarray_vals = new_data._ndarray_values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be pushed to periodarr_to_dt64arr, IOW it should take a how=S/E arg

@jreback
Copy link
Contributor

jreback commented Mar 13, 2018

cc @jbrockmendel if you'd have a look esp at the B arithmetic ops

@reidy-p reidy-p force-pushed the period_end_time branch 10 times, most recently from 59aaccd to f9e9fbd Compare March 17, 2018 22:53
@@ -535,6 +535,44 @@ Returning a ``Series`` allows one to control the exact return structure and colu

.. _whatsnew_0230.api_breaking.build_changes:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ref above needs to move


In [2]: p = pd.Period('2017-01-01', 'D')
In [3]: pi = pd.PeriodIndex([p])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you address this

if freq == 'B':
adjust = Timedelta(1, 'D') - Timedelta(1, 'ns')
return self.to_timestamp(how='start') + adjust
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, this works because we need to rollforward (the + adjust) to make sure we land on a B date.

can you add a nice comment here (about what is going on / why)

@@ -1385,7 +1385,7 @@ def _end_apply_index(self, dtindex):
roll = self.n

base = (base_period + roll).to_timestamp(how='end')
return base + off
return base + off + Timedelta(1, 'ns') - Timedelta(1, 'D')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, might be better to actually add an offset here (e.g. a Day) which handles freq moves.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Day() and Timedelta(days=1) should be equivalent here, no? I'd rather use Timedelta, since Day is a less well-known beast.

Are there scenarios where a DST transition is traversed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking this will correctly handle the B freqs. its IS possible DST is traversed, but we prob don't have tests for this (though maybe)

if freq == 'B':
adjust = Timedelta(1, 'D') - Timedelta(1, 'ns')
return self.to_timestamp(how='start') + adjust
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually see my comment below. this might work if you add an offset (a Day), which handles the freq move, rather than a Timedelta.

@reidy-p reidy-p force-pushed the period_end_time branch 2 times, most recently from 9899a46 to 7e491a7 Compare July 14, 2018 21:33
@@ -1235,11 +1235,9 @@ def _generate_regular_range(cls, start, end, periods, freq):
tz = None
if isinstance(start, Timestamp):
tz = start.tz
start = start.to_pydatetime()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversion to_pydatetime seemed to be causing several tests in tests/test_resample.py to fail because the nanosecond precision was lost. When I removed this it fixed the problem and didn't seem to break any other tests locally.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. minor doc comment. @jbrockmendel hows this look?


.. ipython:: python

p = pd.Period('2017-01-01', 'D')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a comment or 2 on the Current on what is changings (above the line where you are executing the code)

@jreback jreback added this to the 0.24.0 milestone Jul 29, 2018
@jbrockmendel
Copy link
Member

LGTM

@jreback jreback merged commit f76a3f3 into pandas-dev:master Jul 31, 2018
@jreback
Copy link
Contributor

jreback commented Jul 31, 2018

thanks @reidy-p nice patch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: pd.Series.dt.end_time when values are pd.Period objects are producing different results
5 participants