Skip to content

BUG: Passing ambiguous ndarray[datetime64[ns]] to DatetimeIndex constructor can cause ValueError with wrong offset #5152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jtratner opened this issue Oct 8, 2013 · 7 comments · Fixed by #14853
Labels
Bug Frequency DateOffsets

Comments

@jtratner
Copy link
Contributor

jtratner commented Oct 8, 2013

if you give infer_freq 5 consecutive weekdays, it'll come back with 'D' as its inferred frequency. But if your actual frequency is BDay, then, when DatetimeIndex checks that the frequency matches, 'B' != 'D'. (note that verify_integrity=False skips this). This leads to a more general issue about infer_freq with ambiguous cases. I think it makes the most sense to move these sorts of checks to a method on offset that takes a frequence and an Index or ndarray, and determines whether it is compatible.

This matters because you can hit some edge cases when you pass freq and also datetime64[ns] to the DatetimeIndex constructor and more generally because comparing freqstr is probably not the best way to go about checking whether a frequency matches.

Default implementation could be:

def is_compatible(freqstr, arr=None):
    return freqstr == self.freqstr

and then bday could do something like (and this is totally psuedocode)

def is_compatible(freqstr, arr=None):
    if freqstr == self.freqstr: return True
    if arr is not None and len(arr) <= 5:
        if freqstr == 'D': # or other compatibles that ensure it's consecutive
            return all(is_weekday(date) for date in arr)

This gets more complicated with multiplied offsets, but I think it's worth considering.

from datetime import datetime
import pandas as pd
dates = [datetime(2013, 10, 7), datetime(2013, 10, 8), datetime(2013, 10, 9)]
ind = pd.DatetimeIndex(dates, freq=pd.tseries.frequencies.BDay())
ind2 = pd.DatetimeIndex(ind.values, freq=pd.tseries.frequencies.BDay(),
                       verify_integrity=False)

ind3 = pd.DatetimeIndex(ind.values, freq=pd.tseries.frequencies.BDay())

produces this Traceback:

Traceback (most recent call last):
  File "test2.py", line 8, in <module>
    ind3 = pd.DatetimeIndex(ind.values, freq=pd.tseries.frequencies.BDay())
  File "../pandas/tseries/index.py", line 280, in __new__
    raise ValueError('Dates do not conform to passed '
ValueError: Dates do not conform to passed frequency

cc @cancan101 - this is what we need to deal with in adding your offsets. I believe that every other offset can be returned from infer_freq, so these offsets would be different and therefore could never pass integrity checks. So either we'd need to change infer_freq and/or define some kind of is_compatible method that intelligently covers all the ways in which the frequency could be something different than its freqstr.

@cancan101
Copy link
Contributor

Pretty simple test case:

In [88]: infer_freq([datetime(2013, 10, 7), datetime(2013, 10, 8), datetime(2013, 10, 9)])
Out[88]: 'D'

which I agree is incorrect since a frequency of 'BD' cannot be ruled out.

@cancan101
Copy link
Contributor

@wesm ?

@jtratner
Copy link
Contributor Author

jtratner commented Oct 8, 2013

no that's not incorrect - it's reasonable and valid as the freq for the sequence, right? It might be ambiguous, but all that matters is inferring some frequency. So either it should return a list of possibilities, or we pass the buck to offsets to handle the ambiguity.

@cancan101
Copy link
Contributor

Okay. Agreed, "incorrect" is not the right word. That being said, what exactly is the spec for infer_freq? There is not quite a total ordering of frequencies by specificity, but in general, should infer_freq return the most specific or the more general frequency?

@jreback
Copy link
Contributor

jreback commented Oct 8, 2013

@cancan101 it obviously is the most general, it is used quite extensively interally to lazily evaluate frequency when its not already assigned

@wesm
Copy link
Member

wesm commented Oct 16, 2013

Perhaps infer_freq should have an option not to guess when there is any potential for ambiguity. This might be the default behavior, then. I don't think anyone will shed tears if a length-3 (or 5) array case like you describe gets inferred as no frequency with the change.

@jtratner
Copy link
Contributor Author

@wesm or at the very least could skip frequency inference when small.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Frequency DateOffsets
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants