Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: infer_freq has stateful behavior #55794

Open
3 tasks done
mdeceglie opened this issue Nov 1, 2023 · 2 comments
Open
3 tasks done

BUG: infer_freq has stateful behavior #55794

mdeceglie opened this issue Nov 1, 2023 · 2 comments
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@mdeceglie
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import datetime
import pandas as pd
tz = 'America/Santiago'
start_date = datetime.datetime(2018, 8, 10, 0, 0, 0)
end_date = datetime.datetime(2018, 8, 14, 23, 0, 0)
freq = 'H'
times = pd.date_range(start=start_date, end=end_date, freq=freq)
times = times.tz_localize(tz=tz, ambiguous='infer',
                          nonexistent='shift_forward')
print(pd.infer_freq(times[:10]))
pd.infer_freq(times)
print(pd.infer_freq(times[:10]))

Issue Description

Initially, infer_freq on the first 10 items of the index returns H, after attempting it on the full index, it returns None on the first 10 items of the index. Confirmed expected behavior in version 2.0.3.

Expected Behavior

Return H in both instances of pd.infer_freq(times[:10]) in the example.

Installed Versions

INSTALLED VERSIONS

commit : a60ad39
python : 3.10.13.final.0
python-bits : 64
OS : Darwin
OS-release : 21.6.0
Version : Darwin Kernel Version 21.6.0: Wed Oct 4 23:56:02 PDT 2023; root:xnu-8020.240.18.704.15~1/RELEASE_ARM64_T6000
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.2
numpy : 1.26.0
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.3
Cython : None
pytest : 7.4.3
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.15.0
pandas_datareader : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.0
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.3
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : 2.2.0
pyqt5 : None

@mdeceglie mdeceglie added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 1, 2023
@kandersolar
Copy link
Contributor

Perhaps this was introduced in #51738? A call to times._engine.clear_mapping() seems to fix things:

times = pd.date_range(start="2018-08-11 20:00", end="2018-08-12 04:00", freq="H")
times = times.tz_localize(tz="America/Santiago", ambiguous='infer',
                          nonexistent='shift_forward')

print(pd.infer_freq(times[:3]))  # H
pd.infer_freq(times)
print(pd.infer_freq(times[:3]))  # None
times._engine.clear_mapping()
print(pd.infer_freq(times[:3]))  # H

Here is a related example:

times = pd.date_range(start="2018-08-11 20:00", end="2018-08-12 04:00", freq="H")
times = times.tz_localize(tz="America/Santiago", ambiguous='infer', nonexistent='shift_forward')

print(times[:3]._is_unique)  # True
times._is_unique
print(times[:3]._is_unique)  # False
times._engine.clear_mapping()
print(times[:3]._is_unique)  # True

This times DatetimeIndex contains equivalent/duplicate times. The tested slice does not, but incorrectly inherits the cached determination of non-uniqueness from its parent. Perhaps a suitable fix is to make slices not inherit unique and need_unique_check from the parent index?

@kandersolar
Copy link
Contributor

A slightly more minimal reproducer:

# last datetime is a duplicate
times = pd.to_datetime(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-03'])

print(pd.infer_freq(times[:3]))  # D
pd.infer_freq(times)
print(pd.infer_freq(times[:3]))  # None
times._engine.clear_mapping()
print(pd.infer_freq(times[:3]))  # D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

2 participants