Skip to content

BUG: With cache, to_datetime() returns pd.NaT for inputs that produce duplicated values #42259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
zyc09 opened this issue Jun 26, 2021 · 3 comments · Fixed by #42261
Closed
3 tasks done
Assignees
Labels
Bug Datetime Datetime data dtype Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@zyc09
Copy link
Contributor

zyc09 commented Jun 26, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

from pandas import NaT, to_datetime, Series, Timestamp
from pandas._testing import assert_series_equal

input_ser = Series([None] + [NaT] * 50 + ["2012 July 26", Timestamp("2012-07-26")], dtype="object")
expected_ser = Series([NaT] * 51 + [Timestamp("2012-07-26"), Timestamp("2012-07-26")], dtype="datetime64[ns]")
result = to_datetime(input_ser)
assert_series_equal(result, expected_ser)
# AssertionError: numpy array are different

# actual result is [NaT] * 51 + [Timestamp("2012-07-26"), NaT]
# ie. below passes
# assert_series_equal(result, Series([NaT] * 51 + [Timestamp("2012-07-26"), NaT], dtype="datetime64[ns]")) 

Problem description

The current to_datetime will incorrectly parse and omit data in certain situations due to a slightly erroneous deduplication to fix
GH#39882 and GH#35888.

Expected Output

to parse the datetime correctly.
Eg. return Series([NaT] * 51 + [Timestamp("2012-07-26"), Timestamp("2012-07-26")], dtype="datetime64[ns]") in the example above.

Output of pd.show_versions()

pandas : 1.4.0.dev0+108.gfa6b96e128
numpy : 1.21.0
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.2
setuptools : 57.0.0
Cython : 0.29.23
pytest : 6.2.4
hypothesis : 6.14.0
sphinx : 4.0.2
blosc : 1.10.4
feather : None
xlsxwriter : 1.4.3
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.25.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 2021.05.0
fastparquet : 0.6.3
gcsfs : 2021.05.0
matplotlib : 3.4.2
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 4.0.1
pyxlsb : None
s3fs : 2021.05.0
scipy : 1.7.0
sqlalchemy : 1.4.19
tables : 3.6.1
tabulate : 0.8.9
xarray : 0.18.2
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1

@zyc09 zyc09 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 26, 2021
@zyc09
Copy link
Contributor Author

zyc09 commented Jun 26, 2021

take

@jreback jreback added Datetime Datetime data dtype and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 1, 2021
@jreback jreback added this to the 1.4 milestone Jul 1, 2021
@simonjayhawkins
Copy link
Member

The code sample worked on 1.2.5. relabelling as regression, and changing milestone to 1.3.1

will backport #42261 and move release note

@simonjayhawkins simonjayhawkins modified the milestones: 1.4, 1.3.1 Jul 22, 2021
@simonjayhawkins simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Jul 22, 2021
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jul 22, 2021
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jul 22, 2021
@simonjayhawkins
Copy link
Member

relabelling as regression

first bad commit: [54bd5cd] Bug in to_datetime raising ValueError with None and NaT and more than 50 elements (#41006) cc @phofl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants