Skip to content

BUG: Do not fail when parsing pydatetime objects in pd.to_datetime #49298

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 3 tasks
sv1990 opened this issue Oct 25, 2022 · 4 comments · Fixed by #49893
Closed
1 of 3 tasks

BUG: Do not fail when parsing pydatetime objects in pd.to_datetime #49298

sv1990 opened this issue Oct 25, 2022 · 4 comments · Fixed by #49893
Labels
Bug Datetime Datetime data dtype

Comments

@sv1990
Copy link

sv1990 commented Oct 25, 2022

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

When reading excel files I often get date columns looking like

import pandas as pd
from datetime import datetime

s = pd.Series(["01/02/01", datetime(2001, 2, 2)])

When trying to parse those columns using

pd.to_datetime(s, format="%d/%m/%y")

the Exception ValueError("time data '2001-02-02 00:00:00' does not match format '%d/%m/%y' (match)") is raised.

The origin of this issue is that the already parsed datetime is converted to a string in isoformat and then an attempt to parse it in the given format is made.

Using pd.to_datetime without a format leads to wrong results since the format is ambiguous.

Feature Description

pd.to_datetime should either get an option to handle datetime.datetime objects differently or do so by default.

Alternative Solutions

The only alternative solution I could think off currently is a raw loop and checking the type of each element individually.

Additional Context

No response

@sv1990 sv1990 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 25, 2022
@MarcoGorelli MarcoGorelli added the Datetime Datetime data dtype label Oct 25, 2022
@MarcoGorelli
Copy link
Member

Hi @sv1990

Thanks for your report - it might make sense to skip datetime objects, investigations / pull requests would be welcome

Otherwise dayfirst should work here:

>>> pd.to_datetime(s, dayfirst=True)
0   2001-02-01
1   2001-02-02
dtype: datetime64[ns]

@MarcoGorelli MarcoGorelli removed the Needs Triage Issue that has not been reviewed by a pandas team member label Oct 25, 2022
@aaossa
Copy link
Contributor

aaossa commented Oct 25, 2022

Seems like the exception comes from convert_listlike:

convert_listlike = partial(
_convert_listlike_datetimes,
tz=tz,
unit=unit,
dayfirst=dayfirst,
yearfirst=yearfirst,
errors=errors,
exact=exact,
infer_datetime_format=infer_datetime_format,
)

and then from _to_datetime_with_format:

if format is not None:
res = _to_datetime_with_format(
arg, orig_arg, name, tz, format, exact, errors, infer_datetime_format
)
if res is not None:
return res

which uses _array_strptime_with_fallback as fallback:

# fallback
res = _array_strptime_with_fallback(
arg, name, tz, fmt, exact, errors, infer_datetime_format
)

, a wrapper around array_strptime:

try:
result, timezones = array_strptime(arg, fmt, exact=exact, errors=errors)

From there, pandas/_libs/tslibs/strptime.pyx seems the correct place to skip datetime objects (this has not been tested yet) or directly construct an appropriate object again. I'll give it a try.

@aaossa
Copy link
Contributor

aaossa commented Oct 25, 2022

Ok, this change seems to work:

diff --git a/pandas/_libs/tslibs/strptime.pyx b/pandas/_libs/tslibs/strptime.pyx
index 6287c2fbc5..20452bf75e 100644
--- a/pandas/_libs/tslibs/strptime.pyx
+++ b/pandas/_libs/tslibs/strptime.pyx
@@ -2,6 +2,7 @@
 """
 from cpython.datetime cimport (
     date,
+    datetime,
     tzinfo,
 )

@@ -129,6 +130,19 @@ def array_strptime(ndarray[object] values, str fmt, bint exact=True, errors='rai
             if val in nat_strings:
                 iresult[i] = NPY_NAT
                 continue
+        elif isinstance(val, datetime):
+            dts.year = val.year
+            dts.month = val.month
+            dts.day = val.day
+            dts.hour = val.hour
+            dts.min = val.minute
+            dts.sec = val.second
+            dts.us = val.microsecond
+            dts.ps = 0  # Not enough precision in datetime objects (https://github.com/python/cpython/issues/59648)
+
+            iresult[i] = npy_datetimestruct_to_datetime(NPY_FR_ns, &dts)
+            result_timezone[i] = val.tzname()
+            continue
         else:
             if checknull_with_nat_and_na(val):
                 iresult[i] = NPY_NAT

I just replicated the processing applied in the same function a couple of lines below:

dts.year = year
dts.month = month
dts.day = day
dts.hour = hour
dts.min = minute
dts.sec = second
dts.us = us
dts.ps = ns * 1000
iresult[i] = npy_datetimestruct_to_datetime(NPY_FR_ns, &dts)
try:
check_dts_bounds(&dts)
except ValueError:
if is_coerce:
iresult[i] = NPY_NAT
continue
raise
result_timezone[i] = timezone

I'll prepare a PR and proper tests if that's ok, so we can move discuss about implementation there.

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Oct 26, 2022

Thanks for looking into this - if it's the same, can it be factored out into a function?

@aaossa aaossa moved this to In Progress in OSS contributions Nov 17, 2022
@MarcoGorelli MarcoGorelli changed the title ENH: Do not attempt to parse datetime.datetime objects in pd.to_datetime BUG: Do not attempt to parse datetime.datetime objects in pd.to_datetime Nov 24, 2022
MarcoGorelli pushed a commit to MarcoGorelli/pandas that referenced this issue Nov 24, 2022
MarcoGorelli pushed a commit to MarcoGorelli/pandas that referenced this issue Nov 24, 2022
@MarcoGorelli MarcoGorelli changed the title BUG: Do not attempt to parse datetime.datetime objects in pd.to_datetime BUG: Do not fail when parsing pydatetime objects in pd.to_datetime Nov 24, 2022
MarcoGorelli pushed a commit to MarcoGorelli/pandas that referenced this issue Nov 28, 2022
MarcoGorelli pushed a commit to MarcoGorelli/pandas that referenced this issue Nov 30, 2022
Repository owner moved this from In Progress to Done in OSS contributions Dec 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment