Skip to content

to_datetime() throws ValueError: Cannot pass a tz argument when parsing strings with timezone information. #32792

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
empz opened this issue Mar 17, 2020 · 8 comments · Fixed by #32984
Labels
Performance Memory or execution speed performance Timezones Timezone data dtype
Milestone

Comments

@empz
Copy link

empz commented Mar 17, 2020

I think pandas should support passing %z in the format but also utc=True. In my opinion, one thing is the format, which tells pandas how to parse the datetime string. The other argument is just telling to return the dates in UTC, no matter which timezone they were in the beginning.

Here's a repl that shows the issue: https://repl.it/@eparizzi/Pandas-todatetime-in-UTC-with-format

If you replace that simple CSV with some big 50K row time-series CSV, the call to to_datetime without the format takes more than 20 seconds. On the contrary, passing the format and without utc=True takes less than 2 seconds. Unfortunately, this doesn't seem to work properly when there are multiple timezones in the column. It simply can't set a proper dtype in this case.

So, why can't we have a way to specify the format including timezone but also specify that we want everything in datetime64(UTC)?

I've already gone over this issue: #25571 but I still think this deserves a discussion.

import pandas as pd

# I know the format, I want to use it so that Pandas to_datetime() runs faster.
DATETIME_FORMAT = '%m/%d/%Y %H:%M:%S.%f%z'

try:
  data = ['10/11/2018 00:00:00.045-07:00',
  '10/11/2018 01:00:00.045-07:00',
  '10/11/2018 01:00:00.045-08:00',
  '10/11/2018 02:00:00.045-08:00',
  '10/11/2018 04:00:00.045-07:00',
  '10/11/2018 05:00:00.045-07:00']

  df = pd.DataFrame(data, columns=["Timestamp"])

  # This raises "ValueError: Cannot pass a tz argument when parsing strings with timezone information."
  df.Timestamp = pd.to_datetime(df.Timestamp, format=DATETIME_FORMAT, utc=True)

except ValueError as valueError:
  # I don't know why a %z in the format is not compatible with utc=True. The %z is telling pandas that it needs to deal with timezones. Then, utc=True should just convert all to UTC. It shouldn't be more complicated than that I think.
  print(f"ERROR: {str(valueError)}")
  print("...why not?")

  # This works, but it's A LOT slower when parsing a lot of rows.
  df.Timestamp = pd.to_datetime(df.Timestamp, infer_datetime_format=True, utc=True)

finally:
  print(df)
  print(df.Timestamp.dtype)

  # Expected output:
  # 
  #                          Timestamp
  # 0 2018-10-11 07:00:00.045000+00:00
  # 1 2018-10-11 08:00:00.045000+00:00
  # 2 2018-10-11 09:00:00.045000+00:00
  # 3 2018-10-11 10:00:00.045000+00:00
  # 4 2018-10-11 11:00:00.045000+00:00
  # 5 2018-10-11 12:00:00.045000+00:00
  # datetime64[ns, UTC]
@mroeschke
Copy link
Member

Could you provide a minimal, copy pastable example in this issue: https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@empz
Copy link
Author

empz commented Mar 18, 2020

Could you provide a minimal, copy pastable example in this issue: https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

What do you mean? There's a repl link on the issue.
https://repl.it/@eparizzi/Pandas-todatetime-in-UTC-with-format

@jreback
Copy link
Contributor

jreback commented Mar 18, 2020

pls copy paste code to the top of the issue as instructed in the template

@empz
Copy link
Author

empz commented Mar 18, 2020

There you go.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Mar 18, 2020

Distilled from the link / larger example:

# I know the format, I want to use it so that Pandas to_datetime() runs faster.
DATETIME_FORMAT = '%m/%d/%Y %H:%M:%S.%f%z'

pd.to_datetime(["10/11/2018 00:00:00.045-07:00", "10/11/2018 01:00:00.045-07:00"], 
               format=DATETIME_FORMAT, utc=True)

@eparizzi although the repl site looks really nice (certainly for a bit more complex example that might require an additional dependency or data), we still prefer a short copy-pastable example here when possible. Like the example I put above. That makes it easier to deal with a lot of issue reports.

@empz
Copy link
Author

empz commented Mar 18, 2020

Understood.

I wanted to give more detail because on the issue I linked it seems that this is an expected behavior, not a bug. But I still think we should have a way to parse timezone strings to UTC by passing the format which works way faster than letting pandas infer it.

@mroeschke
Copy link
Member

I guess that's a fair case to allow utc=True with %z in the format string if it's a matter of performance.

The documentation will need to be clearly stating that the returning timestamps will be localized to the timezone parsed from %z and converted to utc afterwards.

@mroeschke mroeschke added Performance Memory or execution speed performance Timezones Timezone data dtype labels Mar 18, 2020
@empz
Copy link
Author

empz commented Mar 18, 2020

I thought that was the case already. At least, that's what I understand from the current docs for the utc argument.

utc bool, default None
Return UTC DatetimeIndex if True (converting any tz-aware datetime.datetime objects as well).

(from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html#pandas.to_datetime)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants