Skip to content

tricky timestamp conversion #25571

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
randomgambit opened this issue Mar 6, 2019 · 20 comments
Closed

tricky timestamp conversion #25571

randomgambit opened this issue Mar 6, 2019 · 20 comments
Labels
Datetime Datetime data dtype Needs Info Clarification about behavior needed to assess issue

Comments

@randomgambit
Copy link

Hello there, its me the bug hunter again :)

I have this massive 200 million rows dataset, and I encountered some very annoying behavior. I wonder if this is a bug.

I load my csv using

mylog = pd.read_csv('/mydata.csv',
                    names = ['mydatetime',  'var2', 'var3', 'var4'],
                    dtype = {'mydatetime' : str},
                    skiprows = 1)

and the datetime column really look like regular timestamps (tz aware)

mylog.mydatetime.head()
Out[22]: 
0    2019-03-03T20:58:38.000-0500
1    2019-03-03T20:58:38.000-0500
2    2019-03-03T20:58:38.000-0500
3    2019-03-03T20:58:38.000-0500
4    2019-03-03T20:58:38.000-0500
Name: mydatetime, dtype: object

Now, I take extra care in converting these string into proper timestamps:

mylog['mydatetime'] = pd.to_datetime(mylog['mydatetime'] ,errors = 'coerce', format = '%Y-%m-%dT%H:%M:%S.%f%z', infer_datetime_format = True, cache = True)

That takes a looong time to process, but seems OK. The output is

mylog.mydatetime.head()
Out[23]: 
0    2019-03-03 20:58:38-05:00
1    2019-03-03 20:58:38-05:00
2    2019-03-03 20:58:38-05:00
3    2019-03-03 20:58:38-05:00
4    2019-03-03 20:58:38-05:00
Name: mydatetime, dtype: object

What is puzzling is that so far I thought I had full control of my dtypes. However, running the simple

mylog['myday'] = pd.to_datetime(mylog['mydatetime'].dt.date, errors = 'coerce')

  File "pandas/_libs/tslib.pyx", line 537, in pandas._libs.tslib.array_to_datetime

ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True

The only way I was able to go past this error was by running

mylog['myday'] = pd.to_datetime(mylog['mydatetime'].apply(lambda x: x.date()))

Is this a bug? Before upgrading to 24.1 I was not getting the tz error above. What do you think? I cant share the data but I am happy to try some things to help you out!

Thanks!

@WillAyd
Copy link
Member

WillAyd commented Mar 6, 2019

It doesn't look like your first usage of to_datetime is actually casting to date time. Minimal code sample to reproduce would as always be very beneficial.

FYI should need to pass format and infer_datetime_format together

@WillAyd WillAyd added Datetime Datetime data dtype Needs Info Clarification about behavior needed to assess issue labels Mar 6, 2019
@randomgambit
Copy link
Author

@WillAyd thanks! do you actually mean I should NOT use format and infer_datetime_format together? What might happens is that there are some malformed lines in the data. However, using errors = 'coerce' should protect me from that so I am still a bit puzzled..

@WillAyd
Copy link
Member

WillAyd commented Mar 6, 2019

Documentation points to infer_datetime_format only being valid if format is None, so should essentially be a no-op. The errors argument should just be setting invalid input to NaT, which should still give you the desired dtype instead of the object you are getting.

As mentioned in any case a minimally reproducible example would be very helpful

@randomgambit
Copy link
Author

@WillAyd yes but the issue is that obviously I cannot share the large data. Is there any pandas function I can try in order to understand whats going on? Can we identify the possibly bogus timestamps?

thanks!!

@WillAyd
Copy link
Member

WillAyd commented Mar 6, 2019

Tough to say. Maybe don't coerce errors and see where it raises - could indicate where the issue arises

@mroeschke
Copy link
Member

We have some issues where a combination of errors and infer_datetime_format and format can return odd results. #25143 #25512

Nonetheless, @WillAyd's suggestion is good to see which date is not converting correctly and subsequently why that combination of arguments is returning that result.

@randomgambit
Copy link
Author

thanks guys. trying that shortly and reporting to the base ASAP :)

@randomgambit
Copy link
Author

@mroeschke the problem is that i dont see an easy fix to #25143. maybe just getting rid of infer_datetime_format?

@mroeschke
Copy link
Member

infer_datetime_format would probably be a tough sell to remove as it can provide a performance boost in parsing if all the dates conform to that format.

I imagine the error handling around this argument is not entirely robust.

@randomgambit
Copy link
Author

randomgambit commented Mar 7, 2019

OK @WillAyd @mroeschke this is getting even weirder. I reloaded my data, this time cleaning the (character) timestamps (I noticed some extra space in them)

mylog['mydatetime'] = mylog['mydatetime'].str.strip().str.replace('"', '')

and I was able to run pd.to_datetime() without the errors argument! Unfortunately, the output is a mere object.

mylog['mydatetime'] = pd.to_datetime(mylog['mydatetime'] , format = '%Y-%m-%dT%H:%M:%S.%f%z',cache = True)

mylog.mydatetime.head()
Out[17]: 
0    2019-03-03 20:58:38-05:00
1    2019-03-03 20:58:38-05:00
2    2019-03-03 20:58:38-05:00
3    2019-03-03 20:58:38-05:00
4    2019-03-03 20:58:38-05:00
Name: mydatetime, dtype: object

and using
mylog['myday'] = pd.to_datetime(mylog['mydatetime'].dt.date) will still return
ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True

This is very annoying... is there a workaround to make sure the processing is correct? At the end of the day I think this specific ISO format seems to be causing a few errors at the moment...

@randomgambit
Copy link
Author

the puzzling part is that it is not raising any errors, yet the output is not datetime.

@mroeschke
Copy link
Member

So I think object dtype might actually be the expected behavior here if mylog['mydatetime'] contained multiple UTC offsets and the parsing returned multiple fixed timezones (i.e. datetime64[ns] dtype cannot be used to represent multiple of these)

How many distinct tz offsets are there in mylog['mydatetime']?

@randomgambit
Copy link
Author

@mroeschke HA! that is an interesting idea. Normally I should only have one tz offset but with 200m rows who knows. how can I tabulate the offsets to check that you theory is correct?

@mroeschke
Copy link
Member

Might be a little slow but mylog['mydatetime'].apply(lambda x: x.tzinfo)).unique()

@randomgambit
Copy link
Author

hum... i get


mylog['mydatetime'].apply(lambda x: x.tzinfo).unique()
Out[20]: array([pytz.FixedOffset(-300), pytz.FixedOffset(-240)], dtype=object)

does that mean there is just one timestamp (as expected)?

@mroeschke
Copy link
Member

This means that the object dtype is expected. Since your string data contained more than one timezone offset, it's not possible to cast this data to one datetime64[ns, tz] dtype since there are multiple tzs in your data.

See https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.24.0.html#parsing-datetime-strings-with-timezone-offsets

@randomgambit
Copy link
Author

randomgambit commented Mar 8, 2019

@mroeschke we re close but its not over yet... looking at the link you sent I was hopeful that utc= True would be the magic solution but unfortunately I now get another error... I think that was supposed to work according to the docs. What do you think?

Below I reloaded the data so they are strings again.


mylog['mydatetime'] = pd.to_datetime(mylog['mydatetime'] , format = '%Y-%m-%dT%H:%M:%S.%f%z',\
     cache = True, utc = True)
ValueError: Cannot pass a tz argument when parsing strings with timezone information.

Thanks!

@mroeschke
Copy link
Member

When I addressed this timezone parsing my rational was if %z or %Z were passed, the user would want to preserve these timezones, so this error was intentional.

For your use case, if you leave out the format argument and keep utc=True you should get you're dates in UTC with datetime64[ns, UTC] dtype

@mroeschke
Copy link
Member

I think the behavior in this issue is expected and is more of a usage question. Closing.

@ryq99
Copy link

ryq99 commented May 29, 2020

Even though the issue has been closed, just want to point out that if tz-aware format bothers you so much and gives you the above error, simply convert it to naive format by using pd.Series.tz_localize(None), which removes the tail part, and keeps the datetime part same. Then everything is good to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Needs Info Clarification about behavior needed to assess issue
Projects
None yet
Development

No branches or pull requests

4 participants