-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: support for reading and writing datetimes with timezones #253
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for looking into this! Added some comments / suggestions
pyogrio/geopandas.py
Outdated
res = pd.to_datetime(ser, format="ISO8601") | ||
if res.dtype != "object": | ||
res = res.dt.as_unit("ms") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to try to detect if there are timezones or not (eg check the first (few) strings to see if they contain a "+" or "-"), because if so we have to pass utc=True
to not get object dtype:
In [20]: pd.to_datetime(["2021/01/02 01:02:03+01:00", "2021/01/02 01:02:03+01:00"], format="ISO8601")
Out[20]: DatetimeIndex(['2021-01-02 01:02:03+01:00', '2021-01-02 01:02:03+01:00'], dtype='datetime64[ns, UTC+01:00]', freq=None)
In [21]: pd.to_datetime(["2021/01/02 01:02:03+01:00", "2021/01/02 01:02:03+02:00"], format="ISO8601")
Out[21]: Index([2021-01-02 01:02:03+01:00, 2021-01-02 01:02:03+02:00], dtype='object')
(with longer timeseries, it is very normal to have mixed offsets for the same timezone, i.e. values with and without DST)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a good point, I checked with what we did in geopandas for the fiona IO and did the same thing as there to be consistent -> read without utc=True, if still object dtype then read with utc=True and see if that makes a difference.
I could also look at check for "+" and "-" as you suggest, I'd imagine that's faster, but maybe there are some edge cases?
pyogrio/tests/test_geopandas_io.py
Outdated
) # TODO this shouldn't be called datetime as string in the pandas layer, | ||
# should it even be accessible? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless we have some users asking for it, I wouldn't expose it in read_dataframe (so only in raw.read), just to keep it simpler in read_dataframe
pyogrio/geopandas.py
Outdated
".*parsing datetimes with mixed time zones will raise.*", | ||
FutureWarning, | ||
) | ||
res = pd.to_datetime(ser, **datetime_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still might want to add a try/except around this, for when this would start raising an error in the future (also in that case we should fall back to the utc=True
path below
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Apologies for the very slow review here; I am slowly trying to get up to speed and looking for other alternatives to adding dtype info to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @m-richards for all your work on this!
I'll readily admit I don't work with datetimes very much, so the details around that are a bit beyond my experience. However, your approach here seems like a reasonable way of getting around the limitations of datetime handling in numpy vs pandas. Adding dtypes
to meta
seems reasonable, as does ignoring it on in raw.write()
.
It might be a good idea to add some notes about this to introduction.md
to guide users about when to use datetime_as_string
and otherwise how the returned values will be dependent on the version of Pandas used (if I follow correctly).
pyogrio/geopandas.py
Outdated
@@ -355,8 +401,24 @@ def write_dataframe( | |||
# TODO: may need to fill in pd.NA, etc | |||
field_data = [] | |||
field_mask = [] | |||
# dict[str, np.array(datetime.datetime)] special case for dt-tz fields | |||
timezone_cols_metadata = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timezone_cols_metadata = {} | |
gdal_tz_offsets = {} |
Slight preference to prefix params specifically intended to pass things down to GDAL as gdal_
, and since the only thing being stored here is the GDAL offset rather than other metadata, this rename seems a bit more clear (provided I follow correctly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you do follow correctly, the new name makes more sense. I think the name I had predates me passing down the offset only.
pyogrio/tests/test_geopandas_io.py
Outdated
expected = pd.to_datetime(raw_expected, format="ISO8601").as_unit("ms") | ||
else: | ||
expected = pd.to_datetime(raw_expected) | ||
expected = pd.Series(expected, name="col") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expected = pd.Series(expected, name="col") | |
expected = pd.Series(expected, name="datetime_col") |
Thanks for the detailed review, I hope I have addressed most of the suggestions now.
Happy to do this, but just want to double check the structuring. I am not super familiar with the pyogrio documentation, but On return types, happy to add this, but I saw there is already somewhat of a discussion of this at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
I would personally say it's good to keep this And the section in the "known_issues" about the datetime gotchas is already a nice and quite complete section I think. Maybe we could just move it "introduction", as now it's actually no longer a known "issue" with pyogrio, but general limitations of reading/writing datetimes because of GDAL's data model. |
Apologies for the poor direction. Leaving |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @m-richards ! Getting really close. Looks like there are a few uncommitted suggestions, but after those are in this looks good to me.
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
I think they're all done now, don't know how I missed them before. |
Thanks @m-richards ! 🚀 |
This is an initial proof of concept into reading and writing datetimes with timezones by reading and writing strings from gdal, and parsing the strings on the pandas side, very directly following discussion and suggestions in #123 123.
I'm trying not to affect the numpy
read
api, as a this functionality doesn't make sense from the numpy layer, as it doesn't support timezones.There is something missing from this, the numpy api should still be able to read datetimes with timezones and convert them correctly to UTC (currently timezones are ignored complete), but I think that can be a separate PR (perhaps it should even come before this).
There's also probably a fair bit of work to handle driver support for this, I noticed issues locally trying to write geopackages (using WSL with an old version of gdal). I had a quick look at the tests in fiona, I was hoping for a simple list of supported formats, but it's quite complex there). I figure though that's a secondary issue after figuring out the implementation.