Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support for reading and writing datetimes with timezones #253

Merged
merged 55 commits into from
Oct 20, 2023
Merged
Show file tree
Hide file tree
Changes from 31 commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
e075091
minimal working pandas layer without timezones
m-richards May 7, 2023
3df7936
implement datetime_as_string toggle to get numpy layer working
m-richards May 7, 2023
d68b473
make tests pass
m-richards May 8, 2023
9aa5a8c
add tests showing existing behaviour no tz
m-richards May 8, 2023
1a2af4d
working read
m-richards May 8, 2023
fbd2898
commit my test file
m-richards May 9, 2023
127d0a7
actually fix tests with read working
m-richards May 10, 2023
016778a
good enough wip progress for now
m-richards May 21, 2023
faa0631
make these failures easier to read
m-richards May 21, 2023
a8c200e
fix for non tz
m-richards May 21, 2023
6047375
fix some tests
m-richards May 22, 2023
6061563
run pre commit
m-richards May 22, 2023
3ba42cf
maybe old pandas, can't reproduce locally
m-richards May 23, 2023
d983140
try and find something pandas 1.5 also happy with
m-richards May 23, 2023
e9993bd
lint
m-richards May 23, 2023
b6ca5cf
simple answer
m-richards May 23, 2023
05cc1cf
cleanup
m-richards May 25, 2023
a78a76c
wip, use strings to make multi timezones round trip
m-richards Jun 3, 2023
b681656
use tmp path fixture
m-richards Jun 3, 2023
3426fdc
cleanups
m-richards Jun 3, 2023
bb6fd4e
try cleanup datetime parsing
m-richards Jun 3, 2023
87419ac
more cleanup, realise we can get dt resolution
m-richards Jun 3, 2023
fc78bd9
more careful pandas 1.5 compat
m-richards Jun 3, 2023
5fab348
delete line
m-richards Jun 3, 2023
26c403a
replace write support with working datetime object solution
m-richards Aug 8, 2023
ebdb71b
fixes
m-richards Aug 8, 2023
f46e716
rewrite datetime reading to handle mixed offset to utc
m-richards Aug 8, 2023
44686f9
fix nat handling for datetime as string
m-richards Aug 8, 2023
6b946f5
don't expose datetime_as_string in pandas layer
m-richards Aug 8, 2023
ec16ed3
incorrect variable in 1.5.3 compat
m-richards Aug 8, 2023
da0639a
CLN: tidy up pandas 2.0 compat
m-richards Aug 9, 2023
85a67c2
suggested alternative implementation
m-richards Sep 24, 2023
d96d67e
code review suggestion
m-richards Sep 24, 2023
3eb70dc
Update pyogrio/tests/test_geopandas_io.py
m-richards Sep 24, 2023
c37c1ed
Merge remote-tracking branch 'upstream/main' into matt/timezones_redo
m-richards Sep 28, 2023
4064f25
Merge branches 'matt/timezones_redo' and 'matt/timezones_redo' of git…
m-richards Sep 28, 2023
3df12c0
time tests and suggestions
m-richards Sep 28, 2023
8fd30a5
remove breakpoint
m-richards Sep 28, 2023
55293c0
catch warning
m-richards Sep 30, 2023
8040c21
really need to fix my local gdal
m-richards Sep 30, 2023
fccc8fb
fix fix
m-richards Sep 30, 2023
200cc1d
Apply suggestions from code review
m-richards Sep 30, 2023
ebfc01c
add suggested exception handling
m-richards Sep 30, 2023
c8c186a
move pandas compat to _compat
m-richards Oct 7, 2023
95030c0
address review comments
m-richards Oct 7, 2023
c5c272b
Merge remote-tracking branch 'upstream/main' into matt/timezones_redo
m-richards Oct 7, 2023
086e52e
update known issues
m-richards Oct 7, 2023
2b2dd5f
reword
m-richards Oct 7, 2023
2167d0f
move documentation
m-richards Oct 17, 2023
ab0fbf6
rename field as suggested
m-richards Oct 17, 2023
e3f4d6a
Merge remote-tracking branch 'upstream/main' into matt/timezones_redo
m-richards Oct 17, 2023
0f02115
final missing gdal tz offset change
m-richards Oct 17, 2023
52a922d
Update pyogrio/tests/test_geopandas_io.py
m-richards Oct 17, 2023
7c99e51
Apply suggestions from code review
m-richards Oct 17, 2023
a5f5f9d
add changelog entry
brendan-ward Oct 20, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/tests-conda.yml
Original file line number Diff line number Diff line change
Expand Up @@ -66,4 +66,4 @@ jobs:

- name: Test
run: |
pytest -v -r s pyogrio/tests
pytest -v --color=yes -r s pyogrio/tests
85 changes: 59 additions & 26 deletions pyogrio/_io.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -599,7 +599,8 @@ cdef process_fields(
object field_data_view,
object field_indexes,
object field_ogr_types,
encoding
encoding,
bint datetime_as_string
):
cdef int j
cdef int success
Expand Down Expand Up @@ -631,7 +632,7 @@ cdef process_fields(
else:
data[i] = np.nan

elif field_type in ( OFTDate, OFTDateTime):
elif field_type in ( OFTDate, OFTDateTime) and not datetime_as_string:
data[i] = np.datetime64('NaT')

else:
Expand All @@ -657,22 +658,27 @@ cdef process_fields(
data[i] = bin_value[:ret_length]

elif field_type == OFTDateTime or field_type == OFTDate:
success = OGR_F_GetFieldAsDateTimeEx(
ogr_feature, field_index, &year, &month, &day, &hour, &minute, &fsecond, &timezone)

if datetime_as_string:
# defer datetime parsing to user/ pandas layer
data[i] = get_string(OGR_F_GetFieldAsString(ogr_feature, field_index), encoding=encoding)
else:
success = OGR_F_GetFieldAsDateTimeEx(
ogr_feature, field_index, &year, &month, &day, &hour, &minute, &fsecond, &timezone)

ms, ss = math.modf(fsecond)
second = int(ss)
# fsecond has millisecond accuracy
microsecond = round(ms * 1000) * 1000
ms, ss = math.modf(fsecond)
second = int(ss)
# fsecond has millisecond accuracy
microsecond = round(ms * 1000) * 1000

if not success:
data[i] = np.datetime64('NaT')
if not success:
data[i] = np.datetime64('NaT')

elif field_type == OFTDate:
data[i] = datetime.date(year, month, day).isoformat()
elif field_type == OFTDate:
data[i] = datetime.date(year, month, day).isoformat()

elif field_type == OFTDateTime:
data[i] = datetime.datetime(year, month, day, hour, minute, second, microsecond).isoformat()
elif field_type == OFTDateTime:
data[i] = datetime.datetime(year, month, day, hour, minute, second, microsecond).isoformat()


@cython.boundscheck(False) # Deactivate bounds checking
Expand All @@ -685,7 +691,8 @@ cdef get_features(
uint8_t force_2d,
int skip_features,
int num_features,
uint8_t return_fids
uint8_t return_fids,
bint datetime_as_string
):

cdef OGRFeatureH ogr_feature = NULL
Expand Down Expand Up @@ -718,7 +725,9 @@ cdef get_features(

field_data = [
np.empty(shape=(num_features, ),
dtype=fields[field_index,3]) for field_index in range(n_fields)
dtype = ("object" if datetime_as_string and
fields[field_index,3].startswith("datetime") else fields[field_index,3])
) for field_index in range(n_fields)
]

field_data_view = [field_data[field_index][:] for field_index in range(n_fields)]
Expand Down Expand Up @@ -758,7 +767,7 @@ cdef get_features(

process_fields(
ogr_feature, i, n_fields, field_data, field_data_view,
field_indexes, field_ogr_types, encoding
field_indexes, field_ogr_types, encoding, datetime_as_string
)
i += 1
finally:
Expand Down Expand Up @@ -788,7 +797,8 @@ cdef get_features_by_fid(
object[:,:] fields,
encoding,
uint8_t read_geometry,
uint8_t force_2d
uint8_t force_2d,
bint datetime_as_string
):

cdef OGRFeatureH ogr_feature = NULL
Expand All @@ -811,10 +821,11 @@ cdef get_features_by_fid(
n_fields = fields.shape[0]
field_indexes = fields[:,0]
field_ogr_types = fields[:,1]

field_data = [
np.empty(shape=(count, ),
dtype=fields[field_index,3]) for field_index in range(n_fields)
dtype=("object" if datetime_as_string and fields[field_index,3].startswith("datetime")
else fields[field_index,3]))
for field_index in range(n_fields)
]

field_data_view = [field_data[field_index][:] for field_index in range(n_fields)]
Expand All @@ -837,7 +848,7 @@ cdef get_features_by_fid(

process_fields(
ogr_feature, i, n_fields, field_data, field_data_view,
field_indexes, field_ogr_types, encoding
field_indexes, field_ogr_types, encoding, datetime_as_string
)
finally:
if ogr_feature != NULL:
Expand Down Expand Up @@ -939,7 +950,9 @@ def ogr_read(
object fids=None,
str sql=None,
str sql_dialect=None,
int return_fids=False):
int return_fids=False,
bint datetime_as_string=False
):

cdef int err = 0
cdef const char *path_c = NULL
Expand Down Expand Up @@ -1022,6 +1035,7 @@ def ogr_read(
encoding,
read_geometry=read_geometry and geometry_type is not None,
force_2d=force_2d,
datetime_as_string=datetime_as_string
)

# bypass reading fids since these should match fids used for read
Expand Down Expand Up @@ -1051,13 +1065,15 @@ def ogr_read(
force_2d=force_2d,
skip_features=skip_features,
num_features=num_features,
return_fids=return_fids
return_fids=return_fids,
datetime_as_string=datetime_as_string
)

meta = {
'crs': crs,
'encoding': encoding,
'fields': fields[:,2], # return only names
'dtypes':fields[:,3],
'geometry_type': geometry_type,
}

Expand Down Expand Up @@ -1468,12 +1484,22 @@ cdef infer_field_types(list dtypes):
return field_types


FIFTEEN_MINUTE_DELTA = datetime.timedelta(minutes=15)

cdef int timezone_to_gdal_offset(tz_as_datetime):
"""Convert to GDAL timezone offset representation.

https://gdal.org/development/rfc/rfc56_millisecond_precision.html#core-changes
"""
return tz_as_datetime.utcoffset() / FIFTEEN_MINUTE_DELTA + 100
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tz_as_datetime.utcoffset() is permitted to return None, perhaps it makes sense to check for that explicitly. I believe that given the way this is supplied using to _pydatetime() that shouldn't happen, but there could be edge cases I'm not aware of.


# TODO: set geometry and field data as memory views?
def ogr_write(
str path, str layer, str driver, geometry, fields, field_data, field_mask,
str crs, str geometry_type, str encoding, object dataset_kwargs,
object layer_kwargs, bint promote_to_multi=False, bint nan_as_null=True,
bint append=False, dataset_metadata=None, layer_metadata=None
bint append=False, dataset_metadata=None, layer_metadata=None,
timezone_cols_metadata=None
):
cdef const char *path_c = NULL
cdef const char *layer_c = NULL
Expand Down Expand Up @@ -1526,6 +1552,9 @@ def ogr_write(
if not layer:
layer = os.path.splitext(os.path.split(path)[1])[0]

if timezone_cols_metadata is None:
timezone_cols_metadata = {}


# if shapefile, GeoJSON, or FlatGeobuf, always delete first
# for other types, check if we can create layers
Expand Down Expand Up @@ -1796,8 +1825,12 @@ def ogr_write(
if np.isnat(field_value):
OGR_F_SetFieldNull(ogr_feature, field_idx)
else:
# TODO: add support for timezones
datetime = field_value.astype("datetime64[ms]").item()
tz_array = timezone_cols_metadata.get(fields[field_idx], None)
if tz_array is None:
gdal_tz = 0
else:
gdal_tz = timezone_to_gdal_offset(tz_array[i])
OGR_F_SetFieldDateTimeEx(
ogr_feature,
field_idx,
Expand All @@ -1807,7 +1840,7 @@ def ogr_write(
datetime.hour,
datetime.minute,
datetime.second + datetime.microsecond / 10**6,
0
gdal_tz
)

else:
Expand Down
54 changes: 52 additions & 2 deletions pyogrio/geopandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,16 @@
from pyogrio.raw import DRIVERS_NO_MIXED_SINGLE_MULTI, DRIVERS_NO_MIXED_DIMENSIONS
from pyogrio.raw import detect_driver, read, read_arrow, write
from pyogrio.errors import DataSourceError
from packaging.version import Version


try:
import pandas

PANDAS_GE_20 = Version(pandas.__version__) >= Version("2.0.0")

except ImportError:
PANDAS_GE_20 = None


def _stringify_path(path):
Expand All @@ -19,6 +29,26 @@ def _stringify_path(path):
return path


def _try_parse_datetime(ser):
import pandas as pd # only called when pandas is known to be installed

if PANDAS_GE_20:
datetime_kwargs = dict(format="ISO8601", errors="ignore")
else:
datetime_kwargs = dict(yearfirst=True)
res = pd.to_datetime(ser, **datetime_kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This first attempt will already raise a warning with the latest pandas, so we will need to catch that warning:

FutureWarning: In a future version of pandas, parsing datetimes with mixed time zones will raise a warning unless utc=True. Please specify utc=True to opt in to the new behaviour and silence this warning. To create a Series with mixed offsets and object dtype, please use apply and datetime.datetime.strptime

(I have to check if the warning is actually wrong about the future behaviour of raising a warning. I would assume the deprecation would warn for an error in the future. In which case we should also already start catching that, I think)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I noticed this a while back in the geopandas equivalent test, so we should fix there too.

Based on the discussion in pandas-dev/pandas#50887 and pandas-dev/pandas#54014, it does seem like the error is supposed to read "will raise an error" rather than warning.

# if object dtype, try parse as utc instead
if res.dtype == "object":
res = pd.to_datetime(ser, utc=True, **datetime_kwargs)

if res.dtype != "object":
if PANDAS_GE_20:
res = res.dt.as_unit("ms")
else:
res = res.dt.round(freq="ms")
return res


def read_dataframe(
path_or_buffer,
/,
Expand Down Expand Up @@ -146,6 +176,8 @@ def read_dataframe(
path_or_buffer = _stringify_path(path_or_buffer)

read_func = read_arrow if use_arrow else read
if not use_arrow:
kwargs["datetime_as_string"] = True
result = read_func(
path_or_buffer,
layer=layer,
Expand Down Expand Up @@ -182,8 +214,10 @@ def read_dataframe(
index = pd.Index(index, name="fid")
else:
index = None

df = pd.DataFrame(data, columns=columns, index=index)
for dtype, c in zip(meta["dtypes"], df.columns):
if dtype.startswith("datetime"):
df[c] = _try_parse_datetime(df[c])

if geometry is None or not read_geometry:
return df
Expand Down Expand Up @@ -326,8 +360,23 @@ def write_dataframe(
# TODO: may need to fill in pd.NA, etc
field_data = []
field_mask = []
# dict[str, np.array(datetime.datetime)] special case for dt-tz fields
timezone_cols_metadata = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
timezone_cols_metadata = {}
gdal_tz_offsets = {}

Slight preference to prefix params specifically intended to pass things down to GDAL as gdal_, and since the only thing being stored here is the GDAL offset rather than other metadata, this rename seems a bit more clear (provided I follow correctly).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you do follow correctly, the new name makes more sense. I think the name I had predates me passing down the offset only.

for name in fields:
col = df[name].values
ser = df[name]
col = ser.values
if isinstance(ser.dtype, pd.DatetimeTZDtype):
# Deal with datetimes with timezones by passing down timezone separately
# pass down naive datetime
col = ser.dt.tz_localize(None).values
# pandas only supports a single offset per column
# access via array since we want a numpy array not a series
# (only care about the utc offset, not actually the date)
# but ser.array.timetz won't have valid utc offset for pytz time zones
# (per https://docs.python.org/3/library/datetime.html#datetime.time.utcoffset) # noqa
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably a bit verbose right now.
Basic idea of this is that I can't seem to find a way to produce a series/ numpy array of timezone offsets in a vectorised way through pandas.

Best I could seem to do is that ser.array.to_pydatetime() uses the cython function ints_to_pydatetime. Note the DateTimeArray is marked as experimental and doesn't actually document any methods on the website which isn't great. The equivalent Series / DateTimeIndex method is another option, but it throws unavoidable userwarnings for changing in behaviour.

This seems wasteful in terms of duplication (basically the same datetime information twice), but seemed better to pass this down and compute the offsets in cython (but I haven't profiled it either).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A potentially faster alternative:

In [20]: arr = pd.date_range("2012-10-25 09:00", periods=5, freq="D", tz="Europe/Brussels")

In [21]: arr.tz_localize(None) - arr.tz_convert("UTC").tz_localize(None)
Out[21]: 
TimedeltaIndex(['0 days 02:00:00', '0 days 02:00:00', '0 days 02:00:00',
                '0 days 01:00:00', '0 days 01:00:00'],
               dtype='timedelta64[ns]', freq=None)

Although it has some more steps, it's vectorized and avoids creating python datetime.datetime objects, and based on a quick test is much faster. And I think it should give the same result?

Copy link
Member Author

@m-richards m-richards Sep 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks much nicer! We can also do the conversion into the gdal representation directly rather than in cython. I've pushed a commit which should implement this - tested locally and it looks like it works, but my local setup doesn't have the best version of GDAL to test against. I will do some performance comparisons but haven't got to that yet.

timezone_cols_metadata[name] = ser.array.to_pydatetime()
else:
col = ser.values
if isinstance(col, pd.api.extensions.ExtensionArray):
from pandas.arrays import IntegerArray, FloatingArray, BooleanArray

Expand Down Expand Up @@ -427,5 +476,6 @@ def write_dataframe(
metadata=metadata,
dataset_options=dataset_options,
layer_options=layer_options,
timezone_cols_metadata=timezone_cols_metadata,
**kwargs,
)
9 changes: 9 additions & 0 deletions pyogrio/raw.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ def read(
sql=None,
sql_dialect=None,
return_fids=False,
datetime_as_string=False,
**kwargs,
):
"""Read OGR data source into numpy arrays.
Expand Down Expand Up @@ -108,6 +109,10 @@ def read(
number of features usings FIDs is also driver specific.
return_fids : bool, optional (default: False)
If True, will return the FIDs of the feature that were read.
datetime_as_string : bool, optional (default: False)
If True, will return datetime dtypes as detected by GDAL as a string
array, instead of a datetime64 array (used to extract timezone info).

**kwargs
Additional driver-specific dataset open options passed to OGR. Invalid
options will trigger a warning.
Expand Down Expand Up @@ -150,6 +155,7 @@ def read(
sql_dialect=sql_dialect,
return_fids=return_fids,
dataset_kwargs=dataset_kwargs,
datetime_as_string=datetime_as_string,
)
finally:
if buffer is not None:
Expand Down Expand Up @@ -385,8 +391,10 @@ def write(
metadata=None,
dataset_options=None,
layer_options=None,
timezone_cols_metadata=None,
**kwargs,
):
kwargs.pop("dtypes", None)
if geometry_type is None:
raise ValueError("geometry_type must be provided")

Expand Down Expand Up @@ -471,4 +479,5 @@ def write(
layer_metadata=layer_metadata,
dataset_kwargs=dataset_kwargs,
layer_kwargs=layer_kwargs,
timezone_cols_metadata=timezone_cols_metadata,
)
5 changes: 5 additions & 0 deletions pyogrio/tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,3 +97,8 @@ def test_ogr_types_list():
@pytest.fixture(scope="session")
def test_datetime():
return _data_dir / "test_datetime.geojson"


@pytest.fixture(scope="session")
def test_datetime_tz():
return _data_dir / "test_datetime_tz.geojson"
7 changes: 7 additions & 0 deletions pyogrio/tests/fixtures/test_datetime_tz.geojson
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"type": "FeatureCollection",
"features": [
{ "type": "Feature", "properties": { "col": "2020-01-01T09:00:00.123-05:00" }, "geometry": { "type": "Point", "coordinates": [ 1.0, 1.0 ] } },
{ "type": "Feature", "properties": { "col": "2020-01-01T10:00:00-05:00" }, "geometry": { "type": "Point", "coordinates": [ 2.0, 2.0 ] } }
]
}
Loading