Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArrowStream interface: support for timezones #8460

Closed
jorisvandenbossche opened this issue Sep 23, 2023 · 10 comments · Fixed by #8461
Closed

ArrowStream interface: support for timezones #8460

jorisvandenbossche opened this issue Sep 23, 2023 · 10 comments · Fixed by #8461
Assignees

Comments

@jorisvandenbossche
Copy link
Contributor

jorisvandenbossche commented Sep 23, 2023

Expected behavior and actual behavior.

Assume a geojson file test_datetime_tz.geojson with content:

{
"type": "FeatureCollection",
"features": [
{ "type": "Feature", "properties": { "col": "2020-01-01T09:00:00-05:00" }, "geometry": { "type": "Point", "coordinates": [ 1.0, 1.0 ] } },
{ "type": "Feature", "properties": { "col": "2020-01-01T10:00:00-05:00" }, "geometry": { "type": "Point", "coordinates": [ 2.0, 2.0 ] } }
]
}

and reading this with the ArrowStream interface:

>>> from osgeo import ogr
>>> ds = ogr.Open("test_datetime_tz.geojson")
>>> lyr = ds.GetLayer(0)
>>> stream = lyr.GetArrowStreamAsPyArrow()
>>> list(stream)[0]
<pyarrow.lib.StructArray object at 0x7fedaa11dc00>
-- is_valid: all not null
-- child 0 type: int64
  [
    0,
    1
  ]
-- child 1 type: timestamp[ms]
  [
    2020-01-01 09:00:00.000,
    2020-01-01 10:00:00.000
  ]
-- child 2 type: binary
  [
    0101000000000000000000F03F000000000000F03F,
    010100000000000000000000400000000000000040
  ]

We get tz-naive timestamp values in the "local" time (wall time), not the UTC-equivalent values (after applying the offset). While the "traditional" field based access with OGR_F_GetFieldAsDateTimeEx will give you timezone information.

It's not necessarily clear what would be the best behaviour here. GDAL doesn't actually store or know the time "zone" (like Europe/Brussels), only an offset per value (which is not necessarily constant for the full column). And thus it cannot return an Arrow timestamp type with a zone as tz.
It could return a timestamp with a fixed offset (but that's only possible if all offsets are the same, which you only know when parsing the last value, and so this would require a first pass to check this), or return timestamp[ms, tz=UTC] type (this doesn't preserve the wall time, but it does preserve the actual point in time). Although I can imagine that it depends on the use case whether you would prefer getting wall-time tz-naive timestamps or tz-aware UTC timestamps.

GDAL version and provenance

GDAL 3.7.2 installed with conda forge

@jorisvandenbossche
Copy link
Contributor Author

jorisvandenbossche commented Sep 23, 2023

Probably not directly related, but I also noticed that the equivalent GPKG file (oogr2ogr test_datetime_tz.gpkg test_datetime_tz.geojson) gives a warning when reading with Arrow:

Warning 1: Non-conformant content for record 1 in column col, 2020-01-01T09:00:00.000-05:00, successfully parsed

@jratike80
Copy link
Collaborator

jratike80 commented Sep 23, 2023

Datetime in GeoPackage must be in UTC by the standard, I guess that's the reason for the warning.

ISO-8601 date/time string in the form YYYY-MM-DDTHH:MM:SS.SSSZ with T separator character and Z suffix for coordinated universal time (UTC) encoded in either UTF-8 or UTF-16. See TEXT.

Have a look at the DATETIME_FORMAT in https://gdal.org/drivers/vector/gpkg.html#dataset-creation-options. Could there be something re-usable?

@jorisvandenbossche
Copy link
Contributor Author

OK, that's interesting. But in this case, it's GDAL itself that created the file, so if the spec requires UTC, shouldn't GDAL write that? (by converting any other offset to UTC) Or is it expected that the user knows to convert first to UTC before writing to GPKG?

@rouault
Copy link
Member

rouault commented Sep 23, 2023

But in this case, it's GDAL itself that created the file, so if the spec requires UTC, shouldn't GDAL write that? (by converting any other offset to UTC) Or is it expected that the user knows to convert first to UTC before writing to GPKG?

GDAL kind of uses an informal extension to write non-UTC date times, if the user provides them as such. If a user wants to produce a fully compliant file, he can pass DATETIME_FORMAT=UTC as a creation option and GDAL will do the conversion. Anyway that's not directly related to that issue, which I'm working on

@rouault rouault self-assigned this Sep 23, 2023
@jorisvandenbossche
Copy link
Contributor Author

jorisvandenbossche commented Sep 23, 2023

@jratike80 I see your added second paragraph now. And indeed, the DATETIME_FORMAT param clearly mentions that it defaults to writing datetimes with timezone, while that is not spec-compliant. And you can use DATETIME_FORMAT="UTC" to get a spec compliant one. Thanks for pointing that out, that can resolve the warning I got for GPKG.

EDIT: and I too slow with typing, in the meantime Even answered more or less the same ;)

@rouault
Copy link
Member

rouault commented Sep 23, 2023

with the fix in #8461, I now get:

-- is_valid: all not null
-- child 0 type: int64
  [
    0,
    1
  ]
-- child 1 type: timestamp[ms, tz=-05:00]
  [
    2020-01-01 09:00:00.000,
    2020-01-01 10:00:00.000
  ]
-- child 2 type: binary
  [
    0101000000000000000000F03F000000000000F03F,
    010100000000000000000000400000000000000040
  ]

GeoJSON is the only driver that does a full scan of the file to establish the set of fields on opening, hence the timezone can be known.
For GeoPackage, the timezone in the schema will be advertized as UTC, and conversions will be done if necessary
For other drivers, the current behaviour will be preserved: unknown time zone in the schema, and value of the feature written unmodified
EDIT: (except Arrow and Parquet), for which this comes for free

@jratike80
Copy link
Collaborator

the DATETIME_FORMAT param clearly mentions that it defaults to writing datetimes with timezone

I would rather say that it defaults to writing datetimes as they appear in the source data. I think that if the default was automatic conversion into UTC, then it would make many users confused and unhappy because if the time is around midnight then the date can change by the same.

rouault added a commit to rouault/gdal that referenced this issue Sep 23, 2023
rouault added a commit to rouault/gdal that referenced this issue Sep 23, 2023
rouault added a commit to rouault/gdal that referenced this issue Sep 23, 2023
@jorisvandenbossche
Copy link
Contributor Author

jorisvandenbossche commented Sep 24, 2023

@rouault thanks for that! The PR looks good.

For other drivers, the current behaviour will be preserved: unknown time zone in the schema, and value of the feature written unmodified

In general, additional timezone support in the arrowstream output (as you did now for the mentioned formats) is something that would need to be added per driver? (although I don't know if there are many others that do support timezone in the format, Shapefile don't even support datetimes AFAIK)

EDIT: (except Arrow and Parquet), for which this comes for free

Although I assume (from reading the code, didn't try it out, so might be wrong) that a timestamp field in such a file that would eg have a timezone of "Europe/Brussels" will be set to OGR_TZFLAG_UNKNOWN in the internal data model? And thus come out as the original value as stored.
But since that stored value for Arrow IPC and Parquet file formats is always UTC, it might be more correct to set the TZFLAG to UTC if the timezone is set but not a fixed offset?

@jratike80
Copy link
Collaborator

I think that I now understand what @jorisvandenbossche means with the timezones. Obviously they are the entries in the tz database https://en.wikipedia.org/wiki/List_of_tz_database_time_zones. I had been thinking that the offsets like +02:00 mean the same as timezones but that would have been too easy.

If GDAL would support the tz timezones, doesn't it mean that it should support them properly, so that the Daylight saving times would are also handled correctly? The offset to UTC in Europe/Brussels times can be either +01:00 or +02:00 depending on the day of year. I do not know but I fear that they depend also on the year, because the rules for DST may have changed.

It may be too late to change anything, but isn't it so, that GDAL does not know anything about the tz timezones? And the best that GDAL can do is to deal with the UTC Offsets? So maybe it would be more exact to use that term in the documentation as well:
"For other drivers, the current behaviour will be preserved: unknown UTC Offset in the schema"

rouault added a commit to rouault/gdal that referenced this issue Sep 24, 2023
@rouault
Copy link
Member

rouault commented Sep 24, 2023

In general, additional timezone support in the arrowstream output (as you did now for the mentioned formats) is something that would need to be added per driver?

yes, drivers can either set OGRFieldDefn::SetTZFlag(), or override manually the TIMESTAMP GetArrowStream() option. I've also declare it as public option

Although I assume (from reading the code, didn't try it out, so might be wrong) that a timestamp field in such a file that would eg have a timezone of "Europe/Brussels" will be set to OGR_TZFLAG_UNKNOWN in the internal data model? And thus come out as the original value as stored. But since that stored value for Arrow IPC and Parquet file formats is always UTC, it might be more correct to set the TZFLAG to UTC if the timezone is set but not a fixed offset?

I've just done that in a new commit

GDAL does not know anything about the tz timezones

Not more than (fixed) offsets to UTC. It has no knowledged of things like Europe/Brussels and DST.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants