Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IsAdjustedToUtc flag not used while while reading DeltaTable #1598

Closed
ion-elgreco opened this issue Aug 22, 2023 · 2 comments · Fixed by #2236
Closed

IsAdjustedToUtc flag not used while while reading DeltaTable #1598

ion-elgreco opened this issue Aug 22, 2023 · 2 comments · Fixed by #2236
Labels
bug Something isn't working

Comments

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Aug 22, 2023

Environment

Delta-rs version: 0.10.1

Binding: Python

Environment:

  • Cloud provider: Azure
  • OS: Linux
  • Other:

Bug

What happened:
When you write a PyArrow table with UTC timezone datetimes to delta table, the timezone information get's removed in the final table schema. See example below:

What you expected to happen:

Maintain UTC information by looking at the IsAdjustedToUTC flag in the parquet file and passing this back into the schema while reading to arrow. The information is there because when you read the partition directly with Polars, the timezone information is read as UTC:

print(pl.read_parquet('/home/<redacted>/polars/py-polars/test/0-f7fe1e96-4adc-425a-b75c-80d4fdb1336c-0.parquet'))

shape: (1, 1)
┌─────────────────────────┐
│ datetime                │
│ ---                     │
│ datetime[μs, UTC]       │
╞═════════════════════════╡
│ 2010-01-01 00:00:00 UTC │
└─────────────────────────┘

How to reproduce it:

import polars as pl
from deltalake import DeltaTable

df = pl.select(pl.datetime(2010, 1, 1, time_unit="us", time_zone="UTC"))

print(df.to_arrow().schema)
datetime: timestamp[us, tz=UTC]

df.write_delta('test')


dt = DeltaTable('test')

print(dt.schema().to_pyarrow())
datetime: timestamp[us]

More details:

@ion-elgreco ion-elgreco added the bug Something isn't working label Aug 22, 2023
@ion-elgreco ion-elgreco changed the title UTC timezone while writing not kept in schema IsAdjustedToUtc flag not used while while reading DeltaTable Aug 22, 2023
@ldacey
Copy link
Contributor

ldacey commented Oct 12, 2023

Ah, just ran into this myself. I thought UTC timestamps were supported so I did:

utc_tz = cs.datetime(time_zone="UTC")
non_utc_tz = cs.datetime() - utc_tz

df = df.with_columns(non_utc_tz.cast(pl.Datetime("us")), utc_tz.cast(pl.Datetime("us", "UTC")), cs.categorical().cast(pl.Utf8))

That adjusted my "ns" timestamp columns and I was hopefully going to leave the UTC timestamp intact. Generally I am saving data with the naive version (localized to whatever time zone the client is using with the TZ info stripped) and a UTC version.

My understanding is that at least UTC timestamps are supported, but just not with the delta-rs library currently?

@ion-elgreco
Copy link
Collaborator Author

ion-elgreco commented Oct 12, 2023

Ah, just ran into this myself. I thought UTC timestamps were supported so I did:

utc_tz = cs.datetime(time_zone="UTC")
non_utc_tz = cs.datetime() - utc_tz

df = df.with_columns(non_utc_tz.cast(pl.Datetime("us")), utc_tz.cast(pl.Datetime("us", "UTC")), cs.categorical().cast(pl.Utf8))

That adjusted my "ns" timestamp columns and I was hopefully going to leave the UTC timestamp intact. Generally I am saving data with the naive version (localized to whatever time zone the client is using with the TZ info stripped) and a UTC version.

My understanding is that at least UTC timestamps are supported, but just not with the delta-rs library currently?

So, write support is there, it's writing the parquets properly with UTC timestamps. But the flag is not reused in the schema to read it. I can try to take a look at the issue soon, but I want to work one some other stuff first.

Also in Polars it temporarily casts always to non UTC timezone, until there is a fix in delta-rs

ion-elgreco added a commit that referenced this issue Mar 5, 2024
# Description

- This addresses all our timestamp inconsistencies, where we were
reading Primitive:timestamp as a datatetime without UTC, and now we can
properly write datetimes with no timezone as columns to
Primitive::timestampNtz.
- addressing small bug where checkConstraints feature was not set in
writerFeatures when you are on table writer version 7.
- bumping default protocol to 3,7
- Made the pyarrow writer and reader more flexible so we can write/read
a 3,7 table as long as it has the supported features there.
- Properly parses timestamps with UTC into pyarrow timestamps with UTC
- Added configkey translation to tablefeature inside the Create
Operation

# Related Issue(s)
- closes #1598
- closes #1019
- closes #1777
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants