Timestamp precision in the schema of the DeltaTable of the transaction log. #643

fvaleye · 2021-04-07T12:13:48Z

Hello!

Coming from the Delta-RS community, I have several questions regarding the timestamp type in the DeltaTable schema serialization saved in the transaction log.

Context
The transaction protocol schema serialization format specifies the schema serialization format for the timestamp type with the following precision:

timestamp: Microsecond precision timestamp without a timezone.

It means that Spark uses a timestamp with microsecond precision here given a local or given timezone. But when Spark writes timestamp values out to non-text data sources like Parquet using Delta, the values are just instants (like timestamp in UTC) that have no time zone information.

Taking that into account, if we look at the configuration "spark.sql.parquet.outputTimestampType" here, we see that the default output timestamp used is "ParquetOutputTimestampType.INT96.toString". This timestamp used by this default is with a nanosecond precision when writing .parquet files. But it also could be changed to ParquetOutputTimestampType.INT64 with TIMESTAMP_MICROS or ParquetOutputTimestampType.INT64 with TIMESTAMP_MILLIS.

Use-case
When I am applying a transaction log schema on a DeltaTable (using timestamp with the microsecond precision here), I have a mismatched between the precision of the timestamp given by the schema of the protocol and the real one:

The precision of the timestamp type referenced by the transaction log is with a microsecond precision
The precision of the timestamp type written in the .parquet files is with a nanosecond precision because it uses the default outputTimestampType (but could be microseconds or milliseconds depending on the configuration)
The schema couldn't be applied on the .parquet files because I have a mismatched precision error on a timestamp column

Questions

Why the precision of the timestamp is not written with the timestamp type inside the schema of the transaction log?
It will be used if we want to get the DeltaTable schema timestamp precision if we read the DeltaTable without the Spark dependency.
Does it means that the precision of the timestamp with microsecond precision for internal Spark/Delta is for internal processing only?
In other words, the schema of parquet files must only be directly read from the .parquet files and not from the DeltaTable transaction protocol.
If we change the default timestamp precision to nanoseconds here for applying the schema on .parquet files, it will work only for the default spark.sql.parquet.outputTimestampType configuration, but not for the TIMESTAMP_MICROS and TIMESTAMP_MILLIS ones, right?

Thank you for your help!

The text was updated successfully, but these errors were encountered:

hntd187 · 2023-10-15T16:06:51Z

Did the loop on this ever get closed? I've ran into this a few times when adding parquet files to delta tables because the timestamps are written with different configurations.

alippai · 2023-11-02T04:34:14Z

Since parquet 2.6 has a great int64 timestamp nanos type, could delta standardize on top of that? Java also has nanosecond precision

alippai · 2024-05-07T00:37:40Z

Iceberg is adding nanosecond type too: apache/iceberg#8683

ion-elgreco · 2024-08-04T22:25:10Z

@alippai that's great! Unfortunately for Delta we are bound by what the delta protocol states :(

alippai · 2024-08-04T22:58:45Z

@ion-elgreco how can we extend the delta protocol? I thought this is the correct issue / repo for that.

ion-elgreco · 2024-08-04T23:10:50Z

@ion-elgreco how can we extend the delta protocol? I thought this is the correct issue / repo for that.

It's the correct repo, but it needs to get accepted in the protocol first

fvaleye mentioned this issue Apr 13, 2021

Change the default timestamp to nanosecond precision for Delta <> Arrow schema delta-io/delta-rs#194

Merged

xianwill mentioned this issue Jul 28, 2021

Kafka timestamp and timestamp type meta delta-io/kafka-delta-ingest#44

Merged

houqp mentioned this issue Aug 10, 2021

Support partition value string deserialization for timestamp/binary delta-io/delta-rs#371

Merged

xianwill mentioned this issue Aug 11, 2021

Fix checkpoint and timestamp bugs delta-io/delta-rs#351

Merged

houqp mentioned this issue Sep 22, 2021

Datafusion table provider: issues with timestamp types delta-io/delta-rs#441

Closed

dennyglee added the enhancement New feature or request label Oct 13, 2021

rtyler mentioned this issue Apr 13, 2023

Cannot optimize Spark written table with timestamp (INT96) column delta-io/delta-rs#1286

Closed

Blajda mentioned this issue Jun 29, 2023

Can't delete from Spark DeltaLake table if it has a timestamp field delta-io/delta-rs#1478

Closed

alamb mentioned this issue Sep 20, 2023

Mismatch in timestamp format between dataframe and recordbatch when read from a delta table apache/datafusion#7564

Open

ion-elgreco mentioned this issue Jan 13, 2024

feat: arrow backed log replay and table state delta-io/delta-rs#2037

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timestamp precision in the schema of the DeltaTable of the transaction log. #643

Timestamp precision in the schema of the DeltaTable of the transaction log. #643

fvaleye commented Apr 7, 2021 •

edited

Loading

hntd187 commented Oct 15, 2023

alippai commented Nov 2, 2023

alippai commented May 7, 2024

ion-elgreco commented Aug 4, 2024

alippai commented Aug 4, 2024

ion-elgreco commented Aug 4, 2024

Timestamp precision in the schema of the DeltaTable of the transaction log. #643

Timestamp precision in the schema of the DeltaTable of the transaction log. #643

Comments

fvaleye commented Apr 7, 2021 • edited Loading

hntd187 commented Oct 15, 2023

alippai commented Nov 2, 2023

alippai commented May 7, 2024

ion-elgreco commented Aug 4, 2024

alippai commented Aug 4, 2024

ion-elgreco commented Aug 4, 2024

fvaleye commented Apr 7, 2021 •

edited

Loading