Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timestamp precision in the schema of the DeltaTable of the transaction log. #643

Open
fvaleye opened this issue Apr 7, 2021 · 6 comments
Labels
enhancement New feature or request

Comments

@fvaleye
Copy link
Contributor

fvaleye commented Apr 7, 2021

Hello!

Coming from the Delta-RS community, I have several questions regarding the timestamp type in the DeltaTable schema serialization saved in the transaction log.

Context
The transaction protocol schema serialization format specifies the schema serialization format for the timestamp type with the following precision:

timestamp: Microsecond precision timestamp without a timezone.

It means that Spark uses a timestamp with microsecond precision here given a local or given timezone. But when Spark writes timestamp values out to non-text data sources like Parquet using Delta, the values are just instants (like timestamp in UTC) that have no time zone information.

Taking that into account, if we look at the configuration "spark.sql.parquet.outputTimestampType" here, we see that the default output timestamp used is "ParquetOutputTimestampType.INT96.toString". This timestamp used by this default is with a nanosecond precision when writing .parquet files. But it also could be changed to ParquetOutputTimestampType.INT64 with TIMESTAMP_MICROS or ParquetOutputTimestampType.INT64 with TIMESTAMP_MILLIS.

Use-case
When I am applying a transaction log schema on a DeltaTable (using timestamp with the microsecond precision here), I have a mismatched between the precision of the timestamp given by the schema of the protocol and the real one:

  1. The precision of the timestamp type referenced by the transaction log is with a microsecond precision
  2. The precision of the timestamp type written in the .parquet files is with a nanosecond precision because it uses the default outputTimestampType (but could be microseconds or milliseconds depending on the configuration)
  3. The schema couldn't be applied on the .parquet files because I have a mismatched precision error on a timestamp column

Questions

  1. Why the precision of the timestamp is not written with the timestamp type inside the schema of the transaction log?
    It will be used if we want to get the DeltaTable schema timestamp precision if we read the DeltaTable without the Spark dependency.

  2. Does it means that the precision of the timestamp with microsecond precision for internal Spark/Delta is for internal processing only?
    In other words, the schema of parquet files must only be directly read from the .parquet files and not from the DeltaTable transaction protocol.

  3. If we change the default timestamp precision to nanoseconds here for applying the schema on .parquet files, it will work only for the default spark.sql.parquet.outputTimestampType configuration, but not for the TIMESTAMP_MICROS and TIMESTAMP_MILLIS ones, right?

Thank you for your help!

@hntd187
Copy link

hntd187 commented Oct 15, 2023

Did the loop on this ever get closed? I've ran into this a few times when adding parquet files to delta tables because the timestamps are written with different configurations.

@alippai
Copy link

alippai commented Nov 2, 2023

Since parquet 2.6 has a great int64 timestamp nanos type, could delta standardize on top of that? Java also has nanosecond precision

@alippai
Copy link

alippai commented May 7, 2024

Iceberg is adding nanosecond type too: apache/iceberg#8683

@ion-elgreco
Copy link

@alippai that's great! Unfortunately for Delta we are bound by what the delta protocol states :(

@alippai
Copy link

alippai commented Aug 4, 2024

@ion-elgreco how can we extend the delta protocol? I thought this is the correct issue / repo for that.

@ion-elgreco
Copy link

@ion-elgreco how can we extend the delta protocol? I thought this is the correct issue / repo for that.

It's the correct repo, but it needs to get accepted in the protocol first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants