Cannot optimize Spark written table with `timestamp` (INT96) column #1286

rtyler · 2023-04-13T21:27:04Z

Environment

Delta-rs version: any

Binding: rust

Environment:

Cloud provider:
OS:
Other:

Bug

Basically the protocol says that timestamps are supposed to be microsecond precision. By default however Apache Spark stores timestamps in parquet with INT96 😱 and they're read back by the parquet crate as nanosecond precision timestamps.

What happened:

This manifests when optimizing because the schema defined in delta expects Timestamp(microsecond) but reading the parquet files their arrow schema has a Timestamp(nanosecond).

What you expected to happen:

How to reproduce it:

Painfully 😆

More details:

delta-io/delta#643

The text was updated successfully, but these errors were encountered:

This test currently fails because the RecordBatchWriter doesn't like the difference between Timestamps: ---- writer::record_batch::tests::test_write_batch_with_timestamps stdout ---- thread 'writer::record_batch::tests::test_write_batch_with_timestamps' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidArgumentError("column types must match schema types, expected Timestamp(Microsecond, None) but found Timestamp(Nanosecond, None) at column index 1")', rust/src/writer/record_batch.rs:507:101 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

This currently fails because a parquet file's schema is not delta compatible somehow: thread 'test_all_tables' panicked at 'Failed to convert the schema for creating the table: SchemaError("Invalid data type for Delta Lake: Timestamp(Nanosecond, None)")', /usr/home/tyler/source/github/buoyant-data/oxbow/src/lib.rs:118:10 I have a hunch that this might be similar to delta-io/delta-rs#1286

Nugget2000 · 2023-08-10T11:33:35Z

To reproduce this in python

deltalake==0.10.1
python==3.10.6

import pandas as pd
import pyarrow.dataset as ds
from datetime import datetime
from deltalake.writer import write_deltalake, DeltaTable

data = {'RowDateTime': [datetime.now()]}

df = pd.DataFrame.from_dict(data)

parquet_format = ds.ParquetFileFormat()
write_options = parquet_format.make_write_options(
                use_deprecated_int96_timestamps=True,
            )

# write twice so there is something to optimize
write_deltalake(table_or_uri="bug_delta", data=df, file_options=write_options, mode="append")
write_deltalake(table_or_uri="bug_delta", data=df, file_options=write_options, mode="append")

dt = DeltaTable("bug_delta")
dt.optimize.compact() # <-- this is where we crash
dt.vacuum(retention_hours=0, enforce_retention_duration=False, dry_run=False)

Error:
self.table._table.compact_optimize(
_internal.DeltaError: Data does not match the schema or partitions of the table: Unexpected Arrow schema: got: Field { name: "RowDateTime", data_type: Timestamp(Nanosecond, None), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, expected: Field { name: "RowDateTime", data_type: Timestamp(Microsecond, None), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }

IamJeffG · 2023-10-25T21:53:25Z

I see the same error in a similar but not exactly the same scenario: when the timestamps stored in parquet are time-zone-adjusted:

pa.timestamp("s", tz="UTC"))

In other words, this can happen even when both Timestamps are Microseconds.

You will see the error:

DeltaError: Data does not match the schema or partitions of the table: Unexpected Arrow schema:
got: Field { name: "datetime", data_type: Timestamp(Microsecond, Some("UTC")), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }
expected: Field { name: "datetime", data_type: Timestamp(Microsecond, None), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }

ion-elgreco · 2024-08-19T19:38:41Z

@rtyler this is resolved now that we always cast the read data to the correct schema of the table in optimize

rtyler added the bug Something isn't working label Apr 13, 2023

rtyler self-assigned this Apr 13, 2023

rtyler added the binding/rust Issues for the Rust crate label Apr 13, 2023

rtyler mentioned this issue May 2, 2023

Writing checkpoints on tables with timestamps results in some warnings #1050

Closed

rtyler mentioned this issue Sep 21, 2023

Can't optimize a table created by Spark. #1648

Closed

rtyler mentioned this issue Oct 26, 2023

Support parquet files with nanoseconds timestamps when converting a Parquet table to a Delta table #1721

Open

rtyler added this to the Correct timestamp handling milestone Oct 26, 2023

ion-elgreco closed this as completed Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot optimize Spark written table with `timestamp` (INT96) column #1286

Cannot optimize Spark written table with `timestamp` (INT96) column #1286

rtyler commented Apr 13, 2023

Nugget2000 commented Aug 10, 2023 •

edited

Loading

IamJeffG commented Oct 25, 2023

ion-elgreco commented Aug 19, 2024 •

edited

Loading

Cannot optimize Spark written table with timestamp (INT96) column #1286

Cannot optimize Spark written table with timestamp (INT96) column #1286

Comments

rtyler commented Apr 13, 2023

Environment

Bug

Nugget2000 commented Aug 10, 2023 • edited Loading

IamJeffG commented Oct 25, 2023

ion-elgreco commented Aug 19, 2024 • edited Loading

Cannot optimize Spark written table with `timestamp` (INT96) column #1286

Cannot optimize Spark written table with `timestamp` (INT96) column #1286

Nugget2000 commented Aug 10, 2023 •

edited

Loading

ion-elgreco commented Aug 19, 2024 •

edited

Loading