Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot optimize Spark written table with timestamp (INT96) column #1286

Closed
rtyler opened this issue Apr 13, 2023 · 3 comments
Closed

Cannot optimize Spark written table with timestamp (INT96) column #1286

rtyler opened this issue Apr 13, 2023 · 3 comments
Assignees
Labels
binding/rust Issues for the Rust crate bug Something isn't working

Comments

@rtyler
Copy link
Member

rtyler commented Apr 13, 2023

Environment

Delta-rs version: any

Binding: rust

Environment:

  • Cloud provider:
  • OS:
  • Other:

Bug

Basically the protocol says that timestamps are supposed to be microsecond precision. By default however Apache Spark stores timestamps in parquet with INT96 😱 and they're read back by the parquet crate as nanosecond precision timestamps.

What happened:

This manifests when optimizing because the schema defined in delta expects Timestamp(microsecond) but reading the parquet files their arrow schema has a Timestamp(nanosecond).

What you expected to happen:

How to reproduce it:

Painfully 😆

More details:

delta-io/delta#643

@rtyler rtyler added the bug Something isn't working label Apr 13, 2023
@rtyler rtyler self-assigned this Apr 13, 2023
@rtyler rtyler added the binding/rust Issues for the Rust crate label Apr 13, 2023
rtyler added a commit to rtyler/delta-rs that referenced this issue Apr 14, 2023
This test currently fails because the RecordBatchWriter doesn't like the
difference between Timestamps:

---- writer::record_batch::tests::test_write_batch_with_timestamps stdout ----
thread 'writer::record_batch::tests::test_write_batch_with_timestamps' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidArgumentError("column types must match schema types, expected Timestamp(Microsecond, None) but found Timestamp(Nanosecond, None) at column index 1")', rust/src/writer/record_batch.rs:507:101
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
rtyler added a commit to buoyant-data/oxbow that referenced this issue May 6, 2023
This currently fails because a parquet file's schema is not delta compatible
somehow:

thread 'test_all_tables' panicked at 'Failed to convert the schema for creating the table: SchemaError("Invalid data type for Delta Lake: Timestamp(Nanosecond, None)")', /usr/home/tyler/source/github/buoyant-data/oxbow/src/lib.rs:118:10

I have a hunch that this might be similar to delta-io/delta-rs#1286
@Nugget2000
Copy link

Nugget2000 commented Aug 10, 2023

To reproduce this in python

deltalake==0.10.1
python==3.10.6

import pandas as pd
import pyarrow.dataset as ds
from datetime import datetime
from deltalake.writer import write_deltalake, DeltaTable

data = {'RowDateTime': [datetime.now()]}

df = pd.DataFrame.from_dict(data)

parquet_format = ds.ParquetFileFormat()
write_options = parquet_format.make_write_options(
                use_deprecated_int96_timestamps=True,
            )

# write twice so there is something to optimize
write_deltalake(table_or_uri="bug_delta", data=df, file_options=write_options, mode="append")
write_deltalake(table_or_uri="bug_delta", data=df, file_options=write_options, mode="append")

dt = DeltaTable("bug_delta")
dt.optimize.compact() # <-- this is where we crash
dt.vacuum(retention_hours=0, enforce_retention_duration=False, dry_run=False)

Error:
self.table._table.compact_optimize(
_internal.DeltaError: Data does not match the schema or partitions of the table: Unexpected Arrow schema: got: Field { name: "RowDateTime", data_type: Timestamp(Nanosecond, None), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, expected: Field { name: "RowDateTime", data_type: Timestamp(Microsecond, None), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }

@IamJeffG
Copy link

I see the same error in a similar but not exactly the same scenario: when the timestamps stored in parquet are time-zone-adjusted:

pa.timestamp("s", tz="UTC"))

In other words, this can happen even when both Timestamps are Microseconds.

You will see the error:

DeltaError: Data does not match the schema or partitions of the table: Unexpected Arrow schema:
got: Field { name: "datetime", data_type: Timestamp(Microsecond, Some("UTC")), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }
expected: Field { name: "datetime", data_type: Timestamp(Microsecond, None), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Aug 19, 2024

@rtyler this is resolved now that we always cast the read data to the correct schema of the table in optimize

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants