-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot optimize Spark written table with timestamp
(INT96) column
#1286
Comments
This test currently fails because the RecordBatchWriter doesn't like the difference between Timestamps: ---- writer::record_batch::tests::test_write_batch_with_timestamps stdout ---- thread 'writer::record_batch::tests::test_write_batch_with_timestamps' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidArgumentError("column types must match schema types, expected Timestamp(Microsecond, None) but found Timestamp(Nanosecond, None) at column index 1")', rust/src/writer/record_batch.rs:507:101 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
This currently fails because a parquet file's schema is not delta compatible somehow: thread 'test_all_tables' panicked at 'Failed to convert the schema for creating the table: SchemaError("Invalid data type for Delta Lake: Timestamp(Nanosecond, None)")', /usr/home/tyler/source/github/buoyant-data/oxbow/src/lib.rs:118:10 I have a hunch that this might be similar to delta-io/delta-rs#1286
To reproduce this in python deltalake==0.10.1
Error: |
I see the same error in a similar but not exactly the same scenario: when the timestamps stored in parquet are time-zone-adjusted: pa.timestamp("s", tz="UTC")) In other words, this can happen even when both Timestamps are Microseconds. You will see the error:
|
@rtyler this is resolved now that we always cast the read data to the correct schema of the table in optimize |
Environment
Delta-rs version: any
Binding: rust
Environment:
Bug
Basically the protocol says that timestamps are supposed to be microsecond precision. By default however Apache Spark stores timestamps in parquet with INT96 😱 and they're read back by the parquet crate as nanosecond precision timestamps.
What happened:
This manifests when optimizing because the schema defined in delta expects
Timestamp(microsecond)
but reading the parquet files their arrow schema has aTimestamp(nanosecond)
.What you expected to happen:
How to reproduce it:
Painfully 😆
More details:
delta-io/delta#643
The text was updated successfully, but these errors were encountered: