-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot write delta table from pandas DataFrame with Timestamp column #685
Comments
That's a good point. Delta Lake has a limited set of types that it supports. We don't do any automatic casting now, but perhaps we should create a mapping so that most Pandas and Arrow types can be automatically converted for writes. |
FYI it's microsecond precision that is the only timestamp precision that is supported by Delta Lake. |
Thanks! Out of curiosity, does Delta Lake support per-column timezones (like a pandas DataFrame or arrow Table)? |
It does not support types with stored timezones, but that's at the row-level. https://github.com/delta-io/delta/blob/master/PROTOCOL.md#primitive-types For per-column, that's not specified in the protocol, but it might just be undocumented. Theoretically, timezone info could be stored in the column metadata, so it's technically feasible. Just a question of what is recognized by other readers. I'll look into this. |
I am facing the same issue |
# Description As described in #686 some pandas datatypes are not converted to a format that is compatible with delta lake. This handles the instance of timestamps, which are stored with `ns` resolution in Pandas. Here, if is a schema is not provided, we specify converting the timestamps to `us` resolution. We also update `python/tests/test_writer.py::test_write_pandas` to reflect this change. # Related Issue(s) #685 Co-authored-by: Will Jones <willjones127@gmail.com>
Environment
Delta-rs version: 0.5.8
Binding: Python
Environment:
Bug
What happened: The python delta lake writer cannot handle pandas DataFrames with Timestamp columns natively because of the nanosecond precision:
yields the following traceback:
The same issue happens if I do
write_deltalake("test_with_timestamp", pyarrow.Table.from_pandas(df))
What you expected to happen:
I would have expected this to work out of the box with pandas, without having a priori knowledge of which columns contain timestamps and having to create a new arrow table with downcasted values. Note that the call to
pyarrow.dataset.write_dataset()
is not the issue.The text was updated successfully, but these errors were encountered: