Cannot write delta table from pandas DataFrame with Timestamp column #685

erinov1 · 2022-07-11T22:14:16Z

Environment

Delta-rs version: 0.5.8

Binding: Python

Environment:

OS: OSX
Other: Arrow 8.0.0

Bug

What happened: The python delta lake writer cannot handle pandas DataFrames with Timestamp columns natively because of the nanosecond precision:

from deltalake.writer import write_deltalake

df = pd.DataFrame({"timestamp": [pd.Timestamp('2022-01-01')]})
write_deltalake("test_with_timestamp", df)

yields the following traceback:

---------------------------------------------------------------------------
ArrowException                            Traceback (most recent call last)
Input In [403], in <module>
      1 from deltalake.writer import write_deltalake
      3 df = pd.DataFrame({"timestamp": [pd.Timestamp('2022-01-01')]})
----> 4 write_deltalake('test_with_timestamp', df)

File ~/Library/Caches/pypoetry/virtualenvs/test-GfuZs_x0-py3.8/lib/python3.8/site-packages/deltalake/writer.py:226, in write_deltalake(table_or_uri, data, schema, partition_by, filesystem, mode, file_options, max_open_files, max_rows_per_file, min_rows_per_group, max_rows_per_group, name, description, configuration, overwrite_schema)
    208 ds.write_dataset(
    209     data,
    210     base_dir=table_uri,
   (...)
    222     max_rows_per_group=max_rows_per_group,
    223 )
    225 if table is None:
--> 226     _write_new_deltalake(  # type: ignore[call-arg]
    227         table_uri,
    228         schema,
    229         add_actions,
    230         mode,
    231         partition_by or [],
    232         name,
    233         description,
    234         configuration,
    235     )
    236 else:
    237     table._table.create_write_transaction(
    238         add_actions,
    239         mode,
    240         partition_by or [],
    241         schema,
    242     )

ArrowException: Schema error: Invalid data type for Delta Lake: Timestamp(Nanosecond, None)

The same issue happens if I do write_deltalake("test_with_timestamp", pyarrow.Table.from_pandas(df))

What you expected to happen:

I would have expected this to work out of the box with pandas, without having a priori knowledge of which columns contain timestamps and having to create a new arrow table with downcasted values. Note that the call to pyarrow.dataset.write_dataset() is not the issue.

The text was updated successfully, but these errors were encountered:

wjones127 · 2022-07-11T22:18:01Z

I would have expected this to work out of the box with pandas, without having a priori knowledge of which columns contain timestamps and having to create a new arrow table with downcasted values.

That's a good point. Delta Lake has a limited set of types that it supports. We don't do any automatic casting now, but perhaps we should create a mapping so that most Pandas and Arrow types can be automatically converted for writes.

wjones127 · 2022-07-11T22:19:30Z

FYI it's microsecond precision that is the only timestamp precision that is supported by Delta Lake.

wjones127 · 2022-07-11T22:34:37Z

@erinov1 Thanks for reporting. I've created a new feature request out of this, and we'll track progress there. #686

erinov1 · 2022-07-12T01:26:37Z

Thanks! Out of curiosity, does Delta Lake support per-column timezones (like a pandas DataFrame or arrow Table)?

wjones127 · 2022-07-12T01:37:34Z

It does not support types with stored timezones, but that's at the row-level. https://github.com/delta-io/delta/blob/master/PROTOCOL.md#primitive-types

For per-column, that's not specified in the protocol, but it might just be undocumented. Theoretically, timezone info could be stored in the column metadata, so it's technically feasible. Just a question of what is recognized by other readers. I'll look into this.

vedpd11 · 2022-11-15T03:05:47Z

I am facing the same issue

# Description As described in #686 some pandas datatypes are not converted to a format that is compatible with delta lake. This handles the instance of timestamps, which are stored with `ns` resolution in Pandas. Here, if is a schema is not provided, we specify converting the timestamps to `us` resolution. We also update `python/tests/test_writer.py::test_write_pandas` to reflect this change. # Related Issue(s) #685 Co-authored-by: Will Jones <willjones127@gmail.com>

erinov1 added the bug Something isn't working label Jul 11, 2022

wjones127 mentioned this issue Jul 11, 2022

Python: Automatically convert Pandas types to valid Delta Lake types in write_deltalake() #686

Closed

wjones127 closed this as completed Jul 11, 2022

hayesgb mentioned this issue Nov 26, 2022

Handle pandas timestamps #958

Merged

madsenwattiq mentioned this issue Jan 17, 2023

Python write_deltalake: when will this be production ready? #715

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot write delta table from pandas DataFrame with Timestamp column #685

Cannot write delta table from pandas DataFrame with Timestamp column #685

erinov1 commented Jul 11, 2022

wjones127 commented Jul 11, 2022

wjones127 commented Jul 11, 2022

wjones127 commented Jul 11, 2022

erinov1 commented Jul 12, 2022

wjones127 commented Jul 12, 2022

vedpd11 commented Nov 15, 2022

Cannot write delta table from pandas DataFrame with Timestamp column #685

Cannot write delta table from pandas DataFrame with Timestamp column #685

Comments

erinov1 commented Jul 11, 2022

Environment

Bug

wjones127 commented Jul 11, 2022

wjones127 commented Jul 11, 2022

wjones127 commented Jul 11, 2022

erinov1 commented Jul 12, 2022

wjones127 commented Jul 12, 2022

vedpd11 commented Nov 15, 2022