Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot write delta table from pandas DataFrame with Timestamp column #685

Closed
erinov1 opened this issue Jul 11, 2022 · 6 comments
Closed
Labels
bug Something isn't working

Comments

@erinov1
Copy link

erinov1 commented Jul 11, 2022

Environment

Delta-rs version: 0.5.8

Binding: Python

Environment:

  • OS: OSX
  • Other: Arrow 8.0.0

Bug

What happened: The python delta lake writer cannot handle pandas DataFrames with Timestamp columns natively because of the nanosecond precision:

from deltalake.writer import write_deltalake

df = pd.DataFrame({"timestamp": [pd.Timestamp('2022-01-01')]})
write_deltalake("test_with_timestamp", df)

yields the following traceback:

---------------------------------------------------------------------------
ArrowException                            Traceback (most recent call last)
Input In [403], in <module>
      1 from deltalake.writer import write_deltalake
      3 df = pd.DataFrame({"timestamp": [pd.Timestamp('2022-01-01')]})
----> 4 write_deltalake('test_with_timestamp', df)

File ~/Library/Caches/pypoetry/virtualenvs/test-GfuZs_x0-py3.8/lib/python3.8/site-packages/deltalake/writer.py:226, in write_deltalake(table_or_uri, data, schema, partition_by, filesystem, mode, file_options, max_open_files, max_rows_per_file, min_rows_per_group, max_rows_per_group, name, description, configuration, overwrite_schema)
    208 ds.write_dataset(
    209     data,
    210     base_dir=table_uri,
   (...)
    222     max_rows_per_group=max_rows_per_group,
    223 )
    225 if table is None:
--> 226     _write_new_deltalake(  # type: ignore[call-arg]
    227         table_uri,
    228         schema,
    229         add_actions,
    230         mode,
    231         partition_by or [],
    232         name,
    233         description,
    234         configuration,
    235     )
    236 else:
    237     table._table.create_write_transaction(
    238         add_actions,
    239         mode,
    240         partition_by or [],
    241         schema,
    242     )

ArrowException: Schema error: Invalid data type for Delta Lake: Timestamp(Nanosecond, None)

The same issue happens if I do write_deltalake("test_with_timestamp", pyarrow.Table.from_pandas(df))

What you expected to happen:

I would have expected this to work out of the box with pandas, without having a priori knowledge of which columns contain timestamps and having to create a new arrow table with downcasted values. Note that the call to pyarrow.dataset.write_dataset() is not the issue.

@erinov1 erinov1 added the bug Something isn't working label Jul 11, 2022
@wjones127
Copy link
Collaborator

I would have expected this to work out of the box with pandas, without having a priori knowledge of which columns contain timestamps and having to create a new arrow table with downcasted values.

That's a good point. Delta Lake has a limited set of types that it supports. We don't do any automatic casting now, but perhaps we should create a mapping so that most Pandas and Arrow types can be automatically converted for writes.

@wjones127
Copy link
Collaborator

FYI it's microsecond precision that is the only timestamp precision that is supported by Delta Lake.

@wjones127
Copy link
Collaborator

@erinov1 Thanks for reporting. I've created a new feature request out of this, and we'll track progress there. #686

@erinov1
Copy link
Author

erinov1 commented Jul 12, 2022

Thanks! Out of curiosity, does Delta Lake support per-column timezones (like a pandas DataFrame or arrow Table)?

@wjones127
Copy link
Collaborator

It does not support types with stored timezones, but that's at the row-level. https://github.com/delta-io/delta/blob/master/PROTOCOL.md#primitive-types

For per-column, that's not specified in the protocol, but it might just be undocumented. Theoretically, timezone info could be stored in the column metadata, so it's technically feasible. Just a question of what is recognized by other readers. I'll look into this.

@vedpd11
Copy link

vedpd11 commented Nov 15, 2022

I am facing the same issue

wjones127 added a commit that referenced this issue Dec 1, 2022
# Description
As described in #686 some pandas datatypes are not converted to a format
that is compatible with delta lake. This handles the instance of
timestamps, which are stored with `ns` resolution in Pandas. Here, if is
a schema is not provided, we specify converting the timestamps to `us`
resolution.

We also update `python/tests/test_writer.py::test_write_pandas` to
reflect this change.

# Related Issue(s)
#685

Co-authored-by: Will Jones <willjones127@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants