Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(python): make sure we always write microsecond precision timestamps #1467

Closed
wjones127 opened this issue Jun 15, 2023 · 3 comments · Fixed by #1820
Closed

fix(python): make sure we always write microsecond precision timestamps #1467

wjones127 opened this issue Jun 15, 2023 · 3 comments · Fixed by #1820
Labels
binding/python Issues for the Python package enhancement New feature or request

Comments

@wjones127
Copy link
Collaborator

Description

Right now I think we are be accident. But PyArrow will change default to nanoseconds so we should specify this.

See: apache/arrow#35746

Use Case

Related Issue(s)

@wjones127 wjones127 added enhancement New feature or request binding/python Issues for the Python package labels Jun 15, 2023
@anjakefala
Copy link

The goal is to get apache/arrow#35746 merged in by 13.0.0!

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Jul 29, 2023

@wjones127 I can give this a go, I ran into this issue while writing with Polars, should be straightforward and similar to the delta_arrow_schema_from_pandas logic, right?

@ion-elgreco
Copy link
Collaborator

@wjones127 I was thinking this could work: main...ion-elgreco:delta-rs:fix/cast-timestamp-always-to-us-precision

But after checking the PyArrow docs and testing what a RecordBatchReader does, it apparently doesn't cast the schema when you pass a different one. It will throw an error that the schema that is passed is not the same as the one that was expected, so it only works on pa.Table..

Do you have any idea how you can cast/read a recordBatch to a new schema?

Also, one thing that won't be caught are all the fields or structs that contain nested date-times with different precision. This may possibly be the case for the method where the precision is fixed for pd.DataFrame.

ion-elgreco added a commit that referenced this issue Nov 24, 2023
…iter/merge (#1820)

# Description
This ports some functionality that @stinodego and I had worked on in
Polars. Where we converted a pyarrow schema to a compatible delta
schema. It converts the following:

- uint -> int
- timestamp(any timeunit) -> timestamp(us) 

I adjusted the functionality to do schema conversion from large to
normal when necessary, which is still needed in MERGE as workaround
#1753.

Additional things I've added:

- Schema conversion for every input in write_deltalake/merge
- Add Pandas dataframe conversion
- Add Pandas dataframe as input in merge


# Related Issue(s)
- closes #686
- closes #1467

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
ion-elgreco added a commit to ion-elgreco/delta-rs that referenced this issue Nov 25, 2023
…iter/merge (delta-io#1820)

This ports some functionality that @stinodego and I had worked on in
Polars. Where we converted a pyarrow schema to a compatible delta
schema. It converts the following:

- uint -> int
- timestamp(any timeunit) -> timestamp(us)

I adjusted the functionality to do schema conversion from large to
normal when necessary, which is still needed in MERGE as workaround
delta-io#1753.

Additional things I've added:

- Schema conversion for every input in write_deltalake/merge
- Add Pandas dataframe conversion
- Add Pandas dataframe as input in merge

- closes delta-io#686
- closes delta-io#1467

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants