Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): add pyarrow to delta compatible schema conversion in writer/merge #1820

Merged

Conversation

ion-elgreco
Copy link
Collaborator

@ion-elgreco ion-elgreco commented Nov 7, 2023

Description

This ports some functionality that @stinodego and I had worked on in Polars. Where we converted a pyarrow schema to a compatible delta schema. It converts the following:

  • uint -> int
  • timestamp(any timeunit) -> timestamp(us)

I adjusted the functionality to do schema conversion from large to normal when necessary, which is still needed in MERGE as workaround #1753.

Additional things I've added:

  • Schema conversion for every input in write_deltalake/merge
  • Add Pandas dataframe conversion
  • Add Pandas dataframe as input in merge

Related Issue(s)

@github-actions github-actions bot added the binding/python Issues for the Python package label Nov 7, 2023
@ion-elgreco ion-elgreco changed the title feat: add pyarrow to delta compatible schema conversion in writer and feat: add pyarrow to delta compatible schema conversion in writer/merge Nov 7, 2023
@ion-elgreco ion-elgreco changed the title feat: add pyarrow to delta compatible schema conversion in writer/merge feat(python): add pyarrow to delta compatible schema conversion in writer/merge Nov 7, 2023
@ion-elgreco ion-elgreco force-pushed the feat/add_schema_conversion_helpers branch from 3b9eebc to 2302148 Compare November 7, 2023 21:59
@ion-elgreco ion-elgreco marked this pull request as ready for review November 7, 2023 21:59
@ion-elgreco
Copy link
Collaborator Author

With 843c1d8 it's robust against schema's that have mixed Large and normal types. The schema will either be down casted to normal types or everything to large types, depending on user input.

@ion-elgreco
Copy link
Collaborator Author

@wjones127 do you know how we can cast schemas onto a pyarrow dataset or recordbatchreader without materializing? replace_schema on Dataset doesn't work apparently.

Also I am not sure if it's even wrong to not materialize since we have the validate_batch to check the invariants at the end

Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looking good. Just a few minor changes and it's ready to ship :)

python/deltalake/schema.py Outdated Show resolved Hide resolved
python/deltalake/schema.py Outdated Show resolved Hide resolved
python/tests/test_schema.py Show resolved Hide resolved
python/tests/test_writer.py Outdated Show resolved Hide resolved
python/tests/test_writer.py Outdated Show resolved Hide resolved
python/tests/test_writer.py Outdated Show resolved Hide resolved
@ion-elgreco
Copy link
Collaborator Author

@wjones127 I've applied all the requested changes

Copy link
Collaborator

@roeap roeap Nov 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this file live at the same level as the regular license and be included via the pyproject.toml?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what I would propose:

  1. Move the license to python/licenses/polars_license.txt and reference that path in code comment.
  2. Create a file python/licenses/readme.md and include a list of which license apply to which code.
  3. Add MIT license to pyproject.toml

I think it's possible we will borrow other polars code in the futures, so nice to name the file polars_license.txt.

Does that sound good @ion-elgreco ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just realized I commented on a completely unrelated file - luckily the question seems to have been understood regardless 😆.

Personally, I like @wjones127's suggestion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wjones127 can you take a look at point 3, I wasn't sure on how I could add another license file in the toml for it to work with maturin.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we ready to merge this? 😀

@ion-elgreco ion-elgreco force-pushed the feat/add_schema_conversion_helpers branch 3 times, most recently from 83cbe65 to 3a31fc7 Compare November 20, 2023 06:42
@ion-elgreco ion-elgreco force-pushed the feat/add_schema_conversion_helpers branch from 3a31fc7 to 06931fa Compare November 20, 2023 22:18
Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@ion-elgreco ion-elgreco force-pushed the feat/add_schema_conversion_helpers branch from 3356048 to e2beb38 Compare November 24, 2023 19:18
@ion-elgreco
Copy link
Collaborator Author

@roeap had to force push :) can you approve again?

@ion-elgreco ion-elgreco dismissed wjones127’s stale review November 24, 2023 20:23

Changes are resolved

@ion-elgreco ion-elgreco merged commit fa6c513 into delta-io:main Nov 24, 2023
24 checks passed
ion-elgreco added a commit to ion-elgreco/delta-rs that referenced this pull request Nov 25, 2023
…iter/merge (delta-io#1820)

This ports some functionality that @stinodego and I had worked on in
Polars. Where we converted a pyarrow schema to a compatible delta
schema. It converts the following:

- uint -> int
- timestamp(any timeunit) -> timestamp(us)

I adjusted the functionality to do schema conversion from large to
normal when necessary, which is still needed in MERGE as workaround
delta-io#1753.

Additional things I've added:

- Schema conversion for every input in write_deltalake/merge
- Add Pandas dataframe conversion
- Add Pandas dataframe as input in merge

- closes delta-io#686
- closes delta-io#1467

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package
Projects
None yet
3 participants