-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(python): add pyarrow to delta compatible schema conversion in writer/merge #1820
feat(python): add pyarrow to delta compatible schema conversion in writer/merge #1820
Conversation
3b9eebc
to
2302148
Compare
With 843c1d8 it's robust against schema's that have mixed Large and normal types. The schema will either be down casted to normal types or everything to large types, depending on user input. |
1e7e9b1
to
3a1a120
Compare
@wjones127 do you know how we can cast schemas onto a pyarrow dataset or recordbatchreader without materializing? Also I am not sure if it's even wrong to not materialize since we have the validate_batch to check the invariants at the end |
e370ecb
to
8e076eb
Compare
ea25d41
to
850a05a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looking good. Just a few minor changes and it's ready to ship :)
0e29368
to
08cc31e
Compare
@wjones127 I've applied all the requested changes |
python/deltalake/writer.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this file live at the same level as the regular license and be included via the pyproject.toml
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's what I would propose:
- Move the license to
python/licenses/polars_license.txt
and reference that path in code comment. - Create a file
python/licenses/readme.md
and include a list of which license apply to which code. - Add MIT license to pyproject.toml
I think it's possible we will borrow other polars code in the futures, so nice to name the file polars_license.txt
.
Does that sound good @ion-elgreco ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just realized I commented on a completely unrelated file - luckily the question seems to have been understood regardless 😆.
Personally, I like @wjones127's suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wjones127 can you take a look at point 3, I wasn't sure on how I could add another license file in the toml for it to work with maturin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we ready to merge this? 😀
83cbe65
to
3a31fc7
Compare
3a31fc7
to
06931fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
3356048
to
e2beb38
Compare
@roeap had to force push :) can you approve again? |
…iter/merge (delta-io#1820) This ports some functionality that @stinodego and I had worked on in Polars. Where we converted a pyarrow schema to a compatible delta schema. It converts the following: - uint -> int - timestamp(any timeunit) -> timestamp(us) I adjusted the functionality to do schema conversion from large to normal when necessary, which is still needed in MERGE as workaround delta-io#1753. Additional things I've added: - Schema conversion for every input in write_deltalake/merge - Add Pandas dataframe conversion - Add Pandas dataframe as input in merge - closes delta-io#686 - closes delta-io#1467 --------- Co-authored-by: Will Jones <willjones127@gmail.com>
Description
This ports some functionality that @stinodego and I had worked on in Polars. Where we converted a pyarrow schema to a compatible delta schema. It converts the following:
I adjusted the functionality to do schema conversion from large to normal when necessary, which is still needed in MERGE as workaround #1753.
Additional things I've added:
Related Issue(s)