-
Notifications
You must be signed in to change notification settings - Fork 420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parquet2 implementation backed by parquet2 feature gate #465
Conversation
@ritchie46 let me know what you think about this approach. The core library is now fully decoupled from the arrow-rs crate and only depends on parquet2 for checkpoint parsing. As a consumer, i.e. polars, you should be able to use it with If you are ok with this design, we can collaborate on the |
I don't understand this library enough yet to fully qualify this. But if anything comes during polars integration, I hope I can make suggestions. As I said, kudos for being able to feature gate such a core dependency. |
Sounds good @ritchie46 , I will complete the parquet parsing support for map and list this weekend. But the branch I have here right now should be enough to unblock ploars integration. |
decouple core from arrow
@houqp, any updates on this? |
@andrei-ionescu I have implemented all the data types other than map and nested list, so it's very close to be complete. However, my time is limited now, so progress will be slow. Anyone if welcome to collaborate on this branch to push this over the finish line :) |
alright, this branch is now feature complete ;) now it's time to catch up to latest delta-rs main branch and arrow2/parquet2 releases. |
8d4f220
to
a5579b4
Compare
@wjones127 @roeap ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only had some very minor comments. While not an expert on parquet reading, I felt the tests should cover that quite nicely!
The one thing I was wondering is the impl
's for the actions, or the action.rs file in general. It seems we have alot of feature flags there now, and maybe it would be cleaner to have the parquet specific stuff also moved into its own mod. Maybe adopt something ile the trait based approach in the new parquet2 implementation?
As this also seems like a step towards supporting polars
, we will face the question of having good integration points to support various backends / engines or however one wnats to call it :).
great work!
"metaData" => deserialize_metadata_column_page, | ||
"protocol" => deserialize_protocol_column_page, | ||
"commitInfo" => deserialize_commit_info_column_page, | ||
"cdc" => deserialize_cdc_column_page, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
Good idea, I have moved all those code into a I chatted this with Andy and Jorge at the Data+AI summit a couple of months ago, the long term goal is to develop an Arrow trait that allows users to switch between arrow-rs and arrow2 in different projects including datafusion and delta-rs. This will also open up the possibility for a 3rd GPU based arrow implementation ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
decouple core from arrow
Description
WIP parquet2 implementation. The goal of this PR is to implement full read support leveraging parquet2. Write support is out of the scope and should be added as follow up PRs. Arrow2 integration is also out of the scope and should be added through follow up PR.
Currently all read tests are passing:
A quick benchmark shows more than 50% performance boost on checkpoint deserialization. Only tested with a very tiny checkpoint from the golden dataset, I would expect the performance gap would be bigger for larger real world tables.
Todo:
Related Issue(s)
blocks #310
Documentation