Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet2 implementation backed by parquet2 feature gate #465

Merged
merged 15 commits into from
Aug 30, 2022
Merged

Conversation

houqp
Copy link
Member

@houqp houqp commented Oct 17, 2021

decouple core from arrow

Description

WIP parquet2 implementation. The goal of this PR is to implement full read support leveraging parquet2. Write support is out of the scope and should be added as follow up PRs. Arrow2 integration is also out of the scope and should be added through follow up PR.

Currently all read tests are passing:

cargo test --no-default-features --features=arrow2,parquet2 

A quick benchmark shows more than 50% performance boost on checkpoint deserialization. Only tested with a very tiny checkpoint from the golden dataset, I would expect the performance gap would be bigger for larger real world tables.

Todo:

  • clean up duplicated code
  • support parsing map type
  • support parsing list type
  • benchmark

Related Issue(s)

blocks #310

Documentation

@houqp
Copy link
Member Author

houqp commented Oct 17, 2021

@ritchie46 let me know what you think about this approach. The core library is now fully decoupled from the arrow-rs crate and only depends on parquet2 for checkpoint parsing.

As a consumer, i.e. polars, you should be able to use it with --no-default-features --features=parquet2. arrow integration only provides support for schema conversion between delta table schema and arrow schema, which is not very useful for polars. You might be better off just using the schema from the raw parquet file for now, see #441.

If you are ok with this design, we can collaborate on the qp_arrow2 branch to finish up the PoC.

@ritchie46
Copy link

If you are ok with this design, we can collaborate on the qp_arrow2 branch to finish up the PoC.

I don't understand this library enough yet to fully qualify this. But if anything comes during polars integration, I hope I can make suggestions. As I said, kudos for being able to feature gate such a core dependency.

@houqp
Copy link
Member Author

houqp commented Oct 19, 2021

Sounds good @ritchie46 , I will complete the parquet parsing support for map and list this weekend. But the branch I have here right now should be enough to unblock ploars integration.

@andrei-ionescu
Copy link
Contributor

@houqp, any updates on this?

@houqp
Copy link
Member Author

houqp commented Jul 17, 2022

@andrei-ionescu I have implemented all the data types other than map and nested list, so it's very close to be complete. However, my time is limited now, so progress will be slow. Anyone if welcome to collaborate on this branch to push this over the finish line :)

@houqp houqp dismissed a stale review via a5ebfc9 August 22, 2022 01:39
@houqp
Copy link
Member Author

houqp commented Aug 22, 2022

alright, this branch is now feature complete ;) now it's time to catch up to latest delta-rs main branch and arrow2/parquet2 releases.

@houqp houqp force-pushed the qp_arrow2 branch 4 times, most recently from 8d4f220 to a5579b4 Compare August 29, 2022 00:09
@houqp houqp marked this pull request as ready for review August 29, 2022 00:43
@houqp houqp changed the title WIP: parquet2 implementation backed by parquet2 feature gate parquet2 implementation backed by parquet2 feature gate Aug 29, 2022
rust/tests/dataframe_test.rs Outdated Show resolved Hide resolved
@houqp
Copy link
Member Author

houqp commented Aug 29, 2022

@wjones127 @roeap ready for review.

@houqp houqp enabled auto-merge (squash) August 29, 2022 07:11
Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only had some very minor comments. While not an expert on parquet reading, I felt the tests should cover that quite nicely!

The one thing I was wondering is the impl's for the actions, or the action.rs file in general. It seems we have alot of feature flags there now, and maybe it would be cleaner to have the parquet specific stuff also moved into its own mod. Maybe adopt something ile the trait based approach in the new parquet2 implementation?

As this also seems like a step towards supporting polars, we will face the question of having good integration points to support various backends / engines or however one wnats to call it :).

great work!

rust/src/action/parquet2_read/dictionary/binary.rs Outdated Show resolved Hide resolved
rust/src/action/parquet2_read/map.rs Outdated Show resolved Hide resolved
"metaData" => deserialize_metadata_column_page,
"protocol" => deserialize_protocol_column_page,
"commitInfo" => deserialize_commit_info_column_page,
"cdc" => deserialize_cdc_column_page,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

rust/tests/dataframe_test.rs Outdated Show resolved Hide resolved
@houqp houqp requested a review from roeap August 30, 2022 04:11
@houqp houqp disabled auto-merge August 30, 2022 05:02
@houqp houqp enabled auto-merge (squash) August 30, 2022 05:02
@houqp
Copy link
Member Author

houqp commented Aug 30, 2022

maybe it would be cleaner to have the parquet specific stuff also moved into its own mod

Good idea, I have moved all those code into a parquet_read mod to keep action lean.

I chatted this with Andy and Jorge at the Data+AI summit a couple of months ago, the long term goal is to develop an Arrow trait that allows users to switch between arrow-rs and arrow2 in different projects including datafusion and delta-rs. This will also open up the possibility for a 3rd GPU based arrow implementation ;)

Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@houqp houqp merged commit 63798fd into main Aug 30, 2022
@houqp houqp deleted the qp_arrow2 branch August 30, 2022 06:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants