Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added offset index based parquet reading support #4844

Merged
merged 13 commits into from
Nov 23, 2023

Conversation

malhotrashivam
Copy link
Contributor

@malhotrashivam malhotrashivam commented Nov 15, 2023

Currently, while reading a particular row from a parquet file,

  • We first check if the row is in any of the pages read so far.
  • If not, we keep on extending one page till we reach the required page containing the row. Extending the page means computing the number of rows and setting up some internal data structures for each page.
    All of this happens inside VariablePageSizeColumnChunkPageStore.

Some of the above operations can be optimized using OffsetIndex data structure.
OffsetIndex is an optional field in Parquet which gives us three things for each page in a parquet file:

  1. Byte offset
  2. Compressed page size
  3. Index of first row (<- Most useful for this PR)

So for locating a page containing a row, we can directly binary search inside the offset index and don't need to extend pages up to that page. Doing this helps speed up sparse reads and improves the parallelization.

This PR makes the change and also fixes a bug in writing offset index for array and vector columns. Because of this bug fix, we can not use the old code to read the parquet files generated with the new code. So the PR is backwards compatible but not forwards compatible.

Also, as part of this PR, we are deleting the FixedPageSizeColumnChunkPageStore because to detect if page sizes are fixed, we need offset index. So fixed page size support is added in the new OffsetIndexBasedColumnChunkPageStore.

Benchmarking results: https://docs.google.com/document/d/19AIp6UpHTWpbmxo0ldkBYyuRKOIhlGLP2NHU26jOeOM/edit?usp=sharing

Closes #4717, #4718
Related to #4879

@malhotrashivam malhotrashivam added the parquet Related to the Parquet integration label Nov 15, 2023
@malhotrashivam malhotrashivam added this to the November 2023 milestone Nov 15, 2023
@malhotrashivam malhotrashivam self-assigned this Nov 15, 2023
@malhotrashivam malhotrashivam added feature request New feature or request NoDocumentationNeeded NoReleaseNotesNeeded No release notes are needed. labels Nov 15, 2023
Copy link
Member

@devinrsmith devinrsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not my full review. Just took a quick look at the tests for now. I'll wait until after Ryan's review to dig down deeper.

Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly high-level review pass. Be sure that we keep enterprise in the loop regarding forwards compatibility issue, and look for more external validation of our new approach to writing and interpreting offset indexes.

Copy link
Member

@devinrsmith devinrsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take another pass soon; need to take a break.

Comment on lines +95 to +96
@NotNull final ToPage<ATTR, ?> toPage,
@NotNull final ColumnDefinition<?> columnDefinition) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is technically a "breaking" change; another reason I don't like how our implementation is structured right now. Not your issue, but just noting. #4850

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't expect anyone to use our parquet libraries at this level but us. Java does not make it easy to specify that.

devinrsmith
devinrsmith previously approved these changes Nov 22, 2023
@malhotrashivam malhotrashivam merged commit 2c5ecbd into deephaven:main Nov 23, 2023
12 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Nov 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature request New feature or request NoDocumentationNeeded NoReleaseNotesNeeded No release notes are needed. parquet Related to the Parquet integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Properly use fixed page size reader for parquet
3 participants