Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine parquet documentation on types and metadata #5786

Merged
merged 5 commits into from
May 20, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented May 19, 2024

Which issue does this PR close?

Rationale for this change

In the context of apache/datafusion#10453 I have spent a while trying to use the parquet crate's documentation, and I found it quite confusing.

My real goal is to update the statistics documentation, however, when I opened the docs pretending I just wanted to use the crate (not check out its code and know how it worked) there wasn't an obvious way I was even going to find the statsistics.

Thus I propose cleaning up some of these docs to start.

What changes are included in this PR?

  1. Improve crate lib page to also list parquet metadata and statistics access as an API
  2. Improve documentation on metadata to better explain how they are related to each other
  3. Improve documentation on type structures to try and explain what they do better

Note this is far from perfect, but I think it is significantly better than what we have in terms of navigating the crate. I'll keep working on the docs as follow on PRs

Are there any user-facing changes?

Better docs

@github-actions github-actions bot added the parquet Changes to the parquet crate label May 19, 2024
@alamb alamb changed the title Minor: Refine parquet documentation on types and metadata Refine parquet documentation on types and metadata May 19, 2024
///
/// * File level metadata: [`FileMetaData`]
/// * Row Group level metadata: [`RowGroupMetaData`]
/// * (Optional) "Page Index": [`ParquetColumnIndex`]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't column index here? Page Index includes column index and offset index?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the page index is the collective term for the column index and the offset index

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I have filed a PR in apache/parquet-format#245 to try and clarify this (fun fact -- the actual spec / format do not use the term "page index" anywhere except the file name of the document that describes the ColumnIndex)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've commented on that PR also, I think you are conflating the feature name "page index" with the name of one of the two types of index. This should be called column index

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to clarify in 7ac688e

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL clearly I was confused about exaclty what was meant by "page index" -- but now next time will have documentation to refer to

@@ -1,3 +1,4 @@
//! See [`crate::file`] for easier to use APIs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is autogenerated, can we update regen.sh to append this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a895f7c

Note that there were more differences after running regen.sh but I reverted the other differences locally

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor nits

@tustvold tustvold merged commit c6b3eaa into apache:master May 20, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants