-
Notifications
You must be signed in to change notification settings - Fork 789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refine parquet documentation on types and metadata #5786
Conversation
parquet/src/file/metadata.rs
Outdated
/// | ||
/// * File level metadata: [`FileMetaData`] | ||
/// * Row Group level metadata: [`RowGroupMetaData`] | ||
/// * (Optional) "Page Index": [`ParquetColumnIndex`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't column index here? Page Index includes column index and offset index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, the page index is the collective term for the column index and the offset index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I have filed a PR in apache/parquet-format#245 to try and clarify this (fun fact -- the actual spec / format do not use the term "page index" anywhere except the file name of the document that describes the ColumnIndex)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've commented on that PR also, I think you are conflating the feature name "page index" with the name of one of the two types of index. This should be called column index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to clarify in 7ac688e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOL clearly I was confused about exaclty what was meant by "page index" -- but now next time will have documentation to refer to
@@ -1,3 +1,4 @@ | |||
//! See [`crate::file`] for easier to use APIs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is autogenerated, can we update regen.sh
to append this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in a895f7c
Note that there were more differences after running regen.sh but I reverted the other differences locally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor nits
Which issue does this PR close?
Rationale for this change
In the context of apache/datafusion#10453 I have spent a while trying to use the parquet crate's documentation, and I found it quite confusing.
My real goal is to update the statistics documentation, however, when I opened the docs pretending I just wanted to use the crate (not check out its code and know how it worked) there wasn't an obvious way I was even going to find the statsistics.
Thus I propose cleaning up some of these docs to start.
What changes are included in this PR?
Note this is far from perfect, but I think it is significantly better than what we have in terms of navigating the crate. I'll keep working on the docs as follow on PRs
Are there any user-facing changes?
Better docs