-
Notifications
You must be signed in to change notification settings - Fork 875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minor: Update doc strings about Page Index / Column Index #3625
Conversation
parquet/src/file/metadata.rs
Outdated
@@ -50,7 +50,25 @@ use crate::schema::types::{ | |||
Type as SchemaType, | |||
}; | |||
|
|||
/// [`Index`] page level for each row group of each column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found these two typedefs especially confusing which is why I propose expanding doc strings and add examples
@@ -25,8 +27,17 @@ use crate::format::{ColumnIndex, OffsetIndex, PageLocation}; | |||
use std::io::{Cursor, Read}; | |||
use thrift::protocol::{TCompactInputProtocol, TSerializable}; | |||
|
|||
/// Read on row group's all columns indexes and change into [`Index`] | |||
/// If not the format not available return an empty vector. | |||
/// Reads per-column [`Index`] for all columns of a row group by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I may rant a little, the use of the terms Column Index
and Page Index
by the parquet spec to refer to overlapping parts of this feature I find very confusing. Like the column index feature is made up of page indexes, maybe? Blah
/// `row_group_number`. | ||
/// | ||
/// For example `column_index[2][3]` holds the [`Index`] for the forth | ||
/// column in the third row group of the parquet file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read how ArrowReaderBuilder
populates columns_indexes
and so this looks correct. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @viirya
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, It's developer-friendly improvement 👍
/// the [`PageLocation`] corresponding to page `page_number` of column | ||
/// `column_number`of row group `row_group_number`. | ||
/// | ||
/// For example `offset_index[2][3][4]` holds the [`PageLocation`] for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice write up! 👍
pub physical_type: Type, | ||
/// The indexes, one item per page | ||
pub indexes: Vec<PageIndex<T>>, | ||
/// the order | ||
/// If the min/max elements are ordered, and if so in which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the correct description.
Nice @alamb |
Benchmark runs are scheduled for baseline = 9c95533 and contender = f78a9be. f78a9be is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
N/A
Rationale for this change
While working on a bug downstream in DataFusion apache/datafusion#5104 I found myself often confused about what the
Vec<Vec<Vec<..>>>
and other various structured represented in parquet.I spent a while reading the code so I figured I would encode this learning into some more documentation to help my future sef and hopefully other readers
What changes are included in this PR?
Doc comments about various ColumnIndex / Page Index structures.
Are there any user-facing changes?
docstrings