Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor: Update doc strings about Page Index / Column Index #3625

Merged
merged 5 commits into from
Jan 31, 2023

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jan 29, 2023

Which issue does this PR close?

N/A

Rationale for this change

While working on a bug downstream in DataFusion apache/datafusion#5104 I found myself often confused about what the Vec<Vec<Vec<..>>> and other various structured represented in parquet.

I spent a while reading the code so I figured I would encode this learning into some more documentation to help my future sef and hopefully other readers

What changes are included in this PR?

Doc comments about various ColumnIndex / Page Index structures.

Are there any user-facing changes?

docstrings

@alamb alamb requested a review from Ted-Jiang January 29, 2023 11:57
@github-actions github-actions bot added the parquet Changes to the parquet crate label Jan 29, 2023
@@ -50,7 +50,25 @@ use crate::schema::types::{
Type as SchemaType,
};

/// [`Index`] page level for each row group of each column.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found these two typedefs especially confusing which is why I propose expanding doc strings and add examples

@@ -25,8 +27,17 @@ use crate::format::{ColumnIndex, OffsetIndex, PageLocation};
use std::io::{Cursor, Read};
use thrift::protocol::{TCompactInputProtocol, TSerializable};

/// Read on row group's all columns indexes and change into [`Index`]
/// If not the format not available return an empty vector.
/// Reads per-column [`Index`] for all columns of a row group by
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I may rant a little, the use of the terms Column Index and Page Index by the parquet spec to refer to overlapping parts of this feature I find very confusing. Like the column index feature is made up of page indexes, maybe? Blah

@alamb alamb marked this pull request as ready for review January 29, 2023 12:02
@alamb alamb changed the title Minor: Update doc strings about what Page Index / Column Index Minor: Update doc strings about Page Index / Column Index Jan 29, 2023
/// `row_group_number`.
///
/// For example `column_index[2][3]` holds the [`Index`] for the forth
/// column in the third row group of the parquet file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read how ArrowReaderBuilder populates columns_indexes and so this looks correct. 👍

Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @viirya

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Copy link
Member

@Ted-Jiang Ted-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, It's developer-friendly improvement 👍

/// the [`PageLocation`] corresponding to page `page_number` of column
/// `column_number`of row group `row_group_number`.
///
/// For example `offset_index[2][3][4]` holds the [`PageLocation`] for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice write up! 👍

pub physical_type: Type,
/// The indexes, one item per page
pub indexes: Vec<PageIndex<T>>,
/// the order
/// If the min/max elements are ordered, and if so in which
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the correct description.

@Dandandan Dandandan merged commit f78a9be into apache:master Jan 31, 2023
@Dandandan
Copy link
Contributor

Nice @alamb

@alamb alamb deleted the alamb/parquet_docs branch January 31, 2023 14:08
@ursabot
Copy link

ursabot commented Jan 31, 2023

Benchmark runs are scheduled for baseline = 9c95533 and contender = f78a9be. f78a9be is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants