Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sperate get_next_page_header from get_next_page in PageReader #1834

Closed
Ted-Jiang opened this issue Jun 10, 2022 · 3 comments
Closed

Sperate get_next_page_header from get_next_page in PageReader #1834

Ted-Jiang opened this issue Jun 10, 2022 · 3 comments
Labels
parquet Changes to the parquet crate

Comments

@Ted-Jiang
Copy link
Member

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

pub trait PageReader: Iterator<Item = Result<Page>> + Send {
    /// Gets the next page in the column chunk associated with this reader.
    /// Returns `None` if there are no pages left.
    fn get_next_page(&mut self) -> Result<Option<Page>>;
}

change to

pub trait PageReader: Iterator<Item = Result<Page>> + Send {
    /// Gets the next page in the column chunk associated with this reader.
    /// Returns `None` if there are no pages left.
    fn get_next_page(&mut self) -> Result<Option<Page>>;

    /// Gets the next page in the column chunk associated with this reader.
    /// Returns `None` if there are no pages left.
    fn get_next_page_header(&mut self) -> Result<Option<PageHeader>>;
}

i want to add skip_page in PageReader (for page index) need check page header before call skip_page (
can not skip DictionaryPage IndexPage),
but now read header is in get_next_page, so extract it to a new func.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
put skip_page in get_next_page

Additional context
Add any other context or screenshots about the feature request here.

@Ted-Jiang Ted-Jiang added the enhancement Any new improvement worthy of a entry in the changelog label Jun 10, 2022
@Ted-Jiang
Copy link
Member Author

@tustvold PTAL

@tustvold
Copy link
Contributor

tustvold commented Jun 10, 2022

I could be missing something, but I was under the impression the intention of the ColumnIndex, as opposed to using page-level statistics, was precisely to avoid needing to read any page data for those being skipped? Is this information not encoded in the PageLocation index or something?

Note also, if there is a dictionary page, which there likely will be if ColumnChunkMetaData::encodings suggests it, it MUST be the first one in the column chunk - https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8

@Ted-Jiang
Copy link
Member Author

@tustvold Thanks a lot, i miss the dictionary pateMUST be the first one in the column chunk part.

Last two weeks, i busy about my personal things. I will go back and work on page index!
Thanks again for your info ❤️

@alamb alamb added parquet Changes to the parquet crate and removed enhancement Any new improvement worthy of a entry in the changelog labels Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

3 participants