-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[feature](vparquet-reader) Implements parquet file page cache and cherry-pick #58785 #58895 #59654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature](vparquet-reader) Implements parquet file page cache and cherry-pick #58785 #58895 #59654
Conversation
…ache#58785) Related PR: apache#51329 Problem Summary: This PR primarily enables the Parquet reader to use page indexes when reading complex columns, and also fixes a data reading error in PR of topn.
Problem Summary: Refine some metrics in parquet reader profile. 1. Rename some `Statistics` class name to make it readable. (There are too many `Statistics` struct with same name) 2. Add `read page header timer` in parquet reader profile 3. fix issue of invalid check logic for `MergeRangeFileReader` when setting prefetch buffer size 4. fix issue that data cache profile is incorrect for external table can.
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements a parquet file page cache feature and cherry-picks changes from #58785 and #58895. The implementation adds caching capabilities for parquet pages to improve read performance by avoiding repeated decompression and I/O operations.
Key changes:
- Adds parquet page cache functionality with configurable options (enable/disable, compression thresholds)
- Refactors statistics collection (renamed from
StatisticstoReaderStatisticsandColumnStatisticsfor clarity) - Adds comprehensive test coverage including unit tests for cache hit/miss scenarios, compression handling, and multi-page cases
- Introduces new session variables to control page cache behavior
Reviewed changes
Copilot reviewed 46 out of 54 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| gensrc/thrift/PaloInternalService.thrift | Adds thrift field for enabling parquet file page cache in query options |
| fe/fe-core/src/main/java/org/apache/doris/qe/SessionVariable.java | Adds session variable for parquet page cache with duplicate assignment bug |
| regression-test/suites/external_table_p0/hive/test_hive_topn_lazy_mat.groovy | Adds test cases for complex parquet tables with multiple pages |
| regression-test/data/external_table_p0/hive/test_hive_topn_lazy_mat.out | Expected test output data for new complex table queries |
| docker/thirdparties/.../parquet_topn_lazy_complex_table*/data_part_*.parquet | Binary parquet test data files for single and multi-page scenarios |
| docker/thirdparties/.../run80.hql | Hive table creation scripts for new test tables |
| be/test/vec/exec/format/parquet/parquet_page_cache_test.cpp | Comprehensive unit tests for page cache functionality |
| be/test/vec/exec/format/parquet/parquet_thrift_test.cpp | Updates test to use new API signatures |
| be/test/vec/exec/orc/orc_file_reader_test.cpp | Adds missing mtime() method to mock |
| be/test/vec/exec/format/file_reader/file_meta_cache_test.cpp | Adds missing mtime() method to mock |
| be/test/io/fs/buffered_reader_test.cpp | Adds missing mtime() methods to test readers |
| be/src/vec/exec/scan/file_scanner.cpp | Removes extraneous blank line |
| be/src/vec/exec/format/parquet/vparquet_reader.h | Renames Statistics to ReaderStatistics, adds page cache counters |
| be/src/vec/exec/format/parquet/vparquet_group_reader.h | Updates method signatures for new statistics structure and created_by parameter |
| be/src/vec/exec/format/parquet/vparquet_group_reader.cpp | Implements changes to support page cache and refactored statistics |
| be/src/vec/exec/format/parquet/level_decoder.cpp | Contains a typo ("toto" instead of "TODO") |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| tResult.setEnableParquetLazyMat(enableParquetLazyMat); | ||
| tResult.setEnableOrcLazyMat(enableOrcLazyMat); | ||
| tResult.setEnableParquetFilterByMinMax(enableParquetFilterByMinMax); | ||
| tResult.setEnableParquetFilePageCache(enableParquetFilePageCache); |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The field enableParquetFilePageCache is set twice in the toThrift() method. Line 4903 already sets this field, making this line 4918 redundant. Remove this duplicate assignment to avoid confusion and potential bugs.
| } | ||
|
|
||
| size_t doris::vectorized::LevelDecoder::get_levels(doris::vectorized::level_t* levels, size_t n) { | ||
| // toto template. |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a typo in the comment. "toto" should be "TODO" or "Note:" depending on the intended meaning.
| // toto template. | |
| // TODO: template. |
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)