Skip to content

Conversation

@kaka11chen
Copy link
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

hubgeter and others added 3 commits January 7, 2026 21:10
…ache#58785)

Related PR: apache#51329
Problem Summary:
This PR primarily enables the Parquet reader to use page indexes when
reading complex columns, and also fixes a data reading error in PR
of topn.
Problem Summary:

Refine some metrics in parquet reader profile.
1. Rename some `Statistics` class name to make it readable. (There are
too many `Statistics` struct with same name)
2. Add `read page header timer` in parquet reader profile
3. fix issue of invalid check logic for `MergeRangeFileReader` when
setting prefetch buffer size
4. fix issue that data cache profile is incorrect for external table
can.
Copilot AI review requested due to automatic review settings January 7, 2026 16:59
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a parquet file page cache feature and cherry-picks changes from #58785 and #58895. The implementation adds caching capabilities for parquet pages to improve read performance by avoiding repeated decompression and I/O operations.

Key changes:

  • Adds parquet page cache functionality with configurable options (enable/disable, compression thresholds)
  • Refactors statistics collection (renamed from Statistics to ReaderStatistics and ColumnStatistics for clarity)
  • Adds comprehensive test coverage including unit tests for cache hit/miss scenarios, compression handling, and multi-page cases
  • Introduces new session variables to control page cache behavior

Reviewed changes

Copilot reviewed 46 out of 54 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
gensrc/thrift/PaloInternalService.thrift Adds thrift field for enabling parquet file page cache in query options
fe/fe-core/src/main/java/org/apache/doris/qe/SessionVariable.java Adds session variable for parquet page cache with duplicate assignment bug
regression-test/suites/external_table_p0/hive/test_hive_topn_lazy_mat.groovy Adds test cases for complex parquet tables with multiple pages
regression-test/data/external_table_p0/hive/test_hive_topn_lazy_mat.out Expected test output data for new complex table queries
docker/thirdparties/.../parquet_topn_lazy_complex_table*/data_part_*.parquet Binary parquet test data files for single and multi-page scenarios
docker/thirdparties/.../run80.hql Hive table creation scripts for new test tables
be/test/vec/exec/format/parquet/parquet_page_cache_test.cpp Comprehensive unit tests for page cache functionality
be/test/vec/exec/format/parquet/parquet_thrift_test.cpp Updates test to use new API signatures
be/test/vec/exec/orc/orc_file_reader_test.cpp Adds missing mtime() method to mock
be/test/vec/exec/format/file_reader/file_meta_cache_test.cpp Adds missing mtime() method to mock
be/test/io/fs/buffered_reader_test.cpp Adds missing mtime() methods to test readers
be/src/vec/exec/scan/file_scanner.cpp Removes extraneous blank line
be/src/vec/exec/format/parquet/vparquet_reader.h Renames Statistics to ReaderStatistics, adds page cache counters
be/src/vec/exec/format/parquet/vparquet_group_reader.h Updates method signatures for new statistics structure and created_by parameter
be/src/vec/exec/format/parquet/vparquet_group_reader.cpp Implements changes to support page cache and refactored statistics
be/src/vec/exec/format/parquet/level_decoder.cpp Contains a typo ("toto" instead of "TODO")

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tResult.setEnableParquetLazyMat(enableParquetLazyMat);
tResult.setEnableOrcLazyMat(enableOrcLazyMat);
tResult.setEnableParquetFilterByMinMax(enableParquetFilterByMinMax);
tResult.setEnableParquetFilePageCache(enableParquetFilePageCache);
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field enableParquetFilePageCache is set twice in the toThrift() method. Line 4903 already sets this field, making this line 4918 redundant. Remove this duplicate assignment to avoid confusion and potential bugs.

Copilot uses AI. Check for mistakes.
}

size_t doris::vectorized::LevelDecoder::get_levels(doris::vectorized::level_t* levels, size_t n) {
// toto template.
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a typo in the comment. "toto" should be "TODO" or "Note:" depending on the intended meaning.

Suggested change
// toto template.
// TODO: template.

Copilot uses AI. Check for mistakes.
@kaka11chen kaka11chen closed this Jan 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants