[feature](reader) Optimize Complex Type Column Reading with Column Purning #59286

kaka11chen · 2025-12-23T06:58:53Z

What problem does this PR solve?

Problem Summary:

Release note

Cherry-pick #58370 #58354 #59043 #58851 #58485 #58682 #58614 #58373 #57204 #58719 #58471 #58573 #58657

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

…uning (#57204) Problem Summary: Optimize Complex Type Column Reading with Column Pruning This PR implements column pruning for complex types (Struct, Array, Map) to optimize read performance. Previously, Doris would read entire complex type fields before processing, which was simple to implement but inefficient when only specific sub-columns were needed. **Key changes:** - **FE (Frontend)**: Added column access path calculation and type pruning - Collects and analyzes access paths for complex type fields - Performs type pruning based on access paths - Implements projection pushdown for complex types - **BE (Backend)**: Added selective column reading - Uses columnAccessPath array from FE to identify required sub-columns - Implements selective reading to skip unnecessary sub-columns **Performance Improvement**: When a struct contains hundreds or thousands of columns but the query only accesses a few sub-columns, this optimization can significantly reduce I/O and improve query performance. For example, with `struct<int a, int b> s`, when only `s.a` is referenced, we can avoid reading `s.b` entirely. **Technical Benefits**: Reduces unnecessary data scanning and decoding overhead for complex types, aligning with Doris's continuous performance optimization goals . - **Lazy Materialization for Complex Type Sub-columns**: Defer materialization of unused sub-columns - **Predicate Pushdown for Complex Type Sub-columns**: Push predicates to storage layer for better filtering - **Parquet RL/DL Optimization**: Read only repetition levels and definition levels without data in appropriate scenarios - **Array Size Optimization**: Read only offset and null values for `array_size()` operations - **Null Check Optimization**: Read only offset and null values for `!= null` checks Co-authored-by: 924060929 <lanhuajian@selectdb.com> Co-authored-by: Jerry Hu <mrhhsg@gmail.com>

…complex-type column pruning functionality (#58373) ### What problem does this PR solve? TabletSchema with pruned column type should not be cached. Related PR: #xxx Problem Summary: ### Release note None ### Check List (For Author) - Test  - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason  - Behavior changed: - [ ] No. - [ ] Yes.  - Does this need documentation? - [ ] No. - [ ] Yes.  ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label

…a change (#58614) ### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Should set the column name of sub-iterator of StructIterator. ### Release note None ### Check List (For Author) - Test  - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason  - Behavior changed: - [ ] No. - [ ] Yes.  - Does this need documentation? - [ ] No. - [ ] Yes.  ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label

…ns (#58682) ### What problem does this PR solve? The `next_batch` method should accumulate the row count. Related PR: #xxx Problem Summary: ### Release note None ### Check List (For Author) - Test  - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason  - Behavior changed: - [ ] No. - [ ] Yes.  - Does this need documentation? - [ ] No. - [ ] Yes.  ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label

…tched map access (#58485) - Replace the per-row seek + next_batch(1) loop in MapFileColumnIterator::read_by_rowids with batched offset reads via OffsetFileColumnIterator::read_by_rowids to derive key/value ranges for the requested rowids. - Compute per-row map sizes from offset[rowid] and offset[rowid+1], using the page-tail next_array_item_ordinal sentinel for the last row when rowid+1 is out of bounds. - Skip key/value decoding for null rows by consulting a pre-fetched null map, and add a safety check to reject non-nullable destination columns when the underlying map reader is nullable. - Reuse a small peek column in OffsetFileColumnIterator::_peek_one_offset to avoid repeated temporary column allocations when reading page sentinels. - Add a unit test (MapReadByRowidsSkipReadingResizesDestination) to verify that read_by_rowids honors the SKIP_READING flag and only resizes the destination column without touching sub-iterators. - Improve performance from ~19s to ~0.1s in the worst-case access pattern, and from ~6s to ~3s in the normal case.

…d_by_rowids in scenarios where the rowids are continuous (#58851) ### What problem does this PR solve? Avoid seeking and reading row by row. Issue Number: close #xxx Related PR: #xxx Problem Summary: ### Release note None ### Check List (For Author) - Test  - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason  - Behavior changed: - [ ] No. - [ ] Yes.  - Does this need documentation? - [ ] No. - [ ] Yes.  ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label

### What problem does this PR solve? Read as many consecutive rows as possible. Problem Summary: ### Release note None ### Check List (For Author) - Test  - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason  - Behavior changed: - [ ] No. - [ ] Yes.  - Does this need documentation? - [ ] No. - [ ] Yes.  ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label

add some tests for prune nested column (cherry picked from commit 70f3e9d)

optimize push down project, this can reduce the scan bytes and shuffle bytes by prune nested column. #57204 related the sql: ```sql select coalecse(struct_element(t1.s, 'city'), 'beijing') from t1 join t2 on t1.id = t2.id ``` original plan: ``` Project(coalecse(struct_element(t1.s, 'city'), 'beijing')) | Join(t1.id=t2.id) / \ Project(t1.id, t1.s) Project(t2.id) | | Scan(t1) Scan(t2) ``` optimize plan: ``` Project(coalecse(slot#3, 'beijing')) | Join(t1.id=t2.id) / \ Project(t1.id, struct_element(t1.s, 'city')#3) Project(t2.id) | | Scan(t1) Scan(t2) ``` (cherry picked from commit c30c0ff)

hello-stephen · 2025-12-23T06:58:58Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

kaka11chen · 2025-12-23T06:59:09Z

run buildall

fix `Input slot(s) not in child's output`, introduced by #57204 (cherry picked from commit b788842)

fix prune map type cause backend core, when the map type is changed, we should not prune the nested column type, introduced by #57204 (cherry picked from commit 1d7f6c4)

fix can not prune dereference expression, introduced by #57532 (cherry picked from commit 2b23693)

kaka11chen · 2025-12-23T07:15:31Z

run buildall

doris-robot · 2025-12-23T07:36:13Z

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	82.10% (1573/1916)
Line Coverage	67.12% (28073/41824)
Region Coverage	67.61% (13809/20426)
Branch Coverage	58.05% (7363/12684)

hello-stephen · 2025-12-23T08:29:37Z

FE UT Coverage Report

Increment line coverage 63.13% (916/1451) 🎉
Increment coverage report
Complete coverage report

hello-stephen · 2025-12-23T10:22:32Z

BE UT Coverage Report

Increment line coverage 63.99% (1217/1902) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	53.36% (18616/34887)
Line Coverage	39.14% (172541/440806)
Region Coverage	33.81% (133335/394362)
Branch Coverage	34.83% (57693/165650)

github-actions · 2025-12-24T02:35:07Z

PR approved by at least one committer and no changes requested.

github-actions · 2025-12-24T02:35:10Z

PR approved by anyone and no changes requested.

kaka11chen and others added 9 commits December 23, 2025 11:54

[chore](test) add some tests for prune nested column (#58354)

109bcca

add some tests for prune nested column (cherry picked from commit 70f3e9d)

kaka11chen requested a review from yiguolei as a code owner December 23, 2025 06:58

924060929 added 3 commits December 23, 2025 15:12

[fix](nereids) fix Input slot(s) not in child's output (#58471)

adcdfd4

fix `Input slot(s) not in child's output`, introduced by #57204 (cherry picked from commit b788842)

[fix](nereids) fix prune map type cause backend core (#58573)

4a29af8

fix prune map type cause backend core, when the map type is changed, we should not prune the nested column type, introduced by #57204 (cherry picked from commit 1d7f6c4)

[fix](nereids) fix can not prune dereference expression (#58657)

eac26e9

fix can not prune dereference expression, introduced by #57532 (cherry picked from commit 2b23693)

924060929 force-pushed the cherry-pick-nested_column_prune_4.0 branch from 645394e to eac26e9 Compare December 23, 2025 07:14

yiguolei approved these changes Dec 24, 2025

View reviewed changes

github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels Dec 24, 2025

yiguolei merged commit eea2586 into branch-4.0 Dec 24, 2025
24 of 30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feature](reader) Optimize Complex Type Column Reading with Column Purning #59286

[feature](reader) Optimize Complex Type Column Reading with Column Purning #59286

Uh oh!

kaka11chen commented Dec 23, 2025 •

edited

Loading

Uh oh!

hello-stephen commented Dec 23, 2025

Uh oh!

kaka11chen commented Dec 23, 2025

Uh oh!

kaka11chen commented Dec 23, 2025

Uh oh!

doris-robot commented Dec 23, 2025

Uh oh!

hello-stephen commented Dec 23, 2025

Uh oh!

hello-stephen commented Dec 23, 2025

Uh oh!

github-actions bot commented Dec 24, 2025

Uh oh!

github-actions bot commented Dec 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

[feature](reader) Optimize Complex Type Column Reading with Column Purning #59286

[feature](reader) Optimize Complex Type Column Reading with Column Purning #59286

Uh oh!

Conversation

kaka11chen commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

hello-stephen commented Dec 23, 2025

Uh oh!

kaka11chen commented Dec 23, 2025

Uh oh!

kaka11chen commented Dec 23, 2025

Uh oh!

doris-robot commented Dec 23, 2025

Cloud UT Coverage Report

Uh oh!

hello-stephen commented Dec 23, 2025

FE UT Coverage Report

Uh oh!

hello-stephen commented Dec 23, 2025

BE UT Coverage Report

Uh oh!

github-actions bot commented Dec 24, 2025

Uh oh!

github-actions bot commented Dec 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

kaka11chen commented Dec 23, 2025 •

edited

Loading