Skip to content

Conversation

@kaka11chen
Copy link
Contributor

@kaka11chen kaka11chen commented Dec 23, 2025

What problem does this PR solve?

Problem Summary:

Release note

Cherry-pick #58370 #58354 #59043 #58851 #58485 #58682 #58614 #58373 #57204 #58719 #58471 #58573 #58657

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

kaka11chen and others added 9 commits December 23, 2025 11:54
…uning (#57204)

Problem Summary:

Optimize Complex Type Column Reading with Column Pruning

This PR implements column pruning for complex types (Struct, Array, Map)
to optimize read performance. Previously, Doris would read entire
complex type fields before processing, which was simple to implement but
inefficient when only specific sub-columns were needed.

**Key changes:**
- **FE (Frontend)**: Added column access path calculation and type
pruning
  - Collects and analyzes access paths for complex type fields
  - Performs type pruning based on access paths
  - Implements projection pushdown for complex types

- **BE (Backend)**: Added selective column reading
  - Uses columnAccessPath array from FE to identify required sub-columns
  - Implements selective reading to skip unnecessary sub-columns

**Performance Improvement**: When a struct contains hundreds or
thousands of columns but the query only accesses a few sub-columns, this
optimization can significantly reduce I/O and improve query performance.
For example, with `struct<int a, int b> s`, when only `s.a` is
referenced, we can avoid reading `s.b` entirely.

**Technical Benefits**: Reduces unnecessary data scanning and decoding
overhead for complex types, aligning with Doris's continuous performance
optimization goals .

- **Lazy Materialization for Complex Type Sub-columns**: Defer
materialization of unused sub-columns
- **Predicate Pushdown for Complex Type Sub-columns**: Push predicates
to storage layer for better filtering
- **Parquet RL/DL Optimization**: Read only repetition levels and
definition levels without data in appropriate scenarios
- **Array Size Optimization**: Read only offset and null values for
`array_size()` operations
- **Null Check Optimization**: Read only offset and null values for `!=
null` checks

Co-authored-by: 924060929 <lanhuajian@selectdb.com>
Co-authored-by: Jerry Hu <mrhhsg@gmail.com>
…complex-type column pruning functionality (#58373)

### What problem does this PR solve?

TabletSchema with pruned column type should not be cached.

Related PR: #xxx

Problem Summary:

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
…a change (#58614)

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Should set the column name of sub-iterator of StructIterator.

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
…ns (#58682)

### What problem does this PR solve?

The `next_batch` method should accumulate the row count.

Related PR: #xxx

Problem Summary:

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
…tched map access (#58485)

- Replace the per-row seek + next_batch(1) loop in
MapFileColumnIterator::read_by_rowids with batched offset reads via
OffsetFileColumnIterator::read_by_rowids to derive key/value ranges for
the requested rowids.
- Compute per-row map sizes from offset[rowid] and offset[rowid+1],
using the page-tail next_array_item_ordinal sentinel for the last row
when rowid+1 is out of bounds.
- Skip key/value decoding for null rows by consulting a pre-fetched null
map, and add a safety check to reject non-nullable destination columns
when the underlying map reader is nullable.
- Reuse a small peek column in
OffsetFileColumnIterator::_peek_one_offset to avoid repeated temporary
column allocations when reading page sentinels.
- Add a unit test (MapReadByRowidsSkipReadingResizesDestination) to
verify that read_by_rowids honors the SKIP_READING flag and only resizes
the destination column without touching sub-iterators.
- Improve performance from ~19s to ~0.1s in the worst-case access
pattern, and
  from ~6s to ~3s in the normal case.
…d_by_rowids in scenarios where the rowids are continuous (#58851)

### What problem does this PR solve?

Avoid seeking and reading row by row.

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
### What problem does this PR solve?

Read as many consecutive rows as possible.

Problem Summary:

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
add some tests for prune nested column

(cherry picked from commit 70f3e9d)
optimize push down project, this can reduce the scan bytes and shuffle
bytes by prune nested column. #57204 related

the sql:
```sql
select coalecse(struct_element(t1.s, 'city'), 'beijing')
from t1 join t2
on t1.id = t2.id
```

original plan:
```
Project(coalecse(struct_element(t1.s, 'city'), 'beijing'))
                             |
                    Join(t1.id=t2.id)
                    /               \
            Project(t1.id, t1.s)    Project(t2.id)
                 |                    |
            Scan(t1)                Scan(t2)
```

optimize plan:
```

                       Project(coalecse(slot#3, 'beijing'))
                                      |
                               Join(t1.id=t2.id)
                    /                                       \
Project(t1.id, struct_element(t1.s, 'city')#3)              Project(t2.id)
              |                                                |
            Scan(t1)                                       Scan(t2)
```

(cherry picked from commit c30c0ff)
@kaka11chen kaka11chen requested a review from yiguolei as a code owner December 23, 2025 06:58
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@kaka11chen
Copy link
Contributor Author

run buildall

fix `Input slot(s) not in child's output`, introduced by #57204

(cherry picked from commit b788842)
fix prune map type cause backend core, when the map type is changed, we
should not prune the nested column type, introduced by #57204

(cherry picked from commit 1d7f6c4)
fix can not prune dereference expression, introduced by #57532

(cherry picked from commit 2b23693)
@924060929 924060929 force-pushed the cherry-pick-nested_column_prune_4.0 branch from 645394e to eac26e9 Compare December 23, 2025 07:14
@kaka11chen
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 82.10% (1573/1916)
Line Coverage 67.12% (28073/41824)
Region Coverage 67.61% (13809/20426)
Branch Coverage 58.05% (7363/12684)

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 63.13% (916/1451) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 63.99% (1217/1902) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.36% (18616/34887)
Line Coverage 39.14% (172541/440806)
Region Coverage 33.81% (133335/394362)
Branch Coverage 34.83% (57693/165650)

@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels Dec 24, 2025
@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@yiguolei yiguolei merged commit eea2586 into branch-4.0 Dec 24, 2025
24 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants