Skip to content

Conversation

@kaka11chen
Copy link
Contributor

@kaka11chen kaka11chen commented Oct 21, 2025

What problem does this PR solve?

Problem Summary:

Release note

Optimize Complex Type Column Reading with Column Pruning

Description

This PR implements column pruning for complex types (Struct, Array, Map) to optimize read performance. Previously, Doris would read entire complex type fields before processing, which was simple to implement but inefficient when only specific sub-columns were needed.

Key changes:

  • FE (Frontend): Added column access path calculation and type pruning

    • Collects and analyzes access paths for complex type fields
    • Performs type pruning based on access paths
    • Implements projection pushdown for complex types
  • BE (Backend): Added selective column reading

    • Uses columnAccessPath array from FE to identify required sub-columns
    • Implements selective reading to skip unnecessary sub-columns

Why

Performance Improvement: When a struct contains hundreds or thousands of columns but the query only accesses a few sub-columns, this optimization can significantly reduce I/O and improve query performance. For example, with struct<int a, int b> s, when only s.a is referenced, we can avoid reading s.b entirely.

Technical Benefits: Reduces unnecessary data scanning and decoding overhead for complex types, aligning with Doris's continuous performance optimization goals .

TODO & Future Optimizations

  • Lazy Materialization for Complex Type Sub-columns: Defer materialization of unused sub-columns
  • Predicate Pushdown for Complex Type Sub-columns: Push predicates to storage layer for better filtering
  • Parquet RL/DL Optimization: Read only repetition levels and definition levels without data in appropriate scenarios
  • Array Size Optimization: Read only offset and null values for array_size() operations
  • Null Check Optimization: Read only offset and null values for != null checks

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@kaka11chen kaka11chen force-pushed the nested_column_prune_external_table_no_late_mat branch from 1c99dc6 to d47ffd5 Compare October 21, 2025 13:27
@kaka11chen
Copy link
Contributor Author

run buildall

@kaka11chen kaka11chen force-pushed the nested_column_prune_external_table_no_late_mat branch from 5642997 to 3fc502e Compare October 21, 2025 13:45
@kaka11chen
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 80.77% (1647/2039)
Line Coverage 67.04% (29059/43346)
Region Coverage 67.31% (14371/21352)
Branch Coverage 57.66% (7638/13246)

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 52.00% (156/300) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 88.67% (266/300) 🎉
Increment coverage report
Complete coverage report

@kaka11chen kaka11chen force-pushed the nested_column_prune_external_table_no_late_mat branch 2 times, most recently from 3627661 to 3647221 Compare October 22, 2025 02:58
@kaka11chen
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 80.77% (1647/2039)
Line Coverage 67.03% (29054/43346)
Region Coverage 67.32% (14374/21352)
Branch Coverage 57.69% (7641/13246)

@doris-robot
Copy link

ClickBench: Total hot run time: 29.15 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 3647221f1bd5d8e52cb32f746bfb6833b2a6494f, data reload: false

query1	0.05	0.05	0.05
query2	0.13	0.07	0.07
query3	0.31	0.07	0.07
query4	1.60	0.09	0.09
query5	0.27	0.26	0.25
query6	1.17	0.67	0.65
query7	0.03	0.02	0.03
query8	0.07	0.06	0.06
query9	0.66	0.54	0.52
query10	0.59	0.59	0.59
query11	0.27	0.14	0.14
query12	0.27	0.15	0.14
query13	0.66	0.62	0.63
query14	1.07	1.06	1.04
query15	0.96	0.90	0.88
query16	0.39	0.40	0.39
query17	1.07	1.05	1.07
query18	0.24	0.22	0.24
query19	1.99	1.89	1.81
query20	0.01	0.01	0.02
query21	15.41	0.29	0.24
query22	5.00	0.09	0.10
query23	15.38	0.38	0.23
query24	2.92	0.48	0.30
query25	0.10	0.09	0.09
query26	0.19	0.18	0.17
query27	0.09	0.09	0.08
query28	3.66	1.26	1.05
query29	12.62	4.06	3.34
query30	0.34	0.12	0.10
query31	2.84	0.64	0.44
query32	3.24	0.63	0.55
query33	3.16	3.10	3.13
query34	17.03	5.50	4.77
query35	4.89	4.87	4.89
query36	0.67	0.53	0.52
query37	0.22	0.09	0.09
query38	0.19	0.06	0.06
query39	0.06	0.05	0.05
query40	0.21	0.18	0.19
query41	0.11	0.07	0.06
query42	0.06	0.04	0.04
query43	0.06	0.06	0.06
Total cold run time: 100.26 s
Total hot run time: 29.15 s

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 52.00% (156/300) 🎉
Increment coverage report
Complete coverage report

@kaka11chen kaka11chen force-pushed the nested_column_prune_external_table_no_late_mat branch 3 times, most recently from 34a95f7 to 087f4e0 Compare October 23, 2025 13:22
@kaka11chen
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 80.62% (1647/2043)
Line Coverage 66.94% (29038/43376)
Region Coverage 67.26% (14372/21368)
Branch Coverage 57.62% (7637/13254)

@doris-robot
Copy link

ClickBench: Total hot run time: 28.24 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 087f4e08af2665a1082f04139fe559371526a6c1, data reload: false

query1	0.06	0.05	0.05
query2	0.10	0.06	0.05
query3	0.25	0.09	0.08
query4	1.61	0.12	0.11
query5	0.28	0.28	0.25
query6	1.18	0.67	0.66
query7	0.04	0.03	0.02
query8	0.06	0.05	0.04
query9	0.63	0.54	0.52
query10	0.59	0.59	0.58
query11	0.17	0.11	0.12
query12	0.16	0.12	0.13
query13	0.63	0.61	0.61
query14	1.04	1.02	1.02
query15	0.86	0.86	0.85
query16	0.39	0.41	0.40
query17	1.05	1.06	1.05
query18	0.22	0.21	0.20
query19	1.87	1.84	1.81
query20	0.01	0.01	0.02
query21	15.46	0.20	0.13
query22	5.04	0.07	0.05
query23	15.68	0.27	0.10
query24	1.63	1.13	0.88
query25	0.08	0.08	0.09
query26	0.15	0.14	0.13
query27	0.07	0.07	0.06
query28	5.20	1.17	0.94
query29	12.61	4.03	3.29
query30	0.29	0.15	0.14
query31	2.83	0.58	0.39
query32	3.24	0.56	0.48
query33	3.12	3.12	3.04
query34	15.67	5.16	4.53
query35	4.54	4.59	4.62
query36	0.67	0.51	0.50
query37	0.11	0.07	0.07
query38	0.07	0.05	0.04
query39	0.04	0.04	0.03
query40	0.18	0.16	0.14
query41	0.09	0.03	0.04
query42	0.04	0.04	0.03
query43	0.04	0.04	0.03
Total cold run time: 98.05 s
Total hot run time: 28.24 s

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 66.60% (674/1012) 🎉
Increment coverage report
Complete coverage report

@kaka11chen kaka11chen force-pushed the nested_column_prune_external_table_no_late_mat branch 2 times, most recently from 0d12c7d to 33c5e80 Compare October 24, 2025 04:41
@kaka11chen
Copy link
Contributor Author

run buildall

@kaka11chen kaka11chen force-pushed the nested_column_prune_external_table_no_late_mat branch from 33c5e80 to f059d14 Compare October 24, 2025 05:09
@kaka11chen
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 80.64% (1649/2045)
Line Coverage 67.00% (29104/43437)
Region Coverage 67.32% (14419/21420)
Branch Coverage 57.73% (7674/13294)

@doris-robot
Copy link

ClickBench: Total hot run time: 27.8 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit f059d1417b9ec9216ace8201487208836abebf05, data reload: false

query1	0.06	0.05	0.05
query2	0.10	0.05	0.05
query3	0.26	0.08	0.09
query4	1.60	0.11	0.11
query5	0.28	0.26	0.26
query6	1.17	0.65	0.64
query7	0.04	0.03	0.03
query8	0.05	0.05	0.05
query9	0.64	0.53	0.53
query10	0.58	0.57	0.58
query11	0.17	0.13	0.12
query12	0.15	0.12	0.12
query13	0.61	0.61	0.60
query14	1.00	1.01	1.02
query15	0.84	0.83	0.86
query16	0.39	0.38	0.38
query17	1.01	1.03	1.02
query18	0.21	0.20	0.20
query19	1.88	1.79	1.77
query20	0.02	0.01	0.01
query21	15.46	0.21	0.12
query22	5.04	0.07	0.05
query23	15.67	0.27	0.10
query24	3.28	0.64	0.94
query25	0.09	0.07	0.06
query26	0.14	0.13	0.13
query27	0.06	0.07	0.05
query28	5.52	1.14	0.93
query29	12.56	3.92	3.26
query30	0.28	0.14	0.11
query31	2.83	0.60	0.38
query32	3.22	0.54	0.48
query33	3.01	3.05	3.00
query34	15.85	5.12	4.59
query35	4.55	4.62	4.60
query36	0.66	0.50	0.49
query37	0.10	0.07	0.07
query38	0.07	0.04	0.04
query39	0.04	0.03	0.03
query40	0.19	0.14	0.14
query41	0.09	0.03	0.03
query42	0.04	0.03	0.03
query43	0.04	0.04	0.04
Total cold run time: 99.85 s
Total hot run time: 27.8 s

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 75.89% (768/1012) 🎉
Increment coverage report
Complete coverage report

@kaka11chen kaka11chen force-pushed the nested_column_prune_external_table_no_late_mat branch from f059d14 to bb96ea9 Compare October 25, 2025 05:53
924060929 added a commit that referenced this pull request Dec 2, 2025
fix prune map type cause backend core, when the map type is changed, we
should not prune the nested column type, introduced by #57204
morrySnow added a commit that referenced this pull request Dec 8, 2025
#58765)

### What problem does this PR solve?

Related PR: #57204

Problem Summary:

This pull request refactors and improves the `PushDownProject` rule in
the Nereids optimizer, mainly focusing on the logic for pushing down
projections through `UNION` operations. It also introduces a
comprehensive unit test to verify the new logic, making the relevant
methods more testable and robust.

**Refactoring and Logic Improvements:**

* Refactored the `pushThroughUnion` logic by extracting it into a new
static method, making it easier to test and use independently. The main
logic now takes explicit arguments instead of relying on the context
object.
* Improved the handling of projections and child outputs when pushing
down through `UNION`, ensuring correct mapping and replacement of slots.
This includes using regulator outputs for children and constant
expressions, and making the slot replacement logic static for better
testability.

**Testing Enhancements:**

* Added a new unit test class `PushDownProjectTest` to rigorously test
the pushdown logic in various scenarios, including unions with and
without children. The tests verify both the structure and the
correctness of the rewritten plans.

**Code Quality Improvements:**

* Added the `@VisibleForTesting` annotation and imported necessary
dependencies to clarify method visibility and intent for testing.
* Replaced some usages of `Collection` with `List` for better type
safety and clarity in projection handling.

These changes make the projection pushdown logic more modular, testable,
and robust, and provide strong test coverage for future maintenance.
nagisa-kunhah pushed a commit to nagisa-kunhah/doris that referenced this pull request Dec 14, 2025
…uning (apache#57204)

### What problem does this PR solve?

Problem Summary:

### Release note

Optimize Complex Type Column Reading with Column Pruning

#### Description
This PR implements column pruning for complex types (Struct, Array, Map)
to optimize read performance. Previously, Doris would read entire
complex type fields before processing, which was simple to implement but
inefficient when only specific sub-columns were needed.

**Key changes:**
- **FE (Frontend)**: Added column access path calculation and type
pruning
  - Collects and analyzes access paths for complex type fields
  - Performs type pruning based on access paths
  - Implements projection pushdown for complex types

- **BE (Backend)**: Added selective column reading
  - Uses columnAccessPath array from FE to identify required sub-columns
  - Implements selective reading to skip unnecessary sub-columns

#### Why
**Performance Improvement**: When a struct contains hundreds or
thousands of columns but the query only accesses a few sub-columns, this
optimization can significantly reduce I/O and improve query performance.
For example, with `struct<int a, int b> s`, when only `s.a` is
referenced, we can avoid reading `s.b` entirely.

**Technical Benefits**: Reduces unnecessary data scanning and decoding
overhead for complex types, aligning with Doris's continuous performance
optimization goals .

#### TODO & Future Optimizations
- **Lazy Materialization for Complex Type Sub-columns**: Defer
materialization of unused sub-columns
- **Predicate Pushdown for Complex Type Sub-columns**: Push predicates
to storage layer for better filtering
- **Parquet RL/DL Optimization**: Read only repetition levels and
definition levels without data in appropriate scenarios
- **Array Size Optimization**: Read only offset and null values for
`array_size()` operations
- **Null Check Optimization**: Read only offset and null values for `!=
null` checks

Co-authored-by: 924060929 <lanhuajian@selectdb.com>
Co-authored-by: Jerry Hu <mrhhsg@gmail.com>
nagisa-kunhah pushed a commit to nagisa-kunhah/doris that referenced this pull request Dec 14, 2025
optimize push down project, this can reduce the scan bytes and shuffle
bytes by prune nested column. apache#57204 related

the sql:
```sql
select coalecse(struct_element(t1.s, 'city'), 'beijing') 
from t1 join t2
on t1.id = t2.id
```

original plan:
```
Project(coalecse(struct_element(t1.s, 'city'), 'beijing'))
                             |
                    Join(t1.id=t2.id)
                    /               \
            Project(t1.id, t1.s)    Project(t2.id)
                 |                    |
            Scan(t1)                Scan(t2)
```

optimize plan:
```

                       Project(coalecse(slot#3, 'beijing'))
                                      |
                               Join(t1.id=t2.id)
                    /                                       \
Project(t1.id, struct_element(t1.s, 'city')apache#3)              Project(t2.id)
              |                                                |
            Scan(t1)                                       Scan(t2)
```
nagisa-kunhah pushed a commit to nagisa-kunhah/doris that referenced this pull request Dec 14, 2025
fix `Input slot(s) not in child's output`, introduced by apache#57204
nagisa-kunhah pushed a commit to nagisa-kunhah/doris that referenced this pull request Dec 14, 2025
fix prune map type cause backend core, when the map type is changed, we
should not prune the nested column type, introduced by apache#57204
nagisa-kunhah pushed a commit to nagisa-kunhah/doris that referenced this pull request Dec 14, 2025
apache#58765)

### What problem does this PR solve?

Related PR: apache#57204

Problem Summary:

This pull request refactors and improves the `PushDownProject` rule in
the Nereids optimizer, mainly focusing on the logic for pushing down
projections through `UNION` operations. It also introduces a
comprehensive unit test to verify the new logic, making the relevant
methods more testable and robust.

**Refactoring and Logic Improvements:**

* Refactored the `pushThroughUnion` logic by extracting it into a new
static method, making it easier to test and use independently. The main
logic now takes explicit arguments instead of relying on the context
object.
* Improved the handling of projections and child outputs when pushing
down through `UNION`, ensuring correct mapping and replacement of slots.
This includes using regulator outputs for children and constant
expressions, and making the slot replacement logic static for better
testability.

**Testing Enhancements:**

* Added a new unit test class `PushDownProjectTest` to rigorously test
the pushdown logic in various scenarios, including unions with and
without children. The tests verify both the structure and the
correctness of the rewritten plans.

**Code Quality Improvements:**

* Added the `@VisibleForTesting` annotation and imported necessary
dependencies to clarify method visibility and intent for testing.
* Replaced some usages of `Collection` with `List` for better type
safety and clarity in projection handling.

These changes make the projection pushdown logic more modular, testable,
and robust, and provide strong test coverage for future maintenance.
kaka11chen added a commit to kaka11chen/doris that referenced this pull request Dec 22, 2025
…uning (apache#57204)

Problem Summary:

Optimize Complex Type Column Reading with Column Pruning

This PR implements column pruning for complex types (Struct, Array, Map)
to optimize read performance. Previously, Doris would read entire
complex type fields before processing, which was simple to implement but
inefficient when only specific sub-columns were needed.

**Key changes:**
- **FE (Frontend)**: Added column access path calculation and type
pruning
  - Collects and analyzes access paths for complex type fields
  - Performs type pruning based on access paths
  - Implements projection pushdown for complex types

- **BE (Backend)**: Added selective column reading
  - Uses columnAccessPath array from FE to identify required sub-columns
  - Implements selective reading to skip unnecessary sub-columns

**Performance Improvement**: When a struct contains hundreds or
thousands of columns but the query only accesses a few sub-columns, this
optimization can significantly reduce I/O and improve query performance.
For example, with `struct<int a, int b> s`, when only `s.a` is
referenced, we can avoid reading `s.b` entirely.

**Technical Benefits**: Reduces unnecessary data scanning and decoding
overhead for complex types, aligning with Doris's continuous performance
optimization goals .

- **Lazy Materialization for Complex Type Sub-columns**: Defer
materialization of unused sub-columns
- **Predicate Pushdown for Complex Type Sub-columns**: Push predicates
to storage layer for better filtering
- **Parquet RL/DL Optimization**: Read only repetition levels and
definition levels without data in appropriate scenarios
- **Array Size Optimization**: Read only offset and null values for
`array_size()` operations
- **Null Check Optimization**: Read only offset and null values for `!=
null` checks

Co-authored-by: 924060929 <lanhuajian@selectdb.com>
Co-authored-by: Jerry Hu <mrhhsg@gmail.com>
kaka11chen added a commit to kaka11chen/doris that referenced this pull request Dec 22, 2025
…uning (apache#57204)

Problem Summary:

Optimize Complex Type Column Reading with Column Pruning

This PR implements column pruning for complex types (Struct, Array, Map)
to optimize read performance. Previously, Doris would read entire
complex type fields before processing, which was simple to implement but
inefficient when only specific sub-columns were needed.

**Key changes:**
- **FE (Frontend)**: Added column access path calculation and type
pruning
  - Collects and analyzes access paths for complex type fields
  - Performs type pruning based on access paths
  - Implements projection pushdown for complex types

- **BE (Backend)**: Added selective column reading
  - Uses columnAccessPath array from FE to identify required sub-columns
  - Implements selective reading to skip unnecessary sub-columns

**Performance Improvement**: When a struct contains hundreds or
thousands of columns but the query only accesses a few sub-columns, this
optimization can significantly reduce I/O and improve query performance.
For example, with `struct<int a, int b> s`, when only `s.a` is
referenced, we can avoid reading `s.b` entirely.

**Technical Benefits**: Reduces unnecessary data scanning and decoding
overhead for complex types, aligning with Doris's continuous performance
optimization goals .

- **Lazy Materialization for Complex Type Sub-columns**: Defer
materialization of unused sub-columns
- **Predicate Pushdown for Complex Type Sub-columns**: Push predicates
to storage layer for better filtering
- **Parquet RL/DL Optimization**: Read only repetition levels and
definition levels without data in appropriate scenarios
- **Array Size Optimization**: Read only offset and null values for
`array_size()` operations
- **Null Check Optimization**: Read only offset and null values for `!=
null` checks

Co-authored-by: 924060929 <lanhuajian@selectdb.com>
Co-authored-by: Jerry Hu <mrhhsg@gmail.com>
kaka11chen added a commit to kaka11chen/doris that referenced this pull request Dec 22, 2025
…uning (apache#57204)

Problem Summary:

Optimize Complex Type Column Reading with Column Pruning

This PR implements column pruning for complex types (Struct, Array, Map)
to optimize read performance. Previously, Doris would read entire
complex type fields before processing, which was simple to implement but
inefficient when only specific sub-columns were needed.

**Key changes:**
- **FE (Frontend)**: Added column access path calculation and type
pruning
  - Collects and analyzes access paths for complex type fields
  - Performs type pruning based on access paths
  - Implements projection pushdown for complex types

- **BE (Backend)**: Added selective column reading
  - Uses columnAccessPath array from FE to identify required sub-columns
  - Implements selective reading to skip unnecessary sub-columns

**Performance Improvement**: When a struct contains hundreds or
thousands of columns but the query only accesses a few sub-columns, this
optimization can significantly reduce I/O and improve query performance.
For example, with `struct<int a, int b> s`, when only `s.a` is
referenced, we can avoid reading `s.b` entirely.

**Technical Benefits**: Reduces unnecessary data scanning and decoding
overhead for complex types, aligning with Doris's continuous performance
optimization goals .

- **Lazy Materialization for Complex Type Sub-columns**: Defer
materialization of unused sub-columns
- **Predicate Pushdown for Complex Type Sub-columns**: Push predicates
to storage layer for better filtering
- **Parquet RL/DL Optimization**: Read only repetition levels and
definition levels without data in appropriate scenarios
- **Array Size Optimization**: Read only offset and null values for
`array_size()` operations
- **Null Check Optimization**: Read only offset and null values for `!=
null` checks

Co-authored-by: 924060929 <lanhuajian@selectdb.com>
Co-authored-by: Jerry Hu <mrhhsg@gmail.com>
kaka11chen added a commit to kaka11chen/doris that referenced this pull request Dec 22, 2025
…uning (apache#57204)

Problem Summary:

Optimize Complex Type Column Reading with Column Pruning

This PR implements column pruning for complex types (Struct, Array, Map)
to optimize read performance. Previously, Doris would read entire
complex type fields before processing, which was simple to implement but
inefficient when only specific sub-columns were needed.

**Key changes:**
- **FE (Frontend)**: Added column access path calculation and type
pruning
  - Collects and analyzes access paths for complex type fields
  - Performs type pruning based on access paths
  - Implements projection pushdown for complex types

- **BE (Backend)**: Added selective column reading
  - Uses columnAccessPath array from FE to identify required sub-columns
  - Implements selective reading to skip unnecessary sub-columns

**Performance Improvement**: When a struct contains hundreds or
thousands of columns but the query only accesses a few sub-columns, this
optimization can significantly reduce I/O and improve query performance.
For example, with `struct<int a, int b> s`, when only `s.a` is
referenced, we can avoid reading `s.b` entirely.

**Technical Benefits**: Reduces unnecessary data scanning and decoding
overhead for complex types, aligning with Doris's continuous performance
optimization goals .

- **Lazy Materialization for Complex Type Sub-columns**: Defer
materialization of unused sub-columns
- **Predicate Pushdown for Complex Type Sub-columns**: Push predicates
to storage layer for better filtering
- **Parquet RL/DL Optimization**: Read only repetition levels and
definition levels without data in appropriate scenarios
- **Array Size Optimization**: Read only offset and null values for
`array_size()` operations
- **Null Check Optimization**: Read only offset and null values for `!=
null` checks

Co-authored-by: 924060929 <lanhuajian@selectdb.com>
Co-authored-by: Jerry Hu <mrhhsg@gmail.com>
kaka11chen added a commit to kaka11chen/doris that referenced this pull request Dec 22, 2025
…uning (apache#57204)

Problem Summary:

Optimize Complex Type Column Reading with Column Pruning

This PR implements column pruning for complex types (Struct, Array, Map)
to optimize read performance. Previously, Doris would read entire
complex type fields before processing, which was simple to implement but
inefficient when only specific sub-columns were needed.

**Key changes:**
- **FE (Frontend)**: Added column access path calculation and type
pruning
  - Collects and analyzes access paths for complex type fields
  - Performs type pruning based on access paths
  - Implements projection pushdown for complex types

- **BE (Backend)**: Added selective column reading
  - Uses columnAccessPath array from FE to identify required sub-columns
  - Implements selective reading to skip unnecessary sub-columns

**Performance Improvement**: When a struct contains hundreds or
thousands of columns but the query only accesses a few sub-columns, this
optimization can significantly reduce I/O and improve query performance.
For example, with `struct<int a, int b> s`, when only `s.a` is
referenced, we can avoid reading `s.b` entirely.

**Technical Benefits**: Reduces unnecessary data scanning and decoding
overhead for complex types, aligning with Doris's continuous performance
optimization goals .

- **Lazy Materialization for Complex Type Sub-columns**: Defer
materialization of unused sub-columns
- **Predicate Pushdown for Complex Type Sub-columns**: Push predicates
to storage layer for better filtering
- **Parquet RL/DL Optimization**: Read only repetition levels and
definition levels without data in appropriate scenarios
- **Array Size Optimization**: Read only offset and null values for
`array_size()` operations
- **Null Check Optimization**: Read only offset and null values for `!=
null` checks

Co-authored-by: 924060929 <lanhuajian@selectdb.com>
Co-authored-by: Jerry Hu <mrhhsg@gmail.com>
kaka11chen added a commit to kaka11chen/doris that referenced this pull request Dec 22, 2025
…uning (apache#57204)

Problem Summary:

Optimize Complex Type Column Reading with Column Pruning

This PR implements column pruning for complex types (Struct, Array, Map)
to optimize read performance. Previously, Doris would read entire
complex type fields before processing, which was simple to implement but
inefficient when only specific sub-columns were needed.

**Key changes:**
- **FE (Frontend)**: Added column access path calculation and type
pruning
  - Collects and analyzes access paths for complex type fields
  - Performs type pruning based on access paths
  - Implements projection pushdown for complex types

- **BE (Backend)**: Added selective column reading
  - Uses columnAccessPath array from FE to identify required sub-columns
  - Implements selective reading to skip unnecessary sub-columns

**Performance Improvement**: When a struct contains hundreds or
thousands of columns but the query only accesses a few sub-columns, this
optimization can significantly reduce I/O and improve query performance.
For example, with `struct<int a, int b> s`, when only `s.a` is
referenced, we can avoid reading `s.b` entirely.

**Technical Benefits**: Reduces unnecessary data scanning and decoding
overhead for complex types, aligning with Doris's continuous performance
optimization goals .

- **Lazy Materialization for Complex Type Sub-columns**: Defer
materialization of unused sub-columns
- **Predicate Pushdown for Complex Type Sub-columns**: Push predicates
to storage layer for better filtering
- **Parquet RL/DL Optimization**: Read only repetition levels and
definition levels without data in appropriate scenarios
- **Array Size Optimization**: Read only offset and null values for
`array_size()` operations
- **Null Check Optimization**: Read only offset and null values for `!=
null` checks

Co-authored-by: 924060929 <lanhuajian@selectdb.com>
Co-authored-by: Jerry Hu <mrhhsg@gmail.com>
kaka11chen added a commit to kaka11chen/doris that referenced this pull request Dec 22, 2025
…uning (apache#57204)

Problem Summary:

Optimize Complex Type Column Reading with Column Pruning

This PR implements column pruning for complex types (Struct, Array, Map)
to optimize read performance. Previously, Doris would read entire
complex type fields before processing, which was simple to implement but
inefficient when only specific sub-columns were needed.

**Key changes:**
- **FE (Frontend)**: Added column access path calculation and type
pruning
  - Collects and analyzes access paths for complex type fields
  - Performs type pruning based on access paths
  - Implements projection pushdown for complex types

- **BE (Backend)**: Added selective column reading
  - Uses columnAccessPath array from FE to identify required sub-columns
  - Implements selective reading to skip unnecessary sub-columns

**Performance Improvement**: When a struct contains hundreds or
thousands of columns but the query only accesses a few sub-columns, this
optimization can significantly reduce I/O and improve query performance.
For example, with `struct<int a, int b> s`, when only `s.a` is
referenced, we can avoid reading `s.b` entirely.

**Technical Benefits**: Reduces unnecessary data scanning and decoding
overhead for complex types, aligning with Doris's continuous performance
optimization goals .

- **Lazy Materialization for Complex Type Sub-columns**: Defer
materialization of unused sub-columns
- **Predicate Pushdown for Complex Type Sub-columns**: Push predicates
to storage layer for better filtering
- **Parquet RL/DL Optimization**: Read only repetition levels and
definition levels without data in appropriate scenarios
- **Array Size Optimization**: Read only offset and null values for
`array_size()` operations
- **Null Check Optimization**: Read only offset and null values for `!=
null` checks

Co-authored-by: 924060929 <lanhuajian@selectdb.com>
Co-authored-by: Jerry Hu <mrhhsg@gmail.com>
kaka11chen added a commit to kaka11chen/doris that referenced this pull request Dec 23, 2025
…uning (apache#57204)

Problem Summary:

Optimize Complex Type Column Reading with Column Pruning

This PR implements column pruning for complex types (Struct, Array, Map)
to optimize read performance. Previously, Doris would read entire
complex type fields before processing, which was simple to implement but
inefficient when only specific sub-columns were needed.

**Key changes:**
- **FE (Frontend)**: Added column access path calculation and type
pruning
  - Collects and analyzes access paths for complex type fields
  - Performs type pruning based on access paths
  - Implements projection pushdown for complex types

- **BE (Backend)**: Added selective column reading
  - Uses columnAccessPath array from FE to identify required sub-columns
  - Implements selective reading to skip unnecessary sub-columns

**Performance Improvement**: When a struct contains hundreds or
thousands of columns but the query only accesses a few sub-columns, this
optimization can significantly reduce I/O and improve query performance.
For example, with `struct<int a, int b> s`, when only `s.a` is
referenced, we can avoid reading `s.b` entirely.

**Technical Benefits**: Reduces unnecessary data scanning and decoding
overhead for complex types, aligning with Doris's continuous performance
optimization goals .

- **Lazy Materialization for Complex Type Sub-columns**: Defer
materialization of unused sub-columns
- **Predicate Pushdown for Complex Type Sub-columns**: Push predicates
to storage layer for better filtering
- **Parquet RL/DL Optimization**: Read only repetition levels and
definition levels without data in appropriate scenarios
- **Array Size Optimization**: Read only offset and null values for
`array_size()` operations
- **Null Check Optimization**: Read only offset and null values for `!=
null` checks

Co-authored-by: 924060929 <lanhuajian@selectdb.com>
Co-authored-by: Jerry Hu <mrhhsg@gmail.com>
924060929 added a commit that referenced this pull request Dec 23, 2025
optimize push down project, this can reduce the scan bytes and shuffle
bytes by prune nested column. #57204 related

the sql:
```sql
select coalecse(struct_element(t1.s, 'city'), 'beijing')
from t1 join t2
on t1.id = t2.id
```

original plan:
```
Project(coalecse(struct_element(t1.s, 'city'), 'beijing'))
                             |
                    Join(t1.id=t2.id)
                    /               \
            Project(t1.id, t1.s)    Project(t2.id)
                 |                    |
            Scan(t1)                Scan(t2)
```

optimize plan:
```

                       Project(coalecse(slot#3, 'beijing'))
                                      |
                               Join(t1.id=t2.id)
                    /                                       \
Project(t1.id, struct_element(t1.s, 'city')#3)              Project(t2.id)
              |                                                |
            Scan(t1)                                       Scan(t2)
```

(cherry picked from commit c30c0ff)
924060929 added a commit that referenced this pull request Dec 23, 2025
fix prune map type cause backend core, when the map type is changed, we
should not prune the nested column type, introduced by #57204

(cherry picked from commit 1d7f6c4)
924060929 added a commit that referenced this pull request Dec 23, 2025
fix `Input slot(s) not in child's output`, introduced by #57204

(cherry picked from commit b788842)
924060929 added a commit that referenced this pull request Dec 23, 2025
fix prune map type cause backend core, when the map type is changed, we
should not prune the nested column type, introduced by #57204

(cherry picked from commit 1d7f6c4)
yiguolei pushed a commit that referenced this pull request Dec 24, 2025
…rning (#59286)

### What problem does this PR solve?

Problem Summary:

### Release note

Cherry-pick #58370 #58354 #59043 #58851 #58485 #58682 #58614 #58373
#57204 #58719 #58471 #58573 #58657

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->

---------

Co-authored-by: 924060929 <lanhuajian@selectdb.com>
Co-authored-by: Jerry Hu <mrhhsg@gmail.com>
Co-authored-by: Jerry Hu <hushenggang@selectdb.com>
Co-authored-by: lihangyu <lihangyu@selectdb.com>
924060929 added a commit that referenced this pull request Dec 26, 2025
…58776)

support prune nested column through lateral view with the functions:
explode, explode_outer, explode_map, explode_map_outer, posexplode,
posexplode_outer, #57204 related
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.3-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants