-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[feature](reader) Optimize Complex Type Column Reading with Column Pruning #57204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature](reader) Optimize Complex Type Column Reading with Column Pruning #57204
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
1c99dc6 to
d47ffd5
Compare
|
run buildall |
5642997 to
3fc502e
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
3627661 to
3647221
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
ClickBench: Total hot run time: 29.15 s |
FE UT Coverage ReportIncrement line coverage |
34a95f7 to
087f4e0
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
ClickBench: Total hot run time: 28.24 s |
FE Regression Coverage ReportIncrement line coverage |
0d12c7d to
33c5e80
Compare
|
run buildall |
33c5e80 to
f059d14
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
ClickBench: Total hot run time: 27.8 s |
FE Regression Coverage ReportIncrement line coverage |
f059d14 to
bb96ea9
Compare
fix prune map type cause backend core, when the map type is changed, we should not prune the nested column type, introduced by #57204
#58765) ### What problem does this PR solve? Related PR: #57204 Problem Summary: This pull request refactors and improves the `PushDownProject` rule in the Nereids optimizer, mainly focusing on the logic for pushing down projections through `UNION` operations. It also introduces a comprehensive unit test to verify the new logic, making the relevant methods more testable and robust. **Refactoring and Logic Improvements:** * Refactored the `pushThroughUnion` logic by extracting it into a new static method, making it easier to test and use independently. The main logic now takes explicit arguments instead of relying on the context object. * Improved the handling of projections and child outputs when pushing down through `UNION`, ensuring correct mapping and replacement of slots. This includes using regulator outputs for children and constant expressions, and making the slot replacement logic static for better testability. **Testing Enhancements:** * Added a new unit test class `PushDownProjectTest` to rigorously test the pushdown logic in various scenarios, including unions with and without children. The tests verify both the structure and the correctness of the rewritten plans. **Code Quality Improvements:** * Added the `@VisibleForTesting` annotation and imported necessary dependencies to clarify method visibility and intent for testing. * Replaced some usages of `Collection` with `List` for better type safety and clarity in projection handling. These changes make the projection pushdown logic more modular, testable, and robust, and provide strong test coverage for future maintenance.
…uning (apache#57204) ### What problem does this PR solve? Problem Summary: ### Release note Optimize Complex Type Column Reading with Column Pruning #### Description This PR implements column pruning for complex types (Struct, Array, Map) to optimize read performance. Previously, Doris would read entire complex type fields before processing, which was simple to implement but inefficient when only specific sub-columns were needed. **Key changes:** - **FE (Frontend)**: Added column access path calculation and type pruning - Collects and analyzes access paths for complex type fields - Performs type pruning based on access paths - Implements projection pushdown for complex types - **BE (Backend)**: Added selective column reading - Uses columnAccessPath array from FE to identify required sub-columns - Implements selective reading to skip unnecessary sub-columns #### Why **Performance Improvement**: When a struct contains hundreds or thousands of columns but the query only accesses a few sub-columns, this optimization can significantly reduce I/O and improve query performance. For example, with `struct<int a, int b> s`, when only `s.a` is referenced, we can avoid reading `s.b` entirely. **Technical Benefits**: Reduces unnecessary data scanning and decoding overhead for complex types, aligning with Doris's continuous performance optimization goals . #### TODO & Future Optimizations - **Lazy Materialization for Complex Type Sub-columns**: Defer materialization of unused sub-columns - **Predicate Pushdown for Complex Type Sub-columns**: Push predicates to storage layer for better filtering - **Parquet RL/DL Optimization**: Read only repetition levels and definition levels without data in appropriate scenarios - **Array Size Optimization**: Read only offset and null values for `array_size()` operations - **Null Check Optimization**: Read only offset and null values for `!= null` checks Co-authored-by: 924060929 <lanhuajian@selectdb.com> Co-authored-by: Jerry Hu <mrhhsg@gmail.com>
optimize push down project, this can reduce the scan bytes and shuffle bytes by prune nested column. apache#57204 related the sql: ```sql select coalecse(struct_element(t1.s, 'city'), 'beijing') from t1 join t2 on t1.id = t2.id ``` original plan: ``` Project(coalecse(struct_element(t1.s, 'city'), 'beijing')) | Join(t1.id=t2.id) / \ Project(t1.id, t1.s) Project(t2.id) | | Scan(t1) Scan(t2) ``` optimize plan: ``` Project(coalecse(slot#3, 'beijing')) | Join(t1.id=t2.id) / \ Project(t1.id, struct_element(t1.s, 'city')apache#3) Project(t2.id) | | Scan(t1) Scan(t2) ```
fix `Input slot(s) not in child's output`, introduced by apache#57204
fix prune map type cause backend core, when the map type is changed, we should not prune the nested column type, introduced by apache#57204
apache#58765) ### What problem does this PR solve? Related PR: apache#57204 Problem Summary: This pull request refactors and improves the `PushDownProject` rule in the Nereids optimizer, mainly focusing on the logic for pushing down projections through `UNION` operations. It also introduces a comprehensive unit test to verify the new logic, making the relevant methods more testable and robust. **Refactoring and Logic Improvements:** * Refactored the `pushThroughUnion` logic by extracting it into a new static method, making it easier to test and use independently. The main logic now takes explicit arguments instead of relying on the context object. * Improved the handling of projections and child outputs when pushing down through `UNION`, ensuring correct mapping and replacement of slots. This includes using regulator outputs for children and constant expressions, and making the slot replacement logic static for better testability. **Testing Enhancements:** * Added a new unit test class `PushDownProjectTest` to rigorously test the pushdown logic in various scenarios, including unions with and without children. The tests verify both the structure and the correctness of the rewritten plans. **Code Quality Improvements:** * Added the `@VisibleForTesting` annotation and imported necessary dependencies to clarify method visibility and intent for testing. * Replaced some usages of `Collection` with `List` for better type safety and clarity in projection handling. These changes make the projection pushdown logic more modular, testable, and robust, and provide strong test coverage for future maintenance.
…uning (apache#57204) Problem Summary: Optimize Complex Type Column Reading with Column Pruning This PR implements column pruning for complex types (Struct, Array, Map) to optimize read performance. Previously, Doris would read entire complex type fields before processing, which was simple to implement but inefficient when only specific sub-columns were needed. **Key changes:** - **FE (Frontend)**: Added column access path calculation and type pruning - Collects and analyzes access paths for complex type fields - Performs type pruning based on access paths - Implements projection pushdown for complex types - **BE (Backend)**: Added selective column reading - Uses columnAccessPath array from FE to identify required sub-columns - Implements selective reading to skip unnecessary sub-columns **Performance Improvement**: When a struct contains hundreds or thousands of columns but the query only accesses a few sub-columns, this optimization can significantly reduce I/O and improve query performance. For example, with `struct<int a, int b> s`, when only `s.a` is referenced, we can avoid reading `s.b` entirely. **Technical Benefits**: Reduces unnecessary data scanning and decoding overhead for complex types, aligning with Doris's continuous performance optimization goals . - **Lazy Materialization for Complex Type Sub-columns**: Defer materialization of unused sub-columns - **Predicate Pushdown for Complex Type Sub-columns**: Push predicates to storage layer for better filtering - **Parquet RL/DL Optimization**: Read only repetition levels and definition levels without data in appropriate scenarios - **Array Size Optimization**: Read only offset and null values for `array_size()` operations - **Null Check Optimization**: Read only offset and null values for `!= null` checks Co-authored-by: 924060929 <lanhuajian@selectdb.com> Co-authored-by: Jerry Hu <mrhhsg@gmail.com>
…uning (apache#57204) Problem Summary: Optimize Complex Type Column Reading with Column Pruning This PR implements column pruning for complex types (Struct, Array, Map) to optimize read performance. Previously, Doris would read entire complex type fields before processing, which was simple to implement but inefficient when only specific sub-columns were needed. **Key changes:** - **FE (Frontend)**: Added column access path calculation and type pruning - Collects and analyzes access paths for complex type fields - Performs type pruning based on access paths - Implements projection pushdown for complex types - **BE (Backend)**: Added selective column reading - Uses columnAccessPath array from FE to identify required sub-columns - Implements selective reading to skip unnecessary sub-columns **Performance Improvement**: When a struct contains hundreds or thousands of columns but the query only accesses a few sub-columns, this optimization can significantly reduce I/O and improve query performance. For example, with `struct<int a, int b> s`, when only `s.a` is referenced, we can avoid reading `s.b` entirely. **Technical Benefits**: Reduces unnecessary data scanning and decoding overhead for complex types, aligning with Doris's continuous performance optimization goals . - **Lazy Materialization for Complex Type Sub-columns**: Defer materialization of unused sub-columns - **Predicate Pushdown for Complex Type Sub-columns**: Push predicates to storage layer for better filtering - **Parquet RL/DL Optimization**: Read only repetition levels and definition levels without data in appropriate scenarios - **Array Size Optimization**: Read only offset and null values for `array_size()` operations - **Null Check Optimization**: Read only offset and null values for `!= null` checks Co-authored-by: 924060929 <lanhuajian@selectdb.com> Co-authored-by: Jerry Hu <mrhhsg@gmail.com>
…uning (apache#57204) Problem Summary: Optimize Complex Type Column Reading with Column Pruning This PR implements column pruning for complex types (Struct, Array, Map) to optimize read performance. Previously, Doris would read entire complex type fields before processing, which was simple to implement but inefficient when only specific sub-columns were needed. **Key changes:** - **FE (Frontend)**: Added column access path calculation and type pruning - Collects and analyzes access paths for complex type fields - Performs type pruning based on access paths - Implements projection pushdown for complex types - **BE (Backend)**: Added selective column reading - Uses columnAccessPath array from FE to identify required sub-columns - Implements selective reading to skip unnecessary sub-columns **Performance Improvement**: When a struct contains hundreds or thousands of columns but the query only accesses a few sub-columns, this optimization can significantly reduce I/O and improve query performance. For example, with `struct<int a, int b> s`, when only `s.a` is referenced, we can avoid reading `s.b` entirely. **Technical Benefits**: Reduces unnecessary data scanning and decoding overhead for complex types, aligning with Doris's continuous performance optimization goals . - **Lazy Materialization for Complex Type Sub-columns**: Defer materialization of unused sub-columns - **Predicate Pushdown for Complex Type Sub-columns**: Push predicates to storage layer for better filtering - **Parquet RL/DL Optimization**: Read only repetition levels and definition levels without data in appropriate scenarios - **Array Size Optimization**: Read only offset and null values for `array_size()` operations - **Null Check Optimization**: Read only offset and null values for `!= null` checks Co-authored-by: 924060929 <lanhuajian@selectdb.com> Co-authored-by: Jerry Hu <mrhhsg@gmail.com>
…uning (apache#57204) Problem Summary: Optimize Complex Type Column Reading with Column Pruning This PR implements column pruning for complex types (Struct, Array, Map) to optimize read performance. Previously, Doris would read entire complex type fields before processing, which was simple to implement but inefficient when only specific sub-columns were needed. **Key changes:** - **FE (Frontend)**: Added column access path calculation and type pruning - Collects and analyzes access paths for complex type fields - Performs type pruning based on access paths - Implements projection pushdown for complex types - **BE (Backend)**: Added selective column reading - Uses columnAccessPath array from FE to identify required sub-columns - Implements selective reading to skip unnecessary sub-columns **Performance Improvement**: When a struct contains hundreds or thousands of columns but the query only accesses a few sub-columns, this optimization can significantly reduce I/O and improve query performance. For example, with `struct<int a, int b> s`, when only `s.a` is referenced, we can avoid reading `s.b` entirely. **Technical Benefits**: Reduces unnecessary data scanning and decoding overhead for complex types, aligning with Doris's continuous performance optimization goals . - **Lazy Materialization for Complex Type Sub-columns**: Defer materialization of unused sub-columns - **Predicate Pushdown for Complex Type Sub-columns**: Push predicates to storage layer for better filtering - **Parquet RL/DL Optimization**: Read only repetition levels and definition levels without data in appropriate scenarios - **Array Size Optimization**: Read only offset and null values for `array_size()` operations - **Null Check Optimization**: Read only offset and null values for `!= null` checks Co-authored-by: 924060929 <lanhuajian@selectdb.com> Co-authored-by: Jerry Hu <mrhhsg@gmail.com>
…uning (apache#57204) Problem Summary: Optimize Complex Type Column Reading with Column Pruning This PR implements column pruning for complex types (Struct, Array, Map) to optimize read performance. Previously, Doris would read entire complex type fields before processing, which was simple to implement but inefficient when only specific sub-columns were needed. **Key changes:** - **FE (Frontend)**: Added column access path calculation and type pruning - Collects and analyzes access paths for complex type fields - Performs type pruning based on access paths - Implements projection pushdown for complex types - **BE (Backend)**: Added selective column reading - Uses columnAccessPath array from FE to identify required sub-columns - Implements selective reading to skip unnecessary sub-columns **Performance Improvement**: When a struct contains hundreds or thousands of columns but the query only accesses a few sub-columns, this optimization can significantly reduce I/O and improve query performance. For example, with `struct<int a, int b> s`, when only `s.a` is referenced, we can avoid reading `s.b` entirely. **Technical Benefits**: Reduces unnecessary data scanning and decoding overhead for complex types, aligning with Doris's continuous performance optimization goals . - **Lazy Materialization for Complex Type Sub-columns**: Defer materialization of unused sub-columns - **Predicate Pushdown for Complex Type Sub-columns**: Push predicates to storage layer for better filtering - **Parquet RL/DL Optimization**: Read only repetition levels and definition levels without data in appropriate scenarios - **Array Size Optimization**: Read only offset and null values for `array_size()` operations - **Null Check Optimization**: Read only offset and null values for `!= null` checks Co-authored-by: 924060929 <lanhuajian@selectdb.com> Co-authored-by: Jerry Hu <mrhhsg@gmail.com>
…uning (apache#57204) Problem Summary: Optimize Complex Type Column Reading with Column Pruning This PR implements column pruning for complex types (Struct, Array, Map) to optimize read performance. Previously, Doris would read entire complex type fields before processing, which was simple to implement but inefficient when only specific sub-columns were needed. **Key changes:** - **FE (Frontend)**: Added column access path calculation and type pruning - Collects and analyzes access paths for complex type fields - Performs type pruning based on access paths - Implements projection pushdown for complex types - **BE (Backend)**: Added selective column reading - Uses columnAccessPath array from FE to identify required sub-columns - Implements selective reading to skip unnecessary sub-columns **Performance Improvement**: When a struct contains hundreds or thousands of columns but the query only accesses a few sub-columns, this optimization can significantly reduce I/O and improve query performance. For example, with `struct<int a, int b> s`, when only `s.a` is referenced, we can avoid reading `s.b` entirely. **Technical Benefits**: Reduces unnecessary data scanning and decoding overhead for complex types, aligning with Doris's continuous performance optimization goals . - **Lazy Materialization for Complex Type Sub-columns**: Defer materialization of unused sub-columns - **Predicate Pushdown for Complex Type Sub-columns**: Push predicates to storage layer for better filtering - **Parquet RL/DL Optimization**: Read only repetition levels and definition levels without data in appropriate scenarios - **Array Size Optimization**: Read only offset and null values for `array_size()` operations - **Null Check Optimization**: Read only offset and null values for `!= null` checks Co-authored-by: 924060929 <lanhuajian@selectdb.com> Co-authored-by: Jerry Hu <mrhhsg@gmail.com>
…uning (apache#57204) Problem Summary: Optimize Complex Type Column Reading with Column Pruning This PR implements column pruning for complex types (Struct, Array, Map) to optimize read performance. Previously, Doris would read entire complex type fields before processing, which was simple to implement but inefficient when only specific sub-columns were needed. **Key changes:** - **FE (Frontend)**: Added column access path calculation and type pruning - Collects and analyzes access paths for complex type fields - Performs type pruning based on access paths - Implements projection pushdown for complex types - **BE (Backend)**: Added selective column reading - Uses columnAccessPath array from FE to identify required sub-columns - Implements selective reading to skip unnecessary sub-columns **Performance Improvement**: When a struct contains hundreds or thousands of columns but the query only accesses a few sub-columns, this optimization can significantly reduce I/O and improve query performance. For example, with `struct<int a, int b> s`, when only `s.a` is referenced, we can avoid reading `s.b` entirely. **Technical Benefits**: Reduces unnecessary data scanning and decoding overhead for complex types, aligning with Doris's continuous performance optimization goals . - **Lazy Materialization for Complex Type Sub-columns**: Defer materialization of unused sub-columns - **Predicate Pushdown for Complex Type Sub-columns**: Push predicates to storage layer for better filtering - **Parquet RL/DL Optimization**: Read only repetition levels and definition levels without data in appropriate scenarios - **Array Size Optimization**: Read only offset and null values for `array_size()` operations - **Null Check Optimization**: Read only offset and null values for `!= null` checks Co-authored-by: 924060929 <lanhuajian@selectdb.com> Co-authored-by: Jerry Hu <mrhhsg@gmail.com>
…uning (apache#57204) Problem Summary: Optimize Complex Type Column Reading with Column Pruning This PR implements column pruning for complex types (Struct, Array, Map) to optimize read performance. Previously, Doris would read entire complex type fields before processing, which was simple to implement but inefficient when only specific sub-columns were needed. **Key changes:** - **FE (Frontend)**: Added column access path calculation and type pruning - Collects and analyzes access paths for complex type fields - Performs type pruning based on access paths - Implements projection pushdown for complex types - **BE (Backend)**: Added selective column reading - Uses columnAccessPath array from FE to identify required sub-columns - Implements selective reading to skip unnecessary sub-columns **Performance Improvement**: When a struct contains hundreds or thousands of columns but the query only accesses a few sub-columns, this optimization can significantly reduce I/O and improve query performance. For example, with `struct<int a, int b> s`, when only `s.a` is referenced, we can avoid reading `s.b` entirely. **Technical Benefits**: Reduces unnecessary data scanning and decoding overhead for complex types, aligning with Doris's continuous performance optimization goals . - **Lazy Materialization for Complex Type Sub-columns**: Defer materialization of unused sub-columns - **Predicate Pushdown for Complex Type Sub-columns**: Push predicates to storage layer for better filtering - **Parquet RL/DL Optimization**: Read only repetition levels and definition levels without data in appropriate scenarios - **Array Size Optimization**: Read only offset and null values for `array_size()` operations - **Null Check Optimization**: Read only offset and null values for `!= null` checks Co-authored-by: 924060929 <lanhuajian@selectdb.com> Co-authored-by: Jerry Hu <mrhhsg@gmail.com>
optimize push down project, this can reduce the scan bytes and shuffle bytes by prune nested column. #57204 related the sql: ```sql select coalecse(struct_element(t1.s, 'city'), 'beijing') from t1 join t2 on t1.id = t2.id ``` original plan: ``` Project(coalecse(struct_element(t1.s, 'city'), 'beijing')) | Join(t1.id=t2.id) / \ Project(t1.id, t1.s) Project(t2.id) | | Scan(t1) Scan(t2) ``` optimize plan: ``` Project(coalecse(slot#3, 'beijing')) | Join(t1.id=t2.id) / \ Project(t1.id, struct_element(t1.s, 'city')#3) Project(t2.id) | | Scan(t1) Scan(t2) ``` (cherry picked from commit c30c0ff)
…rning (#59286) ### What problem does this PR solve? Problem Summary: ### Release note Cherry-pick #58370 #58354 #59043 #58851 #58485 #58682 #58614 #58373 #57204 #58719 #58471 #58573 #58657 ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into --> --------- Co-authored-by: 924060929 <lanhuajian@selectdb.com> Co-authored-by: Jerry Hu <mrhhsg@gmail.com> Co-authored-by: Jerry Hu <hushenggang@selectdb.com> Co-authored-by: lihangyu <lihangyu@selectdb.com>
What problem does this PR solve?
Problem Summary:
Release note
Optimize Complex Type Column Reading with Column Pruning
Description
This PR implements column pruning for complex types (Struct, Array, Map) to optimize read performance. Previously, Doris would read entire complex type fields before processing, which was simple to implement but inefficient when only specific sub-columns were needed.
Key changes:
FE (Frontend): Added column access path calculation and type pruning
BE (Backend): Added selective column reading
Why
Performance Improvement: When a struct contains hundreds or thousands of columns but the query only accesses a few sub-columns, this optimization can significantly reduce I/O and improve query performance. For example, with
struct<int a, int b> s, when onlys.ais referenced, we can avoid readings.bentirely.Technical Benefits: Reduces unnecessary data scanning and decoding overhead for complex types, aligning with Doris's continuous performance optimization goals .
TODO & Future Optimizations
array_size()operations!= nullchecksCheck List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)