-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[feature](reader) Optimize Complex Type Column Reading with Column Puning #59249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
ec16ea5 to
0acc3aa
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
0acc3aa to
364e09f
Compare
|
run buildall |
364e09f to
b6db778
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
b6db778 to
b8cb3da
Compare
|
run buildall |
b8cb3da to
e270345
Compare
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
e270345 to
bd4d163
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
…uning (apache#57204) Problem Summary: Optimize Complex Type Column Reading with Column Pruning This PR implements column pruning for complex types (Struct, Array, Map) to optimize read performance. Previously, Doris would read entire complex type fields before processing, which was simple to implement but inefficient when only specific sub-columns were needed. **Key changes:** - **FE (Frontend)**: Added column access path calculation and type pruning - Collects and analyzes access paths for complex type fields - Performs type pruning based on access paths - Implements projection pushdown for complex types - **BE (Backend)**: Added selective column reading - Uses columnAccessPath array from FE to identify required sub-columns - Implements selective reading to skip unnecessary sub-columns **Performance Improvement**: When a struct contains hundreds or thousands of columns but the query only accesses a few sub-columns, this optimization can significantly reduce I/O and improve query performance. For example, with `struct<int a, int b> s`, when only `s.a` is referenced, we can avoid reading `s.b` entirely. **Technical Benefits**: Reduces unnecessary data scanning and decoding overhead for complex types, aligning with Doris's continuous performance optimization goals . - **Lazy Materialization for Complex Type Sub-columns**: Defer materialization of unused sub-columns - **Predicate Pushdown for Complex Type Sub-columns**: Push predicates to storage layer for better filtering - **Parquet RL/DL Optimization**: Read only repetition levels and definition levels without data in appropriate scenarios - **Array Size Optimization**: Read only offset and null values for `array_size()` operations - **Null Check Optimization**: Read only offset and null values for `!= null` checks Co-authored-by: 924060929 <lanhuajian@selectdb.com> Co-authored-by: Jerry Hu <mrhhsg@gmail.com>
bd4d163 to
f4e84e7
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
What problem does this PR solve?
Problem Summary:
Release note
Cherry-pick #57204 #58719
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)