-
Notifications
You must be signed in to change notification settings - Fork 1.8k
fix: prevent UnionExec panic with empty inputs #17449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: prevent UnionExec panic with empty inputs #17449
Conversation
This commit fixes a panic in UnionExec when constructed with empty inputs. Previously, UnionExec::new(vec![]) would cause an index out of bounds panic at union.rs:542 when trying to access inputs[0]. Changes: - Made UnionExec::new() return Result<Self> with proper validation - Made union_schema() return Result<SchemaRef> with empty input checks - Added descriptive error messages for empty input cases - Updated all call sites to handle the new Result return type - Added comprehensive tests for edge cases Error messages: - "UnionExec requires at least one input" - "Cannot create union schema from empty inputs" The fix maintains backward compatibility for valid inputs while preventing crashes and providing clear error messages for invalid usage. Fixes apache#17052
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @EeshanBembi
| fn test_union_empty_inputs() { | ||
| // Test that UnionExec::new fails with empty inputs | ||
| let result = UnionExec::new(vec![]); | ||
| assert!(result.is_err()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the assertion check for is_err is redundant as unwrap_err will panic if result is not an err
| /// Create a new UnionExec | ||
| pub fn new(inputs: Vec<Arc<dyn ExecutionPlan>>) -> Self { | ||
| let schema = union_schema(&inputs); | ||
| pub fn new(inputs: Vec<Arc<dyn ExecutionPlan>>) -> Result<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is technically an API change -- maybe to make it easier on others, we can make a new function called try_new that has the error checking, and deprecate the existing new function per https://datafusion.apache.org/contributor-guide/api-health.html#deprecation-guidelines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point on the API lifecycle. On separate note, can we make the new try_new method return Box<<dyn ExecutionPlan>>? This would allow it to return the only child in case input vector is a singleton. There is no point keeping UnionExec(a) in the plan.
Or maybe, the new method can simply require the input to have at least two elements?
| /// Create a new UnionExec | ||
| pub fn new(inputs: Vec<Arc<dyn ExecutionPlan>>) -> Self { | ||
| let schema = union_schema(&inputs); | ||
| pub fn new(inputs: Vec<Arc<dyn ExecutionPlan>>) -> Result<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point on the API lifecycle. On separate note, can we make the new try_new method return Box<<dyn ExecutionPlan>>? This would allow it to return the only child in case input vector is a singleton. There is no point keeping UnionExec(a) in the plan.
Or maybe, the new method can simply require the input to have at least two elements?
| } | ||
|
|
||
| #[test] | ||
| fn test_union_multiple_inputs_still_works() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| fn test_union_multiple_inputs_still_works() -> Result<()> { | |
| fn test_union_schema_multiple_inputs() -> Result<()> { |
Not related to the PR, but it'll be better for df to have a optimizer rule to remove empty inputs from union |
@xudong963 , exactly See also #17449 (comment) |
- Add new try_new method that returns Result<Arc<dyn ExecutionPlan>> - Deprecate existing new method in favor of try_new - Optimize single-input case: try_new returns the input directly - Remove redundant assert!(result.is_err()) from tests - Rename test_union_multiple_inputs_still_works to test_union_schema_multiple_inputs - Update all call sites to use appropriate API (try_new for new code, deprecated new for tests) This maintains backward compatibility while providing better error handling and optimization for single-input cases.
| ); | ||
|
|
||
| // Union | ||
| #[allow(deprecated)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have an issue to clean these up and add a // TODO (issue link) resolve deprecation comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Add proper feature gates for parquet_encryption in datasource-parquet - Format code to pass cargo fmt checks - All tests passing
|
I merged up from main and fixed a clippy lint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @EeshanBembi and @findepi
* fix: prevent UnionExec panic with empty inputs This commit fixes a panic in UnionExec when constructed with empty inputs. Previously, UnionExec::new(vec![]) would cause an index out of bounds panic at union.rs:542 when trying to access inputs[0]. Changes: - Made UnionExec::new() return Result<Self> with proper validation - Made union_schema() return Result<SchemaRef> with empty input checks - Added descriptive error messages for empty input cases - Updated all call sites to handle the new Result return type - Added comprehensive tests for edge cases Error messages: - "UnionExec requires at least one input" - "Cannot create union schema from empty inputs" The fix maintains backward compatibility for valid inputs while preventing crashes and providing clear error messages for invalid usage. Fixes apache#17052 * refactor: address PR review comments for UnionExec empty inputs fix - Add new try_new method that returns Result<Arc<dyn ExecutionPlan>> - Deprecate existing new method in favor of try_new - Optimize single-input case: try_new returns the input directly - Remove redundant assert!(result.is_err()) from tests - Rename test_union_multiple_inputs_still_works to test_union_schema_multiple_inputs - Update all call sites to use appropriate API (try_new for new code, deprecated new for tests) This maintains backward compatibility while providing better error handling and optimization for single-input cases. * Fix cargo fmt and clippy warnings - Add proper feature gates for parquet_encryption in datasource-parquet - Format code to pass cargo fmt checks - All tests passing * Fix clippy --------- Co-authored-by: Eeshan <eeshan@Eeshans-MacBook-Pro.local> Co-authored-by: ebembi-crdb <ebembi@cockroachlabs.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> (cherry picked from commit b122a16)
* Allow filter pushdown through AggregateExec (apache#18404) ## Which issue does this PR close? - Closes apache#18399 ## Rationale for this change Right now filters cannot pass through `AggregateExec` nodes, preventing filter pushdown optimization in queries with GROUP BY/DISTINCT operations. ## What changes are included in this PR? - Implemented `gather_filters_for_pushdown()` for `AggregateExec` that allows filters on grouping columns to pass through to children - Supports both Pre phase (static filters) and Post phase (dynamic filters from joins) Essentially, filter will pass through in the scenarios @asolimando mentioned [here](apache#18399 (comment)) ## Are these changes tested? Yes, added three tests: - `test_aggregate_filter_pushdown`: Positive case with aggregate functions - `test_no_pushdown_aggregate_filter_on_non_grouping_column`: Negative case ensuring filters on aggregate results are not pushed ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> (cherry picked from commit 076b091) * physical-plan: push filters down to UnionExec children (apache#18054) Filters are safe to be pushed down, so we can override the default behavior here. Signed-off-by: Alfonso Subiotto Marques <alfonso.subiotto@polarsignals.com> (cherry picked from commit 0ecd59b) * fix: prevent UnionExec panic with empty inputs (apache#17449) * fix: prevent UnionExec panic with empty inputs This commit fixes a panic in UnionExec when constructed with empty inputs. Previously, UnionExec::new(vec![]) would cause an index out of bounds panic at union.rs:542 when trying to access inputs[0]. Changes: - Made UnionExec::new() return Result<Self> with proper validation - Made union_schema() return Result<SchemaRef> with empty input checks - Added descriptive error messages for empty input cases - Updated all call sites to handle the new Result return type - Added comprehensive tests for edge cases Error messages: - "UnionExec requires at least one input" - "Cannot create union schema from empty inputs" The fix maintains backward compatibility for valid inputs while preventing crashes and providing clear error messages for invalid usage. Fixes apache#17052 * refactor: address PR review comments for UnionExec empty inputs fix - Add new try_new method that returns Result<Arc<dyn ExecutionPlan>> - Deprecate existing new method in favor of try_new - Optimize single-input case: try_new returns the input directly - Remove redundant assert!(result.is_err()) from tests - Rename test_union_multiple_inputs_still_works to test_union_schema_multiple_inputs - Update all call sites to use appropriate API (try_new for new code, deprecated new for tests) This maintains backward compatibility while providing better error handling and optimization for single-input cases. * Fix cargo fmt and clippy warnings - Add proper feature gates for parquet_encryption in datasource-parquet - Format code to pass cargo fmt checks - All tests passing * Fix clippy --------- Co-authored-by: Eeshan <eeshan@Eeshans-MacBook-Pro.local> Co-authored-by: ebembi-crdb <ebembi@cockroachlabs.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> (cherry picked from commit b122a16) --------- Signed-off-by: Alfonso Subiotto Marques <alfonso.subiotto@polarsignals.com> Co-authored-by: Alfonso Subiotto Marqués <alfonso.subiotto@polarsignals.com> Co-authored-by: EeshanBembi <33062610+EeshanBembi@users.noreply.github.com> Co-authored-by: Eeshan <eeshan@Eeshans-MacBook-Pro.local> Co-authored-by: ebembi-crdb <ebembi@cockroachlabs.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Summary
This PR fixes a panic in
UnionExecwhen constructed with empty inputs, replacing the crash with proper error handling and descriptive error messages.Fixes: #17052
Problem
When
UnionExec::new(vec![])was called with an empty input vector, it would panic with:This occurred because
union_schema()directly accessedinputs[0]without checking if the array was empty.Solution
Core Changes
Made
UnionExec::new()returnResult<Self>:inputs.is_empty()"UnionExec requires at least one input"Made
union_schema()returnResult<SchemaRef>:inputs[0]"Cannot create union schema from empty inputs"Updated all call sites (7 files):
physical_planner.rs- Core DataFusion integrationrepartition/mod.rs- Internal dependenciesResultreturn typeError Handling
Testing
Added 4 comprehensive tests:
test_union_empty_inputs()- Verifies empty input validationtest_union_schema_empty_inputs()- Tests schema creation with empty inputstest_union_single_input()- Ensures single input still workstest_union_multiple_inputs_still_works()- Verifies existing functionality unchangedTest Results:
Backward Compatibility
UnionExec::new()now returnsResult<Self>instead ofSelfThis is a breaking change but justified because:
Unionwhich requires ≥2 inputsFiles Changed
datafusion/physical-plan/src/union.rs- Core fix + tests (main changes)datafusion/core/src/physical_planner.rs- HandleResultreturndatafusion/physical-plan/src/repartition/mod.rs- Update internal callsThe fix provides robust error handling while maintaining all existing functionality for valid use cases.