Skip to content

Conversation

@EeshanBembi
Copy link
Contributor

Summary

This PR fixes a panic in UnionExec when constructed with empty inputs, replacing the crash with proper error handling and descriptive error messages.

Fixes: #17052

Problem

When UnionExec::new(vec![]) was called with an empty input vector, it would panic with:

thread '...' panicked at datafusion/physical-plan/src/union.rs:542:24:
index out of bounds: the len is 0 but the index is 0

This occurred because union_schema() directly accessed inputs[0] without checking if the array was empty.

Solution

Core Changes

  1. Made UnionExec::new() return Result<Self>:

    • Added validation: returns error if inputs.is_empty()
    • Provides clear error message: "UnionExec requires at least one input"
  2. Made union_schema() return Result<SchemaRef>:

    • Added empty input validation before accessing inputs[0]
    • Returns descriptive error: "Cannot create union schema from empty inputs"
  3. Updated all call sites (7 files):

    • physical_planner.rs - Core DataFusion integration
    • repartition/mod.rs - Internal dependencies
    • 4 test files - Updated to handle Result return type

Error Handling

  • Before: Index out of bounds panic (unhelpful)
  • After: Clear error messages that guide users
// Before: panic!
let union = UnionExec::new(vec![]); // PANIC!

// After: proper error handling
match UnionExec::new(vec![]) {
    Ok(_) => { /* use union */ }
    Err(e) => println!("Error: {}", e); // "UnionExec requires at least one input"
}

Testing

Added 4 comprehensive tests:

  1. test_union_empty_inputs() - Verifies empty input validation
  2. test_union_schema_empty_inputs() - Tests schema creation with empty inputs
  3. test_union_single_input() - Ensures single input still works
  4. test_union_multiple_inputs_still_works() - Verifies existing functionality unchanged

Test Results:

  • ✅ All new tests pass
  • ✅ All existing union tests pass (8/8)
  • ✅ All physical planner integration tests pass

Backward Compatibility

  • Existing functionality unchanged for valid inputs (≥1 input)
  • Only adds error handling for previously crashing invalid inputs
  • API change: UnionExec::new() now returns Result<Self> instead of Self

This is a breaking change but justified because:

  1. The previous behavior (panic) was incorrect
  2. Empty inputs are invalid by design (no logical meaning)
  3. Consistent with logical Union which requires ≥2 inputs
  4. Better error handling improves user experience

Files Changed

  • datafusion/physical-plan/src/union.rs - Core fix + tests (main changes)
  • datafusion/core/src/physical_planner.rs - Handle Result return
  • datafusion/physical-plan/src/repartition/mod.rs - Update internal calls
  • 4 test files - Update test utilities and test cases

The fix provides robust error handling while maintaining all existing functionality for valid use cases.

This commit fixes a panic in UnionExec when constructed with empty inputs.
Previously, UnionExec::new(vec![]) would cause an index out of bounds panic
at union.rs:542 when trying to access inputs[0].

Changes:
- Made UnionExec::new() return Result<Self> with proper validation
- Made union_schema() return Result<SchemaRef> with empty input checks
- Added descriptive error messages for empty input cases
- Updated all call sites to handle the new Result return type
- Added comprehensive tests for edge cases

Error messages:
- "UnionExec requires at least one input"
- "Cannot create union schema from empty inputs"

The fix maintains backward compatibility for valid inputs while preventing
crashes and providing clear error messages for invalid usage.

Fixes apache#17052
@github-actions github-actions bot added core Core DataFusion crate physical-plan Changes to the physical-plan crate labels Sep 5, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @EeshanBembi

fn test_union_empty_inputs() {
// Test that UnionExec::new fails with empty inputs
let result = UnionExec::new(vec![]);
assert!(result.is_err());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the assertion check for is_err is redundant as unwrap_err will panic if result is not an err

/// Create a new UnionExec
pub fn new(inputs: Vec<Arc<dyn ExecutionPlan>>) -> Self {
let schema = union_schema(&inputs);
pub fn new(inputs: Vec<Arc<dyn ExecutionPlan>>) -> Result<Self> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is technically an API change -- maybe to make it easier on others, we can make a new function called try_new that has the error checking, and deprecate the existing new function per https://datafusion.apache.org/contributor-guide/api-health.html#deprecation-guidelines

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on the API lifecycle. On separate note, can we make the new try_new method return Box<<dyn ExecutionPlan>>? This would allow it to return the only child in case input vector is a singleton. There is no point keeping UnionExec(a) in the plan.
Or maybe, the new method can simply require the input to have at least two elements?

/// Create a new UnionExec
pub fn new(inputs: Vec<Arc<dyn ExecutionPlan>>) -> Self {
let schema = union_schema(&inputs);
pub fn new(inputs: Vec<Arc<dyn ExecutionPlan>>) -> Result<Self> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on the API lifecycle. On separate note, can we make the new try_new method return Box<<dyn ExecutionPlan>>? This would allow it to return the only child in case input vector is a singleton. There is no point keeping UnionExec(a) in the plan.
Or maybe, the new method can simply require the input to have at least two elements?

}

#[test]
fn test_union_multiple_inputs_still_works() -> Result<()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn test_union_multiple_inputs_still_works() -> Result<()> {
fn test_union_schema_multiple_inputs() -> Result<()> {

@xudong963
Copy link
Member

This PR fixes a panic in UnionExec when constructed with empty inputs,

Not related to the PR, but it'll be better for df to have a optimizer rule to remove empty inputs from union

@findepi
Copy link
Member

findepi commented Sep 10, 2025

This PR fixes a panic in UnionExec when constructed with empty inputs,

Not related to the PR, but it'll be better for df to have a optimizer rule to remove empty inputs from union

@xudong963 , exactly

See also #17449 (comment)
This is, however, related to this PR -- we're adding a new API method and we can address the problem directly in this new code, leaving legacy code path intact.

- Add new try_new method that returns Result<Arc<dyn ExecutionPlan>>
- Deprecate existing new method in favor of try_new
- Optimize single-input case: try_new returns the input directly
- Remove redundant assert!(result.is_err()) from tests
- Rename test_union_multiple_inputs_still_works to test_union_schema_multiple_inputs
- Update all call sites to use appropriate API (try_new for new code, deprecated new for tests)

This maintains backward compatibility while providing better error handling
and optimization for single-input cases.
@github-actions github-actions bot added the proto Related to proto crate label Sep 13, 2025
);

// Union
#[allow(deprecated)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have an issue to clean these up and add a // TODO (issue link) resolve deprecation comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Add proper feature gates for parquet_encryption in datasource-parquet
- Format code to pass cargo fmt checks
- All tests passing
@github-actions github-actions bot added the datasource Changes to the datasource crate label Sep 14, 2025
@alamb
Copy link
Contributor

alamb commented Sep 15, 2025

I merged up from main and fixed a clippy lint

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @EeshanBembi and @findepi

@findepi findepi merged commit b122a16 into apache:main Sep 15, 2025
28 checks passed
LiaCastaneda pushed a commit to DataDog/datafusion that referenced this pull request Nov 3, 2025
* fix: prevent UnionExec panic with empty inputs

This commit fixes a panic in UnionExec when constructed with empty inputs.
Previously, UnionExec::new(vec![]) would cause an index out of bounds panic
at union.rs:542 when trying to access inputs[0].

Changes:
- Made UnionExec::new() return Result<Self> with proper validation
- Made union_schema() return Result<SchemaRef> with empty input checks
- Added descriptive error messages for empty input cases
- Updated all call sites to handle the new Result return type
- Added comprehensive tests for edge cases

Error messages:
- "UnionExec requires at least one input"
- "Cannot create union schema from empty inputs"

The fix maintains backward compatibility for valid inputs while preventing
crashes and providing clear error messages for invalid usage.

Fixes apache#17052

* refactor: address PR review comments for UnionExec empty inputs fix

- Add new try_new method that returns Result<Arc<dyn ExecutionPlan>>
- Deprecate existing new method in favor of try_new
- Optimize single-input case: try_new returns the input directly
- Remove redundant assert!(result.is_err()) from tests
- Rename test_union_multiple_inputs_still_works to test_union_schema_multiple_inputs
- Update all call sites to use appropriate API (try_new for new code, deprecated new for tests)

This maintains backward compatibility while providing better error handling
and optimization for single-input cases.

* Fix cargo fmt and clippy warnings

- Add proper feature gates for parquet_encryption in datasource-parquet
- Format code to pass cargo fmt checks
- All tests passing

* Fix clippy

---------

Co-authored-by: Eeshan <eeshan@Eeshans-MacBook-Pro.local>
Co-authored-by: ebembi-crdb <ebembi@cockroachlabs.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
(cherry picked from commit b122a16)
LiaCastaneda added a commit to DataDog/datafusion that referenced this pull request Nov 4, 2025
* Allow filter pushdown through AggregateExec (apache#18404)

## Which issue does this PR close?

- Closes apache#18399

## Rationale for this change

Right now filters cannot pass through `AggregateExec` nodes, preventing
filter pushdown optimization in queries with GROUP BY/DISTINCT
operations.

## What changes are included in this PR?

- Implemented `gather_filters_for_pushdown()` for `AggregateExec` that
allows filters on grouping columns to pass through to children
- Supports both Pre phase (static filters) and Post phase (dynamic
filters from joins)

Essentially, filter will pass through in the scenarios @asolimando
mentioned
[here](apache#18399 (comment))

## Are these changes tested?

Yes, added three tests:
- `test_aggregate_filter_pushdown`: Positive case with aggregate
functions
- `test_no_pushdown_aggregate_filter_on_non_grouping_column`: Negative
case ensuring filters on aggregate results are not pushed

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

(cherry picked from commit 076b091)

* physical-plan: push filters down to UnionExec children (apache#18054)

Filters are safe to be pushed down, so we can override the default behavior
here.

Signed-off-by: Alfonso Subiotto Marques <alfonso.subiotto@polarsignals.com>
(cherry picked from commit 0ecd59b)

* fix: prevent UnionExec panic with empty inputs (apache#17449)

* fix: prevent UnionExec panic with empty inputs

This commit fixes a panic in UnionExec when constructed with empty inputs.
Previously, UnionExec::new(vec![]) would cause an index out of bounds panic
at union.rs:542 when trying to access inputs[0].

Changes:
- Made UnionExec::new() return Result<Self> with proper validation
- Made union_schema() return Result<SchemaRef> with empty input checks
- Added descriptive error messages for empty input cases
- Updated all call sites to handle the new Result return type
- Added comprehensive tests for edge cases

Error messages:
- "UnionExec requires at least one input"
- "Cannot create union schema from empty inputs"

The fix maintains backward compatibility for valid inputs while preventing
crashes and providing clear error messages for invalid usage.

Fixes apache#17052

* refactor: address PR review comments for UnionExec empty inputs fix

- Add new try_new method that returns Result<Arc<dyn ExecutionPlan>>
- Deprecate existing new method in favor of try_new
- Optimize single-input case: try_new returns the input directly
- Remove redundant assert!(result.is_err()) from tests
- Rename test_union_multiple_inputs_still_works to test_union_schema_multiple_inputs
- Update all call sites to use appropriate API (try_new for new code, deprecated new for tests)

This maintains backward compatibility while providing better error handling
and optimization for single-input cases.

* Fix cargo fmt and clippy warnings

- Add proper feature gates for parquet_encryption in datasource-parquet
- Format code to pass cargo fmt checks
- All tests passing

* Fix clippy

---------

Co-authored-by: Eeshan <eeshan@Eeshans-MacBook-Pro.local>
Co-authored-by: ebembi-crdb <ebembi@cockroachlabs.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
(cherry picked from commit b122a16)

---------

Signed-off-by: Alfonso Subiotto Marques <alfonso.subiotto@polarsignals.com>
Co-authored-by: Alfonso Subiotto Marqués <alfonso.subiotto@polarsignals.com>
Co-authored-by: EeshanBembi <33062610+EeshanBembi@users.noreply.github.com>
Co-authored-by: Eeshan <eeshan@Eeshans-MacBook-Pro.local>
Co-authored-by: ebembi-crdb <ebembi@cockroachlabs.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate datasource Changes to the datasource crate physical-plan Changes to the physical-plan crate proto Related to proto crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants