Skip to content

18644: optimizer: Support dynamic filter in MIN/MAX aggregates#12

Open
martin-augment wants to merge 6 commits intomainfrom
pr-18644-2025-11-12-06-51-21
Open

18644: optimizer: Support dynamic filter in MIN/MAX aggregates#12
martin-augment wants to merge 6 commits intomainfrom
pr-18644-2025-11-12-06-51-21

Conversation

@martin-augment
Copy link
Owner

18644: To review by AI

@coderabbitai
Copy link

coderabbitai bot commented Nov 12, 2025

Walkthrough

Implements dynamic filter pushdown for aggregate functions (MIN/MAX) by adding a new configuration option enable_aggregate_dynamic_filter_pushdown to OptimizerOptions and extending AggregateExec with dynamic filter state management, construction, and pushdown logic, along with comprehensive tests and documentation.

Changes

Cohort / File(s) Summary
Configuration and Settings
datafusion/common/src/config.rs, datafusion/sqllogictest/test_files/information_schema.slt, docs/source/user-guide/configs.md
Added enable_aggregate_dynamic_filter_pushdown boolean configuration option (default true) to OptimizerOptions. Updated ConfigOptions::set to propagate dynamic filter pushdown setting to the new aggregate option. Updated documentation and schema to reflect aggregate dynamic filter support and clarified interaction with existing join/topk options.
Aggregate Execution Implementation
datafusion/physical-plan/src/aggregates/mod.rs, datafusion/physical-plan/src/aggregates/no_grouping.rs
Extended AggregateExec with dynamic_filter field and initialization logic for MIN/MAX aggregate expressions. Implemented dynamic filter state management in AggregateStream including accumulator bounds tracking, filter construction from current values, and predicate synchronization. Added helper functions for scalar MIN/MAX operations with proper null semantics. Integrated dynamic filter into gather_filters_for_pushdown and child pushdown result handling.
Filter Pushdown Logic
datafusion/physical-plan/src/filter.rs
Updated FilterExec::gather_filters_for_pushdown to immediately return FilterDescription with mapped parent filters for non-Pre phase handling.
Tests and Verification
datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs, datafusion/sqllogictest/test_files/dynamic_filter_pushdown_config.slt, datafusion/sqllogictest/test_files/simplify_expr.slt
Added comprehensive test suite for aggregate dynamic filter pushdown covering MIN, MAX, mixed expressions, NULL handling, parquet end-to-end scenarios, and single-mode validation. Added integration tests for configuration control of aggregate dynamic filter pushdown behavior. Minor formatting adjustment to test output.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-18644-2025-11-12-06-51-21

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Nov 12, 2025

PR Review: Dynamic Filter in MIN/MAX Aggregates

Summary

This PR implements dynamic filter pushdown for MIN/MAX aggregates, enabling runtime pruning of parquet files based on aggregate bounds. The implementation is well-structured and includes comprehensive tests.

Code Quality & Best Practices

Strengths ✅

  1. Excellent Documentation: The AggrDynFilter struct has comprehensive module-level documentation explaining the feature, use cases, and examples (lines 396-438 in mod.rs).

  2. Comprehensive Test Coverage: The PR includes 10+ test cases covering:

    • Basic MIN/MAX operations
    • Unsupported expressions (MIN(a+1))
    • Multiple columns (same/different)
    • Mixed supported/unsupported aggregates
    • NULL handling
    • End-to-end parquet integration
    • Non-partial aggregate modes
  3. Clean Abstraction: Good separation of concerns with AggrDynFilter, PerAccumulatorDynFilter, and DynamicFilterAggregateType structs.

  4. Proper Configuration: New config flag enable_aggregate_dynamic_filter_pushdown follows existing patterns and integrates with the master enable_dynamic_filter_pushdown flag.

Potential Issues & Suggestions

1. Function Name Detection (Medium Priority) ⚠️

Location: datafusion/physical-plan/src/aggregates/mod.rs:946-955

// HACK: Should check the function type more precisely
// Issue: <https://github.com/apache/datafusion/issues/18643>
let aggr_type = if fun_name.eq_ignore_ascii_case("min") {
    DynamicFilterAggregateType::Min
} else if fun_name.eq_ignore_ascii_case("max") {
    DynamicFilterAggregateType::Max
}

Issue: String-based function name matching is fragile and could incorrectly match user-defined functions named "min" or "max".

Recommendation: The code acknowledges this as a HACK with a tracking issue. Consider adding a check for the function origin (e.g., verify it's from datafusion_functions_aggregate) or use a more type-safe approach if possible before merging.

2. Error Handling in Dynamic Filter Update ⚠️

Location: datafusion/physical-plan/src/aggregates/no_grouping.rs:331

let _ = this.maybe_update_dyn_filter();

Issue: Errors from maybe_update_dyn_filter() are silently ignored. If the dynamic filter update fails, it could lead to:

  • Incorrect pruning decisions
  • Silent degradation of performance optimization

Recommendation: At minimum, log the error. Better yet, propagate it if the failure indicates a serious issue:

if let Err(e) = this.maybe_update_dyn_filter() {
    warn!("Failed to update dynamic filter: {}", e);
    // Consider disabling dynamic filtering for this stream
}

3. Race Condition Potential (Low Priority) 🔍

Location: datafusion/physical-plan/src/aggregates/no_grouping.rs:173-183

let current_bound = acc.evaluate()?;
{
    let mut bound = acc_info.shared_bound.lock();
    match acc_info.aggr_type {
        DynamicFilterAggregateType::Max => {
            *bound = scalar_max(&bound, &current_bound)?;
        }
        // ...
    }
}

Observation: Each partition updates shared bounds independently. While Mutex protects concurrent writes, there's a brief window where:

  1. Multiple partitions could compute bounds
  2. Updates happen in arbitrary order
  3. The dynamic filter might oscillate between bounds

Assessment: This is probably acceptable since:

  • MIN/MAX are idempotent across partitions
  • The filter becomes more restrictive over time
  • Worst case is suboptimal (but still correct) pruning

Recommendation: Add a comment explaining this is expected behavior and safe.

4. NULL Handling Edge Case 🔍

Location: datafusion/physical-plan/src/aggregates/no_grouping.rs:104-110

let bound = {
    let guard = acc_info.shared_bound.lock();
    if (*guard).is_null() {
        continue;
    }
    guard.clone()
};

Question: What happens if a partition processes only NULL values? The test at line 2195 shows it correctly returns DynamicFilter [ true ], but it might be worth verifying that:

  • Null bounds don't interfere with other partitions' bounds
  • The scalar_min/scalar_max functions handle the null semantics correctly across all data types

Recommendation: The NULL handling looks correct based on scalar_cmp_null_short_circuit (lines 240-249), but consider adding a test with mixed NULL/non-NULL across partitions.

5. Memory Considerations 🔍

Location: datafusion/physical-plan/src/aggregates/mod.rs:468

shared_bound: Arc<Mutex<ScalarValue>>,

Observation: For large string/binary types, cloning ScalarValues in the hot path (line 109, 172) could be expensive.

Recommendation: This is likely fine for typical MIN/MAX use cases (numbers, dates), but document the potential performance impact for very large string/binary columns.

Performance Considerations

Positive ✅

  1. Effective Pruning: The parquet e2e test (line 2224) demonstrates actual file pruning.
  2. Lazy Evaluation: Dynamic filters only update after processing each batch, not per row.
  3. Config-Driven: Can be disabled if overhead becomes problematic.

Potential Concerns ⚠️

  1. Lock Contention: Multiple partitions locking shared_bound on every batch. Consider:

    • Using atomic operations for numeric types
    • Batching updates less frequently
    • Profile under high partition counts
  2. Filter Rebuild Cost: build_dynamic_filter_from_accumulator_bounds() is called on every batch and creates new Binary/Or expressions. Consider memoization if this becomes a bottleneck.

Security Concerns

No security issues identified. The dynamic filter operates on trusted internal data structures and doesn't expose user-controlled input to unsafe operations.

Test Coverage Assessment

Excellent Coverage ✅

  • Unit tests for all supported scenarios
  • Edge cases (NULL, unsupported expressions)
  • Integration test with real parquet files
  • Negative test for unsupported modes (Single mode)

Suggested Additions

  1. Concurrency test: Verify behavior with multiple partitions updating bounds simultaneously
  2. Different data types: Test with timestamps, dates, decimals (not just Int32)
  3. Large scale: Test with many aggregate expressions (e.g., 100+ MIN/MAX columns)
  4. Filter effectiveness metrics: Verify the actual pruning ratio in tests

Documentation

Good ✅

  • Inline comments explain the algorithm
  • Examples in doc comments
  • Config documentation updated

Could Improve 📝

  • Add a high-level architecture diagram showing how filter flows from AggregateExec → Scanner
  • Document performance characteristics and when to disable
  • Add examples to the user guide (docs/source/user-guide/)

Minor Nits

  1. Typo in datafusion/physical-plan/src/aggregates/mod.rs:430:

    "it will gets update" → "it will get updated"

  2. Inconsistent spacing in datafusion/physical-plan/src/aggregates/mod.rs:464:

    "vec inside AggregateStream aggregate_expressions" (extra spaces)

  3. Comment clarity in datafusion/physical-plan/src/aggregates/no_grouping.rs:68:
    Consider renaming agg_dyn_filter_state to dynamic_filter for consistency with AggregateExec

Recommendation

Approve with minor changes

The implementation is solid and well-tested. Address the error handling issue (#2) before merging, and consider the function name detection concern (#1) if feasible. The other issues are minor and could be addressed in follow-up PRs.

Priority fixes before merge:

  1. Handle errors from maybe_update_dyn_filter() (don't silently ignore)
  2. Fix typos

Nice-to-have improvements:

  1. More robust function type detection
  2. Add comments about race condition safety
  3. Additional test coverage for edge cases

Great work on this optimization! 🎉


Review generated with assistance from Claude Code

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

Some(Ordering::Less | Ordering::Equal) => Ok(v1.clone()),
Some(Ordering::Greater) => Ok(v2.clone()),
None => datafusion_common::internal_err!(
"cannot compare values of different or incompatible types: {v1:?} vs {v2:?}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If either bound is NaN (for float types), partial_cmp returns None and this branch returns an internal error, which can fail the query during dynamic filter updates.

🤖 Was this useful? React with 👍 or 👎

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:annoying; category:bug; feedback:The Augment AI reviewer is not correct! https://github.com/apache/datafusion/blob/becc71be04732ca2f62139192dcc76caf6c08fd3/datafusion/common/src/scalar/mod.rs#L515-L529 shows that if the scalar types are the same, e.g. Float64 compared to another Float64, then it uses total ordering and this is not an issue. A None would be returned when the scalars to compare are from different type, but this is an application logic error.

/// --> dynamic filter PhysicalExpr: (c1 < 10) OR (c1>100) OR (c2 < 20)
///
/// # Errors
/// Returns internal errors if the dynamic filter is not enabled
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring states this returns an internal error when the dynamic filter is not enabled, but the function returns a literal true in that case; the comment seems inaccurate.

🤖 Was this useful? React with 👍 or 👎

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:documentation; feedback:The Augment AI reviewer is correct that the docstring does not match the actual implementation. Prevents confusion in the reader/developer reading this documentation.

/// - If yes, init one inside `AggregateExec`'s `dynamic_filter` field.
/// - If not supported, `self.dynamic_filter` should be kept `None`
fn init_dynamic_filter(&mut self) {
if (!self.group_by.is_single()) || (!matches!(self.mode, AggregateMode::Partial))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Dynamic Filters Violate Grouping Requirements

The dynamic filter initialization incorrectly checks !is_single() instead of !is_empty(). This allows dynamic filters for aggregates with GROUP BY columns (like SELECT min(a) FROM table GROUP BY b), contradicting the documented requirement that dynamic filters only work with no grouping. The check passes when there's a single grouping set with columns, but should only pass when there are zero grouping columns.

Fix in Cursor Fix in Web

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Bugbot AI reviewer is correct that according to the Pull Request description and the docstring at line 423 the dynamic filter should be disabled if there is a GROUP BY clause, so the check should be !self.group_by.is_empty()

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
datafusion/physical-plan/src/filter.rs (1)

560-573: Bug: projection is lost when rebuilding FilterExec after pushdown

When new_predicate differs and a new FilterExec is created, projection is set to None. This drops any existing projection, changing the output schema/columns unexpectedly.

Preserve self.projection in the rebuilt node.

         } else {
             // Create a new FilterExec with the new predicate
             let new = FilterExec {
                 predicate: Arc::clone(&new_predicate),
                 input: Arc::clone(&filter_input),
                 metrics: self.metrics.clone(),
                 default_selectivity: self.default_selectivity,
                 cache: Self::compute_properties(
                     &filter_input,
                     &new_predicate,
                     self.default_selectivity,
-                    self.projection.as_ref(),
+                    self.projection.as_ref(),
                 )?,
-                projection: None,
+                // Preserve any existing projection
+                projection: self.projection.clone(),
                 batch_size: self.batch_size,
             };
             Some(Arc::new(new) as _)
         };

Please add/extend a test that exercises a FilterExec with a non‑empty projection through pushdown to ensure the projection is retained.

🧹 Nitpick comments (5)
datafusion/physical-plan/src/filter.rs (1)

474-485: Confirm non‑Pre phase pass‑through semantics

In the non‑Pre phase you map all parent_filters to PushedDownPredicate::supported and set self_filters to empty. If the intent is pure pass‑through (no-op at this phase), marking everything as “supported” depends on downstream logic to avoid swallowing filters.

Please confirm invariants for “supported” in later phases (e.g., with FilterPushdownPropagation::if_all) and consider an inline comment or a small unit test to lock this behavior.

datafusion/sqllogictest/test_files/information_schema.slt (1)

416-418: Nit: clarify scope is MIN/MAX aggregates

The verbose text and the umbrella flag description read “Aggregate dynamic filters”. To reduce ambiguity, consider stating that (currently) only MIN/MAX aggregates emit dynamic filters.

This likely means updating the Rust docstrings that generate both docs and SHOW output so SLT stays consistent.

docs/source/user-guide/configs.md (1)

139-141: Generated docs: consider clarifying “Aggregate” → “MIN/MAX aggregates”

To set the right expectation, recommend adjusting the source docstrings (that feed this generated table) to explicitly mention that aggregate dynamic filters are produced by MIN/MAX.

Do not edit this file directly; change the docstrings or dev/update_config_docs.sh.

datafusion/physical-plan/src/aggregates/mod.rs (2)

932-934: Consider removing the redundant "already initialized" check.

Since try_new_with_schema always initializes with dynamic_filter: None (line 664) and init_dynamic_filter is only called once immediately after (line 667), this early return for "already initialized" appears unnecessary.

If this is defensive programming for future use cases, consider adding a comment explaining when re-initialization might occur. Otherwise, removing this check would simplify the logic.


922-923: Simplify boolean condition.

The extra parentheses in the condition are unnecessary and can be simplified for readability.

Apply this diff:

-        if (!self.group_by.is_single()) || (!matches!(self.mode, AggregateMode::Partial))
+        if !self.group_by.is_single() || !matches!(self.mode, AggregateMode::Partial)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cc49fc0 and 2041763.

📒 Files selected for processing (9)
  • datafusion/common/src/config.rs (2 hunks)
  • datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs (4 hunks)
  • datafusion/physical-plan/src/aggregates/mod.rs (11 hunks)
  • datafusion/physical-plan/src/aggregates/no_grouping.rs (4 hunks)
  • datafusion/physical-plan/src/filter.rs (1 hunks)
  • datafusion/sqllogictest/test_files/dynamic_filter_pushdown_config.slt (7 hunks)
  • datafusion/sqllogictest/test_files/information_schema.slt (2 hunks)
  • datafusion/sqllogictest/test_files/simplify_expr.slt (0 hunks)
  • docs/source/user-guide/configs.md (1 hunks)
💤 Files with no reviewable changes (1)
  • datafusion/sqllogictest/test_files/simplify_expr.slt
🧰 Additional context used
🧬 Code graph analysis (3)
datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs (2)
datafusion/physical-plan/src/aggregates/mod.rs (7)
  • new (183-193)
  • new (2196-2200)
  • schema (2321-2323)
  • expr (220-222)
  • expr (815-815)
  • expr (867-867)
  • try_new (547-567)
datafusion/core/tests/physical_optimizer/filter_pushdown/util.rs (6)
  • new (119-134)
  • new (274-280)
  • new (338-349)
  • new (382-414)
  • new (465-475)
  • format_plan_for_test (448-455)
datafusion/physical-plan/src/aggregates/no_grouping.rs (2)
datafusion/physical-plan/src/filter.rs (2)
  • metrics (417-419)
  • new (664-671)
datafusion/physical-plan/src/metrics/value.rs (1)
  • timer (214-219)
datafusion/physical-plan/src/aggregates/mod.rs (4)
datafusion/physical-plan/src/sorts/sort.rs (3)
  • expressions (852-855)
  • gather_filters_for_pushdown (1344-1364)
  • result (1699-1699)
datafusion/physical-plan/src/execution_plan.rs (2)
  • gather_filters_for_pushdown (561-571)
  • handle_child_pushdown_result (652-659)
datafusion/physical-expr/src/expressions/literal.rs (1)
  • lit (141-146)
datafusion/physical-plan/src/filter_pushdown.rs (2)
  • parent_filters (449-455)
  • if_any (249-259)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: claude-review
  • GitHub Check: Analyze (rust)
🔇 Additional comments (8)
datafusion/sqllogictest/test_files/information_schema.slt (1)

293-293: Good: config surfaced in SHOW ALL

New datafusion.optimizer.enable_aggregate_dynamic_filter_pushdown entry appears in SHOW ALL output as expected.

If any downstream CI is order-sensitive, ensure rowsort is applied for queries that include this row (it is for this one).

datafusion/physical-plan/src/aggregates/mod.rs (7)

396-439: Excellent documentation for dynamic filter feature.

The documentation clearly explains the concept, use cases, enable conditions, and provides concrete examples. This will help maintainers understand the feature.


440-476: Well-designed structure for dynamic filter state.

The use of Arc<Mutex<ScalarValue>> for shared_bound appropriately handles thread-safe updates across multiple execution streams.


972-972: Clarify the purpose of lit(true) in DynamicFilterPhysicalExpr.

The DynamicFilterPhysicalExpr is constructed with lit(true) as the second argument. Without context, it's unclear whether this is:

  • A placeholder that gets replaced during execution
  • An initial "accept all" predicate
  • Combined with the column expressions in some way

Adding a brief comment explaining the semantics would improve code clarity.


538-538: Correct propagation of dynamic_filter state.

The dynamic filter is properly cloned (cheap Arc clone) when creating new instances through with_new_aggr_exprs and with_new_children, ensuring the filter state is preserved through plan transformations.

Also applies to: 1165-1165


1274-1282: Dynamic filter pushdown correctly integrated.

The logic appropriately adds the self dynamic filter during the Post phase when:

  1. The configuration flag enable_aggregate_dynamic_filter_pushdown is enabled
  2. A dynamic filter has been initialized for this aggregate

This integrates well with the existing parent filter handling.


1289-1316: Correctly handles dynamic filter rejection by child.

The logic properly detects when the child cannot consume the dynamic filter by checking if self_filters for the child is non-empty. When the child rejects the filter, it appropriately clones the aggregate and disables the dynamic filter by setting it to None.

Since AggregateExec has a single child, checking .first() is appropriate.


964-964: ScalarValue::Null is correctly used as a sentinel and is safe.

The initial ScalarValue::Null is intentionally a sentinel indicating "bound not yet determined." The implementation safely handles this:

  1. Runtime checks prevent null use: Line 105-110 in no_grouping.rs explicitly checks is_null() and skips pruning if true.
  2. Null handling in comparisons: scalar_cmp_null_short_circuit (line 240-249) correctly implements null semantics—when comparing Null with any value, it returns the non-null value, ensuring the first update replaces the sentinel.
  3. Design is explicit: The sentinel pattern is intentional and well-tested through the early-exit check.

No changes required.

Comment on lines +167 to +189
acc_info.aggr_index
)
})?;
// First get current partition's bound, then update the shared bound among
// all partitions.
let current_bound = acc.evaluate()?;
{
let mut bound = acc_info.shared_bound.lock();
match acc_info.aggr_type {
DynamicFilterAggregateType::Max => {
*bound = scalar_max(&bound, &current_bound)?;
}
DynamicFilterAggregateType::Min => {
*bound = scalar_min(&bound, &current_bound)?;
}
}
}
}

// Step 2: Sync the dynamic filter physical expression with reader
let predicate = self.build_dynamic_filter_from_accumulator_bounds()?;
filter_state.filter.update(predicate)?;

Copy link

@coderabbitai coderabbitai bot Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t clobber the global bound when a partition returns NULL

When a partition has no qualifying rows (e.g., all filtered out or NULL), acc.evaluate() comes back as ScalarValue::<T>(None). The current code still runs it through scalar_min, which (because None < Some(_) in Rust’s Option ordering) overwrites an existing non-null bound from another partition with None. That immediately turns the combined predicate into true, so we lose the pruned range we just discovered. The same situation happens for MAX/scalar_max if the operands are flipped.

We should treat a NULL current_bound as “no update” and skip the merge for that accumulator instead of erasing a valid bound.

             let current_bound = acc.evaluate()?;
+            if current_bound.is_null() {
+                continue;
+            }
             {
                 let mut bound = acc_info.shared_bound.lock();
                 match acc_info.aggr_type {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
acc_info.aggr_index
)
})?;
// First get current partition's bound, then update the shared bound among
// all partitions.
let current_bound = acc.evaluate()?;
{
let mut bound = acc_info.shared_bound.lock();
match acc_info.aggr_type {
DynamicFilterAggregateType::Max => {
*bound = scalar_max(&bound, &current_bound)?;
}
DynamicFilterAggregateType::Min => {
*bound = scalar_min(&bound, &current_bound)?;
}
}
}
}
// Step 2: Sync the dynamic filter physical expression with reader
let predicate = self.build_dynamic_filter_from_accumulator_bounds()?;
filter_state.filter.update(predicate)?;
acc_info.aggr_index
)
})?;
// First get current partition's bound, then update the shared bound among
// all partitions.
let current_bound = acc.evaluate()?;
if current_bound.is_null() {
continue;
}
{
let mut bound = acc_info.shared_bound.lock();
match acc_info.aggr_type {
DynamicFilterAggregateType::Max => {
*bound = scalar_max(&bound, &current_bound)?;
}
DynamicFilterAggregateType::Min => {
*bound = scalar_min(&bound, &current_bound)?;
}
}
}
}
// Step 2: Sync the dynamic filter physical expression with reader
let predicate = self.build_dynamic_filter_from_accumulator_bounds()?;
filter_state.filter.update(predicate)?;
🤖 Prompt for AI Agents
In datafusion/physical-plan/src/aggregates/no_grouping.rs around lines 167 to
189, the code merges a partition's accumulator bound into the shared bound
unconditionally, which lets a None (NULL) current_bound overwrite a valid global
bound via scalar_min/scalar_max; change the merge to first check if
current_bound is None and, if so, skip updating the shared bound entirely;
otherwise perform the existing scalar_min/scalar_max merge while holding the
lock so non-null values update the global bound as before.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct that Null scalars should not be used to find the min value because they will always be the min. Prevents wrong calculation of the min value in the expression.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martin-augment Thank you for confirming! I'm glad the analysis was helpful in catching this bug. You're absolutely right that allowing NULL scalars to participate in the min/max comparison would incorrectly overwrite valid bounds from other partitions, causing the dynamic filter to lose its effectiveness and fall back to true. The suggested fix to skip the merge when current_bound.is_null() ensures that only meaningful bounds update the shared state across partitions.


🧠 Learnings used
Learnt from: martin-augment
Repo: martin-augment/datafusion PR: 5
File: datafusion/core/tests/user_defined/user_defined_scalar_functions.rs:997-1005
Timestamp: 2025-11-06T13:37:46.028Z
Learning: In datafusion/core/tests/user_defined/user_defined_scalar_functions.rs, when implementing ScalarFunctionWrapper::replacement, always bounds-check placeholder_position against defaults.len() before indexing into the defaults array to prevent panics when SQL UDF bodies reference placeholders like $N where N exceeds the declared parameter count.

If you found this review helpful, would you consider giving us a shout-out on X?

Thank you for using CodeRabbit!

Repository owner deleted a comment from coderabbitai bot Nov 12, 2025
@martin-augment
Copy link
Owner Author

2. Error Handling in Dynamic Filter Update ⚠️

Location: datafusion/physical-plan/src/aggregates/no_grouping.rs:331

let _ = this.maybe_update_dyn_filter();

Issue: Errors from maybe_update_dyn_filter() are silently ignored. If the dynamic filter update fails, it could lead to:

  • Incorrect pruning decisions
  • Silent degradation of performance optimization

value:useful; category:bug; feedback:The Claude AI reviewer is correct that errors would be ignored this way. The finding prevents wrong behavior during runtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments