18446: feat: Add evaluate_to_arrays function#7
Conversation
WalkthroughIntroduces a new utility function Changes
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code Review: PR #7 - Add evaluate_to_arrays functionSummaryThis PR introduces a shared utility function Positive AspectsCode Quality & Best Practices:
Correctness:
Areas for Improvement1. Test Coverage (Medium Priority)The new utility function
Suggested test location: Example test structure: #[test]
fn test_evaluate_expressions_to_arrays_empty() -> Result<()> {
let batch = RecordBatch::new_empty(Arc::new(Schema::empty()));
let exprs: Vec<Arc<dyn PhysicalExpr>> = vec![];
let result = evaluate_expressions_to_arrays(&exprs, &batch)?;
assert!(result.is_empty());
Ok(())
}2. Documentation Enhancement (Low Priority)Consider adding a doc example showing typical usage: /// # Example
/// ```rust,ignore
/// let exprs = vec![col("a"), col("b")];
/// let arrays = evaluate_expressions_to_arrays(&exprs, &batch)?;
/// ```
3. Performance Consideration (Informational)The documentation mentions:
This is good advice. For future optimization opportunities, consider whether any of the call sites could benefit from specialized handling of ScalarValues (e.g., in projections where a constant column doesn't need to be materialized). However, this is out of scope for this PR. Security & CorrectnessNo Security Concerns Identified:
Correctness Verification:
Performance ConsiderationsNeutral to Positive Impact:
RecommendationApprove with Minor Suggestions: This is a well-executed refactoring PR. The code is correct, improves maintainability, and follows DataFusion conventions. My only recommendation is to add unit tests for the new utility function to ensure it's tested independently of the integration tests. Merge Status: ✅ Safe to merge after addressing test coverage (or accepting that integration tests provide sufficient coverage). Overall Assessment:
Great work on this cleanup! 🎯 |
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (6)
datafusion/physical-plan/src/joins/hash_join/stream.rs (1)
451-456: Add a debug assertion to guard against key length mismatches.Protects create_hashes/hashes_buffer assumptions in debug builds.
- let keys_values = evaluate_expressions_to_arrays(&self.on_right, &batch)?; + let keys_values = evaluate_expressions_to_arrays(&self.on_right, &batch)?; + debug_assert!( + keys_values.iter().all(|a| a.len() == batch.num_rows()), + "join key arrays must match probe batch length" + );datafusion/physical-plan/src/repartition/mod.rs (1)
387-393: Add a debug assertion before hashing.Validates arrays’ length equals batch.num_rows() for hash partitioning.
- let arrays = evaluate_expressions_to_arrays(exprs, &batch)?; + let arrays = evaluate_expressions_to_arrays(exprs, &batch)?; + debug_assert!( + arrays.iter().all(|a| a.len() == batch.num_rows()), + "hash partitioning keys must match batch length" + );datafusion/physical-plan/src/joins/utils.rs (1)
1677-1682: Add a debug assertion in update_hash for key lengths.Guards against accidental cardinality-changing expressions in join keys.
- let keys_values = evaluate_expressions_to_arrays(on, batch)?; + let keys_values = evaluate_expressions_to_arrays(on, batch)?; + debug_assert!( + keys_values.iter().all(|a| a.len() == batch.num_rows()), + "build-side join key arrays must match batch length" + );datafusion/physical-expr-common/src/utils.rs (1)
95-114: Add debug_assert and unit test to validate array length consistency.The function lacks dedicated test coverage despite being used across 11+ call sites in projection, aggregates, joins, sorts, and window operations. The suggested restructure with
debug_assert!is sound—it enforces the array length invariant at zero runtime cost in release builds while catching programmer errors early in debug builds.Note: The pattern search for legacy
evaluate(...).into_array(...)call sites would yield no results; the codebase is already fully migrated to this centralized helper.Apply the suggested diff and add a unit test mixing
ColumnandScalarValueexpressions to validate both branches of the evaluation path:pub fn evaluate_expressions_to_arrays( exprs: &[Arc<dyn PhysicalExpr>], batch: &RecordBatch, ) -> Result<Vec<ArrayRef>> { let num_rows = batch.num_rows(); - exprs - .iter() - .map(|e| e.evaluate(batch).and_then(|col| col.into_array(num_rows))) - .collect::<Result<Vec<ArrayRef>>>() + let arrays = exprs + .iter() + .map(|e| e.evaluate(batch).and_then(|col| col.into_array(num_rows))) + .collect::<Result<Vec<ArrayRef>>>()?; + debug_assert!( + arrays.iter().all(|a| a.len() == num_rows), + "evaluate_expressions_to_arrays: array length mismatch with input batch" + ); + Ok(arrays) }datafusion/physical-plan/src/aggregates/mod.rs (1)
1411-1428: Apply utility in evaluate_group_by for consistencyRefactor the manual loops to reuse evaluate_expressions_to_arrays; reduces duplication and keeps scalar expansion consistent.
- let exprs: Vec<ArrayRef> = group_by - .expr - .iter() - .map(|(expr, _)| { - let value = expr.evaluate(batch)?; - value.into_array(batch.num_rows()) - }) - .collect::<Result<Vec<_>>>()?; - - let null_exprs: Vec<ArrayRef> = group_by - .null_expr - .iter() - .map(|(expr, _)| { - let value = expr.evaluate(batch)?; - value.into_array(batch.num_rows()) - }) - .collect::<Result<Vec<_>>>()?; + let exprs_list: Vec<_> = group_by.expr.iter().map(|(e, _)| Arc::clone(e)).collect(); + let null_exprs_list: Vec<_> = group_by.null_expr.iter().map(|(e, _)| Arc::clone(e)).collect(); + let exprs: Vec<ArrayRef> = evaluate_expressions_to_arrays(&exprs_list, batch)?; + let null_exprs: Vec<ArrayRef> = evaluate_expressions_to_arrays(&null_exprs_list, batch)?;Also applies to: 1429-1447
datafusion/physical-plan/src/aggregates/no_grouping.rs (1)
222-224: Tiny readability nit: be explicit with CowUse as_ref() to make the Cow to &RecordBatch conversion obvious.
- let values = evaluate_expressions_to_arrays(expr, &batch)?; + let values = evaluate_expressions_to_arrays(expr, batch.as_ref())?;
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
datafusion/physical-expr-common/src/utils.rs(2 hunks)datafusion/physical-expr/src/window/standard_window_function_expr.rs(2 hunks)datafusion/physical-expr/src/window/window_expr.rs(2 hunks)datafusion/physical-plan/src/aggregates/mod.rs(2 hunks)datafusion/physical-plan/src/aggregates/no_grouping.rs(2 hunks)datafusion/physical-plan/src/joins/hash_join/exec.rs(2 hunks)datafusion/physical-plan/src/joins/hash_join/stream.rs(2 hunks)datafusion/physical-plan/src/joins/symmetric_hash_join.rs(2 hunks)datafusion/physical-plan/src/joins/utils.rs(2 hunks)datafusion/physical-plan/src/projection.rs(2 hunks)datafusion/physical-plan/src/repartition/mod.rs(2 hunks)datafusion/physical-plan/src/sorts/stream.rs(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (12)
datafusion/physical-expr/src/window/standard_window_function_expr.rs (1)
datafusion/physical-expr-common/src/utils.rs (1)
evaluate_expressions_to_arrays(104-113)
datafusion/physical-plan/src/joins/symmetric_hash_join.rs (1)
datafusion/physical-expr-common/src/utils.rs (1)
evaluate_expressions_to_arrays(104-113)
datafusion/physical-plan/src/projection.rs (1)
datafusion/physical-expr-common/src/utils.rs (1)
evaluate_expressions_to_arrays(104-113)
datafusion/physical-plan/src/joins/hash_join/stream.rs (1)
datafusion/physical-expr-common/src/utils.rs (1)
evaluate_expressions_to_arrays(104-113)
datafusion/physical-plan/src/sorts/stream.rs (1)
datafusion/physical-expr-common/src/utils.rs (1)
evaluate_expressions_to_arrays(104-113)
datafusion/physical-plan/src/joins/hash_join/exec.rs (1)
datafusion/physical-expr-common/src/utils.rs (1)
evaluate_expressions_to_arrays(104-113)
datafusion/physical-expr/src/window/window_expr.rs (1)
datafusion/physical-expr-common/src/utils.rs (1)
evaluate_expressions_to_arrays(104-113)
datafusion/physical-plan/src/aggregates/mod.rs (1)
datafusion/physical-expr-common/src/utils.rs (1)
evaluate_expressions_to_arrays(104-113)
datafusion/physical-plan/src/aggregates/no_grouping.rs (2)
datafusion/physical-plan/src/filter.rs (1)
batch_filter(651-656)datafusion/physical-expr-common/src/utils.rs (1)
evaluate_expressions_to_arrays(104-113)
datafusion/physical-expr-common/src/utils.rs (3)
datafusion/physical-expr/src/expressions/case.rs (1)
record_batch(262-266)datafusion/physical-plan/src/joins/hash_join/exec.rs (1)
batch(138-140)datafusion/physical-plan/src/aggregates/mod.rs (1)
e(1329-1331)
datafusion/physical-plan/src/repartition/mod.rs (1)
datafusion/physical-expr-common/src/utils.rs (2)
evaluate_expressions_to_arrays(104-113)exprs(109-112)
datafusion/physical-plan/src/joins/utils.rs (2)
datafusion/physical-expr-common/src/utils.rs (1)
evaluate_expressions_to_arrays(104-113)datafusion/physical-plan/src/joins/hash_join/exec.rs (7)
on(484-486)on(735-739)on(753-759)on(907-911)on(984-988)on(1016-1020)batch(138-140)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: codex
🔇 Additional comments (16)
datafusion/physical-expr-common/src/utils.rs (1)
25-25: Import looks correct.Required for the new utility signature.
datafusion/physical-expr/src/window/standard_window_function_expr.rs (2)
26-26: LGTM.Shared utility import keeps this consistent with other modules.
61-62: LGTM — centralized evaluation.Behavior matches previous logic (evaluate + into_array).
datafusion/physical-expr/src/window/window_expr.rs (2)
44-45: LGTM.Import aligns with new shared helper.
94-95: LGTM — same semantics, less duplication.Ensure window arguments do not change row counts (as already assumed here).
If any window arg can alter cardinality, keep using per-op logic instead of the helper.
datafusion/physical-plan/src/joins/hash_join/stream.rs (1)
54-55: LGTM.Import pulls in the shared helper for key evaluation.
datafusion/physical-plan/src/repartition/mod.rs (1)
65-65: LGTM.Import aligns with project-wide consolidation.
datafusion/physical-plan/src/projection.rs (2)
60-61: LGTM.Import is consistent with new helper usage.
361-370: LGTM — preserves zero-column behavior.Centralized evaluation reduces duplication; empty-projection path still correct.
datafusion/physical-plan/src/joins/utils.rs (1)
78-79: LGTM.Import aligns join utilities with the shared evaluation helper.
datafusion/physical-plan/src/sorts/stream.rs (1)
28-28: Centralize expr evaluation in sort stream — LGTMSwapping per-expr loops for evaluate_expressions_to_arrays keeps semantics, avoids duplicate num_rows plumbing, and aligns with other modules. No change to memory accounting or ordering.
Also applies to: 168-176
datafusion/physical-plan/src/joins/hash_join/exec.rs (1)
80-81: Build-side keys via evaluate_expressions_to_arrays — LGTMConsistent with the reversed concat order; indices remain aligned with the hashmap. Shallow cloning into JoinLeftData is fine.
Also applies to: 1469-1471
datafusion/physical-plan/src/aggregates/mod.rs (1)
55-57: Simplify evaluate_many via shared utility — LGTMDelegation to evaluate_expressions_to_arrays per inner group is correct and clearer; behavior unchanged.
Also applies to: 1350-1358
datafusion/physical-plan/src/aggregates/no_grouping.rs (1)
36-40: Batch expression evaluation via utility — LGTMevaluate_expressions_to_arrays after optional filter is correct and simplifies the path; accumulator merge/update logic remains unchanged.
Also applies to: 222-225
datafusion/physical-plan/src/joins/symmetric_hash_join.rs (2)
81-81: LGTM! Import correctly added for the new utility function.The import is properly placed and necessary for the centralized expression evaluation below.
1069-1070: LGTM! Clean refactoring to centralized utility.The replacement of manual expression evaluation with
evaluate_expressions_to_arraysmaintains identical behavior while improving maintainability. The function signatures match perfectly, and this aligns with the broader PR objective to centralize expression evaluation across the codebase.
value:annoying; category:bug; feedback:The Codex AI reviewer seems to hallucinate. 1) There are no changes in this PR around that line in mod.rs; 2) There is no occurrence of |
value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct that by using |
value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct that those two occurrences could also make use of the new helper function. This will make the code even more DRY |
18446: To review by AI
Note
Introduce a shared utility to evaluate physical exprs to arrays and replace duplicated logic across window, aggregate, join, projection, repartition, and sort operators.
utils::evaluate_expressions_to_arraysindatafusion_physical_expr_common.StandardWindowFunctionExpr::evaluate_argsandWindowExpr::evaluate_args.evaluate_manyand no-grouping aggregation path.exec,stream,utils) and symmetric hash join to use the util for key evaluation.ProjectionExec,RepartitionExechash partitioning, and sort cursor streams.RecordBatch) to support the new util.Written by Cursor Bugbot for commit ff2f57a. This will update automatically on new commits. Configure here.