20243: perf: improve performance of array_union/array_intersect with batched row conversion#234
20243: perf: improve performance of array_union/array_intersect with batched row conversion#234martin-augment wants to merge 4 commits intomainfrom
array_union/array_intersect with batched row conversion#234Conversation
Summary of ChangesHello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a substantial performance enhancement for the Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
WalkthroughThis pull request introduces benchmarking infrastructure for array set operations (union and intersection) in DataFusion. It adds a new benchmark file with Criterion tests covering different overlap scenarios, adds the corresponding benchmark configuration, and refactors the underlying set_ops implementation. The ArrayIntersect struct is made publicly accessible with a Default implementation, and the generic_set_lists function is reworked to use batch processing via RowConverter instead of per-element iteration. Test expectations are updated to reflect changes in result ordering and type annotation formatting. ✨ Finishing touches
🧪 Generate unit tests (beta)
No actionable comments were generated in the recent review. 🎉 🧹 Recent nitpick comments
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code Review - PR #234: Performance Improvement for array_union/array_intersectSummaryThis PR implements a performance optimization for ✅ Strengths
🔍 Issues & Concerns1. Visibility Change Requires Justification (datafusion/functions-nested/src/set_ops.rs:187)-pub(super) struct ArrayIntersect {
+pub struct ArrayIntersect {Issue: Changed Recommendation: Consider using 2. Test Output Changes Without Explanation (datafusion/sqllogictest/test_files/array.slt)The PR modifies expected test outputs in multiple places: -[] [false, true] [false]
+[] [true, false] [false]Issue: The order of elements in the result has changed. While set operations don't guarantee order, changing existing behavior could break user code that depends on the current ordering. Questions:
3. Memory Growth Concern (datafusion/functions-nested/src/set_ops.rs:376)let mut final_rows = Vec::with_capacity(rows_l.num_rows());Issue: This pre-allocates based on Recommendation: Consider using 4. Potential Performance Issue with HashSet Clearing (set_ops.rs:396-397)seen.clear();
r_set.clear();Concern: While Recommendation: Consider benchmarking scenarios with very large individual arrays to ensure the reuse strategy is always beneficial. 5. Missing Error Handling ContextThe code maintains existing error handling patterns, but error messages could be more descriptive for the new algorithm path. 🧪 Testing Recommendations
🔒 Security ConsiderationsNo security concerns identified. The changes don't introduce:
📊 Performance ConsiderationsPositive:
Watch Out For:
Suggestion: Add a comment or benchmark results in the PR description showing the performance improvement to justify the optimization. 📝 Code Style & Best PracticesMinor Issues:
✅ Final VerdictOverall Assessment: Good performance optimization with solid implementation. The main concerns are:
Recommendation: ✅ Approve with minor changes requested Action Items:
Great work on the performance optimization! The batched approach is a solid improvement. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request significantly improves the performance of array_union and array_intersect by introducing batched row conversion, which is a great enhancement. The new implementation converts all list values to rows at once, avoiding repeated conversions within loops, and reuses HashSets to minimize allocations. I have a couple of suggestions for further optimization.
| let overlap_positions = &positions[..overlap_count]; | ||
|
|
||
| for i in 0..array_size { | ||
| if overlap_positions.contains(&i) { | ||
| values2.push(base + i as i64); | ||
| } else { | ||
| values2.push(base + array_size as i64 + i as i64); | ||
| } | ||
| } |
There was a problem hiding this comment.
In the benchmark data generation, overlap_positions.contains(&i) performs a linear scan inside a loop, which can be inefficient for larger array_size. Using a HashSet for overlap_positions would provide O(1) average time complexity for lookups and improve the performance of the benchmark setup itself.
let overlap_positions: std::collections::HashSet<_> =
positions[..overlap_count].iter().copied().collect();
for i in 0..array_size {
if overlap_positions.contains(&i) {
values2.push(base + i as i64);
} else {
values2.push(base + array_size as i64 + i as i64);
}
}There was a problem hiding this comment.
value:good-but-wont-fix; category:bug; feedback: The Gemini AI reviewer is correct! The create_arrays_with_overlap() function is not measured by Criterion benchmark library, so the speed is not really important here.
| SetOp::Intersect => { | ||
| // Build hash set from right array for lookup table | ||
| // then iterator left array to find common elements. | ||
| for idx in r_start..r_end { | ||
| r_set.insert(rows_r.row(idx)); | ||
| } | ||
| for idx in l_start..l_end { | ||
| let row = rows_l.row(idx); | ||
| if r_set.contains(&row) && seen.insert(row) { | ||
| final_rows.push(row); | ||
| count += 1; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
For SetOp::Intersect, performance can be further optimized by building the hash set from the smaller of the two arrays and iterating over the larger one. This minimizes memory usage for the hash set and can reduce the number of hash operations, especially when there's a significant size difference between the two arrays.
Note that this may change which array's iteration order determines the output order. The current implementation's output order is based on the left array. Since set operations generally don't guarantee order, and the order has already changed in this PR, this should be an acceptable trade-off for the performance gain.
SetOp::Intersect => {
let l_len = l_end - l_start;
let r_len = r_end - r_start;
let (build_rows, build_start, build_end, iter_rows, iter_start, iter_end) = if l_len <= r_len {
(&rows_l, l_start, l_end, &rows_r, r_start, r_end)
} else {
(&rows_r, r_start, r_end, &rows_l, l_start, l_end)
};
// Build hash set from the smaller array for a faster lookup.
for idx in build_start..build_end {
r_set.insert(build_rows.row(idx));
}
// Iterate through the larger array to find common elements.
for idx in iter_start..iter_end {
let row = iter_rows.row(idx);
if r_set.contains(&row) && seen.insert(row) {
final_rows.push(row);
count += 1;
}
}
}There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! Creating a smaller HashSet will occupy less memory and less hashing operations will be executed when inserting the elements. Iterating over the longer collection is inevitable anyway.
🤖 Augment PR SummarySummary: Improves Changes:
Technical Notes: The new approach builds a single 🤖 Was this summary useful? React with 👍 or 👎 |
| // Reuse hash sets across iterations | ||
| let mut seen = HashSet::new(); | ||
| let mut r_set = HashSet::new(); | ||
| for i in 0..l.len() { |
There was a problem hiding this comment.
generic_set_lists now indexes r_offsets[i]/r_offsets[i+1] using for i in 0..l.len(), whereas the previous implementation used l.iter().zip(r.iter()). If l.len() != r.len() ever happens (e.g., via broadcasting/mismatched inputs), this will panic with an out-of-bounds offset access.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:annoying; category:bug; feedback: The Augment AI reviewer is not correct! Before accessing the offsets by index there is a check whether there is an element at that index at all which returns early if there is no such element, i.e. if the element is null.
| seen.clear(); | ||
| r_set.clear(); | ||
|
|
||
| match set_op { |
There was a problem hiding this comment.
This change removes the per-list .sorted() step, so array_union/array_intersect output ordering is now based on first-seen iteration order (left then right) rather than sorted order (as reflected in the updated .slt expectations). Can you confirm this ordering change is intended and matches the documented/expected function contract?
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:valid-but-wont-fix; category:bug; feedback: The Augment AI reviewer is correct that this changes the results order but since these are set operations (union and intercept) there is no guaranteed order and omitting the .sorted() call will save some time. The Pull Request author has explained this in the original PR description
value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! The |
value:valid-but-wont-fix; category:bug; feedback: The Claude AI reviewer is correct that this changes the results order but since these are set operations (union and intercept) there is no guaranteed order and omitting the .sorted() call will save some time. The Pull Request author has explained this in the original PR description |
value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! There is no need to reserve more capacity than the smaller array for the intercept operation, so this could be optimized. |
20243: To review by AI