20243: perf: improve performance of array_union/array_intersect with batched row conversion#237
20243: perf: improve performance of array_union/array_intersect with batched row conversion#237martin-augment wants to merge 8 commits intomainfrom
array_union/array_intersect with batched row conversion#237Conversation
Summary of ChangesHello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request focuses on a significant performance improvement for the Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
WalkthroughThis change introduces benchmarking infrastructure for array set operations in DataFusion. A new benchmark file with Criterion tests is added for array union and intersection operations, accompanied by a Cargo.toml configuration entry. The ✨ Finishing touches
🧪 Generate unit tests (beta)
No actionable comments were generated in the recent review. 🎉 🧹 Recent nitpick comments
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code ReviewThis PR improves the performance of Strengths ✅
Issues & Recommendations 🔍1. Potential Panic on Empty Arrays (High Priority)Location: let last_offset = *result_offsets.last().unwrap();Issue: This will panic if Recommendation: let last_offset = result_offsets.last().copied().unwrap_or_else(|| OffsetSize::usize_as(0));2. Inconsistent Element Ordering (Medium Priority)Location: Test file changes in The test changes reveal that element ordering has changed (e.g.,
Recommendation:
3. Memory Allocation Strategy (Low Priority)Location: The initial capacity calculation for let initial_capacity = if IS_UNION {
rows_l.num_rows() // Could be too large
} else {
rows_l.num_rows().min(rows_r.num_rows()) // Could still be too large
};Issue: For union operations with low overlap, this could allocate ~2x more than needed. For intersect with low overlap, it could allocate significantly more than needed. Recommendation: Consider using a fraction of the total size or implement a growth strategy: let initial_capacity = if IS_UNION {
rows_l.num_rows() + rows_r.num_rows() / 2
} else {
(rows_l.num_rows().min(rows_r.num_rows())) / 2
};4. Missing Public API DocumentationLocation: The Recommendation: Add documentation: /// Implements the `array_intersect` function which returns elements
/// present in both input arrays without duplicates.
#[derive(Debug, PartialEq, Eq, Hash)]
pub struct ArrayIntersect {5. Benchmark CoverageStrengths: Good coverage of different overlap scenarios and array sizes. Missing scenarios:
Performance Considerations 💡The optimization looks sound:
Expected performance improvement: 2-10x depending on array sizes and overlap ratios. Security Concerns 🔒No security issues identified. The code:
Test Coverage 📊Existing tests: The changes to Recommendation: Add explicit tests for:
SummaryThis is a well-implemented performance optimization with clean code and good design choices. The main concerns are:
Overall assessment: Approve with minor revisions recommended ✅ Great work on the performance optimization! The const generic technique and batch conversion approach are excellent solutions. |
There was a problem hiding this comment.
Code Review
This pull request introduces significant performance improvements for array_union and array_intersect functions by leveraging batched row conversion and optimizing the internal set operations. The changes involve refactoring the generic_set_lists function to use a const generic inner loop, which helps avoid branching in hot code paths. Additionally, hash sets are now reused across iterations, reducing allocation overhead. The benchmark file array_set_ops.rs has been added to measure these performance gains, and the ArrayIntersect struct has been made public and given a Default implementation for consistency. Test outputs in array.slt have been updated to reflect the new, potentially reordered, results from the set operations, which is expected behavior when using hash-based collections.
🤖 Augment PR SummarySummary: Improves performance of Changes:
Technical Notes: The new implementation indexes into precomputed row buffers using list offsets and builds the output list by converting the accumulated rows back into Arrow arrays at the end. 🤖 Was this summary useful? React with 👍 or 👎 |
| let l_len = l_end - l_start; | ||
| let r_len = r_end - r_start; | ||
|
|
||
| // Select shorter side for lookup, longer side for probing |
There was a problem hiding this comment.
This refactor changes the output element ordering vs the prior .sorted().dedup() approach: array_union/array_intersect now emit in input iteration order, and array_intersect’s order may come from either side depending on relative list lengths. Since arrays are ordered and this is user-visible (as reflected in the updated SLT expectations), consider documenting the ordering semantics (or lack thereof) to avoid users relying on the previous ordering.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:valid-but-wont-fix; category:bug; feedback: The Augment AI reviewer is correct but this is intentional. The set operations (like union and intersect) do not guarantee any order. The sorting is removed intentionally to make the operations faster.
value:annoying; category:bug; feedback: The Claude AI reviewer is not correct! The |
value:valid-but-wont-fix; category:bug; feedback: The Claude AI reviewer is not correct! This is intentional! Set operations like union and intersect do not guarantee any order. The sorting is removed to optimise the speed of these operations. |
value:valid-but-wont-fix; category:bug; feedback: The Claude AI reviewer is not correct! The initial capacity is an estimation and it cannot be exact. Here the author prefers using capacity that will most probably will be enough for the whole operation and avoid further resizing and copying of items. This is especially true for the intersect operation. |
value:annoying; category:bug; feedback: The Claude AI reviewer is not correct! The ArrayIntersect struct is annotated with |
20243: To review by AI