20243: perf: improve performance of `array_union`/`array_intersect` with batched row conversion by martin-augment · Pull Request #237 · martin-augment/datafusion

martin-augment · 2026-02-10T20:28:49Z

20243: To review by AI

gemini-code-assist · 2026-02-10T20:29:11Z

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on a significant performance improvement for the array_union and array_intersect scalar functions. The core optimization involves a strategic shift from row-by-row processing to a batched row conversion approach, which reduces overhead. Additionally, the internal logic for these set operations has been refactored into a highly optimized generic loop, leveraging Rust's const generics to minimize runtime branching. To validate and monitor these performance gains, new benchmarks have been introduced.

Highlights

Performance Optimization for Array Set Operations: The array_union and array_intersect functions now leverage batched row conversion using RowConverter to significantly improve performance by processing values in bulk rather than individually.
Refactored Set Logic with Const Generics: A new generic_set_loop function was introduced, utilizing a const IS_UNION: bool generic parameter to eliminate conditional branching within the hot loop, further enhancing execution speed for set operations.
New Benchmarks for Set Operations: A dedicated benchmark file (array_set_ops.rs) was added to measure and track the performance of array_union and array_intersect under various conditions, including different array sizes and overlap ratios.
Visibility and Default Trait for ArrayIntersect: The ArrayIntersect struct's visibility was promoted to pub and a Default implementation was added, enabling its use in the new benchmarking framework.
Updated SQL Logic Tests: Existing SQL logic tests for array_intersect were updated to match the new, potentially reordered, output of the optimized implementation, ensuring correctness with the performance improvements.

Changelog

datafusion/functions-nested/Cargo.toml
- Added array_set_ops as a new benchmark target.
datafusion/functions-nested/benches/array_set_ops.rs
- Added new Criterion benchmarks for array_union and array_intersect functions, testing different array sizes and overlap ratios.
datafusion/functions-nested/src/set_ops.rs
- Refactored generic_set_lists to use batched row conversion and delegated the core logic to a new generic_set_loop function, which uses const generics for performance.
- Changed ArrayIntersect visibility to pub and implemented Default.
datafusion/sqllogictest/test_files/array.slt
- Updated expected output for array_intersect queries to reflect changes in element order due to the optimized implementation.

Activity

The pull request was created with a note "To review by AI", indicating it's ready for initial automated review.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-02-10T20:29:11Z

Walkthrough

This change introduces benchmarking infrastructure for array set operations in DataFusion. A new benchmark file with Criterion tests is added for array union and intersection operations, accompanied by a Cargo.toml configuration entry. The ArrayIntersect struct is changed from module-private to public visibility with a new Default implementation. The internal generic_set_lists function is refactored to use a batched row-based processing approach instead of per-element operations. Test expectations in the array test file are updated to reflect changes in element ordering for array intersection results.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-20243-2026-02-10-20-28-44

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments

datafusion/functions-nested/src/set_ops.rs (1)

438-464: Intersection result ordering now depends on which side is longer – verify this is acceptable.

When l_len >= r_len, the probe side is rows_l (left), so output order follows the left array's element order. When l_len < r_len, output follows the right array's order. This produces correct set-intersection results but is a behavioral change compared to always iterating the left side. The sqllogictest updates confirm awareness of this, but any downstream consumers relying on a specific element ordering from array_intersect could be surprised.

If stable ordering is desired (e.g. always follow left-side order), the lookup/probe assignment could always use r as lookup and l as probe, at the cost of a slightly larger lookup set when l is shorter.
datafusion/functions-nested/benches/array_set_ops.rs (1)
44-61: Consider matching field nullability to actual array schema.

The return_field and arg_fields are constructed with nullable: false, but GenericListArray results from set operations can contain nulls (and the inner field is created as nullable in the implementation). For benchmarking this is unlikely to cause an issue since the UDF doesn't validate this metadata strictly, but aligning it with the actual schema (nullable: true) would make the benchmark more representative.
Proposed change
             arg_fields: vec![
-                Field::new("arr1", array1.data_type().clone(), false).into(),
-                Field::new("arr2", array2.data_type().clone(), false).into(),
+                Field::new("arr1", array1.data_type().clone(), true).into(),
+                Field::new("arr2", array2.data_type().clone(), true).into(),
             ],
             number_rows: NUM_ROWS,
-            return_field: Field::new("result", array1.data_type().clone(), false).into(),
+            return_field: Field::new("result", array1.data_type().clone(), true).into(),

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-02-10T20:30:34Z

Code Review

This PR improves the performance of array_union and array_intersect by batching row conversions and optimizing the inner loop logic. Overall, the implementation is solid with good performance optimizations. Here are my observations:

Strengths ✅

Excellent Performance Optimization: Batch converting all values to rows upfront (lines 368-370 in set_ops.rs) eliminates redundant conversions, which was a major bottleneck in the original implementation.
Smart Const Generic Usage: Using const IS_UNION: bool generic parameter in generic_set_loop (line 384) is a clever technique to avoid branching in the hot loop while maintaining code reuse.
HashSet Reuse: Reusing seen and lookup_set across iterations (lines 408-409, 423, 448) reduces allocations significantly.
Optimized Intersect Logic: For intersect operations, the code intelligently selects the shorter side for the lookup table (lines 443-447), which minimizes memory usage and hash operations.
Comprehensive Benchmarks: The new benchmark suite with high/low overlap scenarios and multiple array sizes provides good coverage for performance testing.

Issues & Recommendations 🔍

1. Potential Panic on Empty Arrays (High Priority)

Location: set_ops.rs:411

let last_offset = *result_offsets.last().unwrap();

Issue: This will panic if result_offsets is empty, though this shouldn't happen given the initialization at line 396. Consider using a more defensive approach or add a debug assertion.

Recommendation:

let last_offset = result_offsets.last().copied().unwrap_or_else(|| OffsetSize::usize_as(0));

2. Inconsistent Element Ordering (Medium Priority)

Location: Test file changes in array.slt

The test changes reveal that element ordering has changed (e.g., [false, true] → [true, false], [arrow, rust] → [rust, arrow]). This indicates:

The new implementation doesn't maintain insertion order the same way as before
This could be a breaking change for users who rely on specific ordering

Recommendation:

Document this behavior change in the PR description
Consider if stable ordering is important for this function
If ordering matters, you may need to sort results or maintain insertion order explicitly

3. Memory Allocation Strategy (Low Priority)

Location: set_ops.rs:397-403

The initial capacity calculation for final_rows could be improved:

let initial_capacity = if IS_UNION {
    rows_l.num_rows()  // Could be too large
} else {
    rows_l.num_rows().min(rows_r.num_rows())  // Could still be too large
};

Issue: For union operations with low overlap, this could allocate ~2x more than needed. For intersect with low overlap, it could allocate significantly more than needed.

Recommendation: Consider using a fraction of the total size or implement a growth strategy:

let initial_capacity = if IS_UNION {
    rows_l.num_rows() + rows_r.num_rows() / 2
} else {
    (rows_l.num_rows().min(rows_r.num_rows())) / 2
};

4. Missing Public API Documentation

Location: set_ops.rs:187

The ArrayIntersect struct was changed to pub (from pub(super)), making it part of the public API for benchmarking. This is fine, but the struct should have doc comments explaining its purpose.

Recommendation: Add documentation:

/// Implements the `array_intersect` function which returns elements 
/// present in both input arrays without duplicates.
#[derive(Debug, PartialEq, Eq, Hash)]
pub struct ArrayIntersect {

5. Benchmark Coverage

Strengths: Good coverage of different overlap scenarios and array sizes.

Missing scenarios:

Arrays with null elements within lists
Very large arrays (1000+ elements)
Arrays with different data types (currently only Int64)
Edge cases: empty arrays, single element arrays

Performance Considerations 💡

The optimization looks sound:

O(n) row conversion upfront vs O(n²) in the loop is a major win
HashSet reuse eliminates allocation overhead
Const generic eliminates branch misprediction costs

Expected performance improvement: 2-10x depending on array sizes and overlap ratios.

Security Concerns 🔒

No security issues identified. The code:

Properly handles null values
Uses safe Rust constructs
Doesn't expose unsafe operations
Validates input types appropriately

Test Coverage 📊

Existing tests: The changes to array.slt show that existing test cases pass (with ordering differences noted).

Recommendation: Add explicit tests for:

Edge cases (empty arrays, single elements)
Null handling within array elements
Large arrays to verify no performance regression
Memory usage tests to ensure bounded allocations

Summary

This is a well-implemented performance optimization with clean code and good design choices. The main concerns are:

MUST FIX: Document the ordering change - this could affect users
SHOULD FIX: Make the panic-prone unwrap() more defensive
NICE TO HAVE: Improve memory allocation strategy and expand benchmark coverage

Overall assessment: Approve with minor revisions recommended ✅

Great work on the performance optimization! The const generic technique and batch conversion approach are excellent solutions.

gemini-code-assist

Code Review

This pull request introduces significant performance improvements for array_union and array_intersect functions by leveraging batched row conversion and optimizing the internal set operations. The changes involve refactoring the generic_set_lists function to use a const generic inner loop, which helps avoid branching in hot code paths. Additionally, hash sets are now reused across iterations, reducing allocation overhead. The benchmark file array_set_ops.rs has been added to measure these performance gains, and the ArrayIntersect struct has been made public and given a Default implementation for consistency. Test outputs in array.slt have been updated to reflect the new, potentially reordered, results from the set operations, which is expected behavior when using hash-based collections.

augmentcode · 2026-02-10T20:55:54Z

🤖 Augment PR Summary

Summary: Improves performance of array_union / array_intersect by batching Arrow row conversion and streamlining the per-row set-operation loop.

Changes:

Refactored set-ops evaluation to convert all list values to arrow::row rows in one batch rather than per-row conversions.
Introduced generic_set_loop with a const-generic (IS_UNION) to avoid branching inside the hot loop.
Reused per-row HashSet allocations across iterations for union/intersect distinctness and lookup.
Adjusted ArrayIntersect visibility and added a Default impl for easier construction in external contexts.
Added a new Criterion benchmark (array_set_ops) to measure union/intersect performance across sizes and overlap ratios.
Updated SQL logic test expectations for array_intersect where the result element ordering changed.

Technical Notes: The new implementation indexes into precomputed row buffers using list offsets and builds the output list by converting the accumulated rows back into Arrow arrays at the end.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 1 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-02-10T20:55:55Z

datafusion/functions-nested/src/set_ops.rs

+            let l_len = l_end - l_start;
+            let r_len = r_end - r_start;
+
+            // Select shorter side for lookup, longer side for probing


This refactor changes the output element ordering vs the prior .sorted().dedup() approach: array_union/array_intersect now emit in input iteration order, and array_intersect’s order may come from either side depending on relative list lengths. Since arrays are ordered and this is user-visible (as reflected in the updated SLT expectations), consider documenting the ordering semantics (or lack thereof) to avoid users relying on the previous ordering.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

value:valid-but-wont-fix; category:bug; feedback: The Augment AI reviewer is correct but this is intentional. The set operations (like union and intersect) do not guarantee any order. The sorting is removed intentionally to make the operations faster.

martin-augment · 2026-02-11T07:04:59Z

1. Potential Panic on Empty Arrays (High Priority)

Location: set_ops.rs:411
let last_offset = *result_offsets.last().unwrap();
Issue: This will panic if result_offsets is empty, though this shouldn't happen given the initialization at line 396. Consider using a more defensive approach or add a debug assertion.

value:annoying; category:bug; feedback: The Claude AI reviewer is not correct! The result_offsets is fed with OffsetSize::usize_as(0) right after its creation, so there is no chance that it is empty => no chance to panic.

martin-augment · 2026-02-11T07:07:58Z

2. Inconsistent Element Ordering (Medium Priority)

Location: Test file changes in array.slt

The test changes reveal that element ordering has changed (e.g., [false, true] → [true, false], [arrow, rust] → [rust, arrow]). This indicates:

The new implementation doesn't maintain insertion order the same way as before

This could be a breaking change for users who rely on specific ordering

value:valid-but-wont-fix; category:bug; feedback: The Claude AI reviewer is not correct! This is intentional! Set operations like union and intersect do not guarantee any order. The sorting is removed to optimise the speed of these operations.

martin-augment · 2026-02-11T07:13:26Z

3. Memory Allocation Strategy (Low Priority)

Location: set_ops.rs:397-403

The initial capacity calculation for final_rows could be improved:
let initial_capacity = if IS_UNION {
    rows_l.num_rows()  // Could be too large
} else {
    rows_l.num_rows().min(rows_r.num_rows())  // Could still be too large
};
Issue: For union operations with low overlap, this could allocate ~2x more than needed. For intersect with low overlap, it could allocate significantly more than needed.

Recommendation: Consider using a fraction of the total size or implement a growth strategy:
let initial_capacity = if IS_UNION {
    rows_l.num_rows() + rows_r.num_rows() / 2
} else {
    (rows_l.num_rows().min(rows_r.num_rows())) / 2
};

value:valid-but-wont-fix; category:bug; feedback: The Claude AI reviewer is not correct! The initial capacity is an estimation and it cannot be exact. Here the author prefers using capacity that will most probably will be enough for the whole operation and avoid further resizing and copying of items. This is especially true for the intersect operation.

martin-augment · 2026-02-11T07:15:02Z

4. Missing Public API Documentation

Location: set_ops.rs:187

The ArrayIntersect struct was changed to pub (from pub(super)), making it part of the public API for benchmarking. This is fine, but the struct should have doc comments explaining its purpose.

Recommendation: Add documentation:
/// Implements the `array_intersect` function which returns elements 
/// present in both input arrays without duplicates.
#[derive(Debug, PartialEq, Eq, Hash)]
pub struct ArrayIntersect {

value:annoying; category:bug; feedback: The Claude AI reviewer is not correct! The ArrayIntersect struct is annotated with #[user_api] attribute that is used for the generations of the end user documentation.

lyne7-sc and others added 8 commits February 9, 2026 22:14

perf array_union/array_intersect

c396950

perf array_union/array_intersect

01affff

clippy

029bfb3

license

0eb66b9

Merge branch 'apache:main' into perf/array_set

605cf42

simplify benches

8284bbb

simplify benches

0e2dc47

optimize by suggestions.

e718bc7

gemini-code-assist bot reviewed Feb 10, 2026

View reviewed changes

augmentcode bot reviewed Feb 10, 2026

View reviewed changes

Conversation

martin-augment commented Feb 10, 2026

Uh oh!

gemini-code-assist bot commented Feb 10, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

claude bot commented Feb 10, 2026

Code Review

Strengths ✅

Issues & Recommendations 🔍

1. Potential Panic on Empty Arrays (High Priority)

2. Inconsistent Element Ordering (Medium Priority)

3. Memory Allocation Strategy (Low Priority)

4. Missing Public API Documentation

5. Benchmark Coverage

Performance Considerations 💡

Security Concerns 🔒

Test Coverage 📊

Summary

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

augmentcode bot commented Feb 10, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martin-augment commented Feb 11, 2026

1. Potential Panic on Empty Arrays (High Priority)

Uh oh!

martin-augment commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

2. Inconsistent Element Ordering (Medium Priority)

Uh oh!

martin-augment commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

3. Memory Allocation Strategy (Low Priority)

Uh oh!

martin-augment commented Feb 11, 2026

4. Missing Public API Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Feb 10, 2026 •

edited

Loading

martin-augment Feb 11, 2026 •

edited

Loading

martin-augment commented Feb 11, 2026 •

edited

Loading

martin-augment commented Feb 11, 2026 •

edited

Loading