Skip to content

20243: perf: improve performance of array_union/array_intersect with batched row conversion#237

Open
martin-augment wants to merge 8 commits intomainfrom
pr-20243-2026-02-10-20-28-44
Open

20243: perf: improve performance of array_union/array_intersect with batched row conversion#237
martin-augment wants to merge 8 commits intomainfrom
pr-20243-2026-02-10-20-28-44

Conversation

@martin-augment
Copy link
Owner

20243: To review by AI

@gemini-code-assist
Copy link

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on a significant performance improvement for the array_union and array_intersect scalar functions. The core optimization involves a strategic shift from row-by-row processing to a batched row conversion approach, which reduces overhead. Additionally, the internal logic for these set operations has been refactored into a highly optimized generic loop, leveraging Rust's const generics to minimize runtime branching. To validate and monitor these performance gains, new benchmarks have been introduced.

Highlights

  • Performance Optimization for Array Set Operations: The array_union and array_intersect functions now leverage batched row conversion using RowConverter to significantly improve performance by processing values in bulk rather than individually.
  • Refactored Set Logic with Const Generics: A new generic_set_loop function was introduced, utilizing a const IS_UNION: bool generic parameter to eliminate conditional branching within the hot loop, further enhancing execution speed for set operations.
  • New Benchmarks for Set Operations: A dedicated benchmark file (array_set_ops.rs) was added to measure and track the performance of array_union and array_intersect under various conditions, including different array sizes and overlap ratios.
  • Visibility and Default Trait for ArrayIntersect: The ArrayIntersect struct's visibility was promoted to pub and a Default implementation was added, enabling its use in the new benchmarking framework.
  • Updated SQL Logic Tests: Existing SQL logic tests for array_intersect were updated to match the new, potentially reordered, output of the optimized implementation, ensuring correctness with the performance improvements.
Changelog
  • datafusion/functions-nested/Cargo.toml
    • Added array_set_ops as a new benchmark target.
  • datafusion/functions-nested/benches/array_set_ops.rs
    • Added new Criterion benchmarks for array_union and array_intersect functions, testing different array sizes and overlap ratios.
  • datafusion/functions-nested/src/set_ops.rs
    • Refactored generic_set_lists to use batched row conversion and delegated the core logic to a new generic_set_loop function, which uses const generics for performance.
    • Changed ArrayIntersect visibility to pub and implemented Default.
  • datafusion/sqllogictest/test_files/array.slt
    • Updated expected output for array_intersect queries to reflect changes in element order due to the optimized implementation.
Activity
  • The pull request was created with a note "To review by AI", indicating it's ready for initial automated review.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link

coderabbitai bot commented Feb 10, 2026

Walkthrough

This change introduces benchmarking infrastructure for array set operations in DataFusion. A new benchmark file with Criterion tests is added for array union and intersection operations, accompanied by a Cargo.toml configuration entry. The ArrayIntersect struct is changed from module-private to public visibility with a new Default implementation. The internal generic_set_lists function is refactored to use a batched row-based processing approach instead of per-element operations. Test expectations in the array test file are updated to reflect changes in element ordering for array intersection results.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-20243-2026-02-10-20-28-44

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
datafusion/functions-nested/src/set_ops.rs (1)

438-464: Intersection result ordering now depends on which side is longer – verify this is acceptable.

When l_len >= r_len, the probe side is rows_l (left), so output order follows the left array's element order. When l_len < r_len, output follows the right array's order. This produces correct set-intersection results but is a behavioral change compared to always iterating the left side. The sqllogictest updates confirm awareness of this, but any downstream consumers relying on a specific element ordering from array_intersect could be surprised.

If stable ordering is desired (e.g. always follow left-side order), the lookup/probe assignment could always use r as lookup and l as probe, at the cost of a slightly larger lookup set when l is shorter.

datafusion/functions-nested/benches/array_set_ops.rs (1)

44-61: Consider matching field nullability to actual array schema.

The return_field and arg_fields are constructed with nullable: false, but GenericListArray results from set operations can contain nulls (and the inner field is created as nullable in the implementation). For benchmarking this is unlikely to cause an issue since the UDF doesn't validate this metadata strictly, but aligning it with the actual schema (nullable: true) would make the benchmark more representative.

Proposed change
             arg_fields: vec![
-                Field::new("arr1", array1.data_type().clone(), false).into(),
-                Field::new("arr2", array2.data_type().clone(), false).into(),
+                Field::new("arr1", array1.data_type().clone(), true).into(),
+                Field::new("arr2", array2.data_type().clone(), true).into(),
             ],
             number_rows: NUM_ROWS,
-            return_field: Field::new("result", array1.data_type().clone(), false).into(),
+            return_field: Field::new("result", array1.data_type().clone(), true).into(),

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Feb 10, 2026

Code Review

This PR improves the performance of array_union and array_intersect by batching row conversions and optimizing the inner loop logic. Overall, the implementation is solid with good performance optimizations. Here are my observations:

Strengths ✅

  1. Excellent Performance Optimization: Batch converting all values to rows upfront (lines 368-370 in set_ops.rs) eliminates redundant conversions, which was a major bottleneck in the original implementation.

  2. Smart Const Generic Usage: Using const IS_UNION: bool generic parameter in generic_set_loop (line 384) is a clever technique to avoid branching in the hot loop while maintaining code reuse.

  3. HashSet Reuse: Reusing seen and lookup_set across iterations (lines 408-409, 423, 448) reduces allocations significantly.

  4. Optimized Intersect Logic: For intersect operations, the code intelligently selects the shorter side for the lookup table (lines 443-447), which minimizes memory usage and hash operations.

  5. Comprehensive Benchmarks: The new benchmark suite with high/low overlap scenarios and multiple array sizes provides good coverage for performance testing.

Issues & Recommendations 🔍

1. Potential Panic on Empty Arrays (High Priority)

Location: set_ops.rs:411

let last_offset = *result_offsets.last().unwrap();

Issue: This will panic if result_offsets is empty, though this shouldn't happen given the initialization at line 396. Consider using a more defensive approach or add a debug assertion.

Recommendation:

let last_offset = result_offsets.last().copied().unwrap_or_else(|| OffsetSize::usize_as(0));

2. Inconsistent Element Ordering (Medium Priority)

Location: Test file changes in array.slt

The test changes reveal that element ordering has changed (e.g., [false, true][true, false], [arrow, rust][rust, arrow]). This indicates:

  • The new implementation doesn't maintain insertion order the same way as before
  • This could be a breaking change for users who rely on specific ordering

Recommendation:

  • Document this behavior change in the PR description
  • Consider if stable ordering is important for this function
  • If ordering matters, you may need to sort results or maintain insertion order explicitly

3. Memory Allocation Strategy (Low Priority)

Location: set_ops.rs:397-403

The initial capacity calculation for final_rows could be improved:

let initial_capacity = if IS_UNION {
    rows_l.num_rows()  // Could be too large
} else {
    rows_l.num_rows().min(rows_r.num_rows())  // Could still be too large
};

Issue: For union operations with low overlap, this could allocate ~2x more than needed. For intersect with low overlap, it could allocate significantly more than needed.

Recommendation: Consider using a fraction of the total size or implement a growth strategy:

let initial_capacity = if IS_UNION {
    rows_l.num_rows() + rows_r.num_rows() / 2
} else {
    (rows_l.num_rows().min(rows_r.num_rows())) / 2
};

4. Missing Public API Documentation

Location: set_ops.rs:187

The ArrayIntersect struct was changed to pub (from pub(super)), making it part of the public API for benchmarking. This is fine, but the struct should have doc comments explaining its purpose.

Recommendation: Add documentation:

/// Implements the `array_intersect` function which returns elements 
/// present in both input arrays without duplicates.
#[derive(Debug, PartialEq, Eq, Hash)]
pub struct ArrayIntersect {

5. Benchmark Coverage

Strengths: Good coverage of different overlap scenarios and array sizes.

Missing scenarios:

  • Arrays with null elements within lists
  • Very large arrays (1000+ elements)
  • Arrays with different data types (currently only Int64)
  • Edge cases: empty arrays, single element arrays

Performance Considerations 💡

The optimization looks sound:

  • O(n) row conversion upfront vs O(n²) in the loop is a major win
  • HashSet reuse eliminates allocation overhead
  • Const generic eliminates branch misprediction costs

Expected performance improvement: 2-10x depending on array sizes and overlap ratios.

Security Concerns 🔒

No security issues identified. The code:

  • Properly handles null values
  • Uses safe Rust constructs
  • Doesn't expose unsafe operations
  • Validates input types appropriately

Test Coverage 📊

Existing tests: The changes to array.slt show that existing test cases pass (with ordering differences noted).

Recommendation: Add explicit tests for:

  • Edge cases (empty arrays, single elements)
  • Null handling within array elements
  • Large arrays to verify no performance regression
  • Memory usage tests to ensure bounded allocations

Summary

This is a well-implemented performance optimization with clean code and good design choices. The main concerns are:

  1. MUST FIX: Document the ordering change - this could affect users
  2. SHOULD FIX: Make the panic-prone unwrap() more defensive
  3. NICE TO HAVE: Improve memory allocation strategy and expand benchmark coverage

Overall assessment: Approve with minor revisions recommended

Great work on the performance optimization! The const generic technique and batch conversion approach are excellent solutions.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant performance improvements for array_union and array_intersect functions by leveraging batched row conversion and optimizing the internal set operations. The changes involve refactoring the generic_set_lists function to use a const generic inner loop, which helps avoid branching in hot code paths. Additionally, hash sets are now reused across iterations, reducing allocation overhead. The benchmark file array_set_ops.rs has been added to measure these performance gains, and the ArrayIntersect struct has been made public and given a Default implementation for consistency. Test outputs in array.slt have been updated to reflect the new, potentially reordered, results from the set operations, which is expected behavior when using hash-based collections.

@augmentcode
Copy link

augmentcode bot commented Feb 10, 2026

🤖 Augment PR Summary

Summary: Improves performance of array_union / array_intersect by batching Arrow row conversion and streamlining the per-row set-operation loop.

Changes:

  • Refactored set-ops evaluation to convert all list values to arrow::row rows in one batch rather than per-row conversions.
  • Introduced generic_set_loop with a const-generic (IS_UNION) to avoid branching inside the hot loop.
  • Reused per-row HashSet allocations across iterations for union/intersect distinctness and lookup.
  • Adjusted ArrayIntersect visibility and added a Default impl for easier construction in external contexts.
  • Added a new Criterion benchmark (array_set_ops) to measure union/intersect performance across sizes and overlap ratios.
  • Updated SQL logic test expectations for array_intersect where the result element ordering changed.

Technical Notes: The new implementation indexes into precomputed row buffers using list offsets and builds the output list by converting the accumulated rows back into Arrow arrays at the end.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

let l_len = l_end - l_start;
let r_len = r_end - r_start;

// Select shorter side for lookup, longer side for probing
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refactor changes the output element ordering vs the prior .sorted().dedup() approach: array_union/array_intersect now emit in input iteration order, and array_intersect’s order may come from either side depending on relative list lengths. Since arrays are ordered and this is user-visible (as reflected in the updated SLT expectations), consider documenting the ordering semantics (or lack thereof) to avoid users relying on the previous ordering.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Owner Author

@martin-augment martin-augment Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:valid-but-wont-fix; category:bug; feedback: The Augment AI reviewer is correct but this is intentional. The set operations (like union and intersect) do not guarantee any order. The sorting is removed intentionally to make the operations faster.

@martin-augment
Copy link
Owner Author

1. Potential Panic on Empty Arrays (High Priority)

Location: set_ops.rs:411

let last_offset = *result_offsets.last().unwrap();

Issue: This will panic if result_offsets is empty, though this shouldn't happen given the initialization at line 396. Consider using a more defensive approach or add a debug assertion.

value:annoying; category:bug; feedback: The Claude AI reviewer is not correct! The result_offsets is fed with OffsetSize::usize_as(0) right after its creation, so there is no chance that it is empty => no chance to panic.

@martin-augment
Copy link
Owner Author

martin-augment commented Feb 11, 2026

2. Inconsistent Element Ordering (Medium Priority)

Location: Test file changes in array.slt

The test changes reveal that element ordering has changed (e.g., [false, true][true, false], [arrow, rust][rust, arrow]). This indicates:

  • The new implementation doesn't maintain insertion order the same way as before
  • This could be a breaking change for users who rely on specific ordering

value:valid-but-wont-fix; category:bug; feedback: The Claude AI reviewer is not correct! This is intentional! Set operations like union and intersect do not guarantee any order. The sorting is removed to optimise the speed of these operations.

@martin-augment
Copy link
Owner Author

martin-augment commented Feb 11, 2026

3. Memory Allocation Strategy (Low Priority)

Location: set_ops.rs:397-403

The initial capacity calculation for final_rows could be improved:

let initial_capacity = if IS_UNION {
    rows_l.num_rows()  // Could be too large
} else {
    rows_l.num_rows().min(rows_r.num_rows())  // Could still be too large
};

Issue: For union operations with low overlap, this could allocate ~2x more than needed. For intersect with low overlap, it could allocate significantly more than needed.

Recommendation: Consider using a fraction of the total size or implement a growth strategy:

let initial_capacity = if IS_UNION {
    rows_l.num_rows() + rows_r.num_rows() / 2
} else {
    (rows_l.num_rows().min(rows_r.num_rows())) / 2
};

value:valid-but-wont-fix; category:bug; feedback: The Claude AI reviewer is not correct! The initial capacity is an estimation and it cannot be exact. Here the author prefers using capacity that will most probably will be enough for the whole operation and avoid further resizing and copying of items. This is especially true for the intersect operation.

@martin-augment
Copy link
Owner Author

4. Missing Public API Documentation

Location: set_ops.rs:187

The ArrayIntersect struct was changed to pub (from pub(super)), making it part of the public API for benchmarking. This is fine, but the struct should have doc comments explaining its purpose.

Recommendation: Add documentation:

/// Implements the `array_intersect` function which returns elements 
/// present in both input arrays without duplicates.
#[derive(Debug, PartialEq, Eq, Hash)]
pub struct ArrayIntersect {

value:annoying; category:bug; feedback: The Claude AI reviewer is not correct! The ArrayIntersect struct is annotated with #[user_api] attribute that is used for the generations of the end user documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants