Skip to content

20243: perf: improve performance of array_union/array_intersect with batched row conversion#234

Open
martin-augment wants to merge 4 commits intomainfrom
pr-20243-2026-02-09-21-53-04
Open

20243: perf: improve performance of array_union/array_intersect with batched row conversion#234
martin-augment wants to merge 4 commits intomainfrom
pr-20243-2026-02-09-21-53-04

Conversation

@martin-augment
Copy link
Owner

20243: To review by AI

@gemini-code-assist
Copy link

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a substantial performance enhancement for the array_union and array_intersect scalar functions in DataFusion. The core of the improvement lies in refactoring the internal logic to utilize batched row conversion, which significantly reduces overhead compared to the previous row-by-row processing. To validate these optimizations, a new set of performance benchmarks has been added, providing a robust way to track the efficiency of these array set operations.

Highlights

  • Performance Optimization for Array Set Operations: The array_union and array_intersect functions have been significantly optimized by transitioning from an iterative, per-row processing model to a batched row conversion approach. This change is expected to improve performance, especially for large datasets.
  • Batched Row Conversion Implementation: The generic_set_lists function now converts all values from the input ListArrays into Row objects in a single batch using RowConverter. This allows for more efficient set operations (union and intersect) by leveraging HashSets for uniqueness and membership checks across the entire batch.
  • New Performance Benchmarks: A new benchmark suite (array_set_ops) has been introduced using criterion to measure the performance of array_union and array_intersect under varying conditions, including different array sizes and overlap ratios.
  • ArrayIntersect Visibility and Default Implementation: The ArrayIntersect struct's visibility has been promoted from pub(super) to pub, and a Default implementation has been added, likely to facilitate its use in the new benchmarks or other public contexts.
  • Updated SQL Logic Tests: Existing sqllogictest files have been updated to reflect changes in the output order of array_intersect results. While the set of elements remains correct, the order may differ due to the new implementation, which is acceptable for set operations.
Changelog
  • datafusion/functions-nested/Cargo.toml
    • Added a new [[bench]] entry for array_set_ops to enable running the new performance benchmarks.
  • datafusion/functions-nested/benches/array_set_ops.rs
    • This new file introduces comprehensive criterion benchmarks for array_union and array_intersect.
    • It includes functions to create test data with controlled array sizes and overlap ratios, and measures performance for both high and low overlap scenarios.
  • datafusion/functions-nested/src/set_ops.rs
    • The ArrayIntersect struct is now pub and implements Default.
    • The generic_set_lists function, which implements the core logic for array_union and array_intersect, has been rewritten.
    • It now performs a batched conversion of all list values to Row objects at the beginning, and the set operations are performed using HashSets on these batched rows, leading to more efficient processing and result construction.
  • datafusion/sqllogictest/test_files/array.slt
    • Updated expected results for several array_intersect queries. The change reflects a different ordering of elements in the output arrays, which is a valid outcome for set operations where order is not guaranteed.
Activity
  • The pull request was initiated by martin-augment.
  • The PR description explicitly requests an AI review.
  • The changes include adding new benchmarks, refactoring existing code for performance, and updating test expectations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link

coderabbitai bot commented Feb 9, 2026

Walkthrough

This pull request introduces benchmarking infrastructure for array set operations (union and intersection) in DataFusion. It adds a new benchmark file with Criterion tests covering different overlap scenarios, adds the corresponding benchmark configuration, and refactors the underlying set_ops implementation. The ArrayIntersect struct is made publicly accessible with a Default implementation, and the generic_set_lists function is reworked to use batch processing via RowConverter instead of per-element iteration. Test expectations are updated to reflect changes in result ordering and type annotation formatting.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-20243-2026-02-09-21-53-04

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
datafusion/functions-nested/src/set_ops.rs (2)

374-376: final_rows capacity underestimates for the Union case.

For SetOp::Union, final_rows may contain elements from both left and right value arrays, but the capacity is only rows_l.num_rows(). This won't cause a bug (the Vec will just reallocate), but the pre-allocation misses its goal.

Suggested fix
-    let mut final_rows = Vec::with_capacity(rows_l.num_rows());
+    let mut final_rows = Vec::with_capacity(match set_op {
+        SetOp::Union => rows_l.num_rows() + rows_r.num_rows(),
+        SetOp::Intersect => rows_l.num_rows(),
+    });

416-421: Typo in comment: "iterator" → "iterate".

Line 418: "then iterator left array" should be "then iterate left array".

Suggested fix
-                // Build hash set from right array for lookup table
-                // then iterator left array to find common elements.
+                // Build hash set from right array for lookup table,
+                // then iterate left array to find common elements.
datafusion/functions-nested/benches/array_set_ops.rs (1)

43-83: Consider deduplicating invoke_array_union and invoke_array_intersect.

These two functions are identical except for the name. A single generic helper would reduce duplication:

Suggested consolidation
-fn invoke_array_union(udf: &impl ScalarUDFImpl, array1: &ArrayRef, array2: &ArrayRef) {
-    black_box(
-        udf.invoke_with_args(ScalarFunctionArgs {
-            args: vec![
-                ColumnarValue::Array(array1.clone()),
-                ColumnarValue::Array(array2.clone()),
-            ],
-            arg_fields: vec![
-                Field::new("arr1", array1.data_type().clone(), false).into(),
-                Field::new("arr2", array2.data_type().clone(), false).into(),
-            ],
-            number_rows: NUM_ROWS,
-            return_field: Field::new("result", array1.data_type().clone(), false).into(),
-            config_options: Arc::new(ConfigOptions::default()),
-        })
-        .unwrap(),
-    );
-}
-
-fn invoke_array_intersect(
-    udf: &impl ScalarUDFImpl,
-    array1: &ArrayRef,
-    array2: &ArrayRef,
-) {
-    black_box(
-        udf.invoke_with_args(ScalarFunctionArgs {
-            args: vec![
-                ColumnarValue::Array(array1.clone()),
-                ColumnarValue::Array(array2.clone()),
-            ],
-            arg_fields: vec![
-                Field::new("arr1", array1.data_type().clone(), false).into(),
-                Field::new("arr2", array2.data_type().clone(), false).into(),
-            ],
-            number_rows: NUM_ROWS,
-            return_field: Field::new("result", array1.data_type().clone(), false).into(),
-            config_options: Arc::new(ConfigOptions::default()),
-        })
-        .unwrap(),
-    );
-}
+fn invoke_set_op(udf: &impl ScalarUDFImpl, array1: &ArrayRef, array2: &ArrayRef) {
+    black_box(
+        udf.invoke_with_args(ScalarFunctionArgs {
+            args: vec![
+                ColumnarValue::Array(array1.clone()),
+                ColumnarValue::Array(array2.clone()),
+            ],
+            arg_fields: vec![
+                Field::new("arr1", array1.data_type().clone(), false).into(),
+                Field::new("arr2", array2.data_type().clone(), false).into(),
+            ],
+            number_rows: NUM_ROWS,
+            return_field: Field::new("result", array1.data_type().clone(), false).into(),
+            config_options: Arc::new(ConfigOptions::default()),
+        })
+        .unwrap(),
+    );
+}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Feb 9, 2026

Code Review - PR #234: Performance Improvement for array_union/array_intersect

Summary

This PR implements a performance optimization for array_union and array_intersect functions using batched row conversion. The changes look solid overall with good performance engineering practices.


✅ Strengths

  1. Performance Optimization Strategy: The batched row conversion approach is well-thought-out:

    • Converting all values to rows upfront (lines 368-370 in set_ops.rs) is more efficient than per-row conversion
    • Reusing HashSet instances across iterations (lines 379-380, 396-397) reduces allocations
    • Pre-allocating vectors with appropriate capacity
  2. Comprehensive Benchmark Suite: The new benchmark file is well-structured:

    • Tests different array sizes (10, 50, 100)
    • Tests different overlap ratios (0.8, 0.2) which covers realistic scenarios
    • Uses proper seeding for reproducibility
    • Good use of black_box to prevent compiler optimizations
  3. Code Quality:

    • Proper ASF license headers
    • Clear comments explaining the algorithm (e.g., lines 417-418)
    • Added Default implementation for ArrayIntersect (lines 192-195)

🔍 Issues & Concerns

1. Visibility Change Requires Justification (datafusion/functions-nested/src/set_ops.rs:187)

-pub(super) struct ArrayIntersect {
+pub struct ArrayIntersect {

Issue: Changed ArrayIntersect from pub(super) to pub to allow benchmark access, but this exposes it as part of the public API.

Recommendation: Consider using #[cfg(test)] or #[cfg(any(test, feature = "bench"))] to conditionally expose this only for testing/benchmarking without making it part of the stable API.

2. Test Output Changes Without Explanation (datafusion/sqllogictest/test_files/array.slt)

The PR modifies expected test outputs in multiple places:

-[] [false, true] [false]
+[] [true, false] [false]

Issue: The order of elements in the result has changed. While set operations don't guarantee order, changing existing behavior could break user code that depends on the current ordering.

Questions:

  • Is this intentional or a side effect of the optimization?
  • Should the documentation clarify that element order is not guaranteed?
  • Have you verified this doesn't break backward compatibility expectations?

3. Memory Growth Concern (datafusion/functions-nested/src/set_ops.rs:376)

let mut final_rows = Vec::with_capacity(rows_l.num_rows());

Issue: This pre-allocates based on rows_l.num_rows(), but for intersect operations, the actual size could be much smaller. For union operations, it could be up to rows_l.num_rows() + rows_r.num_rows().

Recommendation: Consider using min(rows_l.num_rows(), rows_r.num_rows()) for intersect and appropriate heuristics for union to avoid over-allocation or re-allocations.

4. Potential Performance Issue with HashSet Clearing (set_ops.rs:396-397)

seen.clear();
r_set.clear();

Concern: While clear() is generally efficient, for very large arrays with many rows, repeatedly clearing and refilling could be less optimal than using a fresh HashSet if the backing storage grows large.

Recommendation: Consider benchmarking scenarios with very large individual arrays to ensure the reuse strategy is always beneficial.

5. Missing Error Handling Context

The code maintains existing error handling patterns, but error messages could be more descriptive for the new algorithm path.


🧪 Testing Recommendations

  1. Add Unit Tests: The PR adds benchmarks but no new unit tests for edge cases:

    • Empty arrays in both inputs
    • Arrays with all duplicates
    • Very large arrays (>10k elements)
    • Nested complex types
  2. Verify Correctness: Run extensive correctness tests to ensure the optimization doesn't change semantics beyond element ordering.

  3. Performance Regression Tests: Document baseline performance numbers to track improvements and prevent future regressions.


🔒 Security Considerations

No security concerns identified. The changes don't introduce:

  • Unvalidated user input
  • Unsafe code blocks
  • Potential for DOS through unbounded memory growth (existing bounds apply)

📊 Performance Considerations

Positive:

  • Batched conversion reduces function call overhead
  • HashSet reuse reduces allocations
  • Index-based iteration is efficient

Watch Out For:

  • Memory overhead of converting entire arrays upfront
  • For very sparse intersections, the current approach might do unnecessary work

Suggestion: Add a comment or benchmark results in the PR description showing the performance improvement to justify the optimization.


📝 Code Style & Best Practices

Minor Issues:

  1. Line 418: Typo in comment: "then iterator left array" → "then iterate over left array"
  2. Consider adding a module-level doc comment explaining the optimization strategy for future maintainers

✅ Final Verdict

Overall Assessment: Good performance optimization with solid implementation. The main concerns are:

  1. Unexplained test output changes (element ordering)
  2. Public API visibility change needs justification
  3. Need confirmation that all existing tests still pass

Recommendation: ✅ Approve with minor changes requested

Action Items:

  1. Clarify the element ordering change in test expectations
  2. Consider alternative to making ArrayIntersect fully public
  3. Add a comment documenting the performance characteristics of the new approach
  4. Fix the typo on line 418

Great work on the performance optimization! The batched approach is a solid improvement. 🚀

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly improves the performance of array_union and array_intersect by introducing batched row conversion, which is a great enhancement. The new implementation converts all list values to rows at once, avoiding repeated conversions within loops, and reuses HashSets to minimize allocations. I have a couple of suggestions for further optimization.

Comment on lines +158 to +166
let overlap_positions = &positions[..overlap_count];

for i in 0..array_size {
if overlap_positions.contains(&i) {
values2.push(base + i as i64);
} else {
values2.push(base + array_size as i64 + i as i64);
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the benchmark data generation, overlap_positions.contains(&i) performs a linear scan inside a loop, which can be inefficient for larger array_size. Using a HashSet for overlap_positions would provide O(1) average time complexity for lookups and improve the performance of the benchmark setup itself.

        let overlap_positions: std::collections::HashSet<_> =
            positions[..overlap_count].iter().copied().collect();

        for i in 0..array_size {
            if overlap_positions.contains(&i) {
                values2.push(base + i as i64);
            } else {
                values2.push(base + array_size as i64 + i as i64);
            }
        }

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-but-wont-fix; category:bug; feedback: The Gemini AI reviewer is correct! The create_arrays_with_overlap() function is not measured by Criterion benchmark library, so the speed is not really important here.

Comment on lines +416 to 429
SetOp::Intersect => {
// Build hash set from right array for lookup table
// then iterator left array to find common elements.
for idx in r_start..r_end {
r_set.insert(rows_r.row(idx));
}
for idx in l_start..l_end {
let row = rows_l.row(idx);
if r_set.contains(&row) && seen.insert(row) {
final_rows.push(row);
count += 1;
}
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For SetOp::Intersect, performance can be further optimized by building the hash set from the smaller of the two arrays and iterating over the larger one. This minimizes memory usage for the hash set and can reduce the number of hash operations, especially when there's a significant size difference between the two arrays.

Note that this may change which array's iteration order determines the output order. The current implementation's output order is based on the left array. Since set operations generally don't guarantee order, and the order has already changed in this PR, this should be an acceptable trade-off for the performance gain.

            SetOp::Intersect => {
                let l_len = l_end - l_start;
                let r_len = r_end - r_start;

                let (build_rows, build_start, build_end, iter_rows, iter_start, iter_end) = if l_len <= r_len {
                    (&rows_l, l_start, l_end, &rows_r, r_start, r_end)
                } else {
                    (&rows_r, r_start, r_end, &rows_l, l_start, l_end)
                };

                // Build hash set from the smaller array for a faster lookup.
                for idx in build_start..build_end {
                    r_set.insert(build_rows.row(idx));
                }

                // Iterate through the larger array to find common elements.
                for idx in iter_start..iter_end {
                    let row = iter_rows.row(idx);
                    if r_set.contains(&row) && seen.insert(row) {
                        final_rows.push(row);
                        count += 1;
                    }
                }
            }

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! Creating a smaller HashSet will occupy less memory and less hashing operations will be executed when inserting the elements. Iterating over the longer collection is inevitable anyway.

@augmentcode
Copy link

augmentcode bot commented Feb 9, 2026

🤖 Augment PR Summary

Summary: Improves array_union / array_intersect performance by avoiding per-row row-conversion and reducing repeated allocations.

Changes:

  • Reworks `generic_set_lists` to batch-convert list child values to row format once per input array, then slices via offsets per row.
  • Reuses hash sets across rows to reduce allocation churn while computing union/intersection results.
  • Makes `ArrayIntersect` publicly constructible (and adds `Default`) to support external usage such as benchmarks.
  • Adds a Criterion benchmark (`array_set_ops`) that measures union/intersect across multiple list sizes and overlap ratios.
  • Updates SQLLogicTest expected results for `array_intersect` to match the new output ordering semantics.

Technical Notes: The new approach builds a single final_rows buffer and performs a single RowConverter::convert_rows at the end, using list offsets to delimit each row’s segment.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

// Reuse hash sets across iterations
let mut seen = HashSet::new();
let mut r_set = HashSet::new();
for i in 0..l.len() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generic_set_lists now indexes r_offsets[i]/r_offsets[i+1] using for i in 0..l.len(), whereas the previous implementation used l.iter().zip(r.iter()). If l.len() != r.len() ever happens (e.g., via broadcasting/mismatched inputs), this will panic with an out-of-bounds offset access.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:annoying; category:bug; feedback: The Augment AI reviewer is not correct! Before accessing the offsets by index there is a check whether there is an element at that index at all which returns early if there is no such element, i.e. if the element is null.

seen.clear();
r_set.clear();

match set_op {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change removes the per-list .sorted() step, so array_union/array_intersect output ordering is now based on first-seen iteration order (left then right) rather than sorted order (as reflected in the updated .slt expectations). Can you confirm this ordering change is intended and matches the documented/expected function contract?

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:valid-but-wont-fix; category:bug; feedback: The Augment AI reviewer is correct that this changes the results order but since these are set operations (union and intercept) there is no guaranteed order and omitting the .sorted() call will save some time. The Pull Request author has explained this in the original PR description

@martin-augment
Copy link
Owner Author

1. Visibility Change Requires Justification (datafusion/functions-nested/src/set_ops.rs:187)

-pub(super) struct ArrayIntersect {
+pub struct ArrayIntersect {

Issue: Changed ArrayIntersect from pub(super) to pub to allow benchmark access, but this exposes it as part of the public API.

Recommendation: Consider using #[cfg(test)] or #[cfg(any(test, feature = "bench"))] to conditionally expose this only for testing/benchmarking without making it part of the stable API.

value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! The ArrayIntersect struct is not supposed to be part of the public APIs, so it could be annotated with #[doc(hidden)] to hide it from the end users.

@martin-augment
Copy link
Owner Author

2. Test Output Changes Without Explanation (datafusion/sqllogictest/test_files/array.slt)

The PR modifies expected test outputs in multiple places:

-[] [false, true] [false]
+[] [true, false] [false]

Issue: The order of elements in the result has changed. While set operations don't guarantee order, changing existing behavior could break user code that depends on the current ordering.

Questions:

  • Is this intentional or a side effect of the optimization?
  • Should the documentation clarify that element order is not guaranteed?
  • Have you verified this doesn't break backward compatibility expectations?

value:valid-but-wont-fix; category:bug; feedback: The Claude AI reviewer is correct that this changes the results order but since these are set operations (union and intercept) there is no guaranteed order and omitting the .sorted() call will save some time. The Pull Request author has explained this in the original PR description

@martin-augment
Copy link
Owner Author

3. Memory Growth Concern (datafusion/functions-nested/src/set_ops.rs:376)

let mut final_rows = Vec::with_capacity(rows_l.num_rows());

Issue: This pre-allocates based on rows_l.num_rows(), but for intersect operations, the actual size could be much smaller. For union operations, it could be up to rows_l.num_rows() + rows_r.num_rows().

Recommendation: Consider using min(rows_l.num_rows(), rows_r.num_rows()) for intersect and appropriate heuristics for union to avoid over-allocation or re-allocations.

value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! There is no need to reserve more capacity than the smaller array for the intercept operation, so this could be optimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants