Skip to content

20374: perf: Optimize array_has() for scalar needle#247

Open
martin-augment wants to merge 2 commits intomainfrom
pr-20374-2026-02-18-12-16-37
Open

20374: perf: Optimize array_has() for scalar needle#247
martin-augment wants to merge 2 commits intomainfrom
pr-20374-2026-02-18-12-16-37

Conversation

@martin-augment
Copy link
Owner

20374: To review by AI

The previous implementation tested the cost of building an array_has()
`Expr` (!), not actually evaluating the array_has() operation itself.
Refactor things along the way.
@gemini-code-assist
Copy link

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the performance of the array_has function when a scalar value is used as the search target. The core optimization involves a more efficient algorithm within the array_has_dispatch_for_scalar function, which now directly utilizes set_indices() for rapid element matching. Alongside this, the associated benchmarks for array-related functions have been thoroughly updated to use a more modern UDF invocation approach and to generate test data more effectively, ensuring accurate and reliable performance measurements.

Highlights

  • Performance Optimization: Optimized the array_has function for scalar needle inputs, improving search efficiency.
  • Benchmark Refactoring: Refactored existing benchmarks for array_has, array_has_all, and array_has_any to use the ScalarUDFImpl::invoke_with_args method, aligning with new UDF invocation patterns.
  • Improved Data Generation: Introduced new helper functions for generating randomized ListArray data, enhancing benchmark robustness and clarity.
  • Efficient Element Search: Enhanced the array_has_dispatch_for_scalar logic by leveraging BooleanArray::set_indices() for faster and more direct identification of matching elements within arrays.
Changelog
  • datafusion/functions-nested/benches/array_has.rs
    • Updated imports to include necessary Arrow and DataFusion common types for benchmark setup.
    • Added new constants (NUM_ROWS, SEED, NULL_DENSITY, NEEDLE_SIZE) to configure benchmark parameters.
    • Adjusted the array_sizes used in benchmarks to focus on [10, 100, 500] for more relevant testing.
    • Removed the bench_array_has_edge_cases function.
    • Rewrote all bench_array_has, bench_array_has_all, bench_array_has_any, bench_array_has_strings, bench_array_has_all_strings, and bench_array_has_any_strings functions to utilize ScalarUDFImpl::invoke_with_args and new data generation helpers.
    • Added new helper functions: create_int64_list_array, create_int64_list_array_with_offset, create_string_list_array, and create_string_list_array_with_prefix for generating diverse benchmark data.
  • datafusion/functions-nested/src/array_has.rs
    • Modified array_has_dispatch_for_scalar to correctly handle null values in BooleanArray by performing a bitwise AND operation with the validity bitmap.
    • Refactored the iteration logic over haystack offsets to use windows(2) and peekable() with set_indices() for optimized match detection.
    • Changed the initialization of final_contained to vec![Some(false); haystack.len()] for better default handling.
Activity
  • The pull request was created.
  • The author indicated that the pull request is "To review by AI".
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link

coderabbitai bot commented Feb 18, 2026

Walkthrough

The pull request refactors array containment checking logic in DataFusion's nested functions module. The benchmark file is rewritten to use UDF invocation paths with dynamically created list array inputs (both int64 and string variants) instead of direct internal helpers. Helper functions are introduced to construct in-memory list arrays for different edge-case scenarios. The source code changes refactor the scalar-array containment computation from per-row slice-based counting to a streaming window-based check that operates over precomputed equality bits, with offset-based loops to handle nested row boundaries and a matches iterator to determine element matches while preserving null semantics in the resulting BooleanArray.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-20374-2026-02-18-12-16-37

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Feb 18, 2026

Code Review: apache#20374 – perf: Optimize array_has() for scalar needle

Summary

This PR optimises array_has_dispatch_for_scalar by replacing the per-row slice + true_count approach with a single pass over a bit-set iterator (set_indices()). It also refactors the benchmarks to operate at the ScalarUDFImpl level with realistic multi-row arrays instead of single logical-row expressions.

The intent is sound and the benchmark improvements are real. There are a few correctness concerns and design questions worth addressing before merge.


Correctness Issues

1. set_indices() relies on undocumented Arrow bit behaviour (high risk)

The key optimisation ANDs the value buffer with the validity bitmap before calling set_indices():

let eq_bits = match eq_array.nulls() {
    Some(nulls) => eq_array.values() & nulls.inner(),
    None => eq_array.values().clone(),
};

The comment explains the intent: Arrow may leave undefined bits set at null positions, so we mask them out. This is correct today based on how Arrow-rs happens to lay out bits, but it relies on an implementation detail that is not part of the Arrow spec or the public BooleanArray API. Concretely:

  • The Arrow spec says null-slot value bits are undefined – the masking is a workaround for that.
  • A future Arrow-rs release could change how it initialises value bits and silently break this path.

Recommendation: Prefer eq_array.iter() or maintain the simpler slice + true_count path if correctness guarantees matter more than the last few percent of throughput. Alternatively, leave an explicit // SAFETY: comment that calls out the specific Arrow-rs version being relied upon, and add a CI check (e.g. assert_eq!(eq_array.false_count() + eq_array.true_count(), eq_array.len() - eq_array.null_count())).

2. offsets iterator collected into Vec – unnecessary allocation for FixedSizeList

let offsets: Vec<usize> = haystack.offsets().collect();

ArrayWrapper::offsets() for FixedSizeList already materialises a Vec internally and returns it as a Box<dyn Iterator>. This PR then collects that into a second Vec. For large arrays of fixed-size lists this doubles the allocation. Consider adding a windows method directly to ArrayWrapper to avoid the boxing + re-collection.

3. Offset range check uses only end, not start

for (i, window) in offsets.windows(2).enumerate() {
    let end = window[1];
    ...
    if matches.peek().is_some_and(|&p| p < end) {

The start offset (window[0]) is never used. For well-formed Arrow arrays offsets are monotonically non-decreasing and matches is already advanced in row order, so this happens to be correct. However it's fragile: if eq_bits.set_indices() ever yields indices out of order (it won't, but the code doesn't assert this), or if a bug causes matches to fall behind, a match from a previous row could be attributed to the current row. Adding an assertion or using start..end range membership (p >= start && p < end) would make the intent explicit and guard against future regressions.


Performance Considerations

4. UDF instantiated inside the benchmark hot loop

b.iter(|| {
    let udf = ArrayHas::new();
    b.iter(|| { ... udf.invoke_with_args(...) ... })
})

ArrayHas::new() allocates a Signature and alias Vecs. Constructing it inside b.iter() means its allocation cost is folded into every benchmark iteration, skewing results. Move UDF construction outside b.iter(), matching the pattern used for args_found, config_options, etc.

5. Removed benchmarks cover important edge cases

The following benchmarks were deleted:

  • found_at_start / found_at_end (early-exit vs worst-case for array_has)
  • all_found_medium_needle (10%-of-haystack needle size for array_has_all)
  • bench_array_has_edge_cases group (empty array, single element, duplicates, empty needle)

The edge-case group is particularly valuable because empty arrays and null-filled arrays are exactly the cases the new bit-masking code must handle correctly. Losing coverage here makes it harder to catch regressions. Please restore the edge-case benchmarks (adapted to the new invocation style).


Code Quality

6. Benchmark helper functions share the same SEED for int and string arrays

create_int64_list_array and create_string_list_array both use SEED = 42. Their "not found" variants use SEED + 1. If int and string benchmarks are ever run in the same process the RNG sequences are identical, which is fine, but it's easy to accidentally produce correlated data when adding new helpers. Consider namespacing seeds (e.g. INT_SEED, STR_SEED) or documenting the convention.

7. create_int64_list_array_with_offset "not found" guarantee is probabilistic, not certain

The haystack contains values in 0..array_size. The "not found" needle contains values in array_size..2*array_size. This guarantees no overlap by construction — good. But for create_string_list_array_with_prefix, the guarantee relies on the prefix "missing_" not appearing in "value_{idx}" strings, which is true now but could silently break if the naming convention changes. A brief comment explaining this invariant would help.

8. Unused start variable shadowed by let end = window[1]

window[0] (the start offset) is read but discarded with no name. This is a minor style point: either name it with let [start, end] = window else { unreachable!() } or use let (_start, end) = (window[0], window[1]) to signal the intentional discard.


Test Coverage

The existing unit tests (test_array_has_list_null_haystack, test_array_has_list_empty_child) exercise the null-haystack path which the new code modifies. These are good. However, there are no tests for:

  • A haystack row where some elements are null but the row itself is not null (e.g. [1, null, 3] with needle 3 → should be true).
  • A haystack row where all elements are null ([null, null] with needle 1 → should be false, not null).

These cases exercise the eq_bits masking logic directly and would catch the correctness concern raised in point 1.


Minor Nits

  • Benchmark IDs changed from found_at_start/found_at_end to a single found. This loses information about best-vs-worst-case scan position. Consider keeping both or using a position parameter.
  • array_size range reduced from [1, 10, 100, 1000, 10000] to [10, 100, 500]. Dropping size 1 removes coverage for the single-element fast path; dropping 10000 removes the high-cardinality case. Both are worth retaining.

Overall Assessment

The algorithmic direction (iterator over set bits instead of per-row slice allocation) is a legitimate improvement. The main blocker is the reliance on Arrow's undefined null-bit behaviour — this should either be made robust or clearly documented with a version pin. The missing edge-case benchmarks and tests are also important to restore before merge.

@augmentcode
Copy link

augmentcode bot commented Feb 18, 2026

🤖 Augment PR Summary

Summary: Improves performance of array_has() when the needle is scalar by avoiding per-row slicing/scanning of the comparison results.

Changes:

  • Reworked scalar-needle dispatch to compute all equality results once and scan set bits to detect per-row matches
  • Added explicit masking of equality result bits with the null bitmap to avoid treating null comparison slots as true
  • Updated benchmarks to invoke the UDF implementations (ArrayHas/ArrayHasAll/ArrayHasAny) directly via invoke_with_args
  • Bench data generation now builds Arrow ListArray inputs (int64/string) with deterministic RNG seeding and configurable null density
  • Simplified benchmark scenarios and normalized test sizes (e.g. 10/100/500) for both numeric and string cases
  • Removed the prior “edge cases” benchmark group in favor of randomized row-based inputs

Technical Notes: The new implementation relies on BooleanBuffer::set_indices() and per-row offset ranges to avoid repeated true_count() over slices, and the benches now exercise the same UDF entrypoints used by DataFusion execution.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

let sliced_array = eq_array.slice(start, length);
final_contained[i] = Some(sliced_array.true_count() > 0);
// Check if any match falls within this row's range.
if matches.peek().is_some_and(|&p| p < end) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per-row match detection only checks p < end and implicitly assumes the first offset is 0 / that matches has already been advanced to the current row start. If haystack is a sliced ListArray/LargeListArray with a non-zero starting offset, matches from values before the first row can be mis-attributed to row 0, yielding incorrect array_has results.

Severity: high

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant performance optimization for the array_has() function when used with a scalar needle. The new implementation avoids repeated array slicing and counting within a loop by pre-calculating all matches and then efficiently checking for containment while iterating through the list offsets. This is a great improvement.

The changes also include a crucial correctness fix for handling null values within the haystack array, ensuring that array_has behaves correctly in their presence.

Additionally, the benchmarks for array_has and related functions have been substantially refactored. They now provide more realistic and isolated performance measurements by separating data generation from the benchmarked code, which is a very welcome change.

I have one suggestion to further improve memory efficiency by avoiding an allocation when iterating over list offsets.


for (i, (start, end)) in haystack.offsets().tuple_windows().enumerate() {
let length = end - start;
let offsets: Vec<usize> = haystack.offsets().collect();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Collecting all offsets into a Vec can cause a large allocation if the ListArray has many rows. The previous implementation used itertools::tuple_windows to iterate over offsets without collecting them. A similar approach could be used here to avoid the allocation and improve memory efficiency.

Consider replacing this line and the loop at line 380 with for (i, (_start, end)) in haystack.offsets().tuple_windows().enumerate() {.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
datafusion/functions-nested/src/array_has.rs (1)

303-318: ⚠️ Potential issue | 🔴 Critical

Bug: offsets() for FixedSizeList produces incorrect offsets when value_length > 1.

(0..=arr.len()).step_by(value_length) iterates over row indices and skips every N-th row, rather than computing the flat offset of each row. For arr.len()=6, value_length=3, this yields [0, 3, 6] (producing 2 windows for row comparison) instead of the correct [0, 3, 6, 9, 12, 15, 18] (7 offsets for 6 rows). This silently processes only a fraction of rows, truncating results for any FixedSizeList with value_length > 1.

No tests exist for array_has with FixedSizeList columns, so this defect is undetected.

🐛 Proposed fix
 ArrayWrapper::FixedSizeList(arr) => {
-    let offsets = (0..=arr.len())
-        .step_by(arr.value_length() as usize)
-        .collect::<Vec<_>>();
+    let vl = arr.value_length() as usize;
+    let offsets = (0..=arr.len())
+        .map(|i| i * vl)
+        .collect::<Vec<_>>();
     Box::new(offsets.into_iter())
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/functions-nested/src/array_has.rs` around lines 303 - 318, The
offsets() implementation for ArrayWrapper::FixedSizeList is wrong: it uses
(0..=arr.len()).step_by(arr.value_length()) which steps over row indices instead
of producing flat-element offsets; change it to generate offsets by multiplying
the row index by the fixed element size (i.e. build offsets =
(0..=arr.len()).map(|i| i * arr.value_length() as usize).collect::<Vec<_>>() and
return Box::new(offsets.into_iter())) so the returned offsets are the start
positions in the underlying flat values buffer (e.g., for
arr.len()=6,value_length=3 produce [0,3,6,9,12,15,18]). Ensure you update the
ArrayWrapper::FixedSizeList arm in the offsets() method accordingly.
🧹 Nitpick comments (1)
datafusion/functions-nested/benches/array_has.rs (1)

435-549: Helper functions are well-designed with deterministic seeding and clear separation of found/not-found scenarios.

Minor observation: the four helpers share identical offset-construction and ListArray::try_new boilerplate. Extracting a common build_list_array(values, num_rows, array_size) helper could reduce duplication, but this is entirely optional for benchmark code.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/functions-nested/benches/array_has.rs` around lines 435 - 549, The
four list-array helpers (create_int64_list_array,
create_int64_list_array_with_offset, create_string_list_array,
create_string_list_array_with_prefix) duplicate the same offsets construction
and ListArray::try_new boilerplate; refactor by extracting a helper like
build_list_array(values: ArrayRef/impl, num_rows: usize, array_size: usize) that
builds the offsets Vec, creates the OffsetBuffer and calls ListArray::try_new
(preserving the Field/DataType passed in), then update each create_* function to
call build_list_array with their generated values to reduce duplication.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@datafusion/functions-nested/src/array_has.rs`:
- Around line 303-318: The offsets() implementation for
ArrayWrapper::FixedSizeList is wrong: it uses
(0..=arr.len()).step_by(arr.value_length()) which steps over row indices instead
of producing flat-element offsets; change it to generate offsets by multiplying
the row index by the fixed element size (i.e. build offsets =
(0..=arr.len()).map(|i| i * arr.value_length() as usize).collect::<Vec<_>>() and
return Box::new(offsets.into_iter())) so the returned offsets are the start
positions in the underlying flat values buffer (e.g., for
arr.len()=6,value_length=3 produce [0,3,6,9,12,15,18]). Ensure you update the
ArrayWrapper::FixedSizeList arm in the offsets() method accordingly.

---

Nitpick comments:
In `@datafusion/functions-nested/benches/array_has.rs`:
- Around line 435-549: The four list-array helpers (create_int64_list_array,
create_int64_list_array_with_offset, create_string_list_array,
create_string_list_array_with_prefix) duplicate the same offsets construction
and ListArray::try_new boilerplate; refactor by extracting a helper like
build_list_array(values: ArrayRef/impl, num_rows: usize, array_size: usize) that
builds the offsets Vec, creates the OffsetBuffer and calls ListArray::try_new
(preserving the Field/DataType passed in), then update each create_* function to
call build_list_array with their generated values to reduce duplication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments