20374: perf: Optimize array_has() for scalar needle#247
20374: perf: Optimize array_has() for scalar needle#247martin-augment wants to merge 2 commits intomainfrom
Conversation
The previous implementation tested the cost of building an array_has() `Expr` (!), not actually evaluating the array_has() operation itself. Refactor things along the way.
Summary of ChangesHello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the performance of the Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
WalkthroughThe pull request refactors array containment checking logic in DataFusion's nested functions module. The benchmark file is rewritten to use UDF invocation paths with dynamically created list array inputs (both int64 and string variants) instead of direct internal helpers. Helper functions are introduced to construct in-memory list arrays for different edge-case scenarios. The source code changes refactor the scalar-array containment computation from per-row slice-based counting to a streaming window-based check that operates over precomputed equality bits, with offset-based loops to handle nested row boundaries and a matches iterator to determine element matches while preserving null semantics in the resulting BooleanArray. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code Review: apache#20374 – perf: Optimize
|
🤖 Augment PR SummarySummary: Improves performance of Changes:
Technical Notes: The new implementation relies on 🤖 Was this summary useful? React with 👍 or 👎 |
| let sliced_array = eq_array.slice(start, length); | ||
| final_contained[i] = Some(sliced_array.true_count() > 0); | ||
| // Check if any match falls within this row's range. | ||
| if matches.peek().is_some_and(|&p| p < end) { |
There was a problem hiding this comment.
The per-row match detection only checks p < end and implicitly assumes the first offset is 0 / that matches has already been advanced to the current row start. If haystack is a sliced ListArray/LargeListArray with a non-zero starting offset, matches from values before the first row can be mis-attributed to row 0, yielding incorrect array_has results.
Severity: high
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
Code Review
This pull request introduces a significant performance optimization for the array_has() function when used with a scalar needle. The new implementation avoids repeated array slicing and counting within a loop by pre-calculating all matches and then efficiently checking for containment while iterating through the list offsets. This is a great improvement.
The changes also include a crucial correctness fix for handling null values within the haystack array, ensuring that array_has behaves correctly in their presence.
Additionally, the benchmarks for array_has and related functions have been substantially refactored. They now provide more realistic and isolated performance measurements by separating data generation from the benchmarked code, which is a very welcome change.
I have one suggestion to further improve memory efficiency by avoiding an allocation when iterating over list offsets.
|
|
||
| for (i, (start, end)) in haystack.offsets().tuple_windows().enumerate() { | ||
| let length = end - start; | ||
| let offsets: Vec<usize> = haystack.offsets().collect(); |
There was a problem hiding this comment.
Collecting all offsets into a Vec can cause a large allocation if the ListArray has many rows. The previous implementation used itertools::tuple_windows to iterate over offsets without collecting them. A similar approach could be used here to avoid the allocation and improve memory efficiency.
Consider replacing this line and the loop at line 380 with for (i, (_start, end)) in haystack.offsets().tuple_windows().enumerate() {.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
datafusion/functions-nested/src/array_has.rs (1)
303-318:⚠️ Potential issue | 🔴 CriticalBug:
offsets()forFixedSizeListproduces incorrect offsets whenvalue_length > 1.
(0..=arr.len()).step_by(value_length)iterates over row indices and skips every N-th row, rather than computing the flat offset of each row. Forarr.len()=6, value_length=3, this yields[0, 3, 6](producing 2 windows for row comparison) instead of the correct[0, 3, 6, 9, 12, 15, 18](7 offsets for 6 rows). This silently processes only a fraction of rows, truncating results for anyFixedSizeListwithvalue_length > 1.No tests exist for
array_haswithFixedSizeListcolumns, so this defect is undetected.🐛 Proposed fix
ArrayWrapper::FixedSizeList(arr) => { - let offsets = (0..=arr.len()) - .step_by(arr.value_length() as usize) - .collect::<Vec<_>>(); + let vl = arr.value_length() as usize; + let offsets = (0..=arr.len()) + .map(|i| i * vl) + .collect::<Vec<_>>(); Box::new(offsets.into_iter()) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@datafusion/functions-nested/src/array_has.rs` around lines 303 - 318, The offsets() implementation for ArrayWrapper::FixedSizeList is wrong: it uses (0..=arr.len()).step_by(arr.value_length()) which steps over row indices instead of producing flat-element offsets; change it to generate offsets by multiplying the row index by the fixed element size (i.e. build offsets = (0..=arr.len()).map(|i| i * arr.value_length() as usize).collect::<Vec<_>>() and return Box::new(offsets.into_iter())) so the returned offsets are the start positions in the underlying flat values buffer (e.g., for arr.len()=6,value_length=3 produce [0,3,6,9,12,15,18]). Ensure you update the ArrayWrapper::FixedSizeList arm in the offsets() method accordingly.
🧹 Nitpick comments (1)
datafusion/functions-nested/benches/array_has.rs (1)
435-549: Helper functions are well-designed with deterministic seeding and clear separation of found/not-found scenarios.Minor observation: the four helpers share identical offset-construction and
ListArray::try_newboilerplate. Extracting a commonbuild_list_array(values, num_rows, array_size)helper could reduce duplication, but this is entirely optional for benchmark code.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@datafusion/functions-nested/benches/array_has.rs` around lines 435 - 549, The four list-array helpers (create_int64_list_array, create_int64_list_array_with_offset, create_string_list_array, create_string_list_array_with_prefix) duplicate the same offsets construction and ListArray::try_new boilerplate; refactor by extracting a helper like build_list_array(values: ArrayRef/impl, num_rows: usize, array_size: usize) that builds the offsets Vec, creates the OffsetBuffer and calls ListArray::try_new (preserving the Field/DataType passed in), then update each create_* function to call build_list_array with their generated values to reduce duplication.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@datafusion/functions-nested/src/array_has.rs`:
- Around line 303-318: The offsets() implementation for
ArrayWrapper::FixedSizeList is wrong: it uses
(0..=arr.len()).step_by(arr.value_length()) which steps over row indices instead
of producing flat-element offsets; change it to generate offsets by multiplying
the row index by the fixed element size (i.e. build offsets =
(0..=arr.len()).map(|i| i * arr.value_length() as usize).collect::<Vec<_>>() and
return Box::new(offsets.into_iter())) so the returned offsets are the start
positions in the underlying flat values buffer (e.g., for
arr.len()=6,value_length=3 produce [0,3,6,9,12,15,18]). Ensure you update the
ArrayWrapper::FixedSizeList arm in the offsets() method accordingly.
---
Nitpick comments:
In `@datafusion/functions-nested/benches/array_has.rs`:
- Around line 435-549: The four list-array helpers (create_int64_list_array,
create_int64_list_array_with_offset, create_string_list_array,
create_string_list_array_with_prefix) duplicate the same offsets construction
and ListArray::try_new boilerplate; refactor by extracting a helper like
build_list_array(values: ArrayRef/impl, num_rows: usize, array_size: usize) that
builds the offsets Vec, creates the OffsetBuffer and calls ListArray::try_new
(preserving the Field/DataType passed in), then update each create_* function to
call build_list_array with their generated values to reduce duplication.
20374: To review by AI