18729: Add benchmark for array_has/array_has_all/array_has_any by martin-augment · Pull Request #31 · martin-augment/datafusion

martin-augment · 2025-11-16T13:11:03Z

18729: To review by AI

Co-authored-by: Yongting You <2010youy01@gmail.com>

coderabbitai · 2025-11-16T13:11:42Z

Walkthrough

A new Criterion-based benchmarking suite was added for nested array operations in the DataFusion project. The changes include a new benchmark entry in the Cargo.toml file and a comprehensive benchmarking file that tests array_has, array_has_all, and array_has_any functions. The benchmarks cover multiple scenarios with varying array sizes (1, 10, 100, 1000, 10000), special cases like empty and single-element arrays, string array operations, and edge cases. Helper functions organize the benchmarks into logical groups, and test data is constructed using make_array and lit utilities with black_box to prevent compiler optimizations from skewing results.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-18729-2025-11-16-13-10-59

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 86ebaab and e7a723b.

📒 Files selected for processing (2)

datafusion/functions-nested/Cargo.toml (1 hunks)
datafusion/functions-nested/benches/array_has.rs (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: Cursor Bugbot
GitHub Check: claude-review
GitHub Check: Analyze (actions)
GitHub Check: Analyze (python)
GitHub Check: Analyze (rust)
GitHub Check: Analyze (javascript-typescript)

🔇 Additional comments (5)

datafusion/functions-nested/Cargo.toml (1)

72-74: LGTM! Benchmark entry properly configured.

The new benchmark entry follows the same pattern as existing benchmarks with harness = false, which is correct for Criterion-based benchmarks.

datafusion/functions-nested/benches/array_has.rs (4)

18-26: Well-structured imports for benchmarking.

The imports are appropriate for a Criterion-based benchmark suite. Good use of black_box to prevent compiler optimizations from affecting benchmark results.

28-46: Excellent benchmark coverage and organization.

The benchmark suite comprehensively covers various scenarios: multiple array sizes, integer and string types, and edge cases. The organization into separate helper functions makes the code maintainable.

376-377: Correct Criterion setup.

The criterion_group! and criterion_main! macros are properly configured to execute the benchmark suite.

48-374: ****

The benchmark code is correctly designed. The cloning of list_array and needle/needle_array before each iteration is necessary and intentional—the generated array_has, array_has_all, and array_has_any functions consume their Expr parameters (they take owned values, not references). Cloning is the proper pattern to reuse test expressions across multiple benchmark iterations. The benchmarks are not measuring cloning overhead; they are correctly measuring function performance with properly managed test data.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cursor · 2025-11-16T13:13:14Z

datafusion/functions-nested/benches/array_has.rs

+            let needle = lit("TICKER0050");
+
+            b.iter(|| black_box(array_has(list_array.clone(), needle.clone())))
+        });


Bug: Benchmark Misrepresents Search Performance

The benchmark searches for "TICKER0050" but when size=10, the array only contains TICKER0000 through TICKER0009, so the needle won't be found. This causes the "found" benchmark to actually test a "not found" case for the smallest array size, producing misleading performance measurements.

value:useful; category:bug; feedback:The Bugbot AI reviewer is correct that the needed won't be found when the size of the generated array is 10. The ticker value should be changed to 0005 to fix this issue.

cursor · 2025-11-16T13:13:14Z

datafusion/functions-nested/benches/array_has.rs

+
+            b.iter(|| black_box(array_has_all(list_array.clone(), needle_array.clone())))
+        },
+    );


Bug: Benchmark Misrepresents "All Found" Case

The benchmark labeled "all_found_small_needle" searches for elements [0, 1, 2] in the haystack. When array_size=1, the haystack only contains [0], so elements 1 and 2 won't be found. This causes the benchmark to test a partial match case instead of the intended "all elements found" case for the smallest array size.

value:useful; category:bug; feedback:The Bugbot AI reviewer is correct that the test is invalid for the case when the array has just one element. In this case it won't be able to find a sub-array with three elements inside it.

augmentcode

Review completed. 1 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2025-11-16T13:13:34Z

datafusion/functions-nested/benches/array_has.rs

+                .map(|i| lit(format!("TICKER{i:04}")))
+                .collect::<Vec<_>>();
+            let list_array = make_array(array);
+            let needle = lit("TICKER0050");


In the "found" case for string arrays, using "TICKER0050" means size=10 does not actually contain the needle (only 0000..0009), which makes this benchmark mislabeled/misleading for that size (consider adjusting the needle or sizes).

_{🤖 Was this useful? React with 👍 or 👎}

value:useful; category:bug; feedback:The Augment AI reviewer is correct that the needed won't be found when the size of the generated array is 10. The ticker value should be changed to 0005 to fix this issue.

claude · 2025-11-16T13:14:30Z

PR Review: Add benchmark for array_has/array_has_all/array_has_any

Overview

This PR adds comprehensive benchmarks for the array_has, array_has_all, and array_has_any functions. The benchmark suite is well-structured and covers a good range of scenarios.

Code Quality ✅

Strengths:

✅ Proper Apache 2.0 license header
✅ Good use of criterion benchmark groups for organization
✅ Comprehensive coverage of different scenarios (found at start, found at end, not found, edge cases)
✅ Proper use of black_box to prevent compiler optimizations
✅ Well-commented sections explaining what each benchmark tests
✅ Appropriate Cargo.toml configuration matching existing benchmarks

Potential Issues ⚠️

1. Benchmark Design - Expression Building vs Runtime Performance

The benchmarks are currently measuring expression construction rather than actual function execution. This is a critical issue:

Current approach:

b.iter(|| black_box(array_has(list_array.clone(), needle.clone())))

This measures the time to build the expression tree (array_has() returns an Expr), not the time to execute the array search operation.

Comparison with other benchmarks:

array_reverse.rs (line 32-34) calls array_reverse_inner() which actually executes the operation
map.rs (line 114) uses invoke_with_args() to execute the UDF

Recommendation: You should either:

Call the actual implementation functions (array_has_inner_for_scalar, array_has_all_inner, etc.) from datafusion/functions-nested/src/array_has.rs
Use the invoke_with_args pattern like in map.rs to execute the UDF

2. Data Realism

The benchmarks use sequential integer arrays (0..size), which may not represent realistic data distributions:

let array = (0..size).map(|i| lit(i as i64)).collect::<Vec<_>>();

Suggestions:

Add benchmarks with randomized data
Consider duplicates (you have one test for this, which is good)
Test with different data distributions (skewed, uniform, etc.)

3. Memory and Cloning Overhead

Each iteration clones the arrays:

b.iter(|| black_box(array_has(list_array.clone(), needle.clone())))

If the benchmark is actually executing (after fixing issue #1), the cloning overhead might dominate small array operations. Consider using references where possible or mention this in documentation.

Performance Considerations 🚀

Good aspects:

✅ Tests multiple array sizes (1, 10, 100, 1000, 10000)
✅ Tests both i64 and string types
✅ Covers early exit scenarios
✅ Tests edge cases (empty arrays, single elements, duplicates)

Suggestions:

Consider adding benchmarks for:
- Different element types (floats, nested arrays)
- Very large arrays (100k+ elements) to test scalability
- Arrays with high null density
String benchmarks could be more realistic:
- Current: TICKER0000 format is good for uniform length
- Consider: varying string lengths, unicode strings

Security Concerns 🔒

No security issues identified. The benchmarks don't execute arbitrary code or handle untrusted input.

Test Coverage 📊

Excellent coverage of scenarios:

✅ Best/worst/average cases
✅ Empty arrays and null handling
✅ Different needle sizes (small, medium, ~10% of haystack)
✅ Early exit optimization testing
✅ String-specific benchmarks

Missing (nice-to-have):

Integration with actual Arrow arrays (not just expression building)
Benchmarks comparing different array types (List vs FixedSizeList) like array_reverse.rs does

Best Practices Alignment 📋

Comparing with existing benchmarks:

Aspect	This PR	array_reverse.rs	map.rs
License header	✅	✅	✅
extern crate	✅	✅	✅
Criterion setup	✅	✅	✅
black_box usage	✅	✅	✅
Actual execution	❌	✅	✅

Recommendations

Critical (must fix):

Fix the benchmark to measure actual execution instead of expression building. Reference array_reverse.rs:32-34 or map.rs:114 for patterns.

Important:

Consider building arrays using Arrow's array builders instead of the expression API for more realistic benchmarks
Add a comment explaining what aspect of performance each benchmark group is measuring

Nice to have:

Add randomized data scenarios
Test with different array types (List, LargeList, FixedSizeList)
Document expected performance characteristics in comments

Example Fix

Here's how you might fix the core issue:

use datafusion_functions_nested::array_has::{array_has_inner_for_scalar, array_has_all_inner};
use arrow::array::{Int64Array, ListArray};

fn bench_array_has(c: &mut Criterion, array_size: usize) {
    let mut group = c.benchmark_group("array_has_i64");
    
    group.bench_with_input(
        BenchmarkId::new("found_at_start", array_size),
        &array_size,
        |b, &size| {
            // Build actual Arrow arrays
            let values = Int64Array::from_iter_values(0..size as i64);
            let list_array = /* build ListArray from values */;
            let needle = /* create scalar */;

            b.iter(|| {
                black_box(array_has_inner_for_scalar(&list_array, &needle).unwrap())
            })
        },
    );
}

Conclusion

This is a solid benchmark suite with excellent scenario coverage. The main issue is that it's currently benchmarking expression construction rather than execution. Once that's fixed, this will be a valuable addition to the project!

Overall Assessment: Needs revision before merge due to the execution vs construction issue, but the structure and coverage are excellent.

martin-augment · 2025-11-16T14:02:11Z

1. Benchmark Design - Expression Building vs Runtime Performance

The benchmarks are currently measuring expression construction rather than actual function execution. This is a critical issue:

Current approach:
b.iter(|| black_box(array_has(list_array.clone(), needle.clone())))
This measures the time to build the expression tree (array_has() returns an Expr), not the time to execute the array search operation.

value:useful; category:bug; feedback:The Claude AI reviewer is correct that the benchmarks are not realistic because they do not test the execution of the built expressions. The other benchmarks (array_reverse.rs and map.rs) do test the execution part too.

martin-augment · 2025-11-16T14:03:49Z

Data Realism

The benchmarks use sequential integer arrays (0..size), which may not represent realistic data distributions:
let array = (0..size).map(|i| lit(i as i64)).collect::<Vec<_>>();
Suggestions:

Add benchmarks with randomized data

Consider duplicates (you have one test for this, which is good)

Test with different data distributions (skewed, uniform, etc.)

value:useful; category:bug; feedback:The Claude AI reviewer is correct that the generated data is not very common in real life. The tests could be made more realistic by using other stable/reproducible distributions.

martin-augment · 2025-11-16T14:05:23Z

. Memory and Cloning Overhead

Each iteration clones the arrays:
b.iter(|| black_box(array_has(list_array.clone(), needle.clone())))
If the benchmark is actually executing (after fixing issue #1), the cloning overhead might dominate small array operations. Consider using references where possible or mention this in documentation.

value:useful; category:bug; feedback:The Claude AI reviewer is correct that the benchmarks could be improved by passing references to the array and the needle because the cloning could take more time than the tested operation when the array size is not big enough.

zhuqi-lucas and others added 9 commits November 15, 2025 14:46

Add benchmark for array_has/array_has_all/array_has_any

818c12f

Merge branch 'main' into benchmark_for_array_has

33125fa

fmt

cabcaa9

clippy

bc98d36

Update datafusion/functions-nested/benches/array_has.rs

46d157f

Co-authored-by: Yongting You <2010youy01@gmail.com>

Update datafusion/functions-nested/benches/array_has.rs

5f32868

Co-authored-by: Yongting You <2010youy01@gmail.com>

Merge branch 'main' into benchmark_for_array_has

e9ecc15

Merge branch 'main' into benchmark_for_array_has

b067f23

Merge branch 'main' into benchmark_for_array_has

e7a723b

cursor bot reviewed Nov 16, 2025

View reviewed changes

augmentcode bot reviewed Nov 16, 2025

View reviewed changes

Conversation

martin-augment commented Nov 16, 2025

Uh oh!

coderabbitai bot commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

cursor bot Nov 16, 2025

Choose a reason for hiding this comment

Bug: Benchmark Misrepresents Search Performance

Uh oh!

martin-augment Nov 16, 2025

Choose a reason for hiding this comment

Uh oh!

cursor bot Nov 16, 2025

Choose a reason for hiding this comment

Bug: Benchmark Misrepresents "All Found" Case

Uh oh!

martin-augment Nov 16, 2025

Choose a reason for hiding this comment

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Nov 16, 2025

Choose a reason for hiding this comment

Uh oh!

martin-augment Nov 16, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot commented Nov 16, 2025

PR Review: Add benchmark for array_has/array_has_all/array_has_any

Overview

Code Quality ✅

Potential Issues ⚠️

1. Benchmark Design - Expression Building vs Runtime Performance

2. Data Realism

3. Memory and Cloning Overhead

Performance Considerations 🚀

Security Concerns 🔒

Test Coverage 📊

Best Practices Alignment 📋

Recommendations

Critical (must fix):

Important:

Nice to have:

Example Fix

Conclusion

Uh oh!

martin-augment commented Nov 16, 2025

1. Benchmark Design - Expression Building vs Runtime Performance

Uh oh!

martin-augment commented Nov 16, 2025

Data Realism

Uh oh!

martin-augment commented Nov 16, 2025

. Memory and Cloning Overhead

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

coderabbitai bot commented Nov 16, 2025 •

edited

Loading