20102: Fix `array_repeat` handling of null count values by martin-augment · Pull Request #223 · martin-augment/datafusion

martin-augment · 2026-02-02T08:17:02Z

20102: To review by AI

coderabbitai · 2026-02-02T08:17:23Z

Walkthrough

The ArrayRepeat UDF is refactored to use a coercible signature path instead of the previous user-defined approach, enabling support for both exact Any and implicit Int64 input types. The implementation transitions from UInt64 to Int64 count arrays, with updated logic to handle negative counts as zero and propagate null values through per-element validity buffers. A helper function get_count_with_validity is introduced to manage count and validity extraction. The test suite is expanded with comprehensive coverage for array operations, including null count handling, nested list structures, and edge cases across various array types.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-20102-2026-02-02-08-16-58

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-02-02T08:17:24Z

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the array_repeat function in DataFusion to handle NULL and negative count values more robustly and in line with SQL standards. The changes involve updating the function's signature to use Int64 for the count argument, implementing proper null propagation, and ensuring that negative counts produce empty arrays. These modifications enhance the function's correctness and predictability across different input scenarios, supported by comprehensive new test cases.

Highlights

Corrected array_repeat null handling: The array_repeat function now correctly propagates NULL when the count argument is NULL, aligning with standard SQL semantics. Previously, this might have resulted in an empty array.
Negative count handling: Negative count values for array_repeat now consistently result in an empty array, ensuring predictable behavior.
Type coercion for count argument: The count argument for array_repeat is now explicitly coerced to Int64 using a coercible signature, simplifying internal logic and improving type safety.
Improved nullability propagation: The internal implementations of general_repeat and general_list_repeat now correctly track and propagate nullability from the count array to the output array's null buffer.
New helper function for count validity: A get_count_with_validity helper function was introduced to centralize the logic for extracting the count and its validity from an Int64Array, handling null and non-positive values.
Expanded test coverage: New SQL logic tests were added to cover various scenarios involving NULL and negative counts for both simple and nested arrays, including LargeList types, ensuring the new behavior is thoroughly validated.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

claude · 2026-02-02T08:20:13Z

Pull Request Review: Fix array_repeat handling of null count values

Summary

This PR fixes the array_repeat function to properly handle NULL count values by returning NULL when the count is NULL, and empty arrays when the count is negative. The implementation switches from using UInt64 to Int64 for the count parameter to support negative values and updates the type coercion mechanism.

Strengths

Correct NULL semantics: The change properly implements SQL NULL semantics - when count is NULL, the result is NULL (not an empty array).
Comprehensive test coverage: Excellent test additions covering NULL count values for simple/nested arrays, negative count values, edge cases with empty arrays, and both List and LargeList variants.
Type system improvement: Using the new Signature::coercible with implicit coercion is more idiomatic than the manual coerce_types implementation.
Proper null propagation: The get_count_with_validity helper correctly distinguishes between NULL counts (returns false validity) and non-positive counts (returns true validity with 0 count).

Code Quality & Best Practices

Good:

Clean separation of concerns with the get_count_with_validity helper function
Proper null buffer tracking in both general_repeat and general_list_repeat
Consistent handling across List and LargeList variants

Minor observations:

The unused import UInt64Array remains on line 23 (can be removed)
The code properly handles negative counts by treating them as 0, which is reasonable behavior

Potential Issues - CRITICAL

Integer Overflow on repeat.rs:185-189:

When count values are extremely large (e.g., i64::MAX), the cast to usize and subsequent sum could overflow. This could lead to:

Memory allocation with incorrect size
Index out of bounds errors
Potential panics

Recommendation: Add validation to ensure counts are within reasonable bounds (e.g., MAX i32). Similar bounds checking exists in other DataFusion array functions for safety.

Performance Considerations

Pre-allocation efficiency: The code properly pre-calculates capacities in general_list_repeat (lines 240-251), which is good for performance.
Iteration efficiency: In general_list_repeat, the capacity calculation iterates once, then the main loop iterates again. This is acceptable given the need for pre-allocation.
Memory efficiency: The change from skipping null checks to explicitly tracking nulls adds a small overhead but is necessary for correctness.

Security Concerns

Medium severity: No explicit bounds checking on count values could allow memory exhaustion attacks by passing very large count values, potentially leading to DoS via resource exhaustion.

Recommendation: Add bounds checking for count values.

Test Coverage - Excellent

The tests verify all critical scenarios including NULL counts, negative counts, zero counts, NULL elements, both List/LargeList variants, and edge cases with empty arrays.

Suggested additional test (optional): Very large count values to verify bounds checking (once implemented)

Summary & Recommendation

Overall assessment: This is a well-implemented bug fix with excellent test coverage. The primary concern is the lack of bounds checking on count values which could lead to integer overflow and resource exhaustion.

Recommendation: Approve with changes

Required changes:

Add bounds checking for count values to prevent overflow/resource exhaustion
Remove unused UInt64Array import

Optional improvements:

Update documentation to explicitly mention NULL and negative count behavior

The core logic is sound and the NULL semantics are correctly implemented. With bounds checking added, this would be production-ready.

Files reviewed:

datafusion/functions-nested/src/repeat.rs: Implementation
datafusion/sqllogictest/test_files/array.slt: Tests

gemini-code-assist

Code Review

This pull request aims to fix the behavior of the array_repeat function for NULL and negative count values, improve type handling using Signature::coercible, and includes comprehensive tests. However, it introduces several Denial of Service (DoS) vectors due to unchecked integer arithmetic and unwrap() calls on type conversions, which can lead to crashes with crafted inputs. Additionally, there's a minor bug in the general_list_repeat pre-allocation logic that doesn't correctly handle NULL counts.

gemini-code-assist · 2026-02-02T08:20:56Z

datafusion/functions-nested/src/repeat.rs

+    let mut outer_total = 0usize;
+    let mut inner_total = 0usize;
+    for (i, &c) in counts.iter().enumerate() {
+        if c > 0 {
+            outer_total += c as usize;
+            if !list_array.is_null(i) {
+                let len = list_offsets[i + 1].to_usize().unwrap()
+                    - list_offsets[i].to_usize().unwrap();
+                inner_total += len * (c as usize);
+            }
+        }
+    }

    // Build inner structures
    let mut inner_offsets = Vec::with_capacity(outer_total + 1);
    let mut take_indices = Vec::with_capacity(inner_total);
    let mut inner_nulls = BooleanBufferBuilder::new(outer_total);
+    let mut outer_nulls = BooleanBufferBuilder::new(count_array.len());
    let mut inner_running = 0usize;
    inner_offsets.push(O::zero());

-    for (row_idx, &count) in counts.iter().enumerate() {
-        let is_valid = !list_array.is_null(row_idx);
+    for row_idx in 0..count_array.len() {
+        let (count, count_is_valid) = get_count_with_validity(count_array, row_idx);
+        outer_nulls.append(count_is_valid);
+        let list_is_valid = !list_array.is_null(row_idx);
        let start = list_offsets[row_idx].to_usize().unwrap();
        let end = list_offsets[row_idx + 1].to_usize().unwrap();
        let row_len = end - start;

        for _ in 0..count {
            inner_running += row_len;
            inner_offsets.push(O::from_usize(inner_running).unwrap());


The general_list_repeat function is vulnerable to a Denial of Service (DoS) attack due to unchecked integer arithmetic and panics. Specifically, the calculation of inner_total (line 248) can overflow usize, and the conversion of inner_running to the offset type O (line 271) uses unwrap(), which will panic if the total number of nested elements exceeds the capacity of the offset type. These issues can lead to crashes with crafted inputs. Furthermore, the capacity pre-allocation logic for inner_total (around line 248) is incorrect as counts derived from count_array.values() doesn't account for nulls in count_array, leading to incorrect capacity calculation. It's recommended to use checked arithmetic (checked_mul, checked_add) and return an error for overflows, and iterate over count_array directly to correctly handle nulls for capacity calculation.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! Prevents an application crash under the following conditions:

if the input values is ListArray, then its offset type will be i32

and if the count value is bigger than i32::MAX

then i32::from_usize(i32::MAX + 1) will return None and it will panic

gemini-code-assist · 2026-02-02T08:20:56Z

datafusion/functions-nested/src/repeat.rs

+    let total_repeated_values: usize = count_array
+        .values()
+        .iter()
+        .map(|&c| if c > 0 { c as usize } else { 0 })
+        .sum();
+
    let mut take_indices = Vec::with_capacity(total_repeated_values);
+    let mut nulls = BooleanBufferBuilder::new(count_array.len());
    let mut offsets = Vec::with_capacity(count_array.len() + 1);
    offsets.push(O::zero());
    let mut running_offset = 0usize;

-    for (idx, &count) in count_array.values().iter().enumerate() {
-        let count = count as usize;
+    for idx in 0..count_array.len() {
+        let (count, is_valid) = get_count_with_validity(count_array, idx);
+
        running_offset += count;
        offsets.push(O::from_usize(running_offset).unwrap());


The general_repeat function is vulnerable to a Denial of Service (DoS) attack via integer overflow and subsequent panic.

The summation of total_repeated_values (lines 185-189) uses the sum() method on an iterator of usize. If the total count exceeds usize::MAX, this will cause a panic in debug builds.

The conversion of running_offset to the offset type O (line 201) uses unwrap(). If O is i32 (used for ListArray) and the total number of elements exceeds i32::MAX, O::from_usize will return None, causing a panic.

An attacker can trigger these panics by providing large values in the count argument of the array_repeat function.

Recommendation: Use checked_add for the summation and handle the None case of from_usize by returning an error instead of calling unwrap().

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! Prevents an application crash under the following conditions:

if the input values is ListArray, then its offset type will be i32

and if the count value is bigger than i32::MAX

then i32::from_usize(i32::MAX + 1) will return None and it will panic

augmentcode · 2026-02-02T08:22:45Z

🤖 Augment PR Summary

Summary: This PR fixes array_repeat behavior when the count argument contains NULLs so results become NULL rather than empty arrays.

Changes:

Switches the UDF signature to a coercible signature that coerces integer-like count inputs to Int64
Updates the implementation to read count as Int64Array and interpret NULL counts as “row is NULL”
Treats non-positive counts (0 / negative) as producing empty repeated results
Propagates NULL-ness at the outer list level via constructed NullBuffers
Adds SQLLogicTest coverage for NULL/negative counts across scalar, List, and LargeList inputs (including nested lists)

Technical Notes: The implementation now builds explicit validity buffers for the outer repeated list to distinguish NULL count from count=0, and adds edge-case tests for empty arrays and NULL elements.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-02-02T08:22:46Z

datafusion/functions-nested/src/repeat.rs

-    // Build offsets and take_indices
-    let total_repeated_values: usize =
-        count_array.values().iter().map(|&c| c as usize).sum();
+    let total_repeated_values: usize = count_array


Int64Array::values() includes undefined values for NULL slots, so using it to compute total_repeated_values (and similarly in general_list_repeat) can cause huge over-allocation when count has NULLs; consider iterating with validity (count_array.iter() / is_null) and treating NULL as 0. This is especially relevant now that NULL counts are explicitly supported.

Other Locations

datafusion/functions-nested/src/repeat.rs:236

datafusion/functions-nested/src/repeat.rs:242

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

value:useful; category:bug; feedback: The Augment AI reviewer is correct! The Arrow specification allows undefined values at the null position, so the proposed implementation may overestimate the sum. Only the non-null values should be summed.

augmentcode · 2026-02-02T08:22:46Z

datafusion/functions-nested/src/repeat.rs

+        (0, false)
+    } else {
+        let c = count_array.value(idx);
+        if c > 0 { (c as usize, true) } else { (0, true) }


get_count_with_validity clamps negative counts to 0, but the doc comment above general_repeat still states the assumption count[i] >= 0; consider updating that comment (and any user-facing docs if applicable) to match the new semantics.

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

value:annoying; category:documentation; feedback: The Augment AI reviewer is not correct! The docstring at line 172 still states that count[i] should be a positive value and get_count_with_validity() checks exactly that - if the count is a negative value then it is clamped to 0, so it fulfils the pre-condition.

martin-augment · 2026-02-02T08:24:07Z

Minor observations:

The unused import UInt64Array remains on line 23 (can be removed)

value:annoying; category:bug; feedback: The Claude AI reviewer is not correct! The UInt64Array is still used at lines 209 and 282. Removing the import will break the compilation.

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@datafusion/functions-nested/src/repeat.rs`:
- Around line 185-189: total_repeated_values is computed using
count_array.values(), which reads raw buffer including null slots and overstates
the sum; change the computation to iterate get_count_with_validity(count_array)
(the same helper used later in the loop) so you only sum valid counts (e.g., map
the returned (count, valid) pairs and sum count when valid). Update the
total_repeated_values calculation to use get_count_with_validity(count_array)
and mirror the existing null-check logic used in the later loop to avoid
over-allocation.
- Around line 236-251: The preallocation loop is using count_array.values()
which ignores nulls; change it to iterate by index and call
get_count_with_validity(i, &count_array) (the same helper used in the main loop)
so null counts and non-positive counts are skipped when computing outer_total
and inner_total; use list_array.value_offsets() and list_array.is_null(i) as
before to compute each element length only for valid counts and valid list
entries, updating outer_total and inner_total accordingly (this ensures
outer_total/inner_total match the main processing logic).

🧹 Nitpick comments (1)

datafusion/sqllogictest/test_files/array.slt (1)
3259-3277: Add a NULL+NULL precedence test for array_repeat.

You cover NULL count and NULL element separately, but not both together. Adding (NULL, NULL) clarifies which NULL wins (likely count → NULL) and prevents future regressions.
➕ Proposed test addition
 CREATE TABLE array_repeat_null_count_table
 AS VALUES
   (1, 2),
   (2, null),
   (3, 1),
-  (4, -1);
+  (4, -1),
+  (NULL, NULL);

 ...
 ----
 1 [1, 1]
 2 NULL
 3 [3]
 4 []
+NULL NULL

coderabbitai · 2026-02-02T08:26:47Z

datafusion/functions-nested/src/repeat.rs

+    let total_repeated_values: usize = count_array
+        .values()
+        .iter()
+        .map(|&c| if c > 0 { c as usize } else { 0 })
+        .sum();


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n datafusion/functions-nested/src/repeat.rs | sed -n '175,200p'

Repository: martin-augment/datafusion

Length of output: 1049

🏁 Script executed:

rg -A 10 -B 10 "Int64Array|values\(\)" datafusion/functions-nested/src/repeat.rs | head -100

Repository: martin-augment/datafusion

Length of output: 3232

🏁 Script executed:

rg "get_count_with_validity" datafusion/

Repository: martin-augment/datafusion

Length of output: 438

🏁 Script executed:

rg -A 5 "\.values\(\).*null" datafusion/ -t rust | head -50

Repository: martin-augment/datafusion

Length of output: 3944

🏁 Script executed:

cat -n datafusion/functions-nested/src/repeat.rs | sed -n '1,190p' | tail -50

Repository: martin-augment/datafusion

Length of output: 1685

🏁 Script executed:

cat -n datafusion/functions-nested/src/repeat.rs | sed -n '240,280p'

Repository: martin-augment/datafusion

Length of output: 1909

🏁 Script executed:

rg -A 10 "fn get_count_with_validity" datafusion/functions-nested/src/repeat.rs

Repository: martin-augment/datafusion

Length of output: 325

🏁 Script executed:

cat -n datafusion/functions-nested/src/repeat.rs | sed -n '220,250p'

Repository: martin-augment/datafusion

Length of output: 1244

🏁 Script executed:

cat -n datafusion/functions-nested/src/repeat.rs | sed -n '240,260p'

Repository: martin-augment/datafusion

Length of output: 1000

🏁 Script executed:

tail -50 datafusion/functions-nested/src/repeat.rs

Repository: martin-augment/datafusion

Length of output: 1670

🏁 Script executed:

cat -n datafusion/functions-nested/src/repeat.rs | sed -n '260,265p'

Repository: martin-augment/datafusion

Length of output: 403

🏁 Script executed:

rg -B 5 -A 5 "count_array\.values\(\)" datafusion/functions-nested/src/repeat.rs

Repository: martin-augment/datafusion

Length of output: 420

Use get_count_with_validity() instead of values() to properly sum only valid counts.

Calling Int64Array::values() returns the raw buffer including undefined values at null positions. This inflates total_repeated_values and causes unnecessary over-allocation. The loop at line 198 already uses get_count_with_validity() for correct handling; apply the same approach here.

🔧 Suggested fix

- let total_repeated_values: usize = count_array - .values() - .iter() - .map(|&c| if c > 0 { c as usize } else { 0 }) - .sum(); + let total_repeated_values: usize = (0..count_array.len()) + .map(|i| get_count_with_validity(count_array, i).0) + .sum();

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let total_repeated_values: usize = count_array

.values()

.iter()

.map(|&c| if c > 0 { c as usize } else { 0 })

.sum();

let total_repeated_values: usize = (0..count_array.len())

.map(|i| get_count_with_validity(count_array, i).0)

.sum();

🤖 Prompt for AI Agents

In `@datafusion/functions-nested/src/repeat.rs` around lines 185 - 189, total_repeated_values is computed using count_array.values(), which reads raw buffer including null slots and overstates the sum; change the computation to iterate get_count_with_validity(count_array) (the same helper used later in the loop) so you only sum valid counts (e.g., map the returned (count, valid) pairs and sum count when valid). Update the total_repeated_values calculation to use get_count_with_validity(count_array) and mirror the existing null-check logic used in the later loop to avoid over-allocation.

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The Arrow specification allows undefined values at the null position, so the proposed implementation may overestimate the sum. Only the non-null values should be summed.

coderabbitai · 2026-02-02T08:26:48Z

datafusion/functions-nested/src/repeat.rs

    let counts = count_array.values();
    let list_offsets = list_array.value_offsets();

    // calculate capacities for pre-allocation
-    let outer_total = counts.iter().map(|&c| c as usize).sum();
-    let inner_total = counts
-        .iter()
-        .enumerate()
-        .filter(|&(i, _)| !list_array.is_null(i))
-        .map(|(i, &c)| {
-            let len = list_offsets[i + 1].to_usize().unwrap()
-                - list_offsets[i].to_usize().unwrap();
-            len * (c as usize)
-        })
-        .sum();
+    let mut outer_total = 0usize;
+    let mut inner_total = 0usize;
+    for (i, &c) in counts.iter().enumerate() {
+        if c > 0 {
+            outer_total += c as usize;
+            if !list_array.is_null(i) {
+                let len = list_offsets[i + 1].to_usize().unwrap()
+                    - list_offsets[i].to_usize().unwrap();
+                inner_total += len * (c as usize);
+            }
+        }
+    }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n datafusion/functions-nested/src/repeat.rs | sed -n '220,270p'

Repository: martin-augment/datafusion

Length of output: 2215

🏁 Script executed:

rg -n "fn get_count_with_validity" datafusion/functions-nested/src/

Repository: martin-augment/datafusion

Length of output: 198

🏁 Script executed:

cat -n datafusion/functions-nested/src/repeat.rs | sed -n '309,325p'

Repository: martin-augment/datafusion

Length of output: 381

🏁 Script executed:

cat -n datafusion/functions-nested/src/repeat.rs | sed -n '1,50p'

Repository: martin-augment/datafusion

Length of output: 2244

🏁 Script executed:

rg -A 10 "fn general_list_repeat" datafusion/functions-nested/src/repeat.rs | head -30

Repository: martin-augment/datafusion

Length of output: 457

🏁 Script executed:

rg -B 5 -A 15 "for row_idx in 0..count_array.len()" datafusion/functions-nested/src/repeat.rs | head -40

Repository: martin-augment/datafusion

Length of output: 973

🏁 Script executed:

rg -n "NULL\|null" datafusion/functions-nested/src/repeat.rs | grep -i test

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

rg -B 5 -A 20 "#\[test\]" datafusion/functions-nested/src/repeat.rs | head -60

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

rg -n "test" datafusion/functions-nested/src/repeat.rs

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

fd test datafusion/functions-nested/src/ --type f

Repository: martin-augment/datafusion

Length of output: 51

🏁 Script executed:

rg -l "array_repeat\|repeat" datafusion/functions-nested/ --type rs | head -10

Repository: martin-augment/datafusion

Length of output: 95

🏁 Script executed:

find datafusion/functions-nested -name "*.rs" -type f | xargs grep -l "array_repeat" | head -10

Repository: martin-augment/datafusion

Length of output: 201

🏁 Script executed:

cat -n datafusion/functions-nested/benches/array_repeat.rs | head -100

Repository: martin-augment/datafusion

Length of output: 4598

🏁 Script executed:

cat -n datafusion/functions-nested/src/repeat.rs | sed -n '230,280p'

Repository: martin-augment/datafusion

Length of output: 2269

🏁 Script executed:

cat > /tmp/test_arrow_null.py << 'EOF' # This is a test to understand how Arrow handles NULL values # Let's check what values() returns for a nullable Int64Array # In Arrow, when you have an Int64Array with NULL values: # - The values() method returns the underlying buffer # - The validity bitmap is separate # - The raw buffer value for a NULL slot is typically undefined or 0 # - But iteration over raw values doesn't respect the null bitmap # The issue is: # 1. count_array.values() gets the raw buffer values # 2. Iteration happens on all indices, including NULL ones # 3. If a NULL value's buffer contains a positive number, it would be counted # 4. The main loop properly uses get_count_with_validity which returns (0, false) for NULLs # So the review is correct: preallocation should skip NULL entries # The fix should check validity before using the count value EOF cat /tmp/test_arrow_null.py

Repository: martin-augment/datafusion

Length of output: 903

🏁 Script executed:

cat -n datafusion/functions-nested/src/repeat.rs | sed -n '236,316p'

Repository: martin-augment/datafusion

Length of output: 3517

Preallocation loop should skip null counts to match the main processing logic.

The preallocation loop uses count_array.values() which bypasses the null bitmap, causing it to count NULL slots if their buffer happens to contain a positive value. This leads to over-allocation of inner_offsets, take_indices, and inner_nulls. The main processing loop correctly handles NULLs via get_count_with_validity, so the preallocation should do the same for consistency and efficiency.

Use get_count_with_validity (already defined in the function) to skip null and non-positive entries:

Suggested fix

- let counts = count_array.values(); let list_offsets = list_array.value_offsets(); // calculate capacities for pre-allocation let mut outer_total = 0usize; let mut inner_total = 0usize; - for (i, &c) in counts.iter().enumerate() { - if c > 0 { - outer_total += c as usize; - if !list_array.is_null(i) { - let len = list_offsets[i + 1].to_usize().unwrap() - - list_offsets[i].to_usize().unwrap(); - inner_total += len * (c as usize); - } - } - } + for i in 0..count_array.len() { + let (count, is_valid) = get_count_with_validity(count_array, i); + if !is_valid || count == 0 { + continue; + } + outer_total += count; + if !list_array.is_null(i) { + let len = list_offsets[i + 1].to_usize().unwrap() + - list_offsets[i].to_usize().unwrap(); + inner_total += len * count; + } + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let counts = count_array.values();

let list_offsets = list_array.value_offsets();

// calculate capacities for pre-allocation

let outer_total = counts.iter().map(|&c| c as usize).sum();

let inner_total = counts

.iter()

.enumerate()

.filter(|&(i, _)| !list_array.is_null(i))

.map(|(i, &c)| {

let len = list_offsets[i + 1].to_usize().unwrap()

- list_offsets[i].to_usize().unwrap();

len * (c as usize)

})

.sum();

let mut outer_total = 0usize;

let mut inner_total = 0usize;

for (i, &c) in counts.iter().enumerate() {

if c > 0 {

outer_total += c as usize;

if !list_array.is_null(i) {

let len = list_offsets[i + 1].to_usize().unwrap()

- list_offsets[i].to_usize().unwrap();

inner_total += len * (c as usize);

}

}

}

let list_offsets = list_array.value_offsets();

// calculate capacities for pre-allocation

let mut outer_total = 0usize;

let mut inner_total = 0usize;

for i in 0..count_array.len() {

let (count, is_valid) = get_count_with_validity(count_array, i);

if !is_valid || count == 0 {

continue;

}

outer_total += count;

if !list_array.is_null(i) {

let len = list_offsets[i + 1].to_usize().unwrap()

- list_offsets[i].to_usize().unwrap();

inner_total += len * count;

}

}

🤖 Prompt for AI Agents

In `@datafusion/functions-nested/src/repeat.rs` around lines 236 - 251, The preallocation loop is using count_array.values() which ignores nulls; change it to iterate by index and call get_count_with_validity(i, &count_array) (the same helper used in the main loop) so null counts and non-positive counts are skipped when computing outer_total and inner_total; use list_array.value_offsets() and list_array.is_null(i) as before to compute each element length only for valid counts and valid list entries, updating outer_total and inner_total accordingly (this ensures outer_total/inner_total match the main processing logic).

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The Arrow specification allows undefined values at the null position, so the proposed implementation may overestimate the sum. Only the non-null values should be summed.

martin-augment · 2026-02-02T08:28:56Z

Integer Overflow on repeat.rs:185-189:

When count values are extremely large (e.g., i64::MAX), the cast to usize and subsequent sum could overflow. This could lead to:

Memory allocation with incorrect size

Index out of bounds errors

Potential panics

Recommendation: Add validation to ensure counts are within reasonable bounds (e.g., MAX i32). Similar bounds checking exists in other DataFusion array functions for safety.

value:useful; category:bug; feedback: The Claude AI reviewer is correct! The casting of the i64 value to usize may lead to overflow on a 32-bit system. Extra check should be added that the value is not bigger than i32::MAX. Prevents calculating wrong values on 32-bit systems, e.g. WASM32

martin-augment · 2026-02-02T12:30:53Z

3259-3277: Add a NULL+NULL precedence test for array_repeat.

You cover NULL count and NULL element separately, but not both together. Adding (NULL, NULL) clarifies which NULL wins (likely count → NULL) and prevents future regressions.

value:good-to-have; category:bug; feedback: The CodeRabbit AI reviewer is correct! Having a test that uses NULL for both the the values array and the count will be helpful to document what result should be expected.

lyne7-sc and others added 3 commits February 1, 2026 12:47

fix array_repeat

35f018f

Merge branch 'apache:main' into fix/array_repeat

97225fd

add slt

fc12c78

gemini-code-assist bot reviewed Feb 2, 2026

View reviewed changes

augmentcode bot reviewed Feb 2, 2026

View reviewed changes

coderabbitai bot reviewed Feb 2, 2026

View reviewed changes

Comments

Conversation

martin-augment commented Feb 2, 2026

Uh oh!

coderabbitai bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

gemini-code-assist bot commented Feb 2, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

claude bot commented Feb 2, 2026

Pull Request Review: Fix array_repeat handling of null count values

Summary

Strengths

Code Quality & Best Practices

Potential Issues - CRITICAL

Performance Considerations

Security Concerns

Test Coverage - Excellent

Summary & Recommendation

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot commented Feb 2, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment commented Feb 2, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment commented Feb 2, 2026

Uh oh!

martin-augment commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

coderabbitai bot commented Feb 2, 2026 •

edited

Loading