Skip to content

Comments

20102: Fix array_repeat handling of null count values#223

Open
martin-augment wants to merge 3 commits intomainfrom
pr-20102-2026-02-02-08-16-58
Open

20102: Fix array_repeat handling of null count values#223
martin-augment wants to merge 3 commits intomainfrom
pr-20102-2026-02-02-08-16-58

Conversation

@martin-augment
Copy link
Owner

20102: To review by AI

@coderabbitai
Copy link

coderabbitai bot commented Feb 2, 2026

Walkthrough

The ArrayRepeat UDF is refactored to use a coercible signature path instead of the previous user-defined approach, enabling support for both exact Any and implicit Int64 input types. The implementation transitions from UInt64 to Int64 count arrays, with updated logic to handle negative counts as zero and propagate null values through per-element validity buffers. A helper function get_count_with_validity is introduced to manage count and validity extraction. The test suite is expanded with comprehensive coverage for array operations, including null count handling, nested list structures, and edge cases across various array types.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-20102-2026-02-02-08-16-58

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the array_repeat function in DataFusion to handle NULL and negative count values more robustly and in line with SQL standards. The changes involve updating the function's signature to use Int64 for the count argument, implementing proper null propagation, and ensuring that negative counts produce empty arrays. These modifications enhance the function's correctness and predictability across different input scenarios, supported by comprehensive new test cases.

Highlights

  • Corrected array_repeat null handling: The array_repeat function now correctly propagates NULL when the count argument is NULL, aligning with standard SQL semantics. Previously, this might have resulted in an empty array.
  • Negative count handling: Negative count values for array_repeat now consistently result in an empty array, ensuring predictable behavior.
  • Type coercion for count argument: The count argument for array_repeat is now explicitly coerced to Int64 using a coercible signature, simplifying internal logic and improving type safety.
  • Improved nullability propagation: The internal implementations of general_repeat and general_list_repeat now correctly track and propagate nullability from the count array to the output array's null buffer.
  • New helper function for count validity: A get_count_with_validity helper function was introduced to centralize the logic for extracting the count and its validity from an Int64Array, handling null and non-positive values.
  • Expanded test coverage: New SQL logic tests were added to cover various scenarios involving NULL and negative counts for both simple and nested arrays, including LargeList types, ensuring the new behavior is thoroughly validated.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@claude
Copy link

claude bot commented Feb 2, 2026

Pull Request Review: Fix array_repeat handling of null count values

Summary

This PR fixes the array_repeat function to properly handle NULL count values by returning NULL when the count is NULL, and empty arrays when the count is negative. The implementation switches from using UInt64 to Int64 for the count parameter to support negative values and updates the type coercion mechanism.


Strengths

  1. Correct NULL semantics: The change properly implements SQL NULL semantics - when count is NULL, the result is NULL (not an empty array).

  2. Comprehensive test coverage: Excellent test additions covering NULL count values for simple/nested arrays, negative count values, edge cases with empty arrays, and both List and LargeList variants.

  3. Type system improvement: Using the new Signature::coercible with implicit coercion is more idiomatic than the manual coerce_types implementation.

  4. Proper null propagation: The get_count_with_validity helper correctly distinguishes between NULL counts (returns false validity) and non-positive counts (returns true validity with 0 count).


Code Quality & Best Practices

Good:

  • Clean separation of concerns with the get_count_with_validity helper function
  • Proper null buffer tracking in both general_repeat and general_list_repeat
  • Consistent handling across List and LargeList variants

Minor observations:

  • The unused import UInt64Array remains on line 23 (can be removed)
  • The code properly handles negative counts by treating them as 0, which is reasonable behavior

Potential Issues - CRITICAL

Integer Overflow on repeat.rs:185-189:

When count values are extremely large (e.g., i64::MAX), the cast to usize and subsequent sum could overflow. This could lead to:

  1. Memory allocation with incorrect size
  2. Index out of bounds errors
  3. Potential panics

Recommendation: Add validation to ensure counts are within reasonable bounds (e.g., MAX i32). Similar bounds checking exists in other DataFusion array functions for safety.


Performance Considerations

  1. Pre-allocation efficiency: The code properly pre-calculates capacities in general_list_repeat (lines 240-251), which is good for performance.

  2. Iteration efficiency: In general_list_repeat, the capacity calculation iterates once, then the main loop iterates again. This is acceptable given the need for pre-allocation.

  3. Memory efficiency: The change from skipping null checks to explicitly tracking nulls adds a small overhead but is necessary for correctness.


Security Concerns

Medium severity: No explicit bounds checking on count values could allow memory exhaustion attacks by passing very large count values, potentially leading to DoS via resource exhaustion.

Recommendation: Add bounds checking for count values.


Test Coverage - Excellent

The tests verify all critical scenarios including NULL counts, negative counts, zero counts, NULL elements, both List/LargeList variants, and edge cases with empty arrays.

Suggested additional test (optional): Very large count values to verify bounds checking (once implemented)


Summary & Recommendation

Overall assessment: This is a well-implemented bug fix with excellent test coverage. The primary concern is the lack of bounds checking on count values which could lead to integer overflow and resource exhaustion.

Recommendation: Approve with changes

Required changes:

  1. Add bounds checking for count values to prevent overflow/resource exhaustion
  2. Remove unused UInt64Array import

Optional improvements:

  1. Update documentation to explicitly mention NULL and negative count behavior

The core logic is sound and the NULL semantics are correctly implemented. With bounds checking added, this would be production-ready.


Files reviewed:

  • datafusion/functions-nested/src/repeat.rs: Implementation
  • datafusion/sqllogictest/test_files/array.slt: Tests

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix the behavior of the array_repeat function for NULL and negative count values, improve type handling using Signature::coercible, and includes comprehensive tests. However, it introduces several Denial of Service (DoS) vectors due to unchecked integer arithmetic and unwrap() calls on type conversions, which can lead to crashes with crafted inputs. Additionally, there's a minor bug in the general_list_repeat pre-allocation logic that doesn't correctly handle NULL counts.

Comment on lines +240 to 271
let mut outer_total = 0usize;
let mut inner_total = 0usize;
for (i, &c) in counts.iter().enumerate() {
if c > 0 {
outer_total += c as usize;
if !list_array.is_null(i) {
let len = list_offsets[i + 1].to_usize().unwrap()
- list_offsets[i].to_usize().unwrap();
inner_total += len * (c as usize);
}
}
}

// Build inner structures
let mut inner_offsets = Vec::with_capacity(outer_total + 1);
let mut take_indices = Vec::with_capacity(inner_total);
let mut inner_nulls = BooleanBufferBuilder::new(outer_total);
let mut outer_nulls = BooleanBufferBuilder::new(count_array.len());
let mut inner_running = 0usize;
inner_offsets.push(O::zero());

for (row_idx, &count) in counts.iter().enumerate() {
let is_valid = !list_array.is_null(row_idx);
for row_idx in 0..count_array.len() {
let (count, count_is_valid) = get_count_with_validity(count_array, row_idx);
outer_nulls.append(count_is_valid);
let list_is_valid = !list_array.is_null(row_idx);
let start = list_offsets[row_idx].to_usize().unwrap();
let end = list_offsets[row_idx + 1].to_usize().unwrap();
let row_len = end - start;

for _ in 0..count {
inner_running += row_len;
inner_offsets.push(O::from_usize(inner_running).unwrap());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The general_list_repeat function is vulnerable to a Denial of Service (DoS) attack due to unchecked integer arithmetic and panics. Specifically, the calculation of inner_total (line 248) can overflow usize, and the conversion of inner_running to the offset type O (line 271) uses unwrap(), which will panic if the total number of nested elements exceeds the capacity of the offset type. These issues can lead to crashes with crafted inputs. Furthermore, the capacity pre-allocation logic for inner_total (around line 248) is incorrect as counts derived from count_array.values() doesn't account for nulls in count_array, leading to incorrect capacity calculation. It's recommended to use checked arithmetic (checked_mul, checked_add) and return an error for overflows, and iterate over count_array directly to correctly handle nulls for capacity calculation.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! Prevents an application crash under the following conditions:

  1. if the input values is ListArray, then its offset type will be i32
  2. and if the count value is bigger than i32::MAX
  3. then i32::from_usize(i32::MAX + 1) will return None and it will panic

Comment on lines +185 to 201
let total_repeated_values: usize = count_array
.values()
.iter()
.map(|&c| if c > 0 { c as usize } else { 0 })
.sum();

let mut take_indices = Vec::with_capacity(total_repeated_values);
let mut nulls = BooleanBufferBuilder::new(count_array.len());
let mut offsets = Vec::with_capacity(count_array.len() + 1);
offsets.push(O::zero());
let mut running_offset = 0usize;

for (idx, &count) in count_array.values().iter().enumerate() {
let count = count as usize;
for idx in 0..count_array.len() {
let (count, is_valid) = get_count_with_validity(count_array, idx);

running_offset += count;
offsets.push(O::from_usize(running_offset).unwrap());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The general_repeat function is vulnerable to a Denial of Service (DoS) attack via integer overflow and subsequent panic.

  1. The summation of total_repeated_values (lines 185-189) uses the sum() method on an iterator of usize. If the total count exceeds usize::MAX, this will cause a panic in debug builds.
  2. The conversion of running_offset to the offset type O (line 201) uses unwrap(). If O is i32 (used for ListArray) and the total number of elements exceeds i32::MAX, O::from_usize will return None, causing a panic.

An attacker can trigger these panics by providing large values in the count argument of the array_repeat function.

Recommendation: Use checked_add for the summation and handle the None case of from_usize by returning an error instead of calling unwrap().

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! Prevents an application crash under the following conditions:

  1. if the input values is ListArray, then its offset type will be i32
  2. and if the count value is bigger than i32::MAX
  3. then i32::from_usize(i32::MAX + 1) will return None and it will panic

@augmentcode
Copy link

augmentcode bot commented Feb 2, 2026

🤖 Augment PR Summary

Summary: This PR fixes array_repeat behavior when the count argument contains NULLs so results become NULL rather than empty arrays.

Changes:

  • Switches the UDF signature to a coercible signature that coerces integer-like count inputs to Int64
  • Updates the implementation to read count as Int64Array and interpret NULL counts as “row is NULL”
  • Treats non-positive counts (0 / negative) as producing empty repeated results
  • Propagates NULL-ness at the outer list level via constructed NullBuffers
  • Adds SQLLogicTest coverage for NULL/negative counts across scalar, List, and LargeList inputs (including nested lists)

Technical Notes: The implementation now builds explicit validity buffers for the outer repeated list to distinguish NULL count from count=0, and adds edge-case tests for empty arrays and NULL elements.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

// Build offsets and take_indices
let total_repeated_values: usize =
count_array.values().iter().map(|&c| c as usize).sum();
let total_repeated_values: usize = count_array
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Int64Array::values() includes undefined values for NULL slots, so using it to compute total_repeated_values (and similarly in general_list_repeat) can cause huge over-allocation when count has NULLs; consider iterating with validity (count_array.iter() / is_null) and treating NULL as 0. This is especially relevant now that NULL counts are explicitly supported.

Other Locations
  • datafusion/functions-nested/src/repeat.rs:236
  • datafusion/functions-nested/src/repeat.rs:242

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! The Arrow specification allows undefined values at the null position, so the proposed implementation may overestimate the sum. Only the non-null values should be summed.

(0, false)
} else {
let c = count_array.value(idx);
if c > 0 { (c as usize, true) } else { (0, true) }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_count_with_validity clamps negative counts to 0, but the doc comment above general_repeat still states the assumption count[i] >= 0; consider updating that comment (and any user-facing docs if applicable) to match the new semantics.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:annoying; category:documentation; feedback: The Augment AI reviewer is not correct! The docstring at line 172 still states that count[i] should be a positive value and get_count_with_validity() checks exactly that - if the count is a negative value then it is clamped to 0, so it fulfils the pre-condition.

@martin-augment
Copy link
Owner Author

Minor observations:

  • The unused import UInt64Array remains on line 23 (can be removed)

value:annoying; category:bug; feedback: The Claude AI reviewer is not correct! The UInt64Array is still used at lines 209 and 282. Removing the import will break the compilation.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@datafusion/functions-nested/src/repeat.rs`:
- Around line 185-189: total_repeated_values is computed using
count_array.values(), which reads raw buffer including null slots and overstates
the sum; change the computation to iterate get_count_with_validity(count_array)
(the same helper used later in the loop) so you only sum valid counts (e.g., map
the returned (count, valid) pairs and sum count when valid). Update the
total_repeated_values calculation to use get_count_with_validity(count_array)
and mirror the existing null-check logic used in the later loop to avoid
over-allocation.
- Around line 236-251: The preallocation loop is using count_array.values()
which ignores nulls; change it to iterate by index and call
get_count_with_validity(i, &count_array) (the same helper used in the main loop)
so null counts and non-positive counts are skipped when computing outer_total
and inner_total; use list_array.value_offsets() and list_array.is_null(i) as
before to compute each element length only for valid counts and valid list
entries, updating outer_total and inner_total accordingly (this ensures
outer_total/inner_total match the main processing logic).
🧹 Nitpick comments (1)
datafusion/sqllogictest/test_files/array.slt (1)

3259-3277: Add a NULL+NULL precedence test for array_repeat.

You cover NULL count and NULL element separately, but not both together. Adding (NULL, NULL) clarifies which NULL wins (likely count → NULL) and prevents future regressions.

➕ Proposed test addition
 CREATE TABLE array_repeat_null_count_table
 AS VALUES
   (1, 2),
   (2, null),
   (3, 1),
-  (4, -1);
+  (4, -1),
+  (NULL, NULL);

 ...
 ----
 1 [1, 1]
 2 NULL
 3 [3]
 4 []
+NULL NULL

Comment on lines +185 to +189
let total_repeated_values: usize = count_array
.values()
.iter()
.map(|&c| if c > 0 { c as usize } else { 0 })
.sum();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n datafusion/functions-nested/src/repeat.rs | sed -n '175,200p'

Repository: martin-augment/datafusion

Length of output: 1049


🏁 Script executed:

rg -A 10 -B 10 "Int64Array|values\(\)" datafusion/functions-nested/src/repeat.rs | head -100

Repository: martin-augment/datafusion

Length of output: 3232


🏁 Script executed:

rg "get_count_with_validity" datafusion/

Repository: martin-augment/datafusion

Length of output: 438


🏁 Script executed:

rg -A 5 "\.values\(\).*null" datafusion/ -t rust | head -50

Repository: martin-augment/datafusion

Length of output: 3944


🏁 Script executed:

cat -n datafusion/functions-nested/src/repeat.rs | sed -n '1,190p' | tail -50

Repository: martin-augment/datafusion

Length of output: 1685


🏁 Script executed:

cat -n datafusion/functions-nested/src/repeat.rs | sed -n '240,280p'

Repository: martin-augment/datafusion

Length of output: 1909


🏁 Script executed:

rg -A 10 "fn get_count_with_validity" datafusion/functions-nested/src/repeat.rs

Repository: martin-augment/datafusion

Length of output: 325


🏁 Script executed:

cat -n datafusion/functions-nested/src/repeat.rs | sed -n '220,250p'

Repository: martin-augment/datafusion

Length of output: 1244


🏁 Script executed:

cat -n datafusion/functions-nested/src/repeat.rs | sed -n '240,260p'

Repository: martin-augment/datafusion

Length of output: 1000


🏁 Script executed:

tail -50 datafusion/functions-nested/src/repeat.rs

Repository: martin-augment/datafusion

Length of output: 1670


🏁 Script executed:

cat -n datafusion/functions-nested/src/repeat.rs | sed -n '260,265p'

Repository: martin-augment/datafusion

Length of output: 403


🏁 Script executed:

rg -B 5 -A 5 "count_array\.values\(\)" datafusion/functions-nested/src/repeat.rs

Repository: martin-augment/datafusion

Length of output: 420


Use get_count_with_validity() instead of values() to properly sum only valid counts.

Calling Int64Array::values() returns the raw buffer including undefined values at null positions. This inflates total_repeated_values and causes unnecessary over-allocation. The loop at line 198 already uses get_count_with_validity() for correct handling; apply the same approach here.

🔧 Suggested fix
-    let total_repeated_values: usize = count_array
-        .values()
-        .iter()
-        .map(|&c| if c > 0 { c as usize } else { 0 })
-        .sum();
+    let total_repeated_values: usize = (0..count_array.len())
+        .map(|i| get_count_with_validity(count_array, i).0)
+        .sum();
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let total_repeated_values: usize = count_array
.values()
.iter()
.map(|&c| if c > 0 { c as usize } else { 0 })
.sum();
let total_repeated_values: usize = (0..count_array.len())
.map(|i| get_count_with_validity(count_array, i).0)
.sum();
🤖 Prompt for AI Agents
In `@datafusion/functions-nested/src/repeat.rs` around lines 185 - 189,
total_repeated_values is computed using count_array.values(), which reads raw
buffer including null slots and overstates the sum; change the computation to
iterate get_count_with_validity(count_array) (the same helper used later in the
loop) so you only sum valid counts (e.g., map the returned (count, valid) pairs
and sum count when valid). Update the total_repeated_values calculation to use
get_count_with_validity(count_array) and mirror the existing null-check logic
used in the later loop to avoid over-allocation.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The Arrow specification allows undefined values at the null position, so the proposed implementation may overestimate the sum. Only the non-null values should be summed.

Comment on lines 236 to +251
let counts = count_array.values();
let list_offsets = list_array.value_offsets();

// calculate capacities for pre-allocation
let outer_total = counts.iter().map(|&c| c as usize).sum();
let inner_total = counts
.iter()
.enumerate()
.filter(|&(i, _)| !list_array.is_null(i))
.map(|(i, &c)| {
let len = list_offsets[i + 1].to_usize().unwrap()
- list_offsets[i].to_usize().unwrap();
len * (c as usize)
})
.sum();
let mut outer_total = 0usize;
let mut inner_total = 0usize;
for (i, &c) in counts.iter().enumerate() {
if c > 0 {
outer_total += c as usize;
if !list_array.is_null(i) {
let len = list_offsets[i + 1].to_usize().unwrap()
- list_offsets[i].to_usize().unwrap();
inner_total += len * (c as usize);
}
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n datafusion/functions-nested/src/repeat.rs | sed -n '220,270p'

Repository: martin-augment/datafusion

Length of output: 2215


🏁 Script executed:

rg -n "fn get_count_with_validity" datafusion/functions-nested/src/

Repository: martin-augment/datafusion

Length of output: 198


🏁 Script executed:

cat -n datafusion/functions-nested/src/repeat.rs | sed -n '309,325p'

Repository: martin-augment/datafusion

Length of output: 381


🏁 Script executed:

cat -n datafusion/functions-nested/src/repeat.rs | sed -n '1,50p'

Repository: martin-augment/datafusion

Length of output: 2244


🏁 Script executed:

rg -A 10 "fn general_list_repeat" datafusion/functions-nested/src/repeat.rs | head -30

Repository: martin-augment/datafusion

Length of output: 457


🏁 Script executed:

rg -B 5 -A 15 "for row_idx in 0..count_array.len()" datafusion/functions-nested/src/repeat.rs | head -40

Repository: martin-augment/datafusion

Length of output: 973


🏁 Script executed:

rg -n "NULL\|null" datafusion/functions-nested/src/repeat.rs | grep -i test

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

rg -B 5 -A 20 "#\[test\]" datafusion/functions-nested/src/repeat.rs | head -60

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

rg -n "test" datafusion/functions-nested/src/repeat.rs

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

fd test datafusion/functions-nested/src/ --type f

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

rg -l "array_repeat\|repeat" datafusion/functions-nested/ --type rs | head -10

Repository: martin-augment/datafusion

Length of output: 95


🏁 Script executed:

find datafusion/functions-nested -name "*.rs" -type f | xargs grep -l "array_repeat" | head -10

Repository: martin-augment/datafusion

Length of output: 201


🏁 Script executed:

cat -n datafusion/functions-nested/benches/array_repeat.rs | head -100

Repository: martin-augment/datafusion

Length of output: 4598


🏁 Script executed:

cat -n datafusion/functions-nested/src/repeat.rs | sed -n '230,280p'

Repository: martin-augment/datafusion

Length of output: 2269


🏁 Script executed:

cat > /tmp/test_arrow_null.py << 'EOF'
# This is a test to understand how Arrow handles NULL values
# Let's check what values() returns for a nullable Int64Array

# In Arrow, when you have an Int64Array with NULL values:
# - The values() method returns the underlying buffer
# - The validity bitmap is separate
# - The raw buffer value for a NULL slot is typically undefined or 0
# - But iteration over raw values doesn't respect the null bitmap

# The issue is:
# 1. count_array.values() gets the raw buffer values
# 2. Iteration happens on all indices, including NULL ones
# 3. If a NULL value's buffer contains a positive number, it would be counted
# 4. The main loop properly uses get_count_with_validity which returns (0, false) for NULLs

# So the review is correct: preallocation should skip NULL entries
# The fix should check validity before using the count value
EOF
cat /tmp/test_arrow_null.py

Repository: martin-augment/datafusion

Length of output: 903


🏁 Script executed:

cat -n datafusion/functions-nested/src/repeat.rs | sed -n '236,316p'

Repository: martin-augment/datafusion

Length of output: 3517


Preallocation loop should skip null counts to match the main processing logic.

The preallocation loop uses count_array.values() which bypasses the null bitmap, causing it to count NULL slots if their buffer happens to contain a positive value. This leads to over-allocation of inner_offsets, take_indices, and inner_nulls. The main processing loop correctly handles NULLs via get_count_with_validity, so the preallocation should do the same for consistency and efficiency.

Use get_count_with_validity (already defined in the function) to skip null and non-positive entries:

Suggested fix
-    let counts = count_array.values();
     let list_offsets = list_array.value_offsets();

     // calculate capacities for pre-allocation
     let mut outer_total = 0usize;
     let mut inner_total = 0usize;
-    for (i, &c) in counts.iter().enumerate() {
-        if c > 0 {
-            outer_total += c as usize;
-            if !list_array.is_null(i) {
-                let len = list_offsets[i + 1].to_usize().unwrap()
-                    - list_offsets[i].to_usize().unwrap();
-                inner_total += len * (c as usize);
-            }
-        }
-    }
+    for i in 0..count_array.len() {
+        let (count, is_valid) = get_count_with_validity(count_array, i);
+        if !is_valid || count == 0 {
+            continue;
+        }
+        outer_total += count;
+        if !list_array.is_null(i) {
+            let len = list_offsets[i + 1].to_usize().unwrap()
+                - list_offsets[i].to_usize().unwrap();
+            inner_total += len * count;
+        }
+    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let counts = count_array.values();
let list_offsets = list_array.value_offsets();
// calculate capacities for pre-allocation
let outer_total = counts.iter().map(|&c| c as usize).sum();
let inner_total = counts
.iter()
.enumerate()
.filter(|&(i, _)| !list_array.is_null(i))
.map(|(i, &c)| {
let len = list_offsets[i + 1].to_usize().unwrap()
- list_offsets[i].to_usize().unwrap();
len * (c as usize)
})
.sum();
let mut outer_total = 0usize;
let mut inner_total = 0usize;
for (i, &c) in counts.iter().enumerate() {
if c > 0 {
outer_total += c as usize;
if !list_array.is_null(i) {
let len = list_offsets[i + 1].to_usize().unwrap()
- list_offsets[i].to_usize().unwrap();
inner_total += len * (c as usize);
}
}
}
let list_offsets = list_array.value_offsets();
// calculate capacities for pre-allocation
let mut outer_total = 0usize;
let mut inner_total = 0usize;
for i in 0..count_array.len() {
let (count, is_valid) = get_count_with_validity(count_array, i);
if !is_valid || count == 0 {
continue;
}
outer_total += count;
if !list_array.is_null(i) {
let len = list_offsets[i + 1].to_usize().unwrap()
- list_offsets[i].to_usize().unwrap();
inner_total += len * count;
}
}
🤖 Prompt for AI Agents
In `@datafusion/functions-nested/src/repeat.rs` around lines 236 - 251, The
preallocation loop is using count_array.values() which ignores nulls; change it
to iterate by index and call get_count_with_validity(i, &count_array) (the same
helper used in the main loop) so null counts and non-positive counts are skipped
when computing outer_total and inner_total; use list_array.value_offsets() and
list_array.is_null(i) as before to compute each element length only for valid
counts and valid list entries, updating outer_total and inner_total accordingly
(this ensures outer_total/inner_total match the main processing logic).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The Arrow specification allows undefined values at the null position, so the proposed implementation may overestimate the sum. Only the non-null values should be summed.

@martin-augment
Copy link
Owner Author

Integer Overflow on repeat.rs:185-189:

When count values are extremely large (e.g., i64::MAX), the cast to usize and subsequent sum could overflow. This could lead to:

  1. Memory allocation with incorrect size
  2. Index out of bounds errors
  3. Potential panics

Recommendation: Add validation to ensure counts are within reasonable bounds (e.g., MAX i32). Similar bounds checking exists in other DataFusion array functions for safety.

value:useful; category:bug; feedback: The Claude AI reviewer is correct! The casting of the i64 value to usize may lead to overflow on a 32-bit system. Extra check should be added that the value is not bigger than i32::MAX. Prevents calculating wrong values on 32-bit systems, e.g. WASM32

@martin-augment
Copy link
Owner Author

3259-3277: Add a NULL+NULL precedence test for array_repeat.

You cover NULL count and NULL element separately, but not both together. Adding (NULL, NULL) clarifies which NULL wins (likely count → NULL) and prevents future regressions.

value:good-to-have; category:bug; feedback: The CodeRabbit AI reviewer is correct! Having a test that uses NULL for both the the values array and the count will be helpful to document what result should be expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants