Skip to content

Conversation

@Xuanwo
Copy link
Collaborator

@Xuanwo Xuanwo commented Jan 6, 2026

This PR will evolute all_null_layout to constant layout as mentioned in #5631

The logic for checking validity and constant values is a bit clumsy, but I haven't found a better approach yet.


Parts of this PR were drafted with assistance from Codex (with gpt-5.2) and fully reviewed and edited by me. I take full responsibility for all changes.

@Xuanwo Xuanwo requested a review from westonpace January 6, 2026 18:26
@github-actions github-actions bot added the enhancement New feature or request label Jan 6, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2026

Code Review

Overall, this PR cleanly evolves AllNullLayout to ConstantLayout with appropriate backward-compatible protobuf field numbering (slot 2 reused). The implementation is thorough with good test coverage.

P1 Issues

1. Potential integer overflow in is_constant_values (primitive.rs)

In the fixed-width byte comparison path:

let start = (base + i) * byte_width;
if buf[start..start + byte_width] != scalar_bytes[..] {

If base and i are large, and/or byte_width is significant (e.g., 16 bytes for Decimal128), this multiplication could overflow on 32-bit systems. Consider using checked_mul or ensuring the arithmetic cannot overflow via explicit usize bounds.

2. Inconsistent validity handling in encode_constant_page vs detection

The encoder detects constant pages by examining leaf validity via Self::leaf_validity(&repdef, num_values as usize), but the condition for using constant encoding only checks all_null case explicitly:

let all_null = leaf_validity
    .as_ref()
    .map(|validity| validity.count_set_bits() == 0)
    .unwrap_or(false);

When leaf_validity is None (meaning all-valid), it falls through to the constant detection path. This is correct, but the constant detection (find_constant_scalar) then re-derives validity from arrays. Ensure these two validity derivations are semantically consistent across all repdef configurations.

3. Missing validation in decode_scalar_from_inline_value

The function constructs an ArrayData with exactly one buffer (the inline value), but for Boolean data type, Arrow expects a different buffer layout (bit-packed). This could cause issues when decoding boolean constants via inline path. The code should either:

  • Handle boolean specially (bit-pack the single value)
  • Reject boolean inline values in try_inline_value

Minor Suggestions

  • The PR removes all_null_layout and simple_all_null_layout from pb21 but the macro impl_common_protobuf_utils still generates these functions. This leaves dead code that should be cleaned up or the macro simplified.

  • Consider adding a test for boolean constants to verify correct handling through both inline and value buffer paths.

@Xuanwo Xuanwo marked this pull request as draft January 6, 2026 18:34
@codecov
Copy link

codecov bot commented Jan 6, 2026

@Xuanwo Xuanwo marked this pull request as ready for review January 14, 2026 11:17
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty great for a first pass. I have a few thoughts but otherwise this is good!

}

// A layout used for pages where all values are null
// A layout used for pages where all (visible) values are the same scalar value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "visible" mean here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except NULL which is invisible, maybe I should just use non-null values?

Comment on lines +162 to +163
// This MUST only be used for types where a single non-null element is represented by a single
// fixed-width Arrow value buffer (i.e. no offsets buffer, no child data).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a single value that is represented by multiple buffers couldn't we concatenate them with a header that gives us information on how to disassemble them? Or maybe we have a different approach for that case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not support that in this PR. One reason is that it would make the encode/decode logic quite complex. Another reason is that I suspect such data is unlikely to share the same constant value.

//
// Constraints:
// - MUST be absent for an all-null page
// - MUST be <= 32 bytes if present
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? If it is larger than 32 bytes do we put it elsewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if it exceeds 32B, we will place it in a dedicated buffer instead of in the metadata.

The intention here is to avoid bloating our metadata too much. The size is set to the largest fixed data type we support (256B), though I am open to adjusting it.

Comment on lines 3870 to 3875
use arrow_data::transform::MutableArrayData;

let data = array.to_data();
let mut mutable = MutableArrayData::new(vec![&data], /*use_nulls=*/ true, 1);
mutable.extend(0, idx, idx + 1);
Ok(make_array(mutable.freeze()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MutableArrayData is pretty cool. I didn't know it existed. I wonder if we could someday replace some of our data builders with it in the future 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about it either! I learned this from GPT-5.2 😆

Ok(Some(scalar))
}

fn encode_constant_page(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this go in the submodule?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, will do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants