-
Notifications
You must be signed in to change notification settings - Fork 531
feat: evolute all_null_layout to constant layout #5641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Code ReviewOverall, this PR cleanly evolves P1 Issues1. Potential integer overflow in In the fixed-width byte comparison path: let start = (base + i) * byte_width;
if buf[start..start + byte_width] != scalar_bytes[..] {If 2. Inconsistent validity handling in The encoder detects constant pages by examining leaf validity via let all_null = leaf_validity
.as_ref()
.map(|validity| validity.count_set_bits() == 0)
.unwrap_or(false);When 3. Missing validation in The function constructs an
Minor Suggestions
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty great for a first pass. I have a few thoughts but otherwise this is good!
| } | ||
|
|
||
| // A layout used for pages where all values are null | ||
| // A layout used for pages where all (visible) values are the same scalar value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "visible" mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except NULL which is invisible, maybe I should just use non-null values?
| // This MUST only be used for types where a single non-null element is represented by a single | ||
| // fixed-width Arrow value buffer (i.e. no offsets buffer, no child data). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a single value that is represented by multiple buffers couldn't we concatenate them with a header that gives us information on how to disassemble them? Or maybe we have a different approach for that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not support that in this PR. One reason is that it would make the encode/decode logic quite complex. Another reason is that I suspect such data is unlikely to share the same constant value.
| // | ||
| // Constraints: | ||
| // - MUST be absent for an all-null page | ||
| // - MUST be <= 32 bytes if present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? If it is larger than 32 bytes do we put it elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if it exceeds 32B, we will place it in a dedicated buffer instead of in the metadata.
The intention here is to avoid bloating our metadata too much. The size is set to the largest fixed data type we support (256B), though I am open to adjusting it.
| use arrow_data::transform::MutableArrayData; | ||
|
|
||
| let data = array.to_data(); | ||
| let mut mutable = MutableArrayData::new(vec![&data], /*use_nulls=*/ true, 1); | ||
| mutable.extend(0, idx, idx + 1); | ||
| Ok(make_array(mutable.freeze())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MutableArrayData is pretty cool. I didn't know it existed. I wonder if we could someday replace some of our data builders with it in the future 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know about it either! I learned this from GPT-5.2 😆
| Ok(Some(scalar)) | ||
| } | ||
|
|
||
| fn encode_constant_page( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this go in the submodule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, will do.
This PR will evolute all_null_layout to constant layout as mentioned in #5631
The logic for checking validity and constant values is a bit clumsy, but I haven't found a better approach yet.
Parts of this PR were drafted with assistance from Codex (with
gpt-5.2) and fully reviewed and edited by me. I take full responsibility for all changes.