Skip to content

Conversation

@mapleFU
Copy link
Member

@mapleFU mapleFU commented Oct 23, 2025

Which issue does this PR close?

Rationale for this change

Previously, gc() will produce a single buffer. However, for buffer size greater than 2GiB, it would be buggy, since buffer-offset it's a 4-byte signed integer.

What changes are included in this PR?

Add a GcCopyGroup type, and do gc for it.

Are these changes tested?

Yes

Are there any user-facing changes?

gc would produce more buffers

@github-actions github-actions bot added the arrow Changes to the arrow crate label Oct 23, 2025
@mapleFU mapleFU requested a review from alamb October 26, 2025 13:20
@mapleFU mapleFU marked this pull request as ready for review October 26, 2025 13:20
.map(|i| unsafe { self.copy_view_to_buffer(i, &mut data_buf) })
.collect();
for view in self.views() {
let len = *view as u32;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is so slow, but it's right, I can make it faster(by handling the numbers via grouping or batching) if required

}

#[test]
fn test_gc_huge_array() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test requires about 5GiB memory, it's huge, I don't know would it affect the testing on some machines

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous code would meet bug only when buffer greater than 4GiB, the current code can be tested when > 2GiB. Personally I think leave 2GiB for test is ok but 4GiB is also ok to me, decide on reviewer's idea.

@mapleFU
Copy link
Member Author

mapleFU commented Oct 26, 2025

@alamb Besides, I meet this bug when I have 4GiB StringViewArray, arrow-rs regard offset as u32, however, in arrow standard, this uses i32. So I limit it to 2GiB

There're other places uses u32::MAX in view handling, should I also fix them in other patch?

@mapleFU mapleFU changed the title View: Fixing gc on huge batch Fix: ViewType gc on huge batch would produce bad output Oct 26, 2025
};
vec![gc_copy_group]
};
assert!(gc_copy_groups.len() <= i32::MAX as usize);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion can be removed, I just ensure it would pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Array: ViewType gc() would have bug when array sum length exceed i32::MAX

1 participant