-
Couldn't load subscription status.
- Fork 1k
Fix: ViewType gc on huge batch would produce bad output #8694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| .map(|i| unsafe { self.copy_view_to_buffer(i, &mut data_buf) }) | ||
| .collect(); | ||
| for view in self.views() { | ||
| let len = *view as u32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is so slow, but it's right, I can make it faster(by handling the numbers via grouping or batching) if required
| } | ||
|
|
||
| #[test] | ||
| fn test_gc_huge_array() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test requires about 5GiB memory, it's huge, I don't know would it affect the testing on some machines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous code would meet bug only when buffer greater than 4GiB, the current code can be tested when > 2GiB. Personally I think leave 2GiB for test is ok but 4GiB is also ok to me, decide on reviewer's idea.
|
@alamb Besides, I meet this bug when I have 4GiB StringViewArray, arrow-rs regard offset as u32, however, in arrow standard, this uses i32. So I limit it to 2GiB There're other places uses |
| }; | ||
| vec![gc_copy_group] | ||
| }; | ||
| assert!(gc_copy_groups.len() <= i32::MAX as usize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion can be removed, I just ensure it would pass
Which issue does this PR close?
Rationale for this change
Previously,
gc()will produce a single buffer. However, for buffer size greater than 2GiB, it would be buggy, since buffer-offset it's a 4-byte signed integer.What changes are included in this PR?
Add a GcCopyGroup type, and do gc for it.
Are these changes tested?
Yes
Are there any user-facing changes?
gc would produce more buffers