-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-2744: [C++] Avoid creating list arrays with a null values buffer #2243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2243 +/- ##
==========================================
+ Coverage 84.34% 86.78% +2.44%
==========================================
Files 281 237 -44
Lines 43760 41931 -1829
==========================================
- Hits 36909 36390 -519
+ Misses 6820 5541 -1279
+ Partials 31 0 -31
Continue to review full report at Codecov.
|
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a matter of usability, I don't know that we should expect all public API users to create a length-0 buffer when there is no data. I believe that code that interacts with the buffers in an array needs to treat length-0 and null equivalently.
A possibly extreme approach to resolve the issue would be to have a length-0 singleton kNullBuffer. But to really sanitize well, we would need to check every buffer going into ArrayData::Make, which would have negative performance implications. We could do the null check in the array containers, e.g. right here: https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L350.
Allowing the result of values() to be null means more tests to write to make sure that code doesn't break in that edge case.
Another side of this is that for validity bitmaps, it would be incorrect to return a length-0 buffer in the event that there are no nulls, but right now we permit that buffer to be null. In Java they allocate an array of all set bits, which I don't think we should do. So any way you slice it, some code will have to deal with the null buffer case.
My gut feeling is that we should allow the null buffers and document the issue well so that users can defend themselves from untrusted data.
I don't think this PR is doing that, except in
I agree. The parquet-cpp issue was already fixed in apache/parquet-cpp#474 However, I think it's also safer to ensure that we don't generate such buffers unwillingly. I don't think it was deliberate for
Yes, I agree for validity bitmaps code will have to deal with it. For actual values it is a bit unexpected, though (at least the person who wrote the parquet-cpp code clearly didn't expect it :-)).
Where would you document it? in ArrayData? |
What would you say to adding an option to
Agreed
I think in the APIs where |
|
On this
I'm OK with doing this later. I'll give this another review and merge since I don't think it does anything problematic |
|
Shoot I think we need to get ARROW-2822 #2239 in first. Let me see if that's ready to merge |
|
This has conflicts now, I'm gonna rebase. |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Thanks @pitrou!
No description provided.