-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Array::logical_null_count for inspecting number of null values #6608
Conversation
bf12f4e
to
20c1de2
Compare
I'm not sure about this, in all cases where there can be logical nulls, apart from NullArray, this will involve computing the full logical null mask only to throw it away. This feels like it could be surprising for users, especially given null_count is precomputed and therefore very cheap. Perhaps we could discuss making is_nullable precise as opposed to best-effort, as IIUC this is what DF is using this method for. |
20c1de2
to
8147182
Compare
@tustvold thank you for taking time to review this PR!
Good point. this is what callers that need to find out number of logical nulls have to do today. Having this function on the Array itself allows us to provide better implementation. |
Right, my point is that an accurate logical null count can be very expensive to compute, whereas it is much cheaper to instead determine the existence of any nulls. Whilst this won't serve every use-case, my question is whether DF actually needs accurate null counts all the time, or whether most of the time it is just using them as a proxy for nullability. This in turn determines what we optimise for. |
Not all the time, but often enough. |
cc @joroKr21 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a good idea. Especially since we have a default impl that should work in most cases. Just one question which is probably my misunderstanding around why you chose to overload the default impl in a few spots.
fn logical_null_count(&self) -> usize { | ||
self.null_count() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why overload here? Is this more efficient somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same reasoning as for primitive arrays -- #6608 (comment)
fn logical_null_count(&self) -> usize { | ||
self.null_count() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why overload here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make logical_null_count as performant as null_count for primitive types (where they happen to be equivalent), so that logical_null_count can be used without, or with fewer, performance drawbacks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel fairly strongly that we should not merge this, people will likely use this blindly without appreciating the severe performance penalty it entails. I think we should instead make is_nullable accurate, and the places that need an accurate null count should compute the logical null mask explicitly.
Thank you @alamb @westonpace @tustvold for your time reviewing this!
@tustvold Can you elaborate there the severe performance penalty comes from and what would it take to fix it? For DataFusion at least the alternative is to all |
The problem is for RunArray, DictionaryArray and UnionArray computing The problem with exposing a logical_null_count method is it makes the fact this is effectively computing a fresh null mask implicit, hiding this problem. In fact this PR as written actually regresses Taking a step back, apache/datafusion#13033 is a prime example of a use-case that doesn't actually care what the logical null count is, just whether there are any nulls. With some minor adjustments we could make is_nullable accurate, and this method could just use that. EDIT: TBC I really dislike the concept of logical nulls, I really wish the arrow specification didn't make the choices it did, UnionArray in particular is extremely perverse, but our hands are somewhat tied by the specification. |
Add counter-part of `Array::null_count`, but counting the logical null values. This will be useful in DataFusion. Current alternative is to compute null mask (via `Array::logical_nulls()`) and do counting on it. Given this might be expensive and verbose, caller may naturally feel steer towards `Array::null_count` which may or may not be applicable, depending on the context.
8147182
to
5c22898
Compare
I see your point, thanks for explaining this to me. Let's turn the question around. What should the caller do, if they want exactly this: know how many (logical) null values are in the array? |
If this is what you need, which it very often isn't, then you have to call |
For my sake, this is fine for me. I had found myself needing the logical null count recently (for array statistics) and using |
@westonpace good point! this was exactly the case in apache/datafusion#13029 too but that's not the only place -- DataFusion aggregation accumulators often call
@tustvold I don't mind writing more code (friction), but is this efficient at runtime? |
As written in this PR, it will be largely equivalent. Having slept on it, lets just proceed with this. I don't like it, but then I don't like logical nulls in general, but aside from forking the arrow format we're stuck with them. The types it impacts are relatively niche, and if people care to optimise them, they can |
thank you, that makes sense! |
Add counter-part of
Array::null_count
, but counting the logical null values. This will be useful in DataFusion. Current alternative is to compute null mask (viaArray::logical_nulls()
) and do counting on it. Given this might be expensive and verbose, caller may naturally feel steer towardsArray::null_count
which may or may not be applicable, depending on the context.Array
Logical Nullability #4691Which issue does this PR close?
Array::logical_nulls
#5208Rationale for this change
#4691 changed semantics of
Array::null_count
for egNullArray
. DataFusion upgrade to Arrow version with this change introduced a subtle bug, being fixed in apache/datafusion#13029. When working on a fix, it seemed that many usages ofArray::null_count
should be redirect to count logical nulls (not only the one being updated in that PR). Having a function to count logical nulls would be useful, as alternative is computationally more expensive (may involve creation or copying of a null mask).What changes are included in this PR?
New
Array::logical_null_count
function.Are there any user-facing changes?
No