Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Array::logical_null_count for inspecting number of null values #6608

Merged
merged 1 commit into from
Oct 22, 2024

Conversation

findepi
Copy link
Member

@findepi findepi commented Oct 21, 2024

Add counter-part of Array::null_count, but counting the logical null values. This will be useful in DataFusion. Current alternative is to compute null mask (via Array::logical_nulls()) and do counting on it. Given this might be expensive and verbose, caller may naturally feel steer towards Array::null_count which may or may not be applicable, depending on the context.

Which issue does this PR close?

Rationale for this change

#4691 changed semantics of Array::null_count for eg NullArray. DataFusion upgrade to Arrow version with this change introduced a subtle bug, being fixed in apache/datafusion#13029. When working on a fix, it seemed that many usages of Array::null_count should be redirect to count logical nulls (not only the one being updated in that PR). Having a function to count logical nulls would be useful, as alternative is computationally more expensive (may involve creation or copying of a null mask).

What changes are included in this PR?

New Array::logical_null_count function.

Are there any user-facing changes?

No

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Oct 21, 2024
@findepi
Copy link
Member Author

findepi commented Oct 21, 2024

@alamb @tustvold please take a look

@tustvold
Copy link
Contributor

tustvold commented Oct 21, 2024

I'm not sure about this, in all cases where there can be logical nulls, apart from NullArray, this will involve computing the full logical null mask only to throw it away. This feels like it could be surprising for users, especially given null_count is precomputed and therefore very cheap.

Perhaps we could discuss making is_nullable precise as opposed to best-effort, as IIUC this is what DF is using this method for.

@findepi
Copy link
Member Author

findepi commented Oct 21, 2024

@tustvold thank you for taking time to review this PR!

I'm not sure about this, in all cases where there can be logical nulls, apart from NullArray, this will involve computing the full logical null mask only to throw it away.

Good point. this is what callers that need to find out number of logical nulls have to do today.
Examples: https://github.com/apache/datafusion/blob/e9584bc46ffc574cd65044d4199966402def1d15/datafusion/functions-aggregate/src/count.rs#L605-L607, apache/datafusion#13029

Having this function on the Array itself allows us to provide better implementation.
This PR does this for all primitive types, boolean array and null array

@tustvold
Copy link
Contributor

tustvold commented Oct 21, 2024

Right, my point is that an accurate logical null count can be very expensive to compute, whereas it is much cheaper to instead determine the existence of any nulls. Whilst this won't serve every use-case, my question is whether DF actually needs accurate null counts all the time, or whether most of the time it is just using them as a proxy for nullability. This in turn determines what we optimise for.

@findepi
Copy link
Member Author

findepi commented Oct 21, 2024

Right, my point is that an accurate logical null count can be very expensive to compute, whereas it is much cheaper to instead determine the existence of any nulls. Whilst this won't serve every use-case, my question is whether DF actually needs accurate null counts all the time

Not all the time, but often enough.

@findepi
Copy link
Member Author

findepi commented Oct 21, 2024

cc @joroKr21

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good idea. Especially since we have a default impl that should work in most cases. Just one question which is probably my misunderstanding around why you chose to overload the default impl in a few spots.

Comment on lines +319 to +321
fn logical_null_count(&self) -> usize {
self.null_count()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why overload here? Is this more efficient somehow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same reasoning as for primitive arrays -- #6608 (comment)

Comment on lines +1164 to +1166
fn logical_null_count(&self) -> usize {
self.null_count()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why overload here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make logical_null_count as performant as null_count for primitive types (where they happen to be equivalent), so that logical_null_count can be used without, or with fewer, performance drawbacks.

arrow-array/src/array/null_array.rs Show resolved Hide resolved
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel fairly strongly that we should not merge this, people will likely use this blindly without appreciating the severe performance penalty it entails. I think we should instead make is_nullable accurate, and the places that need an accurate null count should compute the logical null mask explicitly.

@findepi
Copy link
Member Author

findepi commented Oct 21, 2024

Thank you @alamb @westonpace @tustvold for your time reviewing this!

I feel fairly strongly that we should not merge this, people will likely use this blindly without appreciating the severe performance penalty it entails.

@tustvold Can you elaborate there the severe performance penalty comes from and what would it take to fix it?

For DataFusion at least the alternative is to all logical_nulls().map(|n| n.null_count()) which is strictly less performant than the alternative offered here. DataFusion will continue to use this slower path until a faster path exists

@findepi findepi requested a review from tustvold October 21, 2024 20:21
@tustvold
Copy link
Contributor

tustvold commented Oct 21, 2024

For DataFusion at least the alternative is to all logical_nulls().map(|n| n.null_count()) which is strictly less performant than the alternative offered here

The problem is for RunArray, DictionaryArray and UnionArray computing logical_nulls is potentially very expensive. Now I accept that there might be a marginal performance win from a specialized logical_null_count implementation, but other than perhaps for NullArray I would expect the difference to largely be a wash. For most types it is a couple of additional atomics, or will be completely dominated by the cost of computing the logical nulls.

The problem with exposing a logical_null_count method is it makes the fact this is effectively computing a fresh null mask implicit, hiding this problem. In fact this PR as written actually regresses is_nullable performance, demonstrating this 😅

Taking a step back, apache/datafusion#13033 is a prime example of a use-case that doesn't actually care what the logical null count is, just whether there are any nulls. With some minor adjustments we could make is_nullable accurate, and this method could just use that.

EDIT: TBC I really dislike the concept of logical nulls, I really wish the arrow specification didn't make the choices it did, UnionArray in particular is extremely perverse, but our hands are somewhat tied by the specification.

Add counter-part of `Array::null_count`, but counting the logical null
values. This will be useful in DataFusion. Current alternative is to
compute null mask (via `Array::logical_nulls()`) and do counting on it.
Given this might be expensive and verbose, caller may naturally feel
steer towards `Array::null_count` which may or may not be applicable,
depending on the context.
@findepi
Copy link
Member Author

findepi commented Oct 21, 2024

The problem is for RunArray, DictionaryArray and UnionArray computing logical_nulls is potentially very expensive.

I see your point, thanks for explaining this to me.

Let's turn the question around. What should the caller do, if they want exactly this: know how many (logical) null values are in the array?

@tustvold
Copy link
Contributor

know how many (logical) null values are in the array?

If this is what you need, which it very often isn't, then you have to call logical_nulls() and get the null count from it. The friction is a feature, not a bug 😅

@westonpace
Copy link
Member

If this is what you need, which it very often isn't, then you have to call logical_nulls() and get the null count from it. The friction is a feature, not a bug 😅

For my sake, this is fine for me. I had found myself needing the logical null count recently (for array statistics) and using logical_nulls wasn't much of a headache.

@findepi
Copy link
Member Author

findepi commented Oct 22, 2024

needing the logical null count recently (for array statistics)

@westonpace good point! this was exactly the case in apache/datafusion#13029 too

but that's not the only place -- DataFusion aggregation accumulators often call null_count() and this works only because for quite a few array types null_count() happens to be "logical null count". But contract-wise, this is a wrong function to call, and doesn't generalize to non-primitive types.

If this is what you need, which it very often isn't, then you have to call logical_nulls() and get the null count from it. The friction is a feature, not a bug 😅

@tustvold I don't mind writing more code (friction), but is this efficient at runtime?

@tustvold
Copy link
Contributor

but is this efficient at runtime?

As written in this PR, it will be largely equivalent.

Having slept on it, lets just proceed with this. I don't like it, but then I don't like logical nulls in general, but aside from forking the arrow format we're stuck with them. The types it impacts are relatively niche, and if people care to optimise them, they can

@tustvold tustvold merged commit 7e51d40 into apache:master Oct 22, 2024
28 checks passed
@findepi findepi deleted the findepi/logical-null-count branch October 22, 2024 09:49
@findepi
Copy link
Member Author

findepi commented Oct 22, 2024

thank you, that makes sense!
and thank you for the PR scrutiny, this is a good thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants