Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add was_valid parameter to NullState callbacks #11592

Closed
wants to merge 3 commits into from

Conversation

joroKr21
Copy link
Contributor

@joroKr21 joroKr21 commented Jul 22, 2024

Which issue does this PR close?

Closes #11591.

Rationale for this change

Provide more flexibility for implementing GroupsAccumulator which often make use of NullState.

What changes are included in this PR?

Add was_valid parameter to NullState callbacks. In this way implementations can handle nulls differently.

Are these changes tested?

Yes, extended existing tests.

Are there any user-facing changes?

Yes, changes the signatures of NullState::accumulate and NullState::accumulate_boolean.

Comment on lines 327 to 331
/// Check if the accumulated value for the group at the given `index` is valid,
/// meaning that there was at least one value passing the filter for this group.
pub fn is_valid(&self, index: usize) -> bool {
self.seen_values.get_bit(index)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I can't use it in accumulate due to mutable vs immutable borrow. This API is pretty difficult to use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment 🤔

Copy link
Contributor Author

@joroKr21 joroKr21 Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the signature of accumulate:

    pub fn accumulate<T, F>(
        &mut self,
        group_indices: &[usize],
        values: &PrimitiveArray<T>,
        opt_filter: Option<&BooleanArray>,
        total_num_groups: usize,
        mut value_fn: F,
    ) where
        T: ArrowPrimitiveType + Send,
        F: FnMut(usize, T::Native) + Send

But I can't use is_null inside value_fn because the NullState is already borrowed as mutable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally BooleanBufferBuilder::set_bit should return the previous value and we can pass it along to the callback. That's in arrow I guess.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still think we should add this API @joroKr21 ? or perhaps we should mark the PR as draft while we work on other options?

@joroKr21 joroKr21 marked this pull request as draft July 22, 2024 19:29
@joroKr21
Copy link
Contributor Author

Yeah good point, I converted it to a draft. I'm not sure what to do. Adding a boolean flag to the callback is not great either...

@joroKr21
Copy link
Contributor Author

@alamb I implemented the version with an additional callback parameter, LMK what you think?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me @joroKr21

I think this PR needs:

  1. Update the docs
  2. A functional test
  3. Run the clickbench performance benchmarks to ensure we don't see a regression

I can help with the benchmarks if necessary

cc @Dandandan

|group_index, new_value| {
let value = &mut self.values[group_index];
(self.prim_fn)(value, new_value);
|group_index, was_valid, new_value| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add a test that covered this, if possible

@@ -132,7 +132,7 @@ impl NullState {
mut value_fn: F,
) where
T: ArrowPrimitiveType + Send,
F: FnMut(usize, T::Native) + Send,
F: FnMut(usize, bool, T::Native) + Send,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please update the documentation to reflect this new argument and explains what it means

@joroKr21 joroKr21 force-pushed the null-state/is-null branch from 13601c8 to 477eada Compare July 23, 2024 05:48
@joroKr21 joroKr21 changed the title Add NullState::is_valid and NullState::is_null Add was_valid parameter to NullState callbacks Jul 23, 2024
@joroKr21
Copy link
Contributor Author

/benchmark

|group_index, new_value| {
let value = &mut self.values[group_index];
(self.prim_fn)(value, new_value);
|group_index, was_valid, new_value| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the existing implementation have to change, or do they?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps not, although conceptually an accumulator is just a map-reduce-map via some monoid or semigroup. The current implementation supports only monoids (need an empty value) but it doesn't support semigroups (no empty value). We could use this for something like FirstValue or AnyValue but I guess we could also implement it specifically for that use case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could even drop the initial value parameter since it doesn't matter in this case.

@joroKr21
Copy link
Contributor Author

@alamb the benchmark didn't trigger on comment

@Dandandan
Copy link
Contributor

@alamb the benchmark didn't trigger on comment

That support has been dropped (as it was a potential security risk and wasn't super reliable).
#11165

Can you run the benchmarks and post the results here?

@joroKr21
Copy link
Contributor Author

Can you run the benchmarks and post the results here?

Sure, it might be some time until I do that though

@joroKr21
Copy link
Contributor Author

I don't like this API at all

@joroKr21 joroKr21 closed this Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add NullState::is_null public method
3 participants