Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make FirstValue an UDAF, Change AggregateUDFImpl::accumulator signature, support ORDER BY for UDAFs #9874

Merged
merged 49 commits into from
Apr 3, 2024

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Mar 30, 2024

Which issue does this PR close?

First step of #8708
Close #9249

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211 jayzhan211 marked this pull request as ready for review April 1, 2024 12:28
@jayzhan211 jayzhan211 requested review from alamb and mustafasrepo April 1, 2024 12:28
@alamb alamb added the api change Changes the API exposed to users of the crate label Apr 1, 2024
@alamb alamb changed the title Make FirstValue an UDAF Make FirstValue an UDAF, Change AggregateUDFImpl::accumulator signature Apr 1, 2024
@alamb alamb changed the title Make FirstValue an UDAF, Change AggregateUDFImpl::accumulator signature Make FirstValue an UDAF, Change AggregateUDFImpl::accumulator signature, support ORDER BY for UDAFs Apr 1, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ -- looking very good @jayzhan211

}

fn state_type(&self, _return_type: &DataType) -> Result<Vec<DataType>> {
Ok(self.state_type.clone())
}
}

pub struct FirstValue {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- it seems like the issue is that the accumulator implementation requires PhysicalSortExpr.

To pull the code into its own crate maybe we could pull out the relevant pieces of datafusion-physical-expr into datafusion-physical-core or something (as a follow on PR)

@@ -710,6 +712,16 @@ pub fn create_udaf(
))
}

/// Creates a new UDAF with a specific signature, state type and return type.
/// The signature and state type must match the `Accumulator's implementation`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to add a comment explaining this is a temporary solution (i.e. that the idea is we'll pull the function out into its own crate, but for now we need to keep the physical implementation separate

f.debug_struct("FirstValue")
.field("name", &self.name)
.field("signature", &self.signature)
.field("fun", &"<FUNC>")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.field("fun", &"<FUNC>")
.field("accumulator", &"<FUNC>")

Comment on lines 44 to 46
pub data_type: &'a DataType, // the return type of the function
pub schema: &'a Schema, // the schema of the input arguments
pub ignore_nulls: bool, // whether to ignore nulls
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these field are pub I think they show up in the rustdocs. Can you please make them with three /// and above the field in question to show up in the docs?

pub schema: &'a Schema, // the schema of the input arguments
pub ignore_nulls: bool, // whether to ignore nulls

// ordering arguments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also please document how we would tell if there was no ORDER BY specified? (is sort_exprs empty?)

pub ignore_nulls: bool, // whether to ignore nulls

// ordering arguments
pub sort_exprs: &'a [Expr], // the expressions of `order by`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you made this Option<&'a [Expr]> that would probably be easier for UDF implementors to check if ORDER BY was specified

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can have an empty vec to describe no ORDER BY is given.

Copy link
Contributor Author

@jayzhan211 jayzhan211 Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach is always to set ordering with another function with_ordering(sort_exprs: &'a [Expr]).

///
/// `arg`: the type of the argument to this accumulator
///
/// `sort_exprs`: contains a list of `Expr::SortExpr`s if the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these comments are now out of date

&self,
_name: &str,
_value_type: DataType,
_ordering_fields: Vec<Field>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are ordering_fields used for? I think they should be documented


/// Return the type used to serialize the [`Accumulator`]'s intermediate state.
/// See [`Accumulator::state()`] for more details
fn state_type(&self, return_type: &DataType) -> Result<Vec<DataType>>;

/// Return the fields of the intermediate state. It is mutually exclusive with [`Self::state_type`].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to change the AggregateUDFImpl anyways, maybe we should just remove state_type and always require state_fields?

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211 jayzhan211 requested a review from alamb April 2, 2024 14:31
let accumulator: AccumulatorFactoryFunction =
Arc::new(|_| Ok(Box::<AvgAccumulator>::default()));
let my_avg = AggregateUDF::from(SimpleAggregateUDF::new_with_signature(
"MY_AVG",
Signature::uniform(1, vec![DataType::Float64], Volatility::Immutable),
return_type,
accumulator,
state_type,
vec![
Field::new("count", DataType::UInt64, true),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think this is a nice change that the fields are now named rather than anonymous

I think @jacksonrnewhouse mentioned this might be helpful for his project as well

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jayzhan211 -- I think this looks good to me

What are you thinking of next? To try and pull first_value into its own crate (to prove it is separable)?

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Apr 3, 2024

Thanks @jayzhan211 -- I think this looks good to me

What are you thinking of next? To try and pull first_value into its own crate (to prove it is separable)?

Yes, I want to move it to its own crate and be able to register like scalar functions. And, move PhysicalSorting to physical-expr-core.

@jayzhan211 jayzhan211 merged commit dfd4442 into apache:main Apr 3, 2024
24 checks passed
@@ -85,13 +87,21 @@ impl AggregateUDFImpl for GeoMeanUdaf {
/// is supported, DataFusion will use this row oriented
/// accumulator when the aggregate function is used as a window function
/// or when there are only aggregates (no GROUP BY columns) in the plan.
fn accumulator(&self, _arg: &DataType) -> Result<Box<dyn Accumulator>> {
fn accumulator(&self, _acc_args: AccumulatorArgs) -> Result<Box<dyn Accumulator>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about the impact on this API for UDAF writers last night.

Specifically, about the many existing UDAFs that exist / will exist at the time this change gets released and on the first time people encounter / try to use this API. i think the args with datatypes is much easier to use (and has less mental gymnastics to use). Thus I am going to propose an easier / beginner API for this that will require fewer changes to existing UDAFs and will be easier to use for first timers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what i came up with: #9920

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants