Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename input_type --> input_types on AggregateFunctionExpr / AccumulatorArgs / StateFieldsArgs #11666

Merged
merged 5 commits into from
Jul 30, 2024

Conversation

lewiszlw
Copy link
Member

Which issue does this PR close?

Closes #.

Rationale for this change

It confused me when I read these code that AggregateFunctionExpr / AccumulatorArgs / StateFieldsArgs only contain one input type.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate labels Jul 26, 2024
@alamb alamb changed the title AggregateFunctionExpr / AccumulatorArgs / StateFieldsArgs contain input types Rename input_type --> input_types om AggregateFunctionExpr / AccumulatorArgs / StateFieldsArgs Jul 26, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lewiszlw -- even though this is an API change I agree it makes the code less confusing

@alamb alamb added the api change Changes the API exposed to users of the crate label Jul 26, 2024
@jcsherin
Copy link
Contributor

jcsherin commented Jul 27, 2024

It confused me when I read these code that AggregateFunctionExpr / AccumulatorArgs / StateFieldsArgs only contain one input type.

This confused me too the first time when I looked at code in user-defined aggregates. I thought the reason it was written like that was because for all existing user-defined aggregates, knowing just the type of the first argument is enough.

@jcsherin
Copy link
Contributor

@lewiszlw It will be very helpful if you could please also update this example.

Field::new("item", args.input_type.clone(), true /* nullable of list item */ ),

I checked your branch and this file doesn't exist. So you will need to merge from main to get it.

@lewiszlw
Copy link
Member Author

Thanks for pointing out. I'll update pr in a few days.

@alamb
Copy link
Contributor

alamb commented Jul 27, 2024

I happened to have this PR opened locally (I get anxious with PRs that are open too long 😅 ) so I took the liberty of updating the docs as well in 1a3c5ca while I was merging up from main

@jayzhan211
Copy link
Contributor

I think we don't even need vector of input type, since we only get the first one. 🤔

@alamb alamb changed the title Rename input_type --> input_types om AggregateFunctionExpr / AccumulatorArgs / StateFieldsArgs Rename input_type --> input_types on AggregateFunctionExpr / AccumulatorArgs / StateFieldsArgs Jul 29, 2024
@alamb
Copy link
Contributor

alamb commented Jul 29, 2024

I think we don't even need vector of input type, since we only get the first one. 🤔

Wouldn't we need a vector of input types if the aggregate had more than one argument? For example

https://docs.rs/datafusion/latest/datafusion/functions_aggregate/approx_percentile_cont/fn.approx_percentile_cont.html

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 29, 2024

I think we don't even need vector of input type, since we only get the first one. 🤔

Wouldn't we need a vector of input types if the aggregate had more than one argument? For example

https://docs.rs/datafusion/latest/datafusion/functions_aggregate/approx_percentile_cont/fn.approx_percentile_cont.html

Second argument is fixed type (f64), so we don't have the actual function that expect different multiple input type yet.
Btw, I think we could remove input_type, because we could compute the type from input_exprs and schema.
If there are any UDAF expect more flexible accumulator based on multiple input types, they can get the types from input_exprs and schema too.

Therefore, I suggest we remove input_type entirely and compute only if we need it in create_accumulator from input_exprs and schema. This not only make the code less confused and also let udaf be flexible and efficient (only compute types if needed)

@alamb
Copy link
Contributor

alamb commented Jul 29, 2024

Therefore, I suggest we remove input_type entirely and compute only if we need it in create_accumulator from input_exprs and schema. This not only make the code less confused and also let udaf be flexible and efficient (only compute types if needed)

That certainly sounds cleaner.

I feel like we have churned this API a bunch recently, so maybe we can take a step back and ensure that whatever we come up with supports the usecases we know of now (and in the future) so we don't keep changing it 🤔

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 29, 2024

Therefore, I suggest we remove input_type entirely and compute only if we need it in create_accumulator from input_exprs and schema. This not only make the code less confused and also let udaf be flexible and efficient (only compute types if needed)

That certainly sounds cleaner.

I feel like we have churned this API a bunch recently, so maybe we can take a step back and ensure that whatever we come up with supports the usecases we know of now (and in the future) so we don't keep changing it 🤔

I plan to change input_exprs to physical-exprs in #11359, where logical exprs are all changed to physical exprs, so there is yet another change.

I think it would be nice to have something like unstable feature (maybe declare as part of the docs for the public function) for feature or APIs that takes over one month (release cycle). After most of the things are done, we could consider it stable feature. We still do breaking change for stable feature, but it gives clear message to others about unstable features that are still in active development and may undergo frequent changes, thereby reducing the impact on downstream projects. We can have the flexibility to iterate and refined design without to worry about API change

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 29, 2024

The end state in my mind now.

pub struct AccumulatorArgs<'a> {
    /// Keep, this is return type, the name might be quite confusing.
    pub data_type: &'a DataType,

    /// We might only need one of `schema` or `dfschema`. It is likely we keep `dfschema`, since we can get `schema` from it.
    pub schema: &'a Schema,

    pub dfschema: &'a DFSchema,

    /// Keep
    pub ignore_nulls: bool,

    /// Convert to physical sort exprs instead
    pub sort_exprs: &'a [Expr],

    /// Keep
    pub is_reversed: bool,

    /// We might able to get the name from expressions
    pub name: &'a str,

    /// Keep
    pub is_distinct: bool,

    /// Get the type from schema and expressions
    pub input_type: &'a DataType,

    /// Convert to physical expressions
    pub input_exprs: &'a [Expr],
}

@alamb
Copy link
Contributor

alamb commented Jul 29, 2024

pub struct AccumulatorArgs<'a> {
    /// Keep, this is return type, the name might be quite confusing.
    pub data_type: &'a DataType,

I agree return_type would be clearer

/// We might only need one of `schema` or `dfschema`. It is likely we keep `dfschema`, since we can get `schema` from it.

I think schema is what makes sense as dfschema is mostly a logical thing in my mind (as it has a relation name).

Further evidence that we don't need dfschema is that you can get a DataType from a PhysicalExpr using a Schema https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#tymethod.data_type

/// Convert to physical sort exprs instead
pub sort_exprs: &'a [Expr],

Agree

/// Convert to physical expressions
pub input_exprs: &'a [Expr],

}

Make sense

Thanks @jayzhan211

@alamb
Copy link
Contributor

alamb commented Jul 29, 2024

So what should we do with this PR? Merge it and then revamp in a follow on PR?

@jayzhan211
Copy link
Contributor

Sure, but it is better to remove input_types before next release @lewiszlw

@alamb
Copy link
Contributor

alamb commented Jul 30, 2024

Sure, but it is better to remove input_types before next release @lewiszlw

Filed #11725 to track the cleanup

Thanks again

@alamb alamb merged commit 66a8570 into apache:main Jul 30, 2024
25 checks passed
@jayzhan211
Copy link
Contributor

jayzhan211 commented Aug 1, 2024

I think schema is what makes sense as dfschema is mostly a logical thing in my mind (as it has a relation name).

@xinlifoobar I think it is not clear whether we should keep dfschema or schema, so I suggest we just remove input_types at this moment. After the refactor #11359 , it would be clear

dfschema is introduced because there is schema mismatch issue without it for array_agg, the column index is incorrect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants