Rename `input_type` --> `input_types` on AggregateFunctionExpr / AccumulatorArgs / StateFieldsArgs #11666

lewiszlw · 2024-07-26T09:39:49Z

Which issue does this PR close?

Closes #.

Rationale for this change

It confused me when I read these code that AggregateFunctionExpr / AccumulatorArgs / StateFieldsArgs only contain one input type.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb

Thanks @lewiszlw -- even though this is an API change I agree it makes the code less confusing

jcsherin · 2024-07-27T06:48:43Z

It confused me when I read these code that AggregateFunctionExpr / AccumulatorArgs / StateFieldsArgs only contain one input type.

This confused me too the first time when I looked at code in user-defined aggregates. I thought the reason it was written like that was because for all existing user-defined aggregates, knowing just the type of the first argument is enough.

jcsherin · 2024-07-27T06:52:55Z

@lewiszlw It will be very helpful if you could please also update this example.

datafusion/datafusion/functions-aggregate/COMMENTS.md

Line 57 in 204e1bc

Field::new("item", args.input_type.clone(), true /* nullable of list item */ ),

I checked your branch and this file doesn't exist. So you will need to merge from main to get it.

lewiszlw · 2024-07-27T11:27:14Z

Thanks for pointing out. I'll update pr in a few days.

alamb · 2024-07-27T11:29:09Z

I happened to have this PR opened locally (I get anxious with PRs that are open too long 😅 ) so I took the liberty of updating the docs as well in 1a3c5ca while I was merging up from main

datafusion/functions-aggregate/COMMENTS.md

jayzhan211 · 2024-07-29T09:50:06Z

I think we don't even need vector of input type, since we only get the first one. 🤔

alamb · 2024-07-29T10:58:59Z

I think we don't even need vector of input type, since we only get the first one. 🤔

Wouldn't we need a vector of input types if the aggregate had more than one argument? For example

https://docs.rs/datafusion/latest/datafusion/functions_aggregate/approx_percentile_cont/fn.approx_percentile_cont.html

jayzhan211 · 2024-07-29T11:59:18Z

I think we don't even need vector of input type, since we only get the first one. 🤔

Wouldn't we need a vector of input types if the aggregate had more than one argument? For example

https://docs.rs/datafusion/latest/datafusion/functions_aggregate/approx_percentile_cont/fn.approx_percentile_cont.html

Second argument is fixed type (f64), so we don't have the actual function that expect different multiple input type yet.
Btw, I think we could remove input_type, because we could compute the type from input_exprs and schema.
If there are any UDAF expect more flexible accumulator based on multiple input types, they can get the types from input_exprs and schema too.

Therefore, I suggest we remove input_type entirely and compute only if we need it in create_accumulator from input_exprs and schema. This not only make the code less confused and also let udaf be flexible and efficient (only compute types if needed)

alamb · 2024-07-29T12:50:31Z

Therefore, I suggest we remove input_type entirely and compute only if we need it in create_accumulator from input_exprs and schema. This not only make the code less confused and also let udaf be flexible and efficient (only compute types if needed)

That certainly sounds cleaner.

I feel like we have churned this API a bunch recently, so maybe we can take a step back and ensure that whatever we come up with supports the usecases we know of now (and in the future) so we don't keep changing it 🤔

jayzhan211 · 2024-07-29T14:45:02Z

Therefore, I suggest we remove input_type entirely and compute only if we need it in create_accumulator from input_exprs and schema. This not only make the code less confused and also let udaf be flexible and efficient (only compute types if needed)

That certainly sounds cleaner.

I feel like we have churned this API a bunch recently, so maybe we can take a step back and ensure that whatever we come up with supports the usecases we know of now (and in the future) so we don't keep changing it 🤔

I plan to change input_exprs to physical-exprs in #11359, where logical exprs are all changed to physical exprs, so there is yet another change.

I think it would be nice to have something like unstable feature (maybe declare as part of the docs for the public function) for feature or APIs that takes over one month (release cycle). After most of the things are done, we could consider it stable feature. We still do breaking change for stable feature, but it gives clear message to others about unstable features that are still in active development and may undergo frequent changes, thereby reducing the impact on downstream projects. We can have the flexibility to iterate and refined design without to worry about API change

jayzhan211 · 2024-07-29T15:01:43Z

The end state in my mind now.

pub struct AccumulatorArgs<'a> {
    /// Keep, this is return type, the name might be quite confusing.
    pub data_type: &'a DataType,

    /// We might only need one of `schema` or `dfschema`. It is likely we keep `dfschema`, since we can get `schema` from it.
    pub schema: &'a Schema,

    pub dfschema: &'a DFSchema,

    /// Keep
    pub ignore_nulls: bool,

    /// Convert to physical sort exprs instead
    pub sort_exprs: &'a [Expr],

    /// Keep
    pub is_reversed: bool,

    /// We might able to get the name from expressions
    pub name: &'a str,

    /// Keep
    pub is_distinct: bool,

    /// Get the type from schema and expressions
    pub input_type: &'a DataType,

    /// Convert to physical expressions
    pub input_exprs: &'a [Expr],
}

alamb · 2024-07-29T17:44:23Z

pub struct AccumulatorArgs<'a> {
    /// Keep, this is return type, the name might be quite confusing.
    pub data_type: &'a DataType,

I agree return_type would be clearer

/// We might only need one of `schema` or `dfschema`. It is likely we keep `dfschema`, since we can get `schema` from it.

I think schema is what makes sense as dfschema is mostly a logical thing in my mind (as it has a relation name).

Further evidence that we don't need dfschema is that you can get a DataType from a PhysicalExpr using a Schema https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#tymethod.data_type

/// Convert to physical sort exprs instead
pub sort_exprs: &'a [Expr],

Agree

/// Convert to physical expressions
pub input_exprs: &'a [Expr],

}

Make sense

Thanks @jayzhan211

alamb · 2024-07-29T17:44:44Z

So what should we do with this PR? Merge it and then revamp in a follow on PR?

jayzhan211 · 2024-07-30T00:02:13Z

Sure, but it is better to remove input_types before next release @lewiszlw

alamb · 2024-07-30T17:56:22Z

Sure, but it is better to remove input_types before next release @lewiszlw

Filed #11725 to track the cleanup

Thanks again

jayzhan211 · 2024-08-01T13:20:36Z

I think schema is what makes sense as dfschema is mostly a logical thing in my mind (as it has a relation name).

@xinlifoobar I think it is not clear whether we should keep dfschema or schema, so I suggest we just remove input_types at this moment. After the refactor #11359 , it would be clear

dfschema is introduced because there is schema mismatch issue without it for array_agg, the column index is incorrect.

lewiszlw added 2 commits July 26, 2024 17:28

UDAF input types

9a419b7

Rename

cfde493

github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate labels Jul 26, 2024

alamb changed the title ~~AggregateFunctionExpr / AccumulatorArgs / StateFieldsArgs contain input types~~ Rename input_type --> input_types om AggregateFunctionExpr / AccumulatorArgs / StateFieldsArgs Jul 26, 2024

alamb approved these changes Jul 26, 2024

View reviewed changes

alamb added the api change Changes the API exposed to users of the crate label Jul 26, 2024

Merge remote-tracking branch 'apache/main' into udaf-input-types

61ed78d

Update COMMENTS.md

1a3c5ca

jcsherin reviewed Jul 27, 2024

View reviewed changes

datafusion/functions-aggregate/COMMENTS.md Outdated Show resolved Hide resolved

Update datafusion/functions-aggregate/COMMENTS.md

23f8878

alamb changed the title ~~Rename input_type --> input_types om AggregateFunctionExpr / AccumulatorArgs / StateFieldsArgs~~ Rename input_type --> input_types on AggregateFunctionExpr / AccumulatorArgs / StateFieldsArgs Jul 29, 2024

alamb mentioned this pull request Jul 30, 2024

Improve AccumulatorArgs after transition to all udafs #11725

Open

alamb merged commit 66a8570 into apache:main Jul 30, 2024
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rename `input_type` --> `input_types` on AggregateFunctionExpr / AccumulatorArgs / StateFieldsArgs #11666

Rename `input_type` --> `input_types` on AggregateFunctionExpr / AccumulatorArgs / StateFieldsArgs #11666

lewiszlw commented Jul 26, 2024

alamb left a comment

jcsherin commented Jul 27, 2024 •

edited

Loading

jcsherin commented Jul 27, 2024

lewiszlw commented Jul 27, 2024

alamb commented Jul 27, 2024

jayzhan211 commented Jul 29, 2024

alamb commented Jul 29, 2024

jayzhan211 commented Jul 29, 2024 •

edited

Loading

alamb commented Jul 29, 2024

jayzhan211 commented Jul 29, 2024 •

edited

Loading

jayzhan211 commented Jul 29, 2024 •

edited

Loading

alamb commented Jul 29, 2024

alamb commented Jul 29, 2024

jayzhan211 commented Jul 30, 2024

alamb commented Jul 30, 2024 •

edited

Loading

jayzhan211 commented Aug 1, 2024 •

edited

Loading

Rename input_type --> input_types on AggregateFunctionExpr / AccumulatorArgs / StateFieldsArgs #11666

Rename input_type --> input_types on AggregateFunctionExpr / AccumulatorArgs / StateFieldsArgs #11666

Conversation

lewiszlw commented Jul 26, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

jcsherin commented Jul 27, 2024 • edited Loading

jcsherin commented Jul 27, 2024

lewiszlw commented Jul 27, 2024

alamb commented Jul 27, 2024

jayzhan211 commented Jul 29, 2024

alamb commented Jul 29, 2024

jayzhan211 commented Jul 29, 2024 • edited Loading

alamb commented Jul 29, 2024

jayzhan211 commented Jul 29, 2024 • edited Loading

jayzhan211 commented Jul 29, 2024 • edited Loading

alamb commented Jul 29, 2024

alamb commented Jul 29, 2024

jayzhan211 commented Jul 30, 2024

alamb commented Jul 30, 2024 • edited Loading

jayzhan211 commented Aug 1, 2024 • edited Loading

Rename `input_type` --> `input_types` on AggregateFunctionExpr / AccumulatorArgs / StateFieldsArgs #11666

Rename `input_type` --> `input_types` on AggregateFunctionExpr / AccumulatorArgs / StateFieldsArgs #11666

jcsherin commented Jul 27, 2024 •

edited

Loading

jayzhan211 commented Jul 29, 2024 •

edited

Loading

jayzhan211 commented Jul 29, 2024 •

edited

Loading

jayzhan211 commented Jul 29, 2024 •

edited

Loading

alamb commented Jul 30, 2024 •

edited

Loading

jayzhan211 commented Aug 1, 2024 •

edited

Loading