Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review use of logical expressions in physical AggregateFunctionExpr #11359

Closed
andygrove opened this issue Jul 9, 2024 · 10 comments · Fixed by #11845
Closed

Review use of logical expressions in physical AggregateFunctionExpr #11359

andygrove opened this issue Jul 9, 2024 · 10 comments · Fixed by #11845
Assignees
Labels
enhancement New feature or request

Comments

@andygrove
Copy link
Member

andygrove commented Jul 9, 2024

Is your feature request related to a problem or challenge?

DataFusion 40.0.0 added a new logical_args: Vec<Expr> field to AggregateFunctionExpr, which seems confusing, and there is no documentation in this struct that explains what this field is used for.

In DataFusion Comet, we do not use DataFusion's logical plan or expressions because we are translating an Apache Spark physical plan into a DataFusion physical plan and therefore we have no logical expressions to pass into this new field.

I think at a minimum we should add some documentation around this new feature.

@andygrove andygrove added the enhancement New feature or request label Jul 9, 2024
@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 10, 2024

@ozankabak had asked for whether there is anyway to entirely getting rid of logical expressions in discord, so I think we can review about the challenge I had before.

The reason why there are logical expressions in creeate_aggregate_expr is for customizing Accumulator. We can have different kind of accumulator based on the function's arguments. Ideally we should check with PhysicalExpr, but unfortunately, given the current design, we are not able to introduce physical expressions in AggregateUDFImpl trait, because we want to avoid adding dependency of physical-expr crate to datafusion-expr.

I think that it is also the main reason that blocking us. I propose to redesign about the role of each crate. To able to deal with physical concept for AggregateUDFImpl, we break physical-expr into two level. One is physical-expr-common, and another is physical-expr. The same applies to logical expr, datafusion-expr-common and datafusion-expr. common crate is the higher level crate so that we can add the dependency of physical-expr-common into datafusion-expr

The crate graph is like

  graph TD;
      functions-aggregates-common-->expr-common;
      functions-aggregates-common-->physical-expr-common;
      functions-aggregates-->functions-aggregates-common;
      functions-aggregates-->expr;
      physical-expr-common-->expr-common;
      expr-->expr-common;
      expr-->functions-aggregates-common;
      physical-expr-->physical-expr-common;
      core-->functions-aggregates;
      core-->physical-expr;
      third-parties-aggregate --> functions-aggregates-common;
Loading

expr-common: Things other than Expr and LogicalPlan can place it here
expr: Mainly for Expr and LogicalPlan. Import functions-aggregate-common for UDAF.
physical-common: Physical expr trait or other common things. Similar to what it is now except physical-expr like Column, Cast, Literal which should be moved to physical-expr
physical-expr: Physical Expr are here + Column, Cast, Literal
functions-aggregate-common: Import physical-common and expr-common, for other users to build their own udaf.
functions-aggregate: datafusion builtin functions

The more detail of discussion before is in #10074

With this approach, function like limited_convert_logical_expr_to_physical_expr is no longer needed after this change.

@alamb I think we can review about this idea again, the previous concern is that

What I am worried about is that physical-expr-common would end up with all the code from physical-expr

I think it is not an issue anymore.

@ozankabak
Copy link
Contributor

With this approach, function like limited_convert_logical_expr_to_physical_expr is no longer needed after this change.

We have at least two such functions and it would be great to arrive at a design that eliminates such functions.

Conceptually, we should have more info at the physical level (relative to logical level), so a refactor that makes AggregateFunctionExpr and likewise structs independent of logicals should be possible. I get the feeling that our dependency structure is currently more complex than it needs to be.

@jayzhan211
Copy link
Contributor

I get the feeling that our dependency structure is currently more complex than it needs to be.

I think we tend to split crate aggressively so others can only import the necessary crate they need.

@ozankabak
Copy link
Contributor

Splitting crates when beneficial is great, maybe we haven't arrived at the best design in terms of structure yet (as evidenced by this issue).

I think we can use your previous work as a starting point to improve the structure and also resolve this current issue 🚀

@alamb
Copy link
Contributor

alamb commented Jul 12, 2024

I agree with @jayzhan211 that the core of the problem is that the user defined API for aggregates is in datafusion_expr so can only use Expr but is invoked / instantiated as part of the physical plan.

However, the same basic problem could be claimed for ScalarUDFImpl and WindowUDFImpl 🤔

It does feel like the way out of this is to make the spit between API and implementation more explicit.

Maybe instead of expr-common maybe we could call it expr-api and physical-common --> physical-api

So like

  • expr-api: Traits like UDFetc other than Expr and LogicalPlan can place it here
  • expr: (existing crate) Mainly for Expr and LogicalPlan. Import functions-aggregate-common for UDAF.
  • physical-api: Physical expr trait or other common things. Similar to what it is now except physical-expr like Column, Cast, Literal which should be moved to physical-expr
  • physical-expr-common Physical Expr are here + Column, Cast, Literal
  • `aggregate-api: Import physical-api and expr-api -- TBD what does this have? ScalarUDAF? Accumulator?
  • functions-aggregate: datafusion builtin functions

@alamb
Copy link
Contributor

alamb commented Jul 12, 2024

I agree we should document what the fields are used for now

I personally recommend we finish #8708 before we try to do some other crate refactor. We are close with that one and once we have all the aggregates going through the same APIs I think we'll be in a better position to split things apart

@jayzhan211
Copy link
Contributor

Cool, maybe I could think about pulling down functions trait from expr instead of pulling up common things to expr-common 🤔

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 13, 2024

I think it is possible to have expr-api crate that contains ContextProvider, TableSource, FunctionRegistry, AggregateUDF, WindowUDF, and maybe ScalarUDF (this is not necessary but it would be nice to be close to other two functions)

physical-api: Physical expr trait or other common things. Similar to what it is now except physical-expr like Column, Cast, Literal which should be moved to physical-expr
physical-expr-common Physical Expr are here + Column, Cast, Literal

Not sure about the reason for physical-expr-common crate that contains Physical Expr, that is almost 60 ~ 80 % of the things in physical-expr, I perfer moving them back 🤔

physical-api / physical-expr-common for PhysicalExpr and PhysicalSortExpr and utils function for physical expr.

The original idea of functions-aggregate-common is crate for AggregateUDFImpl, we can

  1. Keep ContextProvider, FunctionRegistry, TableSource and UDF,UDAF,UDWF in the same crate. The dependency is expr-api -> expr
  2. UDF,UDAF,UDWF in expr-functions and move ContextProvider, FunctionRegistry and TableSource to a crate like expr-provider / expr-registry. The dependency is expr-provider -> expr-functions -> expr

The crate graph is like

  graph TD;
      expr-api --> physical-expr-common;
      physical-expr-common --> expr;
      functions-aggregate -->  expr-api;
      third-parties-aggregate --> expr-api;
Loading

or

  graph TD;
      expr-provider --> expr-functions;
      expr-functions --> physical-expr-common;
      physical-expr-common --> expr;
      functions-aggregate -->  expr-functions;
      third-parties-aggregate --> expr-functions;
Loading

@jayzhan211
Copy link
Contributor

jayzhan211 commented Aug 5, 2024

I plan to pull out aggregate function to functions-aggregate-common (the role of expr-functions in the above graph but narrow to aggregate only) as the first step. Ok, this doesn't work 😆 , AggregateFunction and Expr should belong to the same crate.

Expr::AggregateFunction(AggregateFunction)

pub struct AggregateFunction {
    /// Name of the function
    pub func: Arc<crate::AggregateUDF>,
   ..
}

pub struct AggregateUDF {
    inner: Arc<dyn AggregateUDFImpl>,
}

I guess #10327 is the only possible solution

@jayzhan211
Copy link
Contributor

I found AggregateUDFImpl is now tightly coupled with Expr unlike the status I got in #10327, thus not possible to eliminate the dependency of Expr. Mainly due to fn call() -> Expr

The crate graph is now like

Since we would like to import PhysicalExpr for AggregateUDFImpl, and sadly it seems to be tightly coupled with Expr.

Therefore, we come out several common-level crate, those crate has no dependencies on Expr.
physical-expr-common are common things about physical-expr concept. Mainly PhysicalExpr, PhysicalSortExpr.
Their dependencies are pull out from expr to expr-common.
functions-aggregate-common are more aggregate-specific

Physical expr like Column, Cast, Literal are moved back to physical-expr for now.
AggregateFunctionExpr, AggregateExprBuilder are moved to physical-expr-functions-aggregate, they have depdency on Expr. It's level is similar to physical-expr, but more aggregate specific.

  graph TD;
      physical-expr-common --> expr-common;
      functions-aggregate-common --> physical-expr-common;
      expr --> functions-aggregate-common;
      physical-expr-functions-aggregate --> expr;
      physical-expr --> expr;
      functions-aggregate --> physical-expr;
Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants