-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce user defined SQL planner API #11180
Conversation
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Arc::new(ArrayFunctionPlanner::try_new(provider).unwrap()) as _; | ||
|
||
let field_access_planner = | ||
Arc::new(FieldAccessPlanner::try_new(provider).unwrap()) as _; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_field
for struct does not need array expression, but I'm not sure should we move it out
Interesting, I got stack overflow in CI but not in local. |
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jayzhan211 this is very cool and I think close
I left a bunch of stylistic / naming / comment suggestions -- let me know what you think
datafusion/expr/src/planner.rs
Outdated
} | ||
|
||
/// This trait allows users to customize the behavior of the SQL planner | ||
pub trait UserDefinedPlanner { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about different names for this struct that might be more specific to SQL and thus make it easier to find
How about UserDefinedSQLPlanner
? or SQLPlannerExtensions
?
datafusion/expr/src/planner.rs
Outdated
} | ||
} | ||
|
||
pub struct BinaryExpr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be confusing if this struct is also called BinaryExpr
given https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.BinaryExpr.html. What do you think about callng this RawBinaryExpr
or maybe SqlBinaryExpr
(and then rename FieldAccessExpr
similarly)?
Here is some suggested comments to explain the structure (it would be good to do the same for FieldAccessExpr too):
pub struct BinaryExpr { | |
/// An operator with two arguments to plan | |
/// | |
/// Note `left` and `right` are DataFusion [`Expr`]s but the `op` is the SQL AST operator. | |
/// This structure is used by [`UserDefinedPlanner`] to plan operators with custom expressions. | |
pub struct BinaryExpr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also struggle with this naming too 😆 . I will take RawBinaryExpr
datafusion/expr/src/planner.rs
Outdated
pub enum PlannerSimplifyResult { | ||
/// The function call was simplified to an entirely new Expr | ||
Simplified(Expr), | ||
/// the function call could not be simplified, and the arguments | ||
/// are return unmodified. | ||
OriginalBinaryExpr(BinaryExpr), | ||
OriginalFieldAccessExpr(FieldAccessExpr), | ||
OriginalArray(Vec<Expr>), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be easier to use if it were generic so that the callsites would know exactly the type of the returns Original
SO something like
pub enum PlannerSimplifyResult { | |
/// The function call was simplified to an entirely new Expr | |
Simplified(Expr), | |
/// the function call could not be simplified, and the arguments | |
/// are return unmodified. | |
OriginalBinaryExpr(BinaryExpr), | |
OriginalFieldAccessExpr(FieldAccessExpr), | |
OriginalArray(Vec<Expr>), | |
} | |
pub enum PlannerSimplifyResult<T> { | |
/// The structure was planned as an entirely new Expr by the planner | |
Simplified(Expr), | |
/// the planner did not handle planning the structure, and it is returned | |
/// unmodified. | |
OriginalBinaryExpr(T), | |
} |
datafusion/sql/src/expr/mod.rs
Outdated
field_access_expr = expr; | ||
} | ||
_ => { | ||
return exec_err!("Unexpected result encountered. Did you expect an OriginalFieldAccessExpr?") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can avoid this type of check by using a generic in PlannerSimplifyResult
datafusion/expr/src/planner.rs
Outdated
pub expr: Expr, | ||
} | ||
|
||
pub enum PlannerSimplifyResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this isn't really "simplifying" anything maybe we could call it PlannerResult
instead of PlannerSimplifyResult
/// Get all user defined window function names | ||
fn udwf_names(&self) -> Vec<String>; | ||
} | ||
pub use datafusion_expr::planner::ContextProvider; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for backwards compatibility
datafusion/sql/src/planner.rs
Outdated
/// user defined planner extensions | ||
pub(crate) planners: Vec<Arc<dyn UserDefinedPlanner>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about using a single planner
/// user defined planner extensions | |
pub(crate) planners: Vec<Arc<dyn UserDefinedPlanner>>, | |
/// user defined planner extensions | |
pub planner: Option<Arc<dyn UserDefinedPlanner>>>, |
This would match the basic pattern we have elsewhere (like proto extension codec) and people could implement composable planners like @lewiszlw did in https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/composed_extension_codec.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this
- If we only accept single planner, does that mean we need to implement all the trait functions in one struct? We can't have array function planner optional in this case.
- If the user expect their own planner + our builtin planner, how can they enable both? With current design, they just chain them with
with_user_defined_planner
.
query
.with_user_defined_planner(array_planner)
.with_user_defined_planner(field_access_planner)
.with_user_defined_planner(user's planner)
...
- The example given https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/composed_extension_codec.rs seems to have multiple codecs too 🤔
/// A PhysicalExtensionCodec that tries one of multiple inner codecs
/// until one works
#[derive(Debug)]
struct ComposedPhysicalExtensionCodec {
codecs: Vec<Arc<dyn PhysicalExtensionCodec>>,
}
@@ -4910,10 +4910,11 @@ select array_ndims(arrow_cast([null], 'List(List(List(Int64)))')); | |||
3 | |||
|
|||
# array_ndims scalar function #2 | |||
# TODO: dimensions 20 is the maximum without stack overflow, find a way to enable deep nested array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is perhaps because the stackframe got larger in debug builds with this change -- one way to workaround it might be to pull the code for planning binary_exprs and the code for planning array access into their own functions rather than leave them inline in plan_expr
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jayzhan211 -- I think this looks great. I pushed two commits to this branch to update comments and rename the variant to avoid having to go back and forth again. Hopefully that is ok
@samuelcolvin what do you think of this PR ? Is it enough to work with instead of #11137?
// register crate of array expressions (if enabled) | ||
#[cfg(feature = "array_expressions")] | ||
{ | ||
let array_planner = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
In general this looks great. I'm going to try to use it on datafusion-functions-json and let you know how I get on. Should be able to get this done today, or tomorrow if life get's in the way. |
/// | ||
/// This structure is used by [`UserDefinedSQLPlanner`] to plan operators with | ||
/// custom expressions. | ||
#[derive(Debug, Clone)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need Clone 🤔 Same as RawFieldAccessExpr
and PlannerResult
. Others looks good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree there isn't a critical usecase for it now, but I figured it didn't hurt. If you feel strongly I will remove Clone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not a critical issue.
Thanks @alamb |
I'm a bit surprised this got merged without waiting for me to try it on I said just above that I would try to do that yesterday or today. As far as I can see, this PR provides no way to register external custom query planners. I guess I'll work on a follow up... |
Thanks @samuelcolvin I agree iterating with some other PRs is a good idea. |
I wrote up #11207 to track moving the remaining SQL planning features to this API |
} | ||
} | ||
|
||
not_impl_err!("GetFieldAccess not supported by UserDefinedExtensionPlanners: {field_access_expr:?}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋 I'm curious about this default branch because I'm seeing this error when doing integration testing with the delta-rs package. To the best of my knowledge we're not using any extension planners, but now this is failing with the latest datafusion. I'm uncertain whether we're doing something wrong, or if the default behavior of falling through to not_impl_err
is causing us grief
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rtyler -- looks to me like the delta.rs code is managing its own SessionState -- like in https://github.com/delta-io/delta-rs/blob/main/crates/core/src/delta_datafusion/mod.rs#L1687 (BTW the Delta API is really nicely thought out)
So I think you'll need to register the same planners in your SessionContext 🤔
Helpfully I think @Omega359 just made a PR to make this easier: #11296
I feel in general DataFusion is hard to use / configure correctly if you are using a custom SessionState / configuration -- one potential thing we were discussing is #11182 (comment) -- I'll file a ticket to make this a thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposed fix is here: #11485
Sorry, I didn't follow up this topic before. I'm thinking if we can have one UserDefinedSQLPlanner trait which all methods have default official implementation, users just need to impl method they want to customize. |
I agree the looping is not ideal. In terms of conflicts, I think the semantics are pretty well defined now (the planners are ordered and whichever successfully plans the option first will be chosen). |
* Prototype user defined sql planner might look like * at arrow Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * get field Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * plan array literal Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * move to functions-array Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * license Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * change nested array test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * address comment Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix stack overflow issue Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * upd cli Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix doc Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix doc Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * Rename PlannerResult::Simplified to PlannerResult::Planned * Update comments and add Debug/Clone impls --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Which issue does this PR close?
Closes #11168
part of #10534
Closes #11137
TLDR is to make it easier for users to customize sql planning, specifically to handle operator translation
Rationale for this change
Let sql planner customizable
Follow up from #11168
What changes are included in this PR?
TODO in next PRs
#10534, to close this, we have more functions to move.
Are these changes tested?
Are there any user-facing changes?