-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Additional placeholder datatype inferencing #15980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional placeholder datatype inferencing #15980
Conversation
only infer subquery if exactly 1 field
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kczimm -- this looks great to me except for the special case for limit -- something seems to be wrong there. Would you be willing to help debug it further?
| let fields = subquery_schema.fields(); | ||
|
|
||
| // only supports subquery with exactly 1 field | ||
| if let [first_field] = &fields[..] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if there is more than one field? Will it not rewrite any placeholders?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The subquery should only return one column. https://github.com/apache/datafusion/blob/main/datafusion/sql/src/expr/subquery.rs#L120
| param_types.insert(id.clone(), Some(dt.clone())); | ||
| } | ||
| (Some(Some(_)), None) => { | ||
| // we have already inferred the datatype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When can this happen? Like what situations would one instance of the parameter like $2 be resolved but another instance would not be 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This happens because we earlier treated LIMIT specially. We added it to the param_types but when we get to the Placeholder expression here, we see that the type was never inferred. If we weren't in an immutable method, we could do the actual type inference above instead of directly populating param_types and let it be appropriately populated here.
| let mut param_types: HashMap<String, Option<DataType>> = HashMap::new(); | ||
|
|
||
| self.apply_with_subqueries(|plan| { | ||
| if let LogicalPlan::Limit(Limit { fetch: Some(e), .. }) = plan { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why the expression in the LIMIT needs special handling. It seems like it should be handled by LogicalPlan::apply_expressions here https://github.com/apache/datafusion/blob/41e7aed3a943134c40d1b18cb9d424b358b5e5b1/datafusion/expr/src/logical_plan/tree_node.rs#L462-L461
Also this code doesn't seem to handle the offset clause, only the fetch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, when I removed this special case the tests you have added definitely fail 🤔 Something really strange is going on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My test is actually really unfair, I will modify it. I constructed LogicalPlan by hand and gave it no opportunity to do any placeholder type inference.
Expr::infer_placeholder_types takes an Expr and a DFSchema and it uses the schema to infer the datatype from the field type. However, there is no schema that will provide that information for LIMIT. It's a special case and we (I believe) always know what datatype it must be and it need not be inferred. I think the case is the same for offset as you mention. In fact, it looks like we try to coerce it to be Int64,
| /// Coerce the fetch and skip expression to Int64 type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What concerns me about the code is that code for inferring data type has a special case for LIMIT but from what I know LIMIT doesn't have any special behavior for parameters
If we need a schema to infer types for LIMIT we could either
- Use its
inputplan (though all schema would be ignored) - Use
Schema::empty()perhaps
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my last remaining concern -- if we can figure out some way that LIMIT doesn't need special case it would be great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree with you. I've got caught up working on some other things but I've been giving this some thought and plan to work on it. We will find something that we like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you 🙏
UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49 apache#15980
…81) UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49 apache#15980
|
Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look |
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
…81) UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49 apache#15980
UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49 apache#15980
…81) UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49 apache#15980
UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49 apache#15980
…81) UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49 apache#15980
UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49 apache#15980
…81) UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49 apache#15980
* fix: Ensure only tables or aliases that exist are projected (#52) fix: More dangling references (#54) UPSTREAM NOTE: This PR was attempted to be upstreamed in apache#13405 - but it was not accepted due to the complexity it brought. Phillip needs to figure out what a good solution that solves our problem and can be upstreamed is. * Support for metadata columns (`location`, `size`, `last_modified`) in ListingTableProvider (#74) UPSTREAM NOTE: This PR was attempted to be upstreamed but was not accepted. Needs to be applied manually apache#15181 * Infer placeholder datatype for `Expr::InSubquery` (#80) UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49 apache#15980 * Infer placeholder datatype after `LIMIT` clause as `DataType::Int64` (#81) UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49 apache#15980 * Do not double alias Exprs UPSTREAM NOTE: This was attempted to be fixed with apache#15008 but was closed This is the tracking issue on DataFusion: apache#14895 Do not double alias Exprs * Add prefix to location metadata column (#82) UPSTREAM NOTE: This will not be upstreamed as is. * Infer placeholder types for CASE expressions (#87) UPSTREAM NOTE: This has not been submitted upstream yet. * Expand `infer_placeholder_types` to infer all possible placeholder types based on their expression (#88) UPSTREAM NOTE: This has not been submitted upstream yet. * Fix `Expr::infer_placeholder_types` inference to not fail (#89) UPSTREAM NOTE: This has not been submitted upstream yet. * cherry-pick parquet patch (#94) * Fix array types coercion: preserve child element nullability for list types (#96) UPSTREAM NOTE: This was submitted upstream and should be available in DF50 apache#17306 * Expand `infer_placeholder_types` to infer all possible placeholder types based on their expression (#88) UPSTREAM NOTE: This has not been submitted upstream yet. * do not enforce type guarantees on all Expr traversed in infer_placeholder_types (#97) * Use UDTF function args in `LogicalPlan::TableScan` name (#98) * use UDTF function args in LogicalPlan::TableScan name * update test snapshots * Implement timestamp_cast_dtype for SqliteDialect (#99) * Use text for sqlite timestamp * Add test * Custom timestamp format for DuckDB (#102) * Revert "cherry-pick parquet patch (#94)" This reverts commit d780cc2. * Support ExprNamed arguments to Scalar UDFs (#104) * support ExprNamed until 17379 ships * add same exprnamed lifting to udtf * resolve projection against `ListingTable` table_schema incl. partition columns (#106) * fix: Ensure ListingTable partitions are pruned when filters are not used (#108) * fix: Prune partitions when no filters are defined * fix: Backport for DF49: * review: Address comments * FileScanConfig: Preserve schema metadata across serde boundary (#107) * FileScanConfig: preserve schema metadata across serde boundary * add test * Merge conflict fixes UPSTREAM NOTE: this should not be upstreamed. This contains conflict fixes from various cherry-picks and differences in v50. * update arrow-rs fork UPSTREAM NOTE: this should not be upstreamed --------- Co-authored-by: Phillip LeBlanc <phillip@leblanc.tech> Co-authored-by: Kevin Zimmerman <4733573+kczimm@users.noreply.github.com> Co-authored-by: sgrebnov <sergei.grebnov@gmail.com> Co-authored-by: jeadie <jack@spice.ai> Co-authored-by: Jack Eadie <jack.eadie0@gmail.com> Co-authored-by: Viktor Yershov <krinart@gmail.com> Co-authored-by: Viktor Yershov <viktor@spice.ai> Co-authored-by: David Stancu <david@spice.ai>
Which issue does this PR close?
LIMITclause #15978Expr::InSubquery#15979Rationale for this change
Increases functionality of parameterized queries.
What changes are included in this PR?
Placeholder datatype inference after
LIMITclause andExpr::InSubqueryis now supported.Are these changes tested?
Yes
Are there any user-facing changes?
Yes, users can now use parameterized queries in more places.