Skip to content

Conversation

@kczimm
Copy link
Contributor

@kczimm kczimm commented May 7, 2025

Which issue does this PR close?

Rationale for this change

Increases functionality of parameterized queries.

What changes are included in this PR?

Placeholder datatype inference after LIMIT clause and Expr::InSubquery is now supported.

Are these changes tested?

Yes

Are there any user-facing changes?

Yes, users can now use parameterized queries in more places.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kczimm -- this looks great to me except for the special case for limit -- something seems to be wrong there. Would you be willing to help debug it further?

let fields = subquery_schema.fields();

// only supports subquery with exactly 1 field
if let [first_field] = &fields[..] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if there is more than one field? Will it not rewrite any placeholders?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

param_types.insert(id.clone(), Some(dt.clone()));
}
(Some(Some(_)), None) => {
// we have already inferred the datatype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When can this happen? Like what situations would one instance of the parameter like $2 be resolved but another instance would not be 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens because we earlier treated LIMIT specially. We added it to the param_types but when we get to the Placeholder expression here, we see that the type was never inferred. If we weren't in an immutable method, we could do the actual type inference above instead of directly populating param_types and let it be appropriately populated here.

let mut param_types: HashMap<String, Option<DataType>> = HashMap::new();

self.apply_with_subqueries(|plan| {
if let LogicalPlan::Limit(Limit { fetch: Some(e), .. }) = plan {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why the expression in the LIMIT needs special handling. It seems like it should be handled by LogicalPlan::apply_expressions here https://github.com/apache/datafusion/blob/41e7aed3a943134c40d1b18cb9d424b358b5e5b1/datafusion/expr/src/logical_plan/tree_node.rs#L462-L461

Also this code doesn't seem to handle the offset clause, only the fetch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, when I removed this special case the tests you have added definitely fail 🤔 Something really strange is going on

Copy link
Contributor Author

@kczimm kczimm May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My test is actually really unfair, I will modify it. I constructed LogicalPlan by hand and gave it no opportunity to do any placeholder type inference.

Expr::infer_placeholder_types takes an Expr and a DFSchema and it uses the schema to infer the datatype from the field type. However, there is no schema that will provide that information for LIMIT. It's a special case and we (I believe) always know what datatype it must be and it need not be inferred. I think the case is the same for offset as you mention. In fact, it looks like we try to coerce it to be Int64,

/// Coerce the fetch and skip expression to Int64 type.
.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What concerns me about the code is that code for inferring data type has a special case for LIMIT but from what I know LIMIT doesn't have any special behavior for parameters

If we need a schema to infer types for LIMIT we could either

  1. Use its input plan (though all schema would be ignored)
  2. Use Schema::empty() perhaps

Does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my last remaining concern -- if we can figure out some way that LIMIT doesn't need special case it would be great

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree with you. I've got caught up working on some other things but I've been giving this some thought and plan to work on it. We will find something that we like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you 🙏

sgrebnov pushed a commit to spiceai/datafusion that referenced this pull request May 26, 2025
UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49
apache#15980
sgrebnov pushed a commit to spiceai/datafusion that referenced this pull request May 26, 2025
…81)

UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49
apache#15980
@alamb alamb marked this pull request as draft June 3, 2025 16:16
@alamb
Copy link
Contributor

alamb commented Jun 3, 2025

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

@github-actions
Copy link

github-actions bot commented Aug 5, 2025

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Aug 5, 2025
@github-actions github-actions bot closed this Aug 14, 2025
kczimm added a commit to spiceai/datafusion that referenced this pull request Aug 19, 2025
…81)

UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49
apache#15980
Jeadie pushed a commit to spiceai/datafusion that referenced this pull request Sep 9, 2025
UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49
apache#15980
Jeadie pushed a commit to spiceai/datafusion that referenced this pull request Sep 9, 2025
…81)

UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49
apache#15980
Jeadie pushed a commit to spiceai/datafusion that referenced this pull request Sep 12, 2025
UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49
apache#15980
Jeadie pushed a commit to spiceai/datafusion that referenced this pull request Sep 12, 2025
…81)

UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49
apache#15980
peasee pushed a commit to spiceai/datafusion that referenced this pull request Oct 27, 2025
UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49
apache#15980
peasee pushed a commit to spiceai/datafusion that referenced this pull request Oct 27, 2025
…81)

UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49
apache#15980
peasee added a commit to spiceai/datafusion that referenced this pull request Oct 27, 2025
* fix: Ensure only tables or aliases that exist are projected (#52)
fix: More dangling references (#54)

UPSTREAM NOTE: This PR was attempted to be upstreamed in apache#13405 - but it was not accepted due to the complexity it brought. Phillip needs to figure out what a good solution that solves our problem and can be upstreamed is.

* Support for metadata columns (`location`, `size`, `last_modified`)  in ListingTableProvider (#74)

UPSTREAM NOTE: This PR was attempted to be upstreamed but was not accepted. Needs to be applied manually
apache#15181

* Infer placeholder datatype for `Expr::InSubquery` (#80)

UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49
apache#15980

* Infer placeholder datatype after `LIMIT` clause as `DataType::Int64` (#81)

UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49
apache#15980

* Do not double alias Exprs

UPSTREAM NOTE: This was attempted to be fixed with
apache#15008 but was closed

This is the tracking issue on DataFusion:
apache#14895
Do not double alias Exprs

* Add prefix to location metadata column (#82)

UPSTREAM NOTE: This will not be upstreamed as is.

* Infer placeholder types for CASE expressions (#87)

UPSTREAM NOTE: This has not been submitted upstream yet.

* Expand `infer_placeholder_types` to infer all possible placeholder types based on their expression (#88)

UPSTREAM NOTE: This has not been submitted upstream yet.

* Fix `Expr::infer_placeholder_types` inference to not fail (#89)

UPSTREAM NOTE: This has not been submitted upstream yet.

* cherry-pick parquet patch (#94)

* Fix array types coercion: preserve child element nullability for list types (#96)

UPSTREAM NOTE: This was submitted upstream and should be available in DF50

apache#17306

* Expand `infer_placeholder_types` to infer all possible placeholder types based on their expression (#88)

UPSTREAM NOTE: This has not been submitted upstream yet.

* do not enforce type guarantees on all Expr traversed in infer_placeholder_types (#97)

* Use UDTF function args in `LogicalPlan::TableScan` name (#98)

* use UDTF function args in LogicalPlan::TableScan name

* update test snapshots

* Implement timestamp_cast_dtype for SqliteDialect (#99)

* Use text for sqlite timestamp

* Add test

* Custom timestamp format for DuckDB (#102)

* Revert "cherry-pick parquet patch (#94)"

This reverts commit d780cc2.

* Support ExprNamed arguments to Scalar UDFs (#104)

* support ExprNamed until 17379 ships

* add same exprnamed lifting to udtf

* resolve projection against `ListingTable` table_schema incl. partition columns (#106)

* fix: Ensure ListingTable partitions are pruned when filters are not used (#108)

* fix: Prune partitions when no filters are defined

* fix: Backport for DF49:

* review: Address comments

* FileScanConfig: Preserve schema metadata across serde boundary (#107)

* FileScanConfig: preserve schema metadata across serde boundary

* add test

* Merge conflict fixes

UPSTREAM NOTE: this should not be upstreamed. This contains conflict fixes from various cherry-picks and differences in v50.

* update arrow-rs fork

UPSTREAM NOTE: this should not be upstreamed

---------

Co-authored-by: Phillip LeBlanc <phillip@leblanc.tech>
Co-authored-by: Kevin Zimmerman <4733573+kczimm@users.noreply.github.com>
Co-authored-by: sgrebnov <sergei.grebnov@gmail.com>
Co-authored-by: jeadie <jack@spice.ai>
Co-authored-by: Jack Eadie <jack.eadie0@gmail.com>
Co-authored-by: Viktor Yershov <krinart@gmail.com>
Co-authored-by: Viktor Yershov <viktor@spice.ai>
Co-authored-by: David Stancu <david@spice.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

logical-expr Logical plan and expressions Stale PR has not had any activity for some time

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Placeholder datatype not inferred for Expr::InSubquery Placeholder datatype not inferred after LIMIT clause

2 participants