Skip to content

Conversation

@peasee
Copy link
Contributor

@peasee peasee commented Nov 14, 2024

Which issue does this PR close?

Closes #13027

Rationale for this change

This change ensures that table references which don't exist at their current level are removed. For example, SELECT ta.j1_id FROM (SELECT j1_id FROM j1 ta) is invalid because the subquery is un-aliased so ta is not a valid reference at the top-level projection.

This is usually caused by derived subqueries, both un-aliased and aliased.

What changes are included in this PR?

  • For each select_to_sql_expr, collects the available table identifiers at that level from the projection and table joins.
  • Iterates over the projection and order by and compares them to the list of collected identifiers. For identifiers that are not found, strips their table reference leaving just the bare column (e.g. ta.j1_id -> j1_id).

Are these changes tested?

Yes. A collection of new plan-to-SQL roundtrip tests to validate the changes.

Are there any user-facing changes?

No

@github-actions github-actions bot added the sql SQL Planner label Nov 14, 2024
@peasee peasee marked this pull request as draft November 14, 2024 02:37
assert_eq!(
actual,
r#"SELECT sum(users.age), users."name" FROM (SELECT users."name", users.age FROM users) GROUP BY users."name""#
r#"SELECT sum(age), "name" FROM (SELECT users."name", users.age FROM users) GROUP BY "name""#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@peasee peasee marked this pull request as ready for review November 14, 2024 05:50
@alamb alamb changed the title fix: Remove dangling table references fix: Remove dangling table references in unparser Nov 14, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution @peasee and for the review @sgrebnov

I may not understand something but I left some suggestions for your consideration

_ => None,
}
}
pub fn get_alias(&self) -> Option<String> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably avoid a bunch of copies if you made this return a reference to a &str rather than a String -- if the caller needed the string they can always copy it.

Suggested change
pub fn get_alias(&self) -> Option<String> {
pub fn get_alias(&self) -> Option<&str> {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like Ident doesn't implement anything that would return a &str, so it needs a String intermediary. I'm also not sure what copies you're referring too, I don't make any copies of the values from collect_valid_idents? The return from get_alias also isn't cloned, and is taken ownership of by collect_valid_idents.


let mut twj = select_builder.pop_from().unwrap();
twj.relation(relation_builder);
twj.relation(relation_builder.clone());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this needs to have a clone now 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because twj.relation() takes ownership of the relation_builder causing it to move, so we can't borrow it again later.

The only thing I use the relation builder for is to retrieve the list of all the identifiers, so I could probably do that before the twj.relation() then just pass those like:

let valid_idents = select_builder.collect_valid_idents();
twj.relation();

which shouldn't require a clone.


/// Takes an input list of identifiers and a list of identifiers that are available from relations or joins.
/// Removes any table identifiers that are not present in the list of available identifiers, retains original column names.
pub fn remove_dangling_identifiers(idents: &mut Vec<Ident>, available_idents: &[String]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this code super deeply, but this seems to me like it is treating the symptom (incorrect qualifiers) rather than the root cause.

Specifically, did you look into fixing the code so that it didn't create incorrect indentifiers in the first place, rather than trying to modify the created AST after the fact to remove incorrect indentifers ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had taken a look into doing this at the unparser LogicalPlan level but I wasn't making very good progress. It could be my lack of understanding with LogicalPlan, but I think the symptom originates from the parser rather than the unparser.

If you'd be open to merging this still as an AST modifier, perhaps we could gate it behind a dialect option or feature flag as a non-default?

@alamb alamb added the unparser Changes to the unparser crate label Nov 20, 2024
@alamb
Copy link
Contributor

alamb commented Nov 23, 2024

Maybe @sgrebnov or @phillipleblanc can offer some advice about how to proceed with this PR -- I feel like it adds significantly complexity and I am not sure it doesn't also introduce some hard to understand subtle bugs with indentifiers

@phillipleblanc
Copy link
Contributor

Maybe @sgrebnov or @phillipleblanc can offer some advice about how to proceed with this PR -- I feel like it adds significantly complexity and I am not sure it doesn't also introduce some hard to understand subtle bugs with indentifiers

I plan to take a shot at preventing the identifiers from getting introduced in the first place and/or adding subquery alias nodes as appropriate. I spent a little time on it this weekend and was able to get the bare aggregation + table scan working.

@peasee let's close this PR for now.

@peasee peasee closed this Nov 25, 2024
phillipleblanc pushed a commit to spiceai/datafusion that referenced this pull request Apr 8, 2025
fix: More dangling references (#54)

* fix: More dangling references

* test: Add tests for remove_dangling_identifiers

UPSTREAM NOTE: This PR was attempted to be upstreamed in apache#13405 - but it was not accepted due to the complexity it brought. Phillip needs to figure out what a good solution that solves our problem and can be upstreamed is.
phillipleblanc pushed a commit to spiceai/datafusion that referenced this pull request Apr 25, 2025
fix: More dangling references (#54)

* fix: More dangling references

* test: Add tests for remove_dangling_identifiers

UPSTREAM NOTE: This PR was attempted to be upstreamed in apache#13405 - but it was not accepted due to the complexity it brought. Phillip needs to figure out what a good solution that solves our problem and can be upstreamed is.
sgrebnov pushed a commit to spiceai/datafusion that referenced this pull request May 22, 2025
fix: More dangling references (#54)

* fix: More dangling references

* test: Add tests for remove_dangling_identifiers

UPSTREAM NOTE: This PR was attempted to be upstreamed in apache#13405 - but it was not accepted due to the complexity it brought. Phillip needs to figure out what a good solution that solves our problem and can be upstreamed is.

# Conflicts:
#	datafusion/sql/src/unparser/ast.rs
#	datafusion/sql/tests/cases/plan_to_sql.rs
sgrebnov pushed a commit to spiceai/datafusion that referenced this pull request May 26, 2025
fix: More dangling references (#54)

UPSTREAM NOTE: This PR was attempted to be upstreamed in apache#13405 - but it was not accepted due to the complexity it brought. Phillip needs to figure out what a good solution that solves our problem and can be upstreamed is.
kczimm pushed a commit to spiceai/datafusion that referenced this pull request Aug 19, 2025
fix: More dangling references (#54)

UPSTREAM NOTE: This PR was attempted to be upstreamed in apache#13405 - but it was not accepted due to the complexity it brought. Phillip needs to figure out what a good solution that solves our problem and can be upstreamed is.
kczimm pushed a commit to spiceai/datafusion that referenced this pull request Aug 21, 2025
fix: More dangling references (#54)

UPSTREAM NOTE: This PR was attempted to be upstreamed in apache#13405 - but it was not accepted due to the complexity it brought. Phillip needs to figure out what a good solution that solves our problem and can be upstreamed is.
Jeadie pushed a commit to spiceai/datafusion that referenced this pull request Sep 9, 2025
fix: More dangling references (#54)

UPSTREAM NOTE: This PR was attempted to be upstreamed in apache#13405 - but it was not accepted due to the complexity it brought. Phillip needs to figure out what a good solution that solves our problem and can be upstreamed is.
Jeadie pushed a commit to spiceai/datafusion that referenced this pull request Sep 12, 2025
fix: More dangling references (#54)

UPSTREAM NOTE: This PR was attempted to be upstreamed in apache#13405 - but it was not accepted due to the complexity it brought. Phillip needs to figure out what a good solution that solves our problem and can be upstreamed is.
peasee added a commit to spiceai/datafusion that referenced this pull request Oct 27, 2025
fix: More dangling references (#54)

UPSTREAM NOTE: This PR was attempted to be upstreamed in apache#13405 - but it was not accepted due to the complexity it brought. Phillip needs to figure out what a good solution that solves our problem and can be upstreamed is.
peasee added a commit to spiceai/datafusion that referenced this pull request Oct 27, 2025
* fix: Ensure only tables or aliases that exist are projected (#52)
fix: More dangling references (#54)

UPSTREAM NOTE: This PR was attempted to be upstreamed in apache#13405 - but it was not accepted due to the complexity it brought. Phillip needs to figure out what a good solution that solves our problem and can be upstreamed is.

* Support for metadata columns (`location`, `size`, `last_modified`)  in ListingTableProvider (#74)

UPSTREAM NOTE: This PR was attempted to be upstreamed but was not accepted. Needs to be applied manually
apache#15181

* Infer placeholder datatype for `Expr::InSubquery` (#80)

UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49
apache#15980

* Infer placeholder datatype after `LIMIT` clause as `DataType::Int64` (#81)

UPSTREAM NOTE: Upstream PR has been created but not merged yet. Should be available in DF49
apache#15980

* Do not double alias Exprs

UPSTREAM NOTE: This was attempted to be fixed with
apache#15008 but was closed

This is the tracking issue on DataFusion:
apache#14895
Do not double alias Exprs

* Add prefix to location metadata column (#82)

UPSTREAM NOTE: This will not be upstreamed as is.

* Infer placeholder types for CASE expressions (#87)

UPSTREAM NOTE: This has not been submitted upstream yet.

* Expand `infer_placeholder_types` to infer all possible placeholder types based on their expression (#88)

UPSTREAM NOTE: This has not been submitted upstream yet.

* Fix `Expr::infer_placeholder_types` inference to not fail (#89)

UPSTREAM NOTE: This has not been submitted upstream yet.

* cherry-pick parquet patch (#94)

* Fix array types coercion: preserve child element nullability for list types (#96)

UPSTREAM NOTE: This was submitted upstream and should be available in DF50

apache#17306

* Expand `infer_placeholder_types` to infer all possible placeholder types based on their expression (#88)

UPSTREAM NOTE: This has not been submitted upstream yet.

* do not enforce type guarantees on all Expr traversed in infer_placeholder_types (#97)

* Use UDTF function args in `LogicalPlan::TableScan` name (#98)

* use UDTF function args in LogicalPlan::TableScan name

* update test snapshots

* Implement timestamp_cast_dtype for SqliteDialect (#99)

* Use text for sqlite timestamp

* Add test

* Custom timestamp format for DuckDB (#102)

* Revert "cherry-pick parquet patch (#94)"

This reverts commit d780cc2.

* Support ExprNamed arguments to Scalar UDFs (#104)

* support ExprNamed until 17379 ships

* add same exprnamed lifting to udtf

* resolve projection against `ListingTable` table_schema incl. partition columns (#106)

* fix: Ensure ListingTable partitions are pruned when filters are not used (#108)

* fix: Prune partitions when no filters are defined

* fix: Backport for DF49:

* review: Address comments

* FileScanConfig: Preserve schema metadata across serde boundary (#107)

* FileScanConfig: preserve schema metadata across serde boundary

* add test

* Merge conflict fixes

UPSTREAM NOTE: this should not be upstreamed. This contains conflict fixes from various cherry-picks and differences in v50.

* update arrow-rs fork

UPSTREAM NOTE: this should not be upstreamed

---------

Co-authored-by: Phillip LeBlanc <phillip@leblanc.tech>
Co-authored-by: Kevin Zimmerman <4733573+kczimm@users.noreply.github.com>
Co-authored-by: sgrebnov <sergei.grebnov@gmail.com>
Co-authored-by: jeadie <jack@spice.ai>
Co-authored-by: Jack Eadie <jack.eadie0@gmail.com>
Co-authored-by: Viktor Yershov <krinart@gmail.com>
Co-authored-by: Viktor Yershov <viktor@spice.ai>
Co-authored-by: David Stancu <david@spice.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sql SQL Planner unparser Changes to the unparser crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Table references in subqueries cause invalid references in top-level projection

4 participants