Skip to content

Conversation

@brunal
Copy link
Contributor

@brunal brunal commented Jun 26, 2025

Closes #16120

In SqlToRel::parse_join(), when handling JoinContraint::Using, the
identifiers are normalized using IdentNormalizer::normalize().
That normalization lower-cases unquoted identifiers, and keeps the case
otherwise (but not the quotes).

Before this PR, the normalized column names were passed to
LogicalPlanBuilder::join_using() as strings. When each goes through
LogicalPlanBuilder::normalize(), Column::From() is called,
leading to Column::from_qualified_named(). As it gets an unqualified
column, it lower-cases it.

This means that if a join is USING("SOME_COLUMN_NAME"), we end up with a
Column { name: "some_column_name", ..}. In the end, the join fails, as
that lower-case column does not exist.

With this PR, SqlToRel::parse_join() calls Column::from_name() on
each normalized column and passed those to
LogicalPlanBuilder::join_using(). Downstream, in
LogicalPlanBuilder::normalize(), there is no need to create the Column
objects from strings, and the bug does not happen.

This fixes a regression introduced in 304488d#diff-0762df7208dad0e830a8f0b389945d53ef011cac958582963ab58579caa038bd -- before that commit, Column::from_name() was called on each column name.

Additionally, I remove the genericity on Columns from LogicalPlanBuilder::join_using(). I believe that genericity is bug-prone, while not providing much value.

Bruno Cauet added 2 commits June 26, 2025 09:52
In SqlToRel::parse_join(), when handling JoinContraint::Using, the
identifiers are normalized using IdentNormalizer::normalize().

That normalization lower-cases unquoted identifiers, and keeps the case
otherwise (but not the quotes).

Until this commit, the normalized column names were passed to
LogicalPlanBuilder::join_using() as strings. When each goes through
LogicalPlanBuilder::normalize(), Column::From<String>() is called,
leading to Column::from_qualified_named(). As it gets an unqualified
column, it lower-cases it.

This means that if a join is USING("SOME_COLUMN_NAME"), we end up with a
Column { name: "some_column_name", ..}. In the end, the join fails, as
that lower-case column does not exist.

With this commit, SqlToRel::parse_join() calls Column::from_name() on
each normalized column and passed those to
LogicalPlanBuilder::join_using(). Downstream, in
LogicalPlanBuilder::normalize(), there is no need to create the Column
objects from strings, and the bug does not happen.

This fixes apache#16120.
Until this commit, LogicalPlanBuilder::join_using() accepted using_keys:
Vec<impl Into<Column> + Clone>.

This commit removes this, only allowing Vec<Column>.

Motivation: passing e.g. Vec<String> for using_keys is bug-prone, as the
Strings can get (their case) modified when made into Column. That logic
is admissible with a common column name that can be qualified, but some
column names cannot (e.g. USING keys).

This commit changes the API. However, potential users can trivially fix
their code by calling Column::from/from_qualified_name on their
using_keys. This forces them to things about what their identifier
represent and that removes a class of potential bugs.

Additional bonus: shorter compilation time & binary size.
@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate labels Jun 26, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @brunal

@alamb alamb merged commit 8d34abb into apache:main Jun 27, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner

Projects

None yet

Development

Successfully merging this pull request may close these issues.

commit 304488d3... (2025-02-05) broke JOIN ... USING("UPPERCASE_FIELD_NAME")

2 participants