Fix ambiguous column names in substrait conversion as a result of literals having the same name during conversion. #17299

xanderbailey · 2025-08-22T22:01:59Z

Which issue does this PR close?

closes AmbiguousReference on project following join when expression name matches left or right column name #17294

Rationale for this change

Fix ambiguous column names in substrait conversion as a result of literals before and after a join being assigned the same name.

More information in the issue, but say you have a NULL literal before a join called "column1" and then you create a new NULL column after the join called column2, you would get an ambiguous column name errors like.

Error: SchemaError(AmbiguousReference { field: Column { relation: Some(Bare { table: "left" }), name: "UTF8(NULL)" } }, Some(""))

What changes are included in this PR?

Simply alias all literals as they're converted to have a UUID name.

Are these changes tested?

Yes.
Tested by using substrait-java with this query

./isthmus-cli/build/graal/isthmus --create "create table A (a int); create table B (a int, c int); create table C (a int, d int)" "select t.*, C.d, CAST(NULL AS VARCHAR) as e from (select a, CAST(NULL AS VARCHAR) as c from A UNION ALL select a, c from B) t LEFT JOIN C ON t.a = C.a"

Are there any user-facing changes?

…erals having the same names

alamb · 2025-08-24T10:49:32Z

Thak you for this contribution @xanderbailey

@LiaCastaneda and @lorenarosati , could you please help review this PR?

xanderbailey · 2025-08-26T11:11:24Z

Have changed the implementation to only apply these uuid aliases in Project plans which removes a lot of alias that aren't required. For example lowercase("foo") doesn't need to be lowercase("foo" AS uuid)

LiaCastaneda · 2025-08-26T13:22:04Z

Thak you for this contribution @xanderbailey

@LiaCastaneda and @lorenarosati , could you please help review this PR?

yes, will take a look between today and tomorrow 👀

LiaCastaneda

This makes sense to me, thanks @xanderbailey.
@alamb, can you trigger the CI to confirm all tests pass? There are some other tests that verify name disambiguation/deduplication, and should double check those are passing as well.

LiaCastaneda · 2025-08-26T15:52:14Z

datafusion/substrait/src/logical_plan/consumer/rel/project_rel.rs

+            // unique names but, it does not ensure that if a literal column exists in a previous
+            // project say before a join that it is deduplicated with respect to those columns.
+            let maybe_apply_alias = match e {
+                lit @ Expr::Literal(_, _) => {


I agree to just keep it to the observed case. maybe its a good idea to add a link to this PR as a comment?

Happy to add that

xanderbailey · 2025-08-28T23:12:47Z

Anything else needed from me?

LiaCastaneda

Sorry - forgot to approve!

xanderbailey · 2025-09-02T08:05:23Z

Sorry to be a bother, what's the process to get this merged?

LiaCastaneda · 2025-09-02T10:12:06Z

cc @alamb

alamb · 2025-09-04T13:30:10Z

Sorry to be a bother, what's the process to get this merged?

Sorry -- typically the process is to find a committer to approve and then merge. I clearly operate in such a capacity though I am trying to encourage more people to participate in the process too (there are 50 people with commit access): https://projects.apache.org/committee.html?datafusion

alamb

Seems good to me -- thank you @xanderbailey and @LiaCastaneda

cc @vbarua or @Blizzara in case you have any thoughts on this approach

vbarua

Left some minor comments and questions. The biggest thing is that using UUIDs for this can make the Datafusion plans a bit less readable, and potentially complicate testing both within and outside of Datafusion due to non-determinism in the UUID values. It's not something I would consider blocking though, because in practice it may not be an issue.

vbarua · 2025-09-04T21:32:47Z

datafusion/substrait/tests/cases/logical_plans.rs

+        // File generated with substrait-java's Isthmus:
+        // ./isthmus-cli/build/graal/isthmus --create "create table A (a int); create table B (a int, c int); create table C (a int, d int)" "select t.*, C.d, CAST(NULL AS VARCHAR) as e from (select a, CAST(NULL AS VARCHAR) as c from A UNION ALL select a, c from B) t LEFT JOIN C ON t.a = C.a"
+        let proto_plan =
+            read_json("tests/testdata/test_plans/null_literals.substrait.json");


minor: I would suggest naming the test and test file based on the property that we're trying to test, which is disambiguation of duplicate literal names in plans, instead of the contents of the file. A plan named null_literal doesn't really provide much information about what we're testing.

Changed to disambiguate_literals_with_same_name

vbarua · 2025-09-04T21:37:18Z

datafusion/substrait/src/logical_plan/consumer/rel/project_rel.rs

+            // project say before a join that it is deduplicated with respect to those columns.
+            // See: https://github.com/apache/datafusion/pull/17299
+            let maybe_apply_alias = match e {
+                lit @ Expr::Literal(_, _) => lit.alias(uuid::Uuid::new_v4().to_string()),


The one thing that's a bit wonky about this is that the usage of UUIDs injects a little bit of randomness into the conversion from Substrait plan to Datafusion plan. You're already dealing with this in your tests by using your filter for eliding UUID values:

let mut settings = insta::Settings::clone_current(); settings.add_filter( r"[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}", "[UUID]", );

Using UUIDs makes it easy to guarantee that names are unique, but make the plans less readable and can complicate testing.

Figuring out a deterministic scheme for this would be nice. We could potentially apply the name tracker to the inputs of multi-input relations (i.e. JoinRel, CrossRel, SetRel). That's not as nice as the UUID solution you have because it would require extra handling in every multi-input relation, but it could potentially improve readability. I'm not wedded to this though. The UUID solutions works well enough and maybe in practice it won't be that much of an issue. If it does, we can always tweak the plan conversion later.

vbarua · 2025-09-04T21:47:12Z

datafusion/substrait/src/logical_plan/consumer/rel/project_rel.rs

+            // ambiguous names when the same literal is used before and after a join.
+            // The name tracker will ensure that two literals in the same project would have
+            // unique names but, it does not ensure that if a literal column exists in a previous
+            // project say before a join that it is deduplicated with respect to those columns.


To clarify my understanding of this. The name tracker guarantees name uniqueness in the output names of the relation it is applied to (Projects, Aggregates). In cases of relations that consume multiple inputs (Joins, Unions), if the individual inputs have names that are unique but duplicated between them, we get duplicate name issues in the output schema when we combine the input schemas?

Yes so the name tracker makes sure that if you have a project that creates two null string columns, it will create two unique names for those two columns. But, say you create one of those null columns before a join and then another in a project immediately after a join, the plan fails with an ambiguous column error because there is a UTF8(NULL) from say the left and then another UTF8(NULL) from the project after the join which has no source and it's therefore an ambiguous reference.

I think you're also correct that if you have a null in the left and the right side of a join then this will also be an issue today

But, say you create one of those null columns before a join and then another in a project immediately after a join, the plan fails with an ambiguous column error because there is a UTF8(NULL) from say the left and then another UTF8(NULL) from the project after the join which has no source and it's therefore an ambiguous reference.

Interesting, I would have expected

datafusion/datafusion/substrait/src/logical_plan/consumer/rel/project_rel.rs

Lines 74 to 78 in 3d5863b

let mut final_exprs: Vec<Expr> = vec![];

for index in 0..original_schema.fields().len() {

let e = Expr::Column(Column::from(original_schema.qualified_field(index)));

final_exprs.push(name_tracker.get_uniquely_named_expr(e)?);

}

to generate a unique name in that scenario.

I think the problem is the NameTracker doesn't ignore qualifiers, but the "ambiguous schema" check does. Thus if the input to the Project has e.g. "table1.NULL" column and adds a "NULL" column (from lit(NULL)), the NameTracker doesn't rename the newly added column, and then we get both table1.NULL and NULL columns which fails the ambiguous check.

I think my recommendation would be to make the NameTracker more robust instead, so that it ignores the qualifiers at least when there is also a non-qualified name. While this UUID-aliasing of literals seems like it should work for this specific case, I can imagine there might be some other case where the clash happens with non-literal columns (though I'm not able to come up with an example right now).

(Also hey 👋 @xanderbailey!)

I've updated some of the comments made by @vbarua. I'm happy to merge as is to unblock us and then I can have a go at improving the name tracker in a follow-up?

I think 1 is perfectly reasonable. This is still an improvement even if it doesn't solve every case, and we can always iterate it on it further.

Fine by me.

FWIW, I looked a bit at what it'd take to fix the tracker. I think a core of the issue is that DF checks name ambiguity in two ways: there's the AmbiguousColumn exception you're running into, and then there is a validate_unique_names() function which gets called on the creation of the Project. The former needs unique non-qualified names, while the latter needs unique schema names (which can be qualified).

An easy fix for the former would be to change name_for_alias() into qualified_name()._1 here

datafusion/datafusion/substrait/src/logical_plan/consumer/utils.rs

Line 398 in 1d9e138

match self.get_unique_name(expr.name_for_alias()?) {

. However, that then regresses the latter check (including in the test case for this PR), since there will then be a project node with an expr CAST(B.C as Utf8) with a qualified name ([no qualifier], "B.C") and a schema name "B.C", as well as a reference to the original column B.C with a qualified name ("B", "C") and also schema name "B.C". As the qualified name's name parts are different, it wouldn't be renamed (after the change I propose), and then it'd fail the validate_unique_names() check. So maybe for a proper fix, NameTracker would need to track both the schema name and the name-part of the qualified name, and rename until both are unique.

(A simple example of the behavior of the CAST and validate_unique_names() is that SELECT data.a, CAST(data.a as string) from data; also fails in datafusion-cli.)

@alamb seems like we're okay to merge as it is.

I filed a follow on ticket to track the improvement idea here:

Improve substrait NameTracker so it doesn't require uuids #17508

vbarua · 2025-09-04T21:54:41Z

datafusion/substrait/src/logical_plan/consumer/rel/project_rel.rs

            }
-            explicit_exprs.push(name_tracker.get_uniquely_named_expr(e)?);
+            // Since substrait removes aliases, we need to assign literals with a UUID alias to avoid
+            // ambiguous names when the same literal is used before and after a join.


In what context does "substrait removes aliases"? When converting from Datafusion to Substrait? This is a bit a of a weird statement to me because Substrait doesn't care about names at all.

Yeah maybe my phrasing isn't best here but whatever method you use to construct the substrait, it removes column names and aliases from within the plan which means literals columns are just assigned their default names (which I think come from arrow? I didn't get as far as finding where the UTF8(NULL) name comes from)

which means literals columns are just assigned their default names

Ok yeah, I figured there was some sort of default name thing going on. Another way to phrase this might be:

Substrait plans are ordinal based, so they do not provide names for columns. Names for columns are generated by Datafusion during conversion, and for literals Datafusion produces names based on the literal value. It is possible to construct valid Substrait plans that result in duplicated names if the same literal value is used in multiple relations. To avoid this issue, we alias literals with unique names.

vbarua · 2025-09-08T16:25:07Z

datafusion/substrait/src/logical_plan/consumer/rel/project_rel.rs

+            // ambiguous names when the same literal is used before and after a join.
+            // The name tracker will ensure that two literals in the same project would have
+            // unique names but, it does not ensure that if a literal column exists in a previous
+            // project say before a join that it is deduplicated with respect to those columns.


I think 1 is perfectly reasonable. This is still an improvement even if it doesn't solve every case, and we can always iterate it on it further.

alamb · 2025-09-10T18:50:26Z

I'll plan to merge this PR once the CI passes again

alamb · 2025-09-10T19:26:30Z

Thank you again @xanderbailey @LiaCastaneda @vbarua and @Blizzara

hareshkh · 2025-10-13T14:39:41Z

@alamb Can we please have this cherry-picked on top of 50.0.x release?

alamb · 2025-10-14T19:56:16Z

@alamb Can we please have this cherry-picked on top of 50.0.x release?

Hi @hareshkh -- we don't currently have a plan for a 50.3.0 release, but if you would like to help organize / coordinate it I will be happy to run the actual release/vote process

You can follow the model of #17849:

Create a tracking ticket (following Release DataFusion 50.2.0 (minor) #17849)
Create the backport PRs into the branch-50 branch

…erals having the same name during conversion. (apache#17299) * Fix ambigious column names in substrate conversion as a result of literals having the same names * move it to the project * do it only for projects * comment * fmt * comments --------- Co-authored-by: Xander Bailey <xbailey@palantir.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

alamb

FYI @hareshkh has a backport of this issue to branch-50:

#18077

…#18077) ## Which issue does this PR close? - Related to #18072 - Related to #17294 - Backports #17299 to branch-50 ## Rationale for this change Fix ambiguous column names in substrait conversion as a result of literals before and after a join being assigned the same name. More information in the issue, but say you have a NULL literal before a join called "column1" and then you create a new NULL column after the join called column2, you would get an ambiguous column name errors like. ``` Error: SchemaError(AmbiguousReference { field: Column { relation: Some(Bare { table: "left" }), name: "UTF8(NULL)" } }, Some("")) ``` ## What changes are included in this PR? Simply alias all literals as they're converted to have a UUID name. ## Are these changes tested? Yes. Tested by using substrait-java with this query ``` ./isthmus-cli/build/graal/isthmus --create "create table A (a int); create table B (a int, c int); create table C (a int, d int)" "select t.*, C.d, CAST(NULL AS VARCHAR) as e from (select a, CAST(NULL AS VARCHAR) as c from A UNION ALL select a, c from B) t LEFT JOIN C ON t.a = C.a" ``` ## Are there any user-facing changes? Co-authored-by: Xander <zander181@googlemail.com> Co-authored-by: Xander Bailey <xbailey@palantir.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

…#17299 (apache#18077) ## Which issue does this PR close? - Related to apache#18072 - Related to apache#17294 - Backports apache#17299 to branch-50 ## Rationale for this change Fix ambiguous column names in substrait conversion as a result of literals before and after a join being assigned the same name. More information in the issue, but say you have a NULL literal before a join called "column1" and then you create a new NULL column after the join called column2, you would get an ambiguous column name errors like. ``` Error: SchemaError(AmbiguousReference { field: Column { relation: Some(Bare { table: "left" }), name: "UTF8(NULL)" } }, Some("")) ``` ## What changes are included in this PR? Simply alias all literals as they're converted to have a UUID name. ## Are these changes tested? Yes. Tested by using substrait-java with this query ``` ./isthmus-cli/build/graal/isthmus --create "create table A (a int); create table B (a int, c int); create table C (a int, d int)" "select t.*, C.d, CAST(NULL AS VARCHAR) as e from (select a, CAST(NULL AS VARCHAR) as c from A UNION ALL select a, c from B) t LEFT JOIN C ON t.a = C.a" ``` ## Are there any user-facing changes? Co-authored-by: Xander <zander181@googlemail.com> Co-authored-by: Xander Bailey <xbailey@palantir.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> (cherry picked from commit e3f8e37)

github-actions bot added the substrait Changes to the substrait crate label Aug 22, 2025

Fix ambigious column names in substrate conversion as a result of lit…

d09586b

…erals having the same names

xanderbailey force-pushed the xb/fix_literals_substrait branch from 198888f to d09586b Compare August 22, 2025 22:02

Merge branch 'main' into xb/fix_literals_substrait

4e6688d

xbailey added 2 commits August 26, 2025 12:08

move it to the project

39b159a

do it only for projects

517425d

LiaCastaneda reviewed Aug 26, 2025

View reviewed changes

comment

5581ddf

xanderbailey requested a review from LiaCastaneda August 26, 2025 17:59

xanderbailey and others added 4 commits August 26, 2025 18:59

Merge branch 'main' into xb/fix_literals_substrait

52f6832

Merge branch 'main' into xb/fix_literals_substrait

619cced

Merge branch 'main' into xb/fix_literals_substrait

d39522c

fmt

32f10b0

LiaCastaneda approved these changes Aug 29, 2025

View reviewed changes

Merge branch 'main' into xb/fix_literals_substrait

0d13ab2

Merge branch 'main' into xb/fix_literals_substrait

40f2602

alamb approved these changes Sep 4, 2025

View reviewed changes

vbarua reviewed Sep 4, 2025

View reviewed changes

xbailey and others added 2 commits September 8, 2025 09:28

comments

aacbe31

Merge branch 'main' into xb/fix_literals_substrait

a6143f7

vbarua approved these changes Sep 8, 2025

View reviewed changes

alamb mentioned this pull request Sep 10, 2025

Improve substrait NameTracker so it doesn't require uuids #17508

Open

alamb merged commit 14a7ade into apache:main Sep 10, 2025
30 checks passed

alamb mentioned this pull request Oct 14, 2025

Fix bug in LimitPushPastWindows #18029

Merged

hareshkh mentioned this pull request Oct 15, 2025

Release DataFusion 50.3.0 (minor) #18072

Closed

26 tasks

alamb mentioned this pull request Oct 15, 2025

[branch-50] Fix ambiguous column names in substrait conversion #17299 #18077

Merged

alamb reviewed Oct 15, 2025

View reviewed changes

	let mut final_exprs: Vec<Expr> = vec![];
	for index in 0..original_schema.fields().len() {
	let e = Expr::Column(Column::from(original_schema.qualified_field(index)));
	final_exprs.push(name_tracker.get_uniquely_named_expr(e)?);
	}

Fix ambiguous column names in substrait conversion as a result of literals having the same name during conversion. #17299

Fix ambiguous column names in substrait conversion as a result of literals having the same name during conversion. #17299

Uh oh!

Conversation

xanderbailey commented Aug 22, 2025 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb commented Aug 24, 2025

Uh oh!

xanderbailey commented Aug 26, 2025

Uh oh!

LiaCastaneda commented Aug 26, 2025

Uh oh!

LiaCastaneda left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xanderbailey commented Aug 28, 2025

Uh oh!

LiaCastaneda left a comment

Choose a reason for hiding this comment

Uh oh!

xanderbailey commented Sep 2, 2025

Uh oh!

LiaCastaneda commented Sep 2, 2025

Uh oh!

alamb commented Sep 4, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

vbarua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xanderbailey commented Aug 22, 2025 •

edited by alamb

Loading