Skip to content

Improve substrait NameTracker so it doesn't require uuids #17508

@alamb

Description

@alamb

the following PR adds uuids to certain substrait identifiers to disambiguate them, but this may make the plans non reproducable. @Blizzara has some ideas how how we can avoid the UUIDs

FWIW, I looked a bit at what it'd take to fix the tracker. I think a core of the issue is that DF checks name ambiguity in two ways: there's the AmbiguousColumn exception you're running into, and then there is a validate_unique_names() function which gets called on the creation of the Project. The former needs unique non-qualified names, while the latter needs unique schema names (which can be qualified).

An easy fix for the former would be to change name_for_alias() into qualified_name()._1 here

match self.get_unique_name(expr.name_for_alias()?) {
. However, that then regresses the latter check (including in the test case for this PR), since there will then be a project node with an expr CAST(B.C as Utf8) with a qualified name ([no qualifier], "B.C") and a schema name "B.C", as well as a reference to the original column B.C with a qualified name ("B", "C") and also schema name "B.C". As the qualified name's name parts are different, it wouldn't be renamed (after the change I propose), and then it'd fail the validate_unique_names() check. So maybe for a proper fix, NameTracker would need to track both the schema name and the name-part of the qualified name, and rename until both are unique.

(A simple example of the behavior of the CAST and validate_unique_names() is that SELECT data.a, CAST(data.a as string) from data; also fails in datafusion-cli.)

Originally posted by @Blizzara in #17299 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions