Consider introducing unique expression IDs in Logical/Physical plan #8379

Jefffrey · 2023-11-30T21:31:18Z

Is your feature request related to a problem or challenge?

In Spark, they have a concept of ExprId which is used to uniquely identify named expressions:

https://github.com/apache/spark/blob/9bb358b51e30b5041c0cd20e27cf995aca5ed4c7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala#L41-L57

/**
 * A globally unique id for a given named expression.
 * Used to identify which attribute output by a relation is being
 * referenced in a subsequent computation.
 *
 * The `id` field is unique within a given JVM, while the `uuid` is used to uniquely identify JVMs.
 */
case class ExprId(id: Long, jvmId: UUID) {

  override def equals(other: Any): Boolean = other match {
    case ExprId(id, jvmId) => this.id == id && this.jvmId == jvmId
    case _ => false
  }

  override def hashCode(): Int = id.hashCode()

}

Is it worth as attempting to introduce something similar in DataFusion?

There are issues being caused by rules in the optimizer comparing directly on column name leading to bugs when duplicate names appear, such as #8374

If during the analysis of a plan we can assign unique numeric IDs for columns, we could check for column equality based on these IDs and not need to compare string names.

The obvious downside would be this seems like a large effort in refactoring, not to mention breaking changes.

Describe the solution you'd like

Consider introduction of unique ID for columns/expressions to potentially simplify optimization/planning code

Describe alternatives you've considered

Don't do this (large refactoring effort? breaking changes?)

Additional context

Just a thought I had bouncing in my head, would appreciate to hear more thoughts on this (even if this seems unfeasible), or if there was already some prior discussion on a similar topic

The text was updated successfully, but these errors were encountered:

alamb · 2023-12-01T20:37:00Z

I wonder what "unique" means ? Like every newly created Expr gets some sort of id?

Some examples:

// would expr1 and expr2 have the same id?
let expr1 = col("foo");
let expr2 = col("foo");

// would expr1 and expr2 have the same id?
let expr1 = col("foo");
let expr2 = expr1.clone()

Jefffrey · 2023-12-01T21:43:41Z

Actually I think I was off the mark on what ExprId is intended to do, it seems it would be more useful if there were a new LogicalExpr enum such as AttributeReference, which would refer to an expr from the parent plan by ExprId

Like given a logical plan:

Projection: a.int_col, b.double_col, CAST(a.date_string_col AS Utf8)
  Inner Join: a.int_col = b.int_col
    SubqueryAlias: a
      Projection: alltypes_plain.int_col, alltypes_plain.date_string_col
        Filter: alltypes_plain.id > Int32(1)
          TableScan: alltypes_plain projection=[id, int_col, date_string_col], partial_filters=[alltypes_plain.id > Int32(1)]
    SubqueryAlias: b
      Projection: alltypes_plain.int_col, alltypes_plain.double_col
        Filter: CAST(alltypes_plain.tinyint_col AS Float64) < alltypes_plain.double_col
          TableScan: alltypes_plain projection=[tinyint_col, int_col, double_col], partial_filters=[CAST(alltypes_plain.tinyint_col AS Float64) < alltypes_plain.double_col]

That top level projection has a.int_col as a Column for example, which when turned into physical plan needs to search the parent schema by name

https://github.com/apache/arrow-datafusion/blob/a6e6d3fab083839239ef81cf3a3546dd8929a541/datafusion/core/src/physical_planner.rs#L879-L891

Whereas with exprid's, it could be possible for a.int_col to be an AttributeReference which references the parent expr list to point to which expr it references by id.

And I think each new expr would have a new ID.

Honestly I could be way off the mark here on the usages/benefits of exprid 😅

It's just something I was thinking about, especially in relation to how verbose it can be to check if columns are the same when taking into account table, schema and catalog parts of the identifier for a column

See troubles with ambiguity check here Cross-schema/cross-catalog qualified column doesn't do ambiguity check #6012

So instead of having to find the original column of a projected column in a logical plan via name during logical optimization/physical planning, could have that done once off in an analyzer rule pass then afterwards use exprids

alamb · 2023-12-03T12:49:02Z

What you describe seems similar to the usize index in physical_plan Column references. And as I recall @houqp added exactly to handle this case of ambiguous schema 🤔

I believe Postgres has a similar way of addressing fields (it does so by index rather than column name) in its logical exprs.

The downside I remember from working wit postgres was that there were many potential hard to track down issues when the indexes got messed up.

It might be interesting to see what this would look like in DataFusion 🤔

tv42 · 2023-12-08T22:11:48Z

Related to this, SQLite and Postgres allow duplicate column names in results. From SQLite tests:

SELECT + COUNT( * ) AS col1, - 92 - + - 47 AS col1

Datafusion errors out with

query failed: Error during planning: Projections require unique expression names but the expression "COUNT(*) AS col1" at position 0 and "Int64(-92) - Int64(-47) AS col1" at position 1 have the same name. Consider aliasing ("AS") one of them.

Maybe after/as part of implementing this, Datafusion should relax that restriction?

alamb · 2023-12-09T12:35:07Z

Maybe after/as part of implementing this, Datafusion should relax that restriction?

One challenge is that most of the arrow-rs functionality in RecordBatch, for example, assumes unique column names (find_by_name for example). I don't think the Arrow spec itself restricts names to be unique, but I think the assumption may run pretty deep in the implementations

tv42 · 2023-12-11T22:25:50Z

I assume you mean column_by_name, https://docs.rs/datafusion/latest/datafusion/common/arrow/record_batch/struct.RecordBatch.html#method.column_by_name?

Ecosystem survey:

SQLite picks the first matching column:

sqlite> select col1 from (SELECT + COUNT( * ) AS col1, - 92 - + - 47 AS col1);
1

Postgres errors at column access time if there are duplicates:

postgres=# select col1 from (SELECT + COUNT( * ) AS col1, - 92 - + - 47 AS col1) as foo;
ERROR:  column reference "col1" is ambiguous
LINE 1: select col1 from (SELECT + COUNT( * ) AS col1, - 92 - + - 47...
               ^

tv42 · 2024-04-22T17:23:14Z

This might be a duplicate of #6543?

alamb · 2024-04-22T20:22:04Z

@tv42 I agree it is certainly related

findepi · 2024-11-18T17:31:30Z

This would be great item in the broader scope of #12723, which intents to make DataFusion Logical Plans be state of the art.

Jefffrey added the enhancement New feature or request label Nov 30, 2023

tv42 mentioned this issue Dec 8, 2023

Tracking: run SQLite sqllogictest suite risinglightdb/sqllogictest-rs#148

Open

findepi mentioned this issue Nov 18, 2024

[Epic] Make DataFusion a reliable foundation for building query engines #12723

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider introducing unique expression IDs in Logical/Physical plan #8379

Consider introducing unique expression IDs in Logical/Physical plan #8379

Jefffrey commented Nov 30, 2023

alamb commented Dec 1, 2023

Jefffrey commented Dec 1, 2023

alamb commented Dec 3, 2023

tv42 commented Dec 8, 2023

alamb commented Dec 9, 2023

tv42 commented Dec 11, 2023 •

edited

Loading

tv42 commented Apr 22, 2024

alamb commented Apr 22, 2024

findepi commented Nov 18, 2024

Consider introducing unique expression IDs in Logical/Physical plan #8379

Consider introducing unique expression IDs in Logical/Physical plan #8379

Comments

Jefffrey commented Nov 30, 2023

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Dec 1, 2023

Jefffrey commented Dec 1, 2023

alamb commented Dec 3, 2023

tv42 commented Dec 8, 2023

alamb commented Dec 9, 2023

tv42 commented Dec 11, 2023 • edited Loading

tv42 commented Apr 22, 2024

alamb commented Apr 22, 2024

findepi commented Nov 18, 2024

tv42 commented Dec 11, 2023 •

edited

Loading