Project relation: do we always need to just add columns? #678

EpsilonPrime · 2024-08-08T23:10:26Z

Project relations are currently defined as adding columns to our output. Two Substrait implementations do not implement project relations this way (DuckDB and Datafusion) and the semantics of the plan still work if you keep this in mind (instead of using emit to remove the items you would add the columns you wish to keep instead). The two approaches have their advantages -- not copying allows you to just get the fields you want without needing to identify them explicitly in the output mapping and copying makes it easier to add an additional column.

I'm told that correcting the column handling in one engine will require a substantial amount of work (likely weeks) affecting the bookkeeping in every relation. Are there any potential alternatives such as additionally defining no copy project relation semantics (perhaps an option in common)? Or would this make the landscape more complicated for Substrait producers?

jacques-n · 2024-08-08T23:50:56Z

The original design was done to reduce repetition of reference expressions. Non-scientific observation from myself (and maybe others) was that most projects copy many columns while mutating/adding a few. Trying to embrace the less-is-more/convention over configuration, we picked the current behavior. This behavior also matches most operations including filter, join, sort, etc. Rather than the operation having special properties to manage the subset of fields that should be output, emit consistently works the same.

I'm not really understanding the difficulty for engines to comply with this. Either behavior can already expressed entirely within the project relation. I don't see how an option is any different than what is already possible without touching any other relations/transformations. If people want convenience methods to construct or consume these structures, I see that as the domain of the language libraries, not the format itself.

vbarua · 2024-08-08T23:58:44Z

perhaps an option in common

I think adding an option for this will result in more divergence in how systems handle it.

There is also prior art in how to handle this relatively straightforwardly in Isthmus, as the Calcite Project relation also has this behaviour.

When converting from Substrait To Calcite, for direct output mode the converter adds field references to all of the inputs to the start of the expression list. For remap mode, the converter checks if the remap field maps to an input field or an input expression.

When converting from Calcite to Substrait, we add a remap to the output Project along the lines of [<first_expression_field_ordinal>, ... <last_expression_field_ordinal>].

I think this scheme is relatively portable and only requires intervention when converting into and out of the systems Projects.

EpsilonPrime · 2024-08-27T19:38:41Z

Good discussion, the current design definitely works best here.

EpsilonPrime closed this as completed Aug 27, 2024

tokoko mentioned this issue Sep 3, 2024

fix: workaround for aggregates in datafusion ibis-project/ibis-substrait#1120

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project relation: do we always need to just add columns? #678

Project relation: do we always need to just add columns? #678

EpsilonPrime commented Aug 8, 2024

jacques-n commented Aug 8, 2024

vbarua commented Aug 8, 2024

EpsilonPrime commented Aug 27, 2024

Project relation: do we always need to just add columns? #678

Project relation: do we always need to just add columns? #678

Comments

EpsilonPrime commented Aug 8, 2024

jacques-n commented Aug 8, 2024

vbarua commented Aug 8, 2024

EpsilonPrime commented Aug 27, 2024