Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project relation: do we always need to just add columns? #678

Closed
EpsilonPrime opened this issue Aug 8, 2024 · 3 comments
Closed

Project relation: do we always need to just add columns? #678

EpsilonPrime opened this issue Aug 8, 2024 · 3 comments

Comments

@EpsilonPrime
Copy link
Member

Project relations are currently defined as adding columns to our output. Two Substrait implementations do not implement project relations this way (DuckDB and Datafusion) and the semantics of the plan still work if you keep this in mind (instead of using emit to remove the items you would add the columns you wish to keep instead). The two approaches have their advantages -- not copying allows you to just get the fields you want without needing to identify them explicitly in the output mapping and copying makes it easier to add an additional column.

I'm told that correcting the column handling in one engine will require a substantial amount of work (likely weeks) affecting the bookkeeping in every relation. Are there any potential alternatives such as additionally defining no copy project relation semantics (perhaps an option in common)? Or would this make the landscape more complicated for Substrait producers?

@jacques-n
Copy link
Contributor

The original design was done to reduce repetition of reference expressions. Non-scientific observation from myself (and maybe others) was that most projects copy many columns while mutating/adding a few. Trying to embrace the less-is-more/convention over configuration, we picked the current behavior. This behavior also matches most operations including filter, join, sort, etc. Rather than the operation having special properties to manage the subset of fields that should be output, emit consistently works the same.

I'm not really understanding the difficulty for engines to comply with this. Either behavior can already expressed entirely within the project relation. I don't see how an option is any different than what is already possible without touching any other relations/transformations. If people want convenience methods to construct or consume these structures, I see that as the domain of the language libraries, not the format itself.

@vbarua
Copy link
Member

vbarua commented Aug 8, 2024

perhaps an option in common

I think adding an option for this will result in more divergence in how systems handle it.

There is also prior art in how to handle this relatively straightforwardly in Isthmus, as the Calcite Project relation also has this behaviour.

When converting from Substrait To Calcite, for direct output mode the converter adds field references to all of the inputs to the start of the expression list. For remap mode, the converter checks if the remap field maps to an input field or an input expression.

When converting from Calcite to Substrait, we add a remap to the output Project along the lines of [<first_expression_field_ordinal>, ... <last_expression_field_ordinal>].

I think this scheme is relatively portable and only requires intervention when converting into and out of the systems Projects.

@EpsilonPrime
Copy link
Member Author

Good discussion, the current design definitely works best here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants