-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(substrait): remove optimize calls from substrait consumer #12800
Conversation
cc @Blizzara @westonpace and @vbarua |
Thanks @tokoko - I think this makes sense overall and is the right thing to do, we'll just need to ensure the extract_projection doesn't break as a result :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I left couple more nits/ideas, feel free to look at them or disregard :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, left a few comments (or perhaps more just random thoughts)
I refactored a fair bit to address the comments.
|
…something other than a TableScan
@alamb can you force a rerun on the clippy job? seems like it failed on apt update. thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as well, thanks @tokoko for bearing with all the comments! :)
@alamb just a reminder... this just needs clippy job to be rerun. Should I do a dummy commit? |
Another trick I have found is to close the PR and then reopen it. I will do so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tokoko and @Blizzara @Blizzara and @westonpace for the reviews 🙏
thanks
not sure how one gets access to it, but I think you should have a re-run failed jobs button just to re-run a single job w/o triggering the whole thing. |
yes, you can do this if you are a committer. But the author can also close/reopen the PR |
Thanks again @tokoko |
Which issue does this PR close?
This is a necessary prerequisite for #12798
Rationale for this change
substrait consumer should make the best effort for one-to-one traslation w/o invoking optimizer. Doing otherwise makes round-trip tests too complicated.
What changes are included in this PR?
When translating substrait ReadRel nodes, consumer constructs a dataframe first, applies projections with
select
and hopes that subsequentinto_optimized_plan
calls with push projections down toTableScan
. In practice, this sometimes adds unnecessary projection nodes to the plan and also unnecessary "pushed down" projections to TableScan even when substrait doesn't specify any such thing.This PR:
into_optimized_plan
calls from consumer.ensure_schema_compatability
, instead opting to handcraft projections that are put into TableScan only when necessary (when base_schema provided in ReadRel has fewer fields than the actual table schema).Are these changes tested?
This is essentially a refactor, being covered by the existing tests. I had to alter some asserts in schema compatibility tests as expected assertions were less than ideal in the first place (duplicated projections)
Are there any user-facing changes?
minor, substrait plans are functionally the same, but may lack unnecessary projections.