Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(substrait): remove optimize calls from substrait consumer #12800

Merged
merged 7 commits into from
Oct 15, 2024

Conversation

tokoko
Copy link
Contributor

@tokoko tokoko commented Oct 7, 2024

Which issue does this PR close?

This is a necessary prerequisite for #12798

Rationale for this change

substrait consumer should make the best effort for one-to-one traslation w/o invoking optimizer. Doing otherwise makes round-trip tests too complicated.

What changes are included in this PR?

When translating substrait ReadRel nodes, consumer constructs a dataframe first, applies projections with select and hopes that subsequent into_optimized_plan calls with push projections down to TableScan. In practice, this sometimes adds unnecessary projection nodes to the plan and also unnecessary "pushed down" projections to TableScan even when substrait doesn't specify any such thing.

This PR:

  • removes into_optimized_plan calls from consumer.
  • removes unnecessary projection nodes introduced in ensure_schema_compatability, instead opting to handcraft projections that are put into TableScan only when necessary (when base_schema provided in ReadRel has fewer fields than the actual table schema).
  • Also fixes a bug during schema comparison in ensure_schema_compatability where comparison never yielded true because qualified and unqualified schemas were being compared. This led to consumer adding unnecessary projections to TableScan even when schemas match. Had to manually remove these projections from test assertions (incl. tpch queries).

Are these changes tested?

This is essentially a refactor, being covered by the existing tests. I had to alter some asserts in schema compatibility tests as expected assertions were less than ideal in the first place (duplicated projections)

Are there any user-facing changes?

minor, substrait plans are functionally the same, but may lack unnecessary projections.

@alamb
Copy link
Contributor

alamb commented Oct 8, 2024

cc @Blizzara @westonpace and @vbarua

@Blizzara
Copy link
Contributor

Blizzara commented Oct 8, 2024

Thanks @tokoko - I think this makes sense overall and is the right thing to do, we'll just need to ensure the extract_projection doesn't break as a result :)

Copy link
Contributor

@Blizzara Blizzara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I left couple more nits/ideas, feel free to look at them or disregard :)

datafusion/substrait/src/logical_plan/consumer.rs Outdated Show resolved Hide resolved
datafusion/substrait/src/logical_plan/consumer.rs Outdated Show resolved Hide resolved
datafusion/substrait/tests/cases/substrait_validations.rs Outdated Show resolved Hide resolved
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, left a few comments (or perhaps more just random thoughts)

datafusion/substrait/src/logical_plan/consumer.rs Outdated Show resolved Hide resolved
datafusion/substrait/src/logical_plan/consumer.rs Outdated Show resolved Hide resolved
datafusion/substrait/src/logical_plan/consumer.rs Outdated Show resolved Hide resolved
@tokoko
Copy link
Contributor Author

tokoko commented Oct 9, 2024

I refactored a fair bit to address the comments.

  • renamed apply_projection to apply_masking
  • Split ensure_schema_compatibility into two functions. new ensure_schema_compatibility takes two schemas and validates them only w/o applying any sort of transformations. Also this is now happening before we call apply_masking to the substrait schema, the reason being that even if some of the fields are being masked out, ReadRel's filter and best_effort_filter expressions might still depend on them as @vbarua pointed out.
  • The actual transform part from the original ensure_schema_compatibility function is now called apply_projection (i know... didn't have better ideas 😆) and is used to construct TableScan using masked substrait schema, no validations here anymore.

@tokoko
Copy link
Contributor Author

tokoko commented Oct 11, 2024

@alamb can you force a rerun on the clippy job? seems like it failed on apt update. thanks

Copy link
Contributor

@Blizzara Blizzara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well, thanks @tokoko for bearing with all the comments! :)

@tokoko
Copy link
Contributor Author

tokoko commented Oct 15, 2024

@alamb just a reminder... this just needs clippy job to be rerun. Should I do a dummy commit?

@alamb
Copy link
Contributor

alamb commented Oct 15, 2024

@alamb just a reminder... this just needs clippy job to be rerun. Should I do a dummy commit?

Another trick I have found is to close the PR and then reopen it. I will do so

@alamb alamb closed this Oct 15, 2024
@alamb alamb reopened this Oct 15, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tokoko and @Blizzara @Blizzara and @westonpace for the reviews 🙏

@tokoko
Copy link
Contributor Author

tokoko commented Oct 15, 2024

thanks

Another trick I have found is to close the PR and then reopen it. I will do so

not sure how one gets access to it, but I think you should have a re-run failed jobs button just to re-run a single job w/o triggering the whole thing.

@alamb
Copy link
Contributor

alamb commented Oct 15, 2024

not sure how one gets access to it, but I think you should have a re-run failed jobs button just to re-run a single job w/o triggering the whole thing.

yes, you can do this if you are a committer. But the author can also close/reopen the PR

@alamb alamb merged commit 5a0ea0b into apache:main Oct 15, 2024
46 of 47 checks passed
@alamb
Copy link
Contributor

alamb commented Oct 15, 2024

Thanks again @tokoko

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants