-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(substrait): set ProjectRel output_mapping in producer #12495
feat(substrait): set ProjectRel output_mapping in producer #12495
Conversation
advanced_extension: None, | ||
}) | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is potentially a useful optimization, however it becomes a bit more complicated with the introduction of the output_mapping
because you a need to modify it along with the expressions. I've opted to simplify this for now to favour simplicity and correctness.
As well, I think this is better handled in when consuming plans and/or by the optimizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the datafusion optimizer already handles projection pushdown quite well -- so keeping the substrait producer simpler makes the most sense to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thank you @vbarua -- these PRs are easy to review as they clearly explain why they are changing something, are well documented, and well tested 🏆
advanced_extension: None, | ||
}) | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the datafusion optimizer already handles projection pushdown quite well -- so keeping the substrait producer simpler makes the most sense to me
FYI @Blizzara |
Thanks again @vbarua |
}; | ||
|
||
if let Some(rel::RelType::Project(p1)) = root_rel.rel_type.as_ref() { | ||
// The WindowAggr outputs 4 columns, the Projection has 4 columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think both are 3 columns?
Or well the window outputs 4 but the project outputs only 3.
I guess that's what you're saying here anyways 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that is a typo. It should have been
The WindowAggr outputs 4 columns, the Projection has 3 columns
^^^
which is what the assert below is actually (correctly) checking.
I'll make a note to myself to update this when I make the consumer changes.
let mut window_exprs = vec![]; | ||
|
||
// create a field reference for each input field | ||
let mut expressions = (0..window.input.schema().fields().len()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you could not add these ref expressions, and then use a direct emit (or remap starting from 0) instead, would that not be the same? That said it'd break roundtrip until you've gotten the consumer part done as well 😅
Which issue does this PR close?
Continues #12347
This PR sets the output_mapping field on ProjectRels for plans produced by DataFusion.
Rationale for this change
Given a plan
When creating the Substrait ProjectRel, DataFusion will now set the output mapping to [2,3,4] so that the Substrait ProjectRel will match the DataFusion behaviour of only outputting expressions.
By default, if the
emit_kind
field is not set with anoutput_mapping
, a Substrait ProjectRel emits all input columns followed by all expressions. This does not match the DataFusion behaviour.This change only includes producer changes in order to ease the migration for users. Because DataFusion ignores the
emit_kind
field when consuming plans, DataFusion will be able to consume plans that set the output mapping correctly and plans that don't.Once the consumer is updated to respect
emit_kind
, plans that don't set theoutput_mapping
will not work as before. By shipping the producer change by itself, we allow users who serialize plans between versions the opportunity to update all of their plans before making a consumer change that would break them.Are these changes tested?
Yes
Are there any user-facing changes?
Substrait plans with ProjectRels will now set the output_mapping field.