-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Projection Expression - Input Field Inconsistencies during Projection #10088
Projection Expression - Input Field Inconsistencies during Projection #10088
Conversation
@@ -67,6 +67,9 @@ impl ProjectionMapping { | |||
// Conceptually, `source_expr` and `expression` should be the same. | |||
let idx = col.index(); | |||
let matching_input_field = input_schema.field(idx); | |||
if col.name() != matching_input_field.name() { | |||
return Err(DataFusionError::Internal(format!("Input field name {} does not match with the projection expression {}",matching_input_field.name(),col.name()))); | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check does not exist before. Now, input field names and projection expressions must match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@alamb, can you PTAL? This will start generating errors for plans that fail to treat names correctly, so it will have downstream effects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A great job to me
02)--AggregateExec: mode=FinalPartitioned, gby=[a@0 as a], aggr=[NTH_VALUE(multiple_ordered_table.c,Int64(1)) ORDER BY [multiple_ordered_table.c ASC NULLS LAST]], ordering_mode=Sorted | ||
03)----SortExec: expr=[a@0 ASC NULLS LAST] | ||
04)------CoalesceBatchesExec: target_batch_size=8192 | ||
05)--------RepartitionExec: partitioning=Hash([a@0], 4), input_partitions=4 | ||
06)----------AggregateExec: mode=Partial, gby=[a@0 as a], aggr=[NTH_VALUE(multiple_ordered_table.c,Int64(1))], ordering_mode=Sorted | ||
06)----------AggregateExec: mode=Partial, gby=[a@0 as a], aggr=[NTH_VALUE(multiple_ordered_table.c,Int64(1)) ORDER BY [multiple_ordered_table.c ASC NULLS LAST]], ordering_mode=Sorted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -895,6 +891,31 @@ fn convert_to_sort_cols( | |||
.collect::<Vec<_>>() | |||
} | |||
|
|||
fn replace_order_by_clause(order_by: &mut String) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than doing string manipulation here, maybe we could call create_function_physical_name
to just create the right name in the first place 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see -- create_function_physical_name
doesn't have sufficient information (Expr
s etc to do this)
I suppose the alternate is to remember the relevant parts of the expression, but that also seems brittle.
I can't think of anything better at the moment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think erroring when there is a mismatch between physical schemas is a reasonable things to do (it is probably better to fail fast than some other harder to diagnose error later on)
@@ -895,6 +891,31 @@ fn convert_to_sort_cols( | |||
.collect::<Vec<_>>() | |||
} | |||
|
|||
fn replace_order_by_clause(order_by: &mut String) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see -- create_function_physical_name
doesn't have sufficient information (Expr
s etc to do this)
I suppose the alternate is to remember the relevant parts of the expression, but that also seems brittle.
I can't think of anything better at the moment
Thanks @berkaysynnada |
…apache#10088) * agg fixes * test updates * fixing count mismatch * Update aggregate_statistics.rs * catch different names * minor
# Description Updates the arrow and datafusion dependencies to 52 and 39(-rc1) respectively. This is necessary for updating pyo3. While most changes with trivial, some required big rewrites. Namely, the logic for the Updates operation had to be rewritten (and simplified) to accommodate some new sanity checks inside datafusion: (apache/datafusion#10088). Depends on delta-kernel having its arrow and object-store version bumped as well. This PR doesn't include any major changes for pyo3, I'll open a separate PR depending on this PR. # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation ---> --------- Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
Which issue does this PR close?
Closes #.
Rationale for this change
During projection, 3 cases found where the input field names do not match with the projection expressions:
These are not bugs since the projection does not check the names during execution. However, when writing some optimizer rules, projections may be added by looking at the input schema. That inconsistencies cause some errors then.
What changes are included in this PR?
Are these changes tested?
One test added for the COUNT case, existing tests with new versions cover the other cases.
Are there any user-facing changes?