-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: made generalize_filter less permissive, also added more cases #2149
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Blajda
approved these changes
Feb 1, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks for contributing this fix!
roeap
approved these changes
Feb 1, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
RobinLin666
pushed a commit
to RobinLin666/delta-rs
that referenced
this pull request
Feb 2, 2024
…elta-io#2149) # Description This fixes an observed bug where the partition generalization was failing. A minimal repro was: ```python from deltalake import DeltaTable, write_deltalake import pyarrow as pa import pandas as pd data = pd.DataFrame.from_dict( { "a": [], "b": [], "c": [], } ) schema = pa.schema( [ ("a", pa.string()), ("b", pa.int32()), ("c", pa.int32()), ] ) table = pa.Table.from_pandas(data, schema=schema) write_deltalake( "test", table, mode="overwrite", partition_by="a" ) new_data = pd.DataFrame.from_dict( { "a": ["a", "a", "a"], "b": [None, 2, 4], "c": [5, 6, 7], } ) new_table = pa.Table.from_pandas(new_data, schema) dt = DeltaTable("test") dt.merge( source=new_table, predicate="s.b IS NULL", source_alias="s", target_alias="t", ).when_matched_update_all().when_not_matched_insert_all().execute() ``` This would cause a DataFusion error: ``` _internal.DeltaError: Generic DeltaTable error: Optimizer rule 'simplify_expressions' failed caused by Schema error: No field named s.b. Valid fields are t.a, t.b, t.c, t.__delta_rs_path. ``` This was because when generalizing the match predicate to use as a partition filter, an expression `IsNull(Column('b', 's'))` was deemed to not reference the table `s`. This PR does two things: 1. **Tightens up the referencing logic.** Previous it conflated 'definitely does not reference' with 'don't know if it references or not'. This tightens up the logic and means the plan is less likely to generalize out a partition filter if we can't be sure that it can be generalized. The ability to generalize over arbitrary binary expressions has been tightened too - previously behaviour would permit that `generalizable_expression OR non_generalizable_expression` would reduce to `generalizable_expression`. This isn't correct in the case of partition filters, because this would cause us to leave out half the cases that should be extracted from the target table. 2. **Adds a couple of extra cases where we know if a target reference exists.** Namely, `is null` can now be checked for source table references and `literal` is now re-covered, as previously it was working by taking advantage of looser logic that has since been tightened. Co-authored-by: David Blajda <db@davidblajda.com>
This was referenced Feb 2, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This fixes an observed bug where the partition generalization was failing. A minimal repro was:
This would cause a DataFusion error:
This was because when generalizing the match predicate to use as a partition filter, an expression
IsNull(Column('b', 's'))
was deemed to not reference the tables
.This PR does two things:
Tightens up the referencing logic. Previous it conflated 'definitely does not reference' with 'don't know if it references or not'. This tightens up the logic and means the plan is less likely to generalize out a partition filter if we can't be sure that it can be generalized. The ability to generalize over arbitrary binary expressions has been tightened too - previously behaviour would permit that
generalizable_expression OR non_generalizable_expression
would reduce togeneralizable_expression
. This isn't correct in the case of partition filters, because this would cause us to leave out half the cases that should be extracted from the target table.Adds a couple of extra cases where we know if a target reference exists. Namely,
is null
can now be checked for source table references andliteral
is now re-covered, as previously it was working by taking advantage of looser logic that has since been tightened.