forked from delta-io/delta-rs
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: made generalize_filter less permissive, also added more cases (d…
…elta-io#2149) # Description This fixes an observed bug where the partition generalization was failing. A minimal repro was: ```python from deltalake import DeltaTable, write_deltalake import pyarrow as pa import pandas as pd data = pd.DataFrame.from_dict( { "a": [], "b": [], "c": [], } ) schema = pa.schema( [ ("a", pa.string()), ("b", pa.int32()), ("c", pa.int32()), ] ) table = pa.Table.from_pandas(data, schema=schema) write_deltalake( "test", table, mode="overwrite", partition_by="a" ) new_data = pd.DataFrame.from_dict( { "a": ["a", "a", "a"], "b": [None, 2, 4], "c": [5, 6, 7], } ) new_table = pa.Table.from_pandas(new_data, schema) dt = DeltaTable("test") dt.merge( source=new_table, predicate="s.b IS NULL", source_alias="s", target_alias="t", ).when_matched_update_all().when_not_matched_insert_all().execute() ``` This would cause a DataFusion error: ``` _internal.DeltaError: Generic DeltaTable error: Optimizer rule 'simplify_expressions' failed caused by Schema error: No field named s.b. Valid fields are t.a, t.b, t.c, t.__delta_rs_path. ``` This was because when generalizing the match predicate to use as a partition filter, an expression `IsNull(Column('b', 's'))` was deemed to not reference the table `s`. This PR does two things: 1. **Tightens up the referencing logic.** Previous it conflated 'definitely does not reference' with 'don't know if it references or not'. This tightens up the logic and means the plan is less likely to generalize out a partition filter if we can't be sure that it can be generalized. The ability to generalize over arbitrary binary expressions has been tightened too - previously behaviour would permit that `generalizable_expression OR non_generalizable_expression` would reduce to `generalizable_expression`. This isn't correct in the case of partition filters, because this would cause us to leave out half the cases that should be extracted from the target table. 2. **Adds a couple of extra cases where we know if a target reference exists.** Namely, `is null` can now be checked for source table references and `literal` is now re-covered, as previously it was working by taking advantage of looser logic that has since been tightened. Co-authored-by: David Blajda <db@davidblajda.com>
- Loading branch information
1 parent
2b6e171
commit 0b03177
Showing
1 changed file
with
121 additions
and
22 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters