-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat (datafusion integration): convert datafusion expr filters to Iceberg Predicate #588
Conversation
/// | ||
/// * `Some(Predicate)` if the expression could be successfully converted. | ||
/// * `None` if the expression couldn't be converted to an Iceberg predicate. | ||
fn expr_to_predicate(expr: &Expr) -> Option<Predicate> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could use visitor pattern
to make code more cleaner to avoid code bomb if supporting more expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I'm not a DataFusion expert but I think this makes sense being implemented as a visitor of Expr
, probably by implementing a https://docs.rs/datafusion-common/41.0.0/datafusion_common/tree_node/trait.TreeNodeVisitor.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sdd, @FANNG1 , I appriciate it
I will try to refactor this, though I am wondering whether Visitor will indeed be the most suitable. Let me know what you think, but I think visitor shines when we have to run different logic on different kinds of elements (chained in some way) while we want to keep the logic in one place - i.e., the Visitor. whereas here we have one kind of element - Expr - which is an enum that can be deconstructed in different ways, for example -
Expr::Column(col), op, Expr::Literal(lit)
OR
Expr::BinaryExpr(left_expr), Operator::Or, Expr::BinaryExpr(right_expr)
so what Im trying to say is that using visitor will simply move the matching complexity to another place - to the visitor.
Does this make sense?
I will continue to try and refactor this but please let me know what you think
- refactored scan to use new predicate converter as visitor and seperated it to a new mod - added support for simple predicates with column cast expressions - added testing, mostly around date functions
Hey @sdd, @FANNG1 and @liurenjie1024 fn convert_filters_to_predicate(filters: &[Expr]) -> Option<Predicate> {
PredicateConverter.visit_many(filters)
}
also note that I have added a fix to the schema mod that correctly converts Date type to its Arrow type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work you've put into this! I think there are some structural changes that could be made, by aligning more closely to the visitor style that we already have, but I also think that perfect is the enemy of done, and since this datafusion integration is still quite experimental, and that Rust makes refactorings quite easy, I'm happy to proceed here with something working that we can evolve further as we go forwards.
My main concern is that it seems like you have taken a "soft-fail" approach, where, for unsupported operators or nodes, you return None and proceed by effectively ignoring as-yet-unimplemented operations? I think the whole thing should return a Result rather than an Option, and when an unimplemented aspect of Expr is encountered, we should fail, returning a Err with ErrorKind::Unimplemented. This prevents confusion and unexpected behaviour by users whose datafusion queries succeed but give them unexpected results due to partial filters.
Thanks for the review and comment, @sdd ! for example, suppose that we have an expression like: |
you can see what I mean in this test case: #[test]
fn test_predicate_conversion_with_unsupported_condition_or() {
let sql = "(foo > 1 and bar in ('test', 'test2')) or foo < 0 ";
let df_schema = create_test_schema();
let expr = SessionContext::new()
.parse_sql_expr(sql, &df_schema)
.unwrap();
let predicate = convert_filters_to_predicate(&[expr]).unwrap();
let expected_predicate = Predicate::or(
Reference::new("foo").greater_than(Datum::long(1)),
Reference::new("foo").less_than(Datum::long(0)),
);
assert_eq!(predicate, expected_predicate);
} although we don't support |
Aah I see! That makes complete sense. So, what you are saying is, we do the best we can to filter via metadata within iceberg, but whatever operations we can't handle in iceberg will get applied at row-level anyway once the data is passed back to DataFusion? If so, then I withdraw my objection as this seems totally sensible 👍 |
Precisely! |
Hey @liurenjie1024 , |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @a-agmon for this pr! This helps to improve datafusio integration, but there are still some missing part.
crates/integrations/datafusion/src/physical_plan/predicate_converter.rs
Outdated
Show resolved
Hide resolved
crates/integrations/datafusion/src/physical_plan/predicate_converter.rs
Outdated
Show resolved
Hide resolved
Thanks much for the review and comments @liurenjie1024 |
impl<'n> TreeNodeVisitor<'n> for ExprToPredicateVisitor { | ||
type Node = Expr; | ||
|
||
fn f_down(&mut self, _node: &'n Self::Node) -> Result<TreeNodeRecursion, DataFusionError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn f_down(&mut self, _node: &'n Self::Node) -> Result<TreeNodeRecursion, DataFusionError> { | |
fn f_down(&mut self, _node: &Expr) -> Result<TreeNodeRecursion, DataFusionError> { |
I would suggest to use concrete type to make it more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Ok(TreeNodeRecursion::Continue) | ||
} | ||
|
||
fn f_up(&mut self, node: &'n Self::Node) -> Result<TreeNodeRecursion, DataFusionError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if let Expr::BinaryExpr(binary) = node { | ||
match (&*binary.left, &binary.op, &*binary.right) { | ||
// process simple expressions (involving a column, operator and literal) | ||
(Expr::Column(col), op, Expr::Literal(lit)) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this visitor visit leaf nodes such as Expr::Column
? This pattern doesn't seem flexible to me, for example, I believe we could also support things like a > 6
, 6 < a
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment @liurenjie1024
I have added support, and test, for both "directions" of binary statement: X > 1
and 1 < X
.
But I think there is something more to be clarified:
Both x>1
and 1<x
are binary statements.
When the binary statement foo > 1
is processed then there are 3 main leafs to handle:
1 - Column(Column { relation: Some(Bare { table: "my_table" }), name: "foo" })
This ^ will be ignored as its not a binary expr (its a column expr)
2 - Literal(Int64(1))
This ^ will also be ignored as its not a binary expr (its a literal expr)
3 - BinaryExpr(BinaryExpr { left: Column(Column { relation: Some(Bare { table: "my_table" }), name: "foo" }), op: Gt, right: Literal(Int64(1)) })
This ^ binary expr will be captured, deconstructed, and processed by this match arm
match (&*binary.left, &binary.op, &*binary.right) {
(Expr::Column(col), op, Expr::Literal(lit)) => {
...
}
Therefore, in order to support a>6
or 6<a
we dont need to capture Expr:Column
, only the binary statement that includes it.
On the other hand, I dont think we want to be too flexible because there are expr we want to avoid. For example, Expr::SimilarTo
(regexp comparison) is something I think we should leave to DataFusion to handle. TBH, I think there are actually very few Exp we want to handle, rather than leaving them to DataFusion, which are mostly where Iceberg can improve the datafusion scan by using predicates at the metadata level and pruning data files to be scanned.
fn convert_compound_expr(&self, valid_preds: &[Predicate], op: &Operator) -> Option<Predicate> { | ||
let valid_preds_count = valid_preds.len(); | ||
match (op, valid_preds_count) { | ||
(Operator::And, 1) => valid_preds.first().cloned(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confusing about this case, when will this be valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I basically designed the conversion of filters to be greedy, which means we need to try and convert as many filters as we can in order to scan less files.
suppose we have a case like this:
SELECT * FROM T WHERE A > 10 AND REGEXP_LIKE(name, '[a-z]A[a-zA-Z]{2}');
This will translate into an expr that looks something like this
BinaryExpr(BinaryExpr AND RegExpExpr)
After we processed both leafs: ( BinaryExpr and RegExpExpr)
, we have on the stack just one valid expression because the RegExpExpr
is not supported.
Then we get to the parent: BinaryExpr(A AND B)
, we try to get the 2 child expressions from the stack, but in this case we get only one (A > 10
) because the second one is not supported (REGEXP_LIKE
).
In this case, because its a logical AND expr, we proceed with just one valid predicate. Thats the case you are highlighinting - cases where we process logical AND expr where just one side is valid. in this case - we are processing with just one predicate.
If it was an OR expr that was wrapping both then we would have failed the whole predicate.
let compound_pred = self.convert_compound_expr(&children, op); | ||
self.stack.push_back(compound_pred); | ||
} | ||
_ => {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there sth not supported, I think the correct behavior should be TreeNodeRecursion::Stop
and clear the stack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
|
||
#[cfg(test)] | ||
mod tests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all these tests, really appreciate it!
Thank you very much for the review and comments, @liurenjie1024 I just want to make a comment which might explain some of my design choices. |
Hi @liurenjie1024 , |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for being so patient on this and explaining the rationale, @a-agmon. I think that there has been a lot of confusion on this because the approach of ignoring unsupported predicates (because they will be handled downstream by DataFusion) was not quite what we're used to seeing, but I think your approach makes sense. And, as I said before, perfect is the enemy of done and I'm happy to move forward with what we have here as there is nothing preventing us from further iterating on this.
Thanks again!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @a-agmon for working this and thanks @liurenjie1024, @sdd, @FANNG1's review. Let's move!
commit cda4a0c Author: congyi wang <58715567+wcy-fdu@users.noreply.github.com> Date: Mon Sep 30 11:07:36 2024 +0800 chore: fix typo in FileIO Schemes (apache#653) * fix typo * fix typo commit af9609d Author: Scott Donnelly <scott@donnel.ly> Date: Mon Sep 30 04:06:14 2024 +0100 fix: page index evaluator min/max args inverted (apache#648) * fix: page index evaluator min/max args inverted * style: fix clippy lint in test commit a6a3fd7 Author: Alon Agmon <54080741+a-agmon@users.noreply.github.com> Date: Sat Sep 28 10:10:08 2024 +0300 test (datafusion): add test for table provider creation (apache#651) * add test for table provider creation * fix formatting * fixing yet another formatting issue * testing schema using data fusion --------- Co-authored-by: Alon Agmon <alon.agmon@appsflyer.com> commit 87483b4 Author: Alon Agmon <54080741+a-agmon@users.noreply.github.com> Date: Fri Sep 27 04:40:08 2024 +0300 making table provider pub (apache#650) Co-authored-by: Alon Agmon <alon.agmon@appsflyer.com> commit 984c91e Author: ZENOTME <43447882+ZENOTME@users.noreply.github.com> Date: Thu Sep 26 17:56:02 2024 +0800 avoid to create memory schema operator every time (apache#635) Co-authored-by: ZENOTME <st810918843@gmail.com> commit 4171275 Author: Matheus Alcantara <mths.dev@pm.me> Date: Wed Sep 25 08:28:42 2024 -0300 scan: change ErrorKind when table dont have spanshots (apache#608) commit ab51355 Author: xxchan <xxchan22f@gmail.com> Date: Tue Sep 24 21:25:45 2024 +0800 fix: compile error due to merge stale PR (apache#646) Signed-off-by: xxchan <xxchan22f@gmail.com> commit 420b4e2 Author: Scott Donnelly <scott@donnel.ly> Date: Tue Sep 24 08:20:23 2024 +0100 Table Scan: Add Row Selection Filtering (apache#565) * feat(scan): add row selection capability via PageIndexEvaluator * test(row-selection): add first few row selection tests * feat(scan): add more tests, fix bug where min/max args swapped * fix: ad test and fix for logic bug in PageIndexEvaluator in-clause handler * feat: changes suggested from PR review commit b3709ba Author: Christian <christian@hansetag.com> Date: Tue Sep 24 04:47:04 2024 +0200 feat: Add NamespaceIdent.parent() (apache#641) * Add NamespaceIdent.parent() * Use split_last commit 1533c43 Author: Alon Agmon <54080741+a-agmon@users.noreply.github.com> Date: Mon Sep 23 13:39:46 2024 +0300 feat (datafusion integration): convert datafusion expr filters to Iceberg Predicate (apache#588) * adding main function and tests * adding tests, removing integration test for now * fixing typos and lints * fixing typing issue * - added support in schmema to convert Date32 to correct arrow type - refactored scan to use new predicate converter as visitor and seperated it to a new mod - added support for simple predicates with column cast expressions - added testing, mostly around date functions * fixing format and lic * reducing number of tests (17 -> 7) * fix formats * fix naming * refactoring to use TreeNodeVisitor * fixing fmt * small refactor * adding swapped op and fixing CR comments --------- Co-authored-by: Alon Agmon <alon.agmon@appsflyer.com> commit e967deb Author: xxchan <xxchan22f@gmail.com> Date: Mon Sep 23 18:34:59 2024 +0800 feat: expose remove_all in FileIO (apache#643) Signed-off-by: xxchan <xxchan22f@gmail.com> commit d03c4f8 Author: Scott Donnelly <scott@donnel.ly> Date: Mon Sep 23 08:28:52 2024 +0100 Migrate to arrow-* v53 (apache#626) * chore: migrate to arrow-* v53 * chore: update datafusion to 42 * test: fix incorrect test assertion * chore: update python bindings to arrow 53 commit 88e5e4a Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Sep 23 15:26:18 2024 +0800 chore(deps): Bump crate-ci/typos from 1.24.5 to 1.24.6 (apache#640) Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.5 to 1.24.6. - [Release notes](https://github.com/crate-ci/typos/releases) - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md) - [Commits](crate-ci/typos@v1.24.5...v1.24.6) --- updated-dependencies: - dependency-name: crate-ci/typos dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit c354983 Author: xxchan <xxchan22f@gmail.com> Date: Mon Sep 23 14:50:18 2024 +0800 doc: improve FileIO doc (apache#642) Signed-off-by: xxchan <xxchan22f@gmail.com> commit 12e12e2 Author: xxchan <xxchan22f@gmail.com> Date: Fri Sep 20 19:59:55 2024 +0800 feat: expose arrow type <-> iceberg type (apache#637) * feat: expose arrow type <-> iceberg type Previously we only exposed the schema conversion. Signed-off-by: xxchan <xxchan22f@gmail.com> * add tests Signed-off-by: xxchan <xxchan22f@gmail.com> --------- Signed-off-by: xxchan <xxchan22f@gmail.com> commit 3b27c9e Author: xxchan <xxchan22f@gmail.com> Date: Fri Sep 20 18:32:31 2024 +0800 feat: add Sync to TransformFunction (apache#638) Signed-off-by: xxchan <xxchan22f@gmail.com> commit 34cb81c Author: Xuanwo <github@xuanwo.io> Date: Wed Sep 18 20:18:40 2024 +0800 chore: Bump opendal to 0.50 (apache#634) commit cde35ab Author: FANNG <xiaojing@datastrato.com> Date: Fri Sep 13 10:01:16 2024 +0800 feat: support projection pushdown for datafusion iceberg (apache#594) * support projection pushdown for datafusion iceberg * support projection pushdown for datafusion iceberg * fix ci * fix field id * remove depencences * remove depencences commit eae9464 Author: Xuanwo <github@xuanwo.io> Date: Thu Sep 12 02:06:31 2024 +0800 refactor(python): Expose transform as a submodule for pyiceberg_core (apache#628) commit 8a3de4e Author: Christian <christian@hansetag.com> Date: Mon Sep 9 14:45:16 2024 +0200 Feat: Normalize TableMetadata (apache#611) * Normalize Table Metadata * Improve readability & comments commit e08c0e5 Author: Renjie Liu <liurenjie2008@gmail.com> Date: Mon Sep 9 11:57:22 2024 +0800 fix: Correctly calculate highest_field_id in schema (apache#590) commit f78c59b Author: Jack <56563911+jdockerty@users.noreply.github.com> Date: Mon Sep 9 03:35:16 2024 +0100 feat: add `client.region` (apache#623) commit a5aba9a Author: Christian <christian@hansetag.com> Date: Sun Sep 8 18:36:05 2024 +0200 feat: SortOrder methods should take schema ref if possible (apache#613) * SortOrder methods should take schema ref if possible * Fix test type * with_order_id should not take reference commit 5812399 Author: Christian <christian@hansetag.com> Date: Sun Sep 8 18:18:41 2024 +0200 feat: partition compatibility (apache#612) * Partition compatability * Partition compatability * Rename compatible_with -> is_compatible_with commit ede4720 Author: Christian <christian@hansetag.com> Date: Sun Sep 8 16:49:39 2024 +0200 fix: Less Panics for Snapshot timestamps (apache#614) commit ced661f Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Sun Sep 8 22:43:38 2024 +0800 chore(deps): Bump crate-ci/typos from 1.24.3 to 1.24.5 (apache#616) Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.3 to 1.24.5. - [Release notes](https://github.com/crate-ci/typos/releases) - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md) - [Commits](crate-ci/typos@v1.24.3...v1.24.5) --- updated-dependencies: - dependency-name: crate-ci/typos dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit cbbd086 Author: Xuanwo <github@xuanwo.io> Date: Sun Sep 8 10:29:31 2024 +0800 feat: Add more fields in FileScanTask (apache#609) Signed-off-by: Xuanwo <github@xuanwo.io> commit 620d58e Author: Callum Ryan <19956159+callum-ryan@users.noreply.github.com> Date: Thu Sep 5 03:44:55 2024 +0100 feat: SQL Catalog - namespaces (apache#534) * feat: SQL Catalog - namespaces Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com> * feat: use transaction for updates and creates Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com> * fix: pull out query param builder to fn Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com> * feat: add drop and tests Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com> * fix: String to str, remove pub and optimise query builder Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com> * fix: nested match, remove ok() Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com> * fix: remove pub, add set, add comments Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com> * fix: refactor list_namespaces slightly Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com> * fix: add default properties to all new namespaces Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com> * fix: remove check for nested namespace Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com> * chore: add more comments to the CatalogConfig to explain bind styles Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com> * fix: edit test for nested namespaces Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com> --------- Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com> commit ae75f96 Author: Søren Dalby Larsen <sdlarsen@gmail.com> Date: Tue Sep 3 13:46:48 2024 +0200 chore: bump crate-ci/typos to 1.24.3 (apache#598) commit 7aa8bdd Author: Scott Donnelly <scott@donnel.ly> Date: Thu Aug 29 04:37:48 2024 +0100 Table Scan: Add Row Group Skipping (apache#558) * feat(scan): add row group and page index row selection filtering * fix(row selection): off-by-one error * feat: remove row selection to defer to a second PR * feat: better min/max val conversion in RowGroupMetricsEvaluator * test(row_group_filtering): first three tests * test(row_group_filtering): next few tests * test: add more tests for RowGroupMetricsEvaluator * chore: refactor test assertions to silence clippy lints * refactor: consolidate parquet stat min/max parsing in one place commit da08e8d Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed Aug 28 14:35:55 2024 +0800 chore(deps): Bump crate-ci/typos from 1.23.6 to 1.24.1 (apache#583) commit ecbb4c3 Author: Sung Yun <107272191+sungwy@users.noreply.github.com> Date: Mon Aug 26 23:57:01 2024 -0400 Expose Transforms to Python Binding (apache#556) * bucket transform rust binding * format * poetry x maturin * ignore poetry.lock in license check * update bindings_python_ci to use makefile * newline * python-poetry/poetry#9135 * use hatch instead of poetry * refactor * revert licenserc change * adopt review feedback * comments * unused dependency * adopt review comment * newline * I like this approach a lot better * more tests commit 905ebd2 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Aug 26 20:49:07 2024 +0800 chore(deps): Update typed-builder requirement from 0.19 to 0.20 (apache#582) --- updated-dependencies: - dependency-name: typed-builder dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit f9c92b7 Author: FANNG <xiaojing@datastrato.com> Date: Sun Aug 25 22:31:36 2024 +0800 fix: Update sqlx from 0.8.0 to 0.8.1 (apache#584) commit ba66665 Author: FANNG <xiaojing@datastrato.com> Date: Sat Aug 24 12:35:36 2024 +0800 fix: correct partition-id to field-id in UnboundPartitionField (apache#576) * correct partition-id to field id in PartitionSpec * correct partition-id to field id in PartitionSpec * correct partition-id to field id in PartitionSpec * xx
commit 2c9d7e4 Author: Christian Thiel <christian@hansetag.com> Date: Wed Oct 2 07:51:50 2024 +0200 New builder commit fea1817 Author: Christian Thiel <christian@hansetag.com> Date: Tue Oct 1 12:56:10 2024 +0200 Merge branch 'feat/safe-partition-spec' commit 1f49c95 Author: Christian Thiel <christian@hansetag.com> Date: Tue Oct 1 12:55:02 2024 +0200 Merge branch 'feat/schema-reassign-field-ids' commit cda4a0c Author: congyi wang <58715567+wcy-fdu@users.noreply.github.com> Date: Mon Sep 30 11:07:36 2024 +0800 chore: fix typo in FileIO Schemes (apache#653) * fix typo * fix typo commit af9609d Author: Scott Donnelly <scott@donnel.ly> Date: Mon Sep 30 04:06:14 2024 +0100 fix: page index evaluator min/max args inverted (apache#648) * fix: page index evaluator min/max args inverted * style: fix clippy lint in test commit a6a3fd7 Author: Alon Agmon <54080741+a-agmon@users.noreply.github.com> Date: Sat Sep 28 10:10:08 2024 +0300 test (datafusion): add test for table provider creation (apache#651) * add test for table provider creation * fix formatting * fixing yet another formatting issue * testing schema using data fusion --------- Co-authored-by: Alon Agmon <alon.agmon@appsflyer.com> commit 87483b4 Author: Alon Agmon <54080741+a-agmon@users.noreply.github.com> Date: Fri Sep 27 04:40:08 2024 +0300 making table provider pub (apache#650) Co-authored-by: Alon Agmon <alon.agmon@appsflyer.com> commit 984c91e Author: ZENOTME <43447882+ZENOTME@users.noreply.github.com> Date: Thu Sep 26 17:56:02 2024 +0800 avoid to create memory schema operator every time (apache#635) Co-authored-by: ZENOTME <st810918843@gmail.com> commit 4171275 Author: Matheus Alcantara <mths.dev@pm.me> Date: Wed Sep 25 08:28:42 2024 -0300 scan: change ErrorKind when table dont have spanshots (apache#608) commit ab51355 Author: xxchan <xxchan22f@gmail.com> Date: Tue Sep 24 21:25:45 2024 +0800 fix: compile error due to merge stale PR (apache#646) Signed-off-by: xxchan <xxchan22f@gmail.com> commit 420b4e2 Author: Scott Donnelly <scott@donnel.ly> Date: Tue Sep 24 08:20:23 2024 +0100 Table Scan: Add Row Selection Filtering (apache#565) * feat(scan): add row selection capability via PageIndexEvaluator * test(row-selection): add first few row selection tests * feat(scan): add more tests, fix bug where min/max args swapped * fix: ad test and fix for logic bug in PageIndexEvaluator in-clause handler * feat: changes suggested from PR review commit b3709ba Author: Christian <christian@hansetag.com> Date: Tue Sep 24 04:47:04 2024 +0200 feat: Add NamespaceIdent.parent() (apache#641) * Add NamespaceIdent.parent() * Use split_last commit 1533c43 Author: Alon Agmon <54080741+a-agmon@users.noreply.github.com> Date: Mon Sep 23 13:39:46 2024 +0300 feat (datafusion integration): convert datafusion expr filters to Iceberg Predicate (apache#588) * adding main function and tests * adding tests, removing integration test for now * fixing typos and lints * fixing typing issue * - added support in schmema to convert Date32 to correct arrow type - refactored scan to use new predicate converter as visitor and seperated it to a new mod - added support for simple predicates with column cast expressions - added testing, mostly around date functions * fixing format and lic * reducing number of tests (17 -> 7) * fix formats * fix naming * refactoring to use TreeNodeVisitor * fixing fmt * small refactor * adding swapped op and fixing CR comments --------- Co-authored-by: Alon Agmon <alon.agmon@appsflyer.com> commit e967deb Author: xxchan <xxchan22f@gmail.com> Date: Mon Sep 23 18:34:59 2024 +0800 feat: expose remove_all in FileIO (apache#643) Signed-off-by: xxchan <xxchan22f@gmail.com> commit d03c4f8 Author: Scott Donnelly <scott@donnel.ly> Date: Mon Sep 23 08:28:52 2024 +0100 Migrate to arrow-* v53 (apache#626) * chore: migrate to arrow-* v53 * chore: update datafusion to 42 * test: fix incorrect test assertion * chore: update python bindings to arrow 53 commit 88e5e4a Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Sep 23 15:26:18 2024 +0800 chore(deps): Bump crate-ci/typos from 1.24.5 to 1.24.6 (apache#640) Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.5 to 1.24.6. - [Release notes](https://github.com/crate-ci/typos/releases) - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md) - [Commits](crate-ci/typos@v1.24.5...v1.24.6) --- updated-dependencies: - dependency-name: crate-ci/typos dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit c354983 Author: xxchan <xxchan22f@gmail.com> Date: Mon Sep 23 14:50:18 2024 +0800 doc: improve FileIO doc (apache#642) Signed-off-by: xxchan <xxchan22f@gmail.com> commit 12e12e2 Author: xxchan <xxchan22f@gmail.com> Date: Fri Sep 20 19:59:55 2024 +0800 feat: expose arrow type <-> iceberg type (apache#637) * feat: expose arrow type <-> iceberg type Previously we only exposed the schema conversion. Signed-off-by: xxchan <xxchan22f@gmail.com> * add tests Signed-off-by: xxchan <xxchan22f@gmail.com> --------- Signed-off-by: xxchan <xxchan22f@gmail.com> commit 3b27c9e Author: xxchan <xxchan22f@gmail.com> Date: Fri Sep 20 18:32:31 2024 +0800 feat: add Sync to TransformFunction (apache#638) Signed-off-by: xxchan <xxchan22f@gmail.com> commit 34cb81c Author: Xuanwo <github@xuanwo.io> Date: Wed Sep 18 20:18:40 2024 +0800 chore: Bump opendal to 0.50 (apache#634) commit cde35ab Author: FANNG <xiaojing@datastrato.com> Date: Fri Sep 13 10:01:16 2024 +0800 feat: support projection pushdown for datafusion iceberg (apache#594) * support projection pushdown for datafusion iceberg * support projection pushdown for datafusion iceberg * fix ci * fix field id * remove depencences * remove depencences commit eae9464 Author: Xuanwo <github@xuanwo.io> Date: Thu Sep 12 02:06:31 2024 +0800 refactor(python): Expose transform as a submodule for pyiceberg_core (apache#628) commit 8a3de4e Author: Christian <christian@hansetag.com> Date: Mon Sep 9 14:45:16 2024 +0200 Feat: Normalize TableMetadata (apache#611) * Normalize Table Metadata * Improve readability & comments commit e08c0e5 Author: Renjie Liu <liurenjie2008@gmail.com> Date: Mon Sep 9 11:57:22 2024 +0800 fix: Correctly calculate highest_field_id in schema (apache#590) commit f78c59b Author: Jack <56563911+jdockerty@users.noreply.github.com> Date: Mon Sep 9 03:35:16 2024 +0100 feat: add `client.region` (apache#623) commit a5aba9a Author: Christian <christian@hansetag.com> Date: Sun Sep 8 18:36:05 2024 +0200 feat: SortOrder methods should take schema ref if possible (apache#613) * SortOrder methods should take schema ref if possible * Fix test type * with_order_id should not take reference commit 5812399 Author: Christian <christian@hansetag.com> Date: Sun Sep 8 18:18:41 2024 +0200 feat: partition compatibility (apache#612) * Partition compatability * Partition compatability * Rename compatible_with -> is_compatible_with commit ede4720 Author: Christian <christian@hansetag.com> Date: Sun Sep 8 16:49:39 2024 +0200 fix: Less Panics for Snapshot timestamps (apache#614) commit ced661f Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Sun Sep 8 22:43:38 2024 +0800 chore(deps): Bump crate-ci/typos from 1.24.3 to 1.24.5 (apache#616) Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.3 to 1.24.5. - [Release notes](https://github.com/crate-ci/typos/releases) - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md) - [Commits](crate-ci/typos@v1.24.3...v1.24.5) --- updated-dependencies: - dependency-name: crate-ci/typos dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit cbbd086 Author: Xuanwo <github@xuanwo.io> Date: Sun Sep 8 10:29:31 2024 +0800 feat: Add more fields in FileScanTask (apache#609) Signed-off-by: Xuanwo <github@xuanwo.io> commit 620d58e Author: Callum Ryan <19956159+callum-ryan@users.noreply.github.com> Date: Thu Sep 5 03:44:55 2024 +0100 feat: SQL Catalog - namespaces (apache#534) * feat: SQL Catalog - namespaces Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com> * feat: use transaction for updates and creates Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com> * fix: pull out query param builder to fn Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com> * feat: add drop and tests Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com> * fix: String to str, remove pub and optimise query builder Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com> * fix: nested match, remove ok() Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com> * fix: remove pub, add set, add comments Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com> * fix: refactor list_namespaces slightly Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com> * fix: add default properties to all new namespaces Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com> * fix: remove check for nested namespace Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com> * chore: add more comments to the CatalogConfig to explain bind styles Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com> * fix: edit test for nested namespaces Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com> --------- Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com> commit ae75f96 Author: Søren Dalby Larsen <sdlarsen@gmail.com> Date: Tue Sep 3 13:46:48 2024 +0200 chore: bump crate-ci/typos to 1.24.3 (apache#598) commit 7aa8bdd Author: Scott Donnelly <scott@donnel.ly> Date: Thu Aug 29 04:37:48 2024 +0100 Table Scan: Add Row Group Skipping (apache#558) * feat(scan): add row group and page index row selection filtering * fix(row selection): off-by-one error * feat: remove row selection to defer to a second PR * feat: better min/max val conversion in RowGroupMetricsEvaluator * test(row_group_filtering): first three tests * test(row_group_filtering): next few tests * test: add more tests for RowGroupMetricsEvaluator * chore: refactor test assertions to silence clippy lints * refactor: consolidate parquet stat min/max parsing in one place commit da08e8d Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed Aug 28 14:35:55 2024 +0800 chore(deps): Bump crate-ci/typos from 1.23.6 to 1.24.1 (apache#583) commit ecbb4c3 Author: Sung Yun <107272191+sungwy@users.noreply.github.com> Date: Mon Aug 26 23:57:01 2024 -0400 Expose Transforms to Python Binding (apache#556) * bucket transform rust binding * format * poetry x maturin * ignore poetry.lock in license check * update bindings_python_ci to use makefile * newline * python-poetry/poetry#9135 * use hatch instead of poetry * refactor * revert licenserc change * adopt review feedback * comments * unused dependency * adopt review comment * newline * I like this approach a lot better * more tests commit 905ebd2 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Aug 26 20:49:07 2024 +0800 chore(deps): Update typed-builder requirement from 0.19 to 0.20 (apache#582) --- updated-dependencies: - dependency-name: typed-builder dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit f9c92b7 Author: FANNG <xiaojing@datastrato.com> Date: Sun Aug 25 22:31:36 2024 +0800 fix: Update sqlx from 0.8.0 to 0.8.1 (apache#584) commit ba66665 Author: FANNG <xiaojing@datastrato.com> Date: Sat Aug 24 12:35:36 2024 +0800 fix: correct partition-id to field-id in UnboundPartitionField (apache#576) * correct partition-id to field id in PartitionSpec * correct partition-id to field id in PartitionSpec * correct partition-id to field id in PartitionSpec * xx
This PR closes #585
IcebergTableScan
structExp
) to IcebergPredicate
and apply them to the TableScan On data fusionexecute()
thereby enabling data files pruning by data fusion to boost scan.(X > 1 AND Y = 'something') OR Z > 100)
, and any combination of them.Update: