feat (datafusion integration): convert datafusion expr filters to Iceberg Predicate #588

a-agmon · 2024-08-28T06:57:05Z

This PR closes #585

Adds datafusion filters to the IcebergTableScan struct
Converts datafusion filters (or Exp) to Iceberg Predicate and apply them to the TableScan On data fusion execute() thereby enabling data files pruning by data fusion to boost scan.
Currently supports binary expressions (e.g., (X > 1 AND Y = 'something') OR Z > 100), and any combination of them.

Update:

added support in schema to convert Date32 to correct arrow type
refactored scan to use new predicate converter as visitor and separated it to a new mod
added support for simple predicates

crates/integrations/datafusion/src/table.rs

FANNG1 · 2024-08-30T12:20:32Z

crates/integrations/datafusion/src/physical_plan/scan.rs

+///
+/// * `Some(Predicate)` if the expression could be successfully converted.
+/// * `None` if the expression couldn't be converted to an Iceberg predicate.
+fn expr_to_predicate(expr: &Expr) -> Option<Predicate> {


Maybe we could use visitor pattern to make code more cleaner to avoid code bomb if supporting more expressions.

I agree. I'm not a DataFusion expert but I think this makes sense being implemented as a visitor of Expr, probably by implementing a https://docs.rs/datafusion-common/41.0.0/datafusion_common/tree_node/trait.TreeNodeVisitor.html

Thanks @sdd, @FANNG1 , I appriciate it
I will try to refactor this, though I am wondering whether Visitor will indeed be the most suitable. Let me know what you think, but I think visitor shines when we have to run different logic on different kinds of elements (chained in some way) while we want to keep the logic in one place - i.e., the Visitor. whereas here we have one kind of element - Expr - which is an enum that can be deconstructed in different ways, for example -

Expr::Column(col), op, Expr::Literal(lit) OR Expr::BinaryExpr(left_expr), Operator::Or, Expr::BinaryExpr(right_expr)

so what Im trying to say is that using visitor will simply move the matching complexity to another place - to the visitor.
Does this make sense?
I will continue to try and refactor this but please let me know what you think

- refactored scan to use new predicate converter as visitor and seperated it to a new mod - added support for simple predicates with column cast expressions - added testing, mostly around date functions

a-agmon · 2024-09-01T09:52:14Z

Hey @sdd, @FANNG1 and @liurenjie1024
Please see updated PR. I have refactored predicate_converter into a separate module that follows a more visitor-ish approach.
The scan mod is now clean and convert predicates using:

fn convert_filters_to_predicate(filters: &[Expr]) -> Option<Predicate> {
    PredicateConverter.visit_many(filters)
}

PredicateConverter is now more modular and allows adding more expressions and patterns. Though I think that it currently covers quite a lot of the cases that are commonly supported in pushdown strategies. Most of the complexity is just in identifying and matching the particular patterns of the Expr we need to deal with in different ways.
Please let me know what do you think about this

also note that I have added a fix to the schema mod that correctly converts Date type to its Arrow type.

sdd

Thanks for the work you've put into this! I think there are some structural changes that could be made, by aligning more closely to the visitor style that we already have, but I also think that perfect is the enemy of done, and since this datafusion integration is still quite experimental, and that Rust makes refactorings quite easy, I'm happy to proceed here with something working that we can evolve further as we go forwards.

My main concern is that it seems like you have taken a "soft-fail" approach, where, for unsupported operators or nodes, you return None and proceed by effectively ignoring as-yet-unimplemented operations? I think the whole thing should return a Result rather than an Option, and when an unimplemented aspect of Expr is encountered, we should fail, returning a Err with ErrorKind::Unimplemented. This prevents confusion and unexpected behaviour by users whose datafusion queries succeed but give them unexpected results due to partial filters.

a-agmon · 2024-09-01T10:46:01Z

My main concern is that it seems like you have taken a "soft-fail" approach, where, for unsupported operators or nodes, you return None and proceed by effectively ignoring as-yet-unimplemented operations? I think the whole thing should return a Result rather than an Option, and when an unimplemented aspect of Expr is encountered, we should fail, returning a Err with ErrorKind::Unimplemented. This prevents confusion and unexpected behaviour by users whose datafusion queries succeed but give them unexpected results due to partial filters.

Thanks for the review and comment, @sdd !
I Will be happy to update this accordingly (will also learn more carefully the visitor approach in the code base to perhaps refactor this further in the future).
I do have one question though. The idea I have followed here is that the conversion process operates in an eager-like mode. It works through the expressions, some of which are not implemented yet, but some of which will not be implemented by design (for example: Expr::ScalarFunction(ScalarFunction)), and basically returns a predicate that might not reflect exactly your filters but will filter as much as possible to speed up the query.

for example, suppose that we have an expression like:
(A > 1 AND B = func(X)) OR (A < 0)
and we know that B = func(X) is not an expression we support or can evaluate. So the idea here of soft-fail, is to not fail the entire conversion but to return (A > 1) OR (A < 0), which will still contain unnecessary data but will filter out many files that are not relevant and let you enjoy the metadata-based file pruning.

a-agmon · 2024-09-01T10:50:12Z

you can see what I mean in this test case:

    #[test]
    fn test_predicate_conversion_with_unsupported_condition_or() {
        let sql = "(foo > 1 and bar in ('test', 'test2')) or foo < 0 ";
        let df_schema = create_test_schema();
        let expr = SessionContext::new()
            .parse_sql_expr(sql, &df_schema)
            .unwrap();
        let predicate = convert_filters_to_predicate(&[expr]).unwrap();
        let expected_predicate = Predicate::or(
            Reference::new("foo").greater_than(Datum::long(1)),
            Reference::new("foo").less_than(Datum::long(0)),
        );
        assert_eq!(predicate, expected_predicate);
    }

although we don't support bar in ('test', 'test2') there is no reason to fail the entire conversion and not let you enjoy filtering files according to what we can support.
Do you see my point here? or perhaps I misunderstood your comment

sdd · 2024-09-01T11:20:27Z

Aah I see! That makes complete sense. So, what you are saying is, we do the best we can to filter via metadata within iceberg, but whatever operations we can't handle in iceberg will get applied at row-level anyway once the data is passed back to DataFusion? If so, then I withdraw my objection as this seems totally sensible 👍

a-agmon · 2024-09-01T11:28:55Z

Aah I see! That makes complete sense. So, what you are saying is, we do the best we can to filter via metadata within iceberg, but whatever operations we can't handle in iceberg will get applied at row-level anyway once the data is passed back to DataFusion? If so, then I withdraw my objection as this seems totally sensible 👍

Precisely!
DataFusion is pretty fast in scanning and processing parquet files and record batches, but I think that the major performance boost that Iceberg can bring in is by using its metadata to filter out and prune data files. So we make best effort to prune using predicate and let DF handle the rest (some of my tests on huge tables show this to gain a great performance boost)

crates/integrations/datafusion/src/physical_plan/scan.rs

a-agmon · 2024-09-02T08:47:15Z

Hey @liurenjie1024 ,
I think this one is ready to go :)

liurenjie1024

Thanks @a-agmon for this pr! This helps to improve datafusio integration, but there are still some missing part.

crates/integrations/datafusion/src/physical_plan/predicate_converter.rs

a-agmon · 2024-09-02T19:55:13Z

Thanks much for the review and comments @liurenjie1024
I refactored the code to be based on TreeNodeVisitor now, and all unit tests moved there.
I initially thought that TreeNodeVisitor will add an unnecessary layer of complexity but it actually seems to simplify it in some sense. Thanks

cc @sdd @FANNG1

liurenjie1024 · 2024-09-08T03:28:22Z