feat(substrait): modular substrait producer #13931

vbarua · 2024-12-27T19:04:50Z

Which issue does this PR close?

Rationale for this change

This is the producer equivalent to the consumer changes in #13803

Improves the reusability of the Substrait Producer for users that wish to customize how DataFusion relations and expression are converted into Substrait.

This is especially useful for controlling how UserDefinedLogicalNodes are converted into Substrait.

What changes are included in this PR?

Refactoring

Relation and expression handling code has been extracted into a series of from_* functions (i.e. from_projection, from_filter, from_between) to aid re-use.
A SubstraitProducer trait has been introduced with default methods that handle relations and expression using the above functions.
All conversion methods take as their first argument a &mut impl SubstraitProducer
SubstraitPlanningState has been fully removed it is no longer used anywhere.

Code Changes

The conversion of joins has been simplified to no longer require a column offset when converting the join condition. This allowed for the removal of the col_ref_offset argument from all methods that used it, simplifying the API.

Are these changes tested?

These changes refactor existing code and leverage their tests.

A test was added to verify the behaviour when converting joins, as there is a code change there.

Are there any user-facing changes?

The top-level to_substrait_plan function now consumes a &SessionState directly, instead of a &dyn SubstraitPlanningState which most users should not notice.
Public functions like to_substrait_rel and to_substrait_rex have had their API change, but this should not affect most users.
The SubstraitPlanningState has been entirely removed. It's functionality has been superceded by the SubstraitConsumer and SubstraitProducer traits.

vbarua · 2024-12-27T19:49:16Z

datafusion/substrait/src/logical_plan/producer.rs

+
+    fn consume_plan(&mut self, plan: &LogicalPlan) -> Result<Box<Rel>> {
+        to_substrait_rel(self, plan)
+    }


Even though this is the SubstraitProducer, I used consume as the verb for the API as it consumes DataFusion and produces Substrait.

I though about using produce_plan, produce_projection, etc but found that pattern a little weird reading-wise.

For example does produce_between create a Substrait Between expression (which does not exist), or does it convert a Between expression into a Substrait equivalent. Because DataFusion relations and expressions don't map 1-1 with Substrait, I found it easier to think of this as consuming DataFusion. Just my 2 cents.

Yeah, I agree that "produce" doesn't make sense here, as it's more logical to think of the functions in terms of processing DF concepts rather than in producing Substrait things. However, the "consume" in producer can be a bit confusing w.r.t "consumer" - would it make sense to use some alternative, like "from" (which is already used for the functions) or "handle", "process", or something?

There's actually a lint that stops me from using from. I think handle makes sense, I'll switch to that.

vbarua · 2024-12-27T19:51:28Z

datafusion/substrait/src/logical_plan/producer.rs

+        self.extensions
+    }
+
+    fn consume_extension(&mut self, plan: &Extension) -> Result<Box<Rel>> {


The following was copied from the existing code for handling LogicalPlan::Extension nodes found later on.

vbarua · 2024-12-27T19:53:50Z

datafusion/substrait/src/logical_plan/producer.rs

-    state: &dyn SubstraitPlanningState,
-) -> Result<Box<Plan>> {
-    let mut extensions = Extensions::default();
+pub fn to_substrait_plan(plan: &LogicalPlan, state: &SessionState) -> Result<Box<Plan>> {


The public API stays mostly the same, taking a &SessionState instead of a &dyn SubstraitPlanningState which most users shouldn't notice.

vbarua · 2024-12-27T19:55:52Z

datafusion/substrait/src/logical_plan/producer.rs

-                maintain_singular_struct: false,
-            });
+pub fn from_table_scan(
+    _producer: &mut impl SubstraitProducer,


This currently isn't used. However, in the future we're likely going to want to use this producer when converting the DataFusion schema into Substrait, especially after the logical type work lands and we can potentially add user-define logical types.

likely

Indicates to me we shouldn't add it there yet, since there's a risk it won't be used :) And I think it'll be fine to add it later - it'll be an API break, but only for those customizing the usage, and at least it'll be a clear break.

since there's a risk it won't be used :)

Looking at this again, the presences of the filter field on the TableScan

datafusion/datafusion/expr/src/logical_plan/plan.rs

Line 2496 in fb1d4bc

pub filters: Vec<Expr>,

means that we can upgrade my "likely" to 100% as we'll need the producer to convert the filter expressions.

vbarua · 2024-12-27T19:58:14Z

datafusion/substrait/src/logical_plan/producer.rs

+    } else {
+        Operator::Eq
+    };
+    let join_on = to_substrait_join_expr(producer, &join.on, eq_op, &in_join_schema)?;


The code here has changed slightly. to_substrait_join_expr now takes the output schema of the join which makes it easier to process the join condition. More details below.

vbarua · 2024-12-27T19:58:51Z

datafusion/substrait/src/logical_plan/producer.rs

-            };
-            Ok(Box::new(Rel {
-                rel_type: Some(rel_type),
-            }))


Moved up into the DefaultSubstraitProducer

vbarua · 2024-12-27T20:10:40Z

datafusion/substrait/src/logical_plan/producer.rs

-            extensions,
-        )?;
+        let l = producer.consume_expr(left, join_schema)?;
+        let r = producer.consume_expr(right, join_schema)?;


We no longer need to track the column offset explicitly.

The column offset code was added as part of #6135 to handle queries like

SELECT d1.b, d2.c FROM data d1 JOIN data d2 ON d1.b = d2.e

which caused issue because the left and right inputs both had the same name. This could potentially cause column name references in DataFusion to converted incorrectly into Substrait column indices in some cases. Additionally, there were issues with duplicate schema errors.

However, the introduction and usage of

datafusion/datafusion/substrait/src/logical_plan/consumer.rs

Lines 1772 to 1777 in a08dc0a

/// (Re)qualify the sides of a join if needed, i.e. if the columns from one side would otherwise

/// conflict with the columns from the other.

/// Substrait doesn't currently allow specifying aliases, neither for columns nor for tables. For

/// Substrait the names don't matter since it only refers to columns by indices, however DataFusion

/// requires columns to be uniquely identifiable, in some places (see e.g. DFSchema::check_names).

fn requalify_sides_if_needed(

which was needed because different tables might have columns with the same name, now means that we can concatenate the left and right schemas of a join without issues. Then if the DataFusion column references are unambiguous, we can simply look up the columns in the join schema to get the correct index.

This was the only place were the column offset was used. Removing this here allowed me to remove the col_ref_offset argument from a number of functions, which IMO simplifies the API substantially.

For further verification, a test has been added for in in roundtrip_logical_plan.rs

I'm not sure I follow how the requalify_sides_if_needed (added by me in #11049, just for reference) affects the need for this handling, given it's on the consumer side and this is on the producer. https://github.com/apache/datafusion/pull/6135/files#r1215611954 seems to indicate the re-added test doesn't catch the issue. Does this change affect the produced substrait plan?

Does this change affect the produced substrait plan?

Checking with the test I added, it does not actually.

Taking a step back as well. When I was updating this code I was curious why the code in #6135 was added, and I may have misunderstood what it was trying to fix.

What I did notice when refactoring this code was that we were already concatenating the schemas and then using that to convert the filter on the join

datafusion/datafusion/substrait/src/logical_plan/producer.rs

Lines 465 to 475 in 9b5995f

let in_join_schema = join.left.schema().join(join.right.schema())?;

let join_filter = match &join.filter {

Some(filter) => Some(to_substrait_rex(

state,

filter,

&Arc::new(in_join_schema),

0,

extensions,

)?),

None => None,

};

If it can be used to convert the filter, which itself can contain expressions referencing either side of the join, then it should be possible to use that schema to convert the expressions in the join condition as well. Based on this, I removed the column offset code.

As I understand it, something like

SELECT * FROM foo JOIN foo ON id = id

where foo has an id column, is rejected with

Schema error: Ambiguous reference to unqualified field table_name

If we qualify both side

SELECT * FROM foo JOIN foo ON l.id = r.id

it works fine. If DataFusion rejects queries where the column name references is ambiguous, it should be possible to look up the column in the combined schema generally.

You're work in #11049 made it possible to read plans where both sides of the join had columns with the same name, which would otherwise fail. That probably affected the testing code, but not the producer behaviour.

vbarua · 2024-12-27T20:19:38Z

datafusion/substrait/src/logical_plan/producer.rs

+        Expr::Negative(arg) => ("negate", arg),
+        expr => not_impl_err!("Unsupported expression: {expr:?}")?,
+    };
+    to_substrait_unary_scalar_fn(producer, fn_name, arg, schema)


Consolidated the handling of unary expression like Not, IsNull, IsNotNull etc into a single function for improved readability.

BREAKING CHANGE: SubstraitPlanningState is no longer available

vbarua · 2024-12-27T21:42:01Z

@Blizzara, @ccciudatu I would appreciate if y'all could take a look when you have an opportunity.

Blizzara

Looks great, thanks @vbarua! I left some comments or thoughts, but nothing major.

datafusion/substrait/src/logical_plan/producer.rs

Blizzara · 2024-12-28T12:16:17Z

datafusion/substrait/src/logical_plan/producer.rs

+
+    fn consume_plan(&mut self, plan: &LogicalPlan) -> Result<Box<Rel>> {
+        to_substrait_rel(self, plan)
+    }


Yeah, I agree that "produce" doesn't make sense here, as it's more logical to think of the functions in terms of processing DF concepts rather than in producing Substrait things. However, the "consume" in producer can be a bit confusing w.r.t "consumer" - would it make sense to use some alternative, like "from" (which is already used for the functions) or "handle", "process", or something?

Blizzara · 2024-12-28T12:28:49Z

datafusion/substrait/src/logical_plan/producer.rs

+
+    fn consume_scalar_function(
+        &mut self,
+        scalar_fn: &expr::ScalarFunction,


I guess this is to de-conflict with Substrait's ScalarFunction, which is imported? 👍

That is indeed the case.

Blizzara · 2024-12-28T12:50:17Z

datafusion/substrait/src/logical_plan/producer.rs

-                maintain_singular_struct: false,
-            });
+pub fn from_table_scan(
+    _producer: &mut impl SubstraitProducer,


likely

Indicates to me we shouldn't add it there yet, since there's a risk it won't be used :) And I think it'll be fine to add it later - it'll be an API break, but only for those customizing the usage, and at least it'll be a clear break.

Blizzara · 2024-12-28T12:51:08Z

datafusion/substrait/src/logical_plan/producer.rs

+    if e.produce_one_row {
+        return not_impl_err!("Producing a row from empty relation is unsupported");
+    }
+    #[allow(deprecated)]


I think previously it was allowed on even higher level so this is fine, but ooc, what's deprecated in all these?

The deprecation warnings are for fields on the generated protobufs. For example here it's the values field on the VirtualFields.

My intent with moving the #[allow(deprecated)] to the statement declarations was to more tightly associate it with the code with the deprecated fields.

Blizzara · 2024-12-28T13:48:50Z

datafusion/substrait/src/logical_plan/producer.rs

+        let substrait_expr = producer.consume_expr(expr.as_ref(), schema)?;
+        let substrait_low = producer.consume_expr(low.as_ref(), schema)?;
+        let substrait_high = producer.consume_expr(high.as_ref(), schema)?;


unrelated to this PR and probs better not to change now to keep diff small(er), but I think there's no reason to duplicate these below, they could just happen above the if

probs better not to change now to keep diff small(er),

I agree. Do small code improvements also require full issue to be linked to them?

datafusion/substrait/src/logical_plan/producer.rs

Blizzara · 2024-12-28T14:05:33Z

datafusion/substrait/src/logical_plan/producer.rs

+
+    fn consume_extension(&mut self, plan: &Extension) -> Result<Box<Rel>> {
+        let extension_bytes = self
+            .state


I think this is the only use for SessionState in the DefaultSubstraitProducer, so presumably it wouldn't need the full state to operate with... But given users have the option of making their own producer if they care, maybe that's fine and better to just have the state here for future needs?

Actually, this makes me want to only store the SerializerRegistry. If we need the state (or other data) we can always add it in later.

Part of the reason to switch to the producer trait is that we can modify the internal details of the DefaultSubstraitConsumer without it impacting users.

Blizzara · 2024-12-28T14:22:05Z

datafusion/substrait/tests/cases/roundtrip_logical_plan.rs

@@ -571,6 +571,21 @@ async fn roundtrip_self_implicit_cross_join() -> Result<()> {
    roundtrip("SELECT left.a left_a, left.b, right.a right_a, right.c FROM data AS left, data AS right").await
 }

+#[tokio::test]
+async fn self_join_introduces_aliases() -> Result<()> {


This is adding back this test, right? I seem to have argued back then that it is unnecessary given the roundtrip_self_join test.

This is adding that test back with the SubqueryAlias, yes. If you think it's redundant with roundtrip_self_join I'm happy to remove it.

alamb · 2024-12-30T10:38:29Z

This PR is looking nice -- @vbarua and @Blizzara let me know when it is ready for a final review.

vbarua · 2024-12-30T20:55:30Z

Thanks for the feedback @Blizzara. Went ahead and made changes based on what you suggested, and also answered some questions.

Blizzara

Thanks, LGTM! 🚀

vbarua added 3 commits December 27, 2024 11:02

feat(substrait): modular substrait producer

dcb9db4

refactor(substrait): simplify col_ref_offset handling in producer

1183d45

refactor(substrait): remove column offset tracking from producer

8a14160

github-actions bot added the substrait label Dec 27, 2024

docs(substrait): document SubstraitProducer

4a464cb

vbarua force-pushed the vbarua/modular-substrait-producer branch from a8273c0 to 4a464cb Compare December 27, 2024 19:42

vbarua commented Dec 27, 2024

View reviewed changes

vbarua added 3 commits December 27, 2024 12:21

refactor: minor cleanup

22bcc94

feature: remove unused SubstraitPlanningState

6839d33

BREAKING CHANGE: SubstraitPlanningState is no longer available

refactor: cargo fmt

1f547dc

vbarua marked this pull request as ready for review December 27, 2024 21:40

Blizzara reviewed Dec 28, 2024

View reviewed changes

vbarua added 6 commits December 30, 2024 10:05

refactor(substrait): consume_ -> handle_

d3400f4

refactor(substrait): expand match blocks

d962dc3

refactor: DefaultSubstraitProducer only needs serializer_registry

aa9e6f3

refactor: remove unnecessary warning suppression

af9c8a5

fix(substrait): route expr conversion through handle_expr

cf762a2

cargo fmt

85106f3

Blizzara approved these changes Jan 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(substrait): modular substrait producer #13931

feat(substrait): modular substrait producer #13931

vbarua commented Dec 27, 2024 •

edited

Loading

vbarua Dec 27, 2024

Blizzara Dec 28, 2024

vbarua Dec 30, 2024

vbarua Dec 27, 2024

vbarua Dec 27, 2024

vbarua Dec 27, 2024

Blizzara Dec 28, 2024

vbarua Dec 30, 2024

vbarua Dec 27, 2024

vbarua Dec 27, 2024

vbarua Dec 27, 2024 •

edited

Loading

Blizzara Dec 28, 2024

vbarua Dec 30, 2024

vbarua Dec 27, 2024

vbarua commented Dec 27, 2024

Blizzara left a comment

Blizzara Dec 28, 2024

Blizzara Dec 28, 2024

vbarua Dec 30, 2024

Blizzara Dec 28, 2024

Blizzara Dec 28, 2024

vbarua Dec 30, 2024

Blizzara Dec 28, 2024

vbarua Dec 30, 2024

Blizzara Dec 28, 2024

vbarua Dec 30, 2024

Blizzara Dec 28, 2024

vbarua Dec 30, 2024

alamb commented Dec 30, 2024

vbarua commented Dec 30, 2024

Blizzara left a comment

	/// (Re)qualify the sides of a join if needed, i.e. if the columns from one side would otherwise
	/// conflict with the columns from the other.
	/// Substrait doesn't currently allow specifying aliases, neither for columns nor for tables. For
	/// Substrait the names don't matter since it only refers to columns by indices, however DataFusion
	/// requires columns to be uniquely identifiable, in some places (see e.g. DFSchema::check_names).
	fn requalify_sides_if_needed(

	let in_join_schema = join.left.schema().join(join.right.schema())?;
	let join_filter = match &join.filter {
	Some(filter) => Some(to_substrait_rex(
	state,
	filter,
	&Arc::new(in_join_schema),
	0,
	extensions,
	)?),
	None => None,
	};

feat(substrait): modular substrait producer #13931

Are you sure you want to change the base?

feat(substrait): modular substrait producer #13931

Conversation

vbarua commented Dec 27, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Refactoring

Code Changes

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbarua Dec 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbarua commented Dec 27, 2024

Blizzara left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Dec 30, 2024

vbarua commented Dec 30, 2024

Blizzara left a comment

Choose a reason for hiding this comment

vbarua commented Dec 27, 2024 •

edited

Loading

vbarua Dec 27, 2024 •

edited

Loading