Implement TPCH substrait integration teset, support tpch_1 #10842

Lordworms · 2024-06-09T20:41:18Z

Which issue does this PR close?

part of #10710

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Lordworms · 2024-06-09T20:47:53Z

datafusion/substrait/src/logical_plan/consumer.rs

@@ -1253,6 +1335,8 @@ fn from_substrait_type(
            r#type::Kind::Struct(s) => Ok(DataType::Struct(from_substrait_struct_type(
                s, dfs_names, name_idx,
            )?)),
+            r#type::Kind::Varchar(_) => Ok(DataType::Utf8),


Currently directly use Utf8

Lordworms · 2024-06-09T21:25:58Z

another problem here is that the json file seems to do aggregation after projection

so in the actual logical plan generated from this proto, we do aggregate after projection

Also we do not have the alias support, so we would recalculate the elements in Aggregate again, which causes the logical plan like this

we have done twice calculations in both projection and aggregation.

but since we just generated a plan from a JSON file. I don't know whether we need to integrate the optimizer in our test here

Lordworms · 2024-06-09T22:06:43Z

datafusion/substrait/tests/testdata/tpch/lineitem.csv

@@ -0,0 +1,2 @@
+l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
+1,1,1,1,17,21168.23,0.04,0.02,'N','O','1996-03-13','1996-02-12','1996-03-22','DELIVER IN PERSON','TRUCK','egular courts above the'


in order to minimize the repo size, just upload a CSV version of lineitem table

I think a single line row is fine 👍

alamb

Thank you @Lordworms -- this is very cool.

I have some readability / organization / documentation suggestions I think would help this PR but we could also make them as a follow on PR too

cc @waynexia and @Blizzara

Once we merge this PR what do you think about creating tickets to track supporting the other queries in TPCH? Or maybe we can just use #10710 🤔

alamb · 2024-06-10T14:51:41Z

datafusion/substrait/tests/testdata/tpch/lineitem.csv

@@ -0,0 +1,2 @@
+l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
+1,1,1,1,17,21168.23,0.04,0.02,'N','O','1996-03-13','1996-02-12','1996-03-22','DELIVER IN PERSON','TRUCK','egular courts above the'


I think a single line row is fine 👍

alamb · 2024-06-10T14:55:56Z

datafusion/substrait/tests/cases/mod.rs

@@ -19,3 +19,4 @@ mod logical_plans;
 mod roundtrip_logical_plan;
 mod roundtrip_physical_plan;
 mod serialize;
+mod tpch;


What do you think about renaming this module to consumer_integration to make it clearer that this is an integration test of existing substrait plans?

alamb · 2024-06-10T14:57:52Z

datafusion/substrait/tests/testdata/query_1.json

@@ -0,0 +1,810 @@
+{


Can you please

Move this file into a directory that makes it clearer where it came from. Perhaps datafusion/substrait/tests/testdata/tpch_substrait_plans/query_1.json

add a README.md file in datafusion/substrait/tests/testdata/tpch_substrait_plans that explains the files came from https://github.com/substrait-io/consumer-testing/tree/main/substrait_consumer/tests/integration/queries/tpch_substrait_plans?

alamb · 2024-06-10T15:00:27Z

datafusion/substrait/tests/cases/tpch.rs

+// specific language governing permissions and limitations
+// under the License.
+
+//! tests contains in <https://github.com/substrait-io/consumer-testing/tree/main/substrait_consumer/tests/integration/queries/tpch_substrait_plans>


This is very cool -- thank you . I think the context of this PR may be lost after merge so some more documentation might help

Something like

Suggested change

//! tests contains in <https://github.com/substrait-io/consumer-testing/tree/main/substrait_consumer/tests/integration/queries/tpch_substrait_plans>

//! TPCH `substrait_consumer` tests

//!

//! This module tests that substrait plans as json encoded protobuf can be

//! correctly read as DataFusion plans.

//!

//! The input data comes from <https://github.com/substrait-io/consumer-testing/tree/main/substrait_consumer/tests/integration/queries/tpch_substrait_plans>

//! TPCH substrait_consumer tests
//!
//! This module tests that substrait plans as json encoded protobuf can be
//! correctly read as DataFusion plans.
//!
//! The input data comes from https://github.com/substrait-io/consumer-testing/tree/main/substrait_consumer/tests/integration/queries/tpch_substrait_plans

Got it

alamb · 2024-06-10T15:02:50Z

datafusion/substrait/src/logical_plan/consumer.rs

+                fn extract_filename(name: &str) -> Option<String> {
+                    let corrected_url =
+                        if name.starts_with("file://") && !name.starts_with("file:///") {
+                            name.replacen("file://", "file:///", 1)


this makes all URLs absolute (is that intended)?

the file name in those json files are all starts with FILE:// which makes it impossible for url librarary to parse so I did the transformation, or we could try direct string parse otherwise

alamb · 2024-06-10T15:03:46Z

datafusion/substrait/src/logical_plan/consumer.rs

+                }
+
+                // we could use the file name to check the original table provider
+                // TODO: currently does not support multiple local files


Should we file at ticket for this feature?

Filed #10864

alamb · 2024-06-10T15:04:14Z

datafusion/substrait/src/logical_plan/consumer.rs

+                                .iter()
+                                .map(|item| item.field as usize)
+                                .collect();
+                            match &t {


I think if you matched on t you could avoid the scan.clone() later on

my bad, gotta fix it

Blizzara

Thanks, this makes sense to me - mostly left some nits!

Blizzara · 2024-06-10T15:39:07Z

datafusion/substrait/src/logical_plan/consumer.rs

@@ -22,6 +22,9 @@ use datafusion::arrow::datatypes::{
 use datafusion::common::{
    not_impl_err, substrait_datafusion_err, substrait_err, DFSchema, DFSchemaRef,
 };
+use substrait::proto::expression::literal::IntervalDayToSecond;


nit: can we combine/move these into the other substrait imports below?

Blizzara · 2024-06-10T15:39:50Z

datafusion/substrait/src/logical_plan/consumer.rs

+
+                if lf.items.len() > 1 || filename.is_none() {
+                    return not_impl_err!(
+                        "Only NamedTable and VirtualTable reads are supported"


Suggested change

"Only NamedTable and VirtualTable reads are supported"

"Only single file reads are supported"

Blizzara · 2024-06-10T15:41:07Z

datafusion/substrait/src/logical_plan/consumer.rs

+                let table_reference = TableReference::Bare { table: name.into() };
+                let t = ctx.table(table_reference).await?;
+                let t = t.into_optimized_plan()?;
+                match &read.projection {


Is this logic same as for NamedTable? If so, maybe extract into a function / reuse in some way?

Blizzara · 2024-06-10T15:43:47Z

datafusion/substrait/src/logical_plan/consumer.rs

+            _ => {
+                not_impl_err!("Only NamedTable and VirtualTable reads are supported")
+            }


Suggested change

_ => {

not_impl_err!("Only NamedTable and VirtualTable reads are supported")

}

_ => not_impl_err!("Unsupported ReadType: {:?}", &read.as_ref().read_type),

Blizzara · 2024-06-10T15:46:15Z

datafusion/substrait/src/logical_plan/consumer.rs

@@ -810,14 +885,21 @@ pub async fn from_substrait_agg_func(
            f.function_reference
        );
    };
-
+    let function_name = function_name.split(':').next().unwrap_or(function_name);


I guess this is same idea as in

datafusion/datafusion/substrait/src/logical_plan/consumer.rs

Line 132 in 5912025

let name = match name.rsplit_once(':') {

? Might be worth consolidating those into some helper function as well

Blizzara · 2024-06-10T16:02:03Z

datafusion/substrait/tests/cases/tpch.rs

+
+    async fn create_context() -> datafusion::common::Result<SessionContext> {
+        let ctx = SessionContext::new();
+        ctx.register_csv(


the substrait plan indicates the files would be parquet, I wonder if that'll cause trouble now or later given we use csv (and it makes sense to use CSV here, I think)

yes, the only concern for using csv file is that we could not parse if there are some partitioned parquet filenames in the json file. But right now I think it is fine.

Lordworms · 2024-06-10T18:07:12Z

I have resolved all the reviews provided. Thanks so much for your help, really appreciate it. @alamb @Blizzara

alamb

Looks great -- thank you so much @Lordworms 🙏

alamb · 2024-06-11T01:24:54Z

datafusion/substrait/src/logical_plan/consumer.rs

+                }
+
+                // we could use the file name to check the original table provider
+                // TODO: currently does not support multiple local files


Filed #10864

) * support tpch_1 consumer_producer_test * refactor and optimize code

github-actions bot added the substrait label Jun 9, 2024

Lordworms commented Jun 9, 2024

View reviewed changes

Lordworms force-pushed the issue_10710 branch from 04e8820 to 3c511a2 Compare June 9, 2024 21:59

support tpch_1 consumer_producer_test

076208d

Lordworms force-pushed the issue_10710 branch from 3c511a2 to 076208d Compare June 9, 2024 22:06

Lordworms commented Jun 9, 2024

View reviewed changes

Lordworms marked this pull request as ready for review June 9, 2024 22:23

alamb approved these changes Jun 10, 2024

View reviewed changes

Blizzara reviewed Jun 10, 2024

View reviewed changes

Lordworms mentioned this pull request Jun 10, 2024

Support for Multiple Local Files in Substrait ReadType::LocalFiles #10857

Closed

refactor and optimize code

27d3f2b

alamb mentioned this pull request Jun 11, 2024

Add substrait support for multiple files in ReadType::LocalFiles #10864

Open

alamb approved these changes Jun 11, 2024

View reviewed changes

alamb changed the title ~~support tpch_1 consumer_producer_test~~ Implement TPCH substrait integration teset, support tpch_1 Jun 11, 2024

alamb merged commit 76f5110 into apache:main Jun 11, 2024
26 checks passed

alamb mentioned this pull request Jun 17, 2024

DataFusion weekly project plan (Andrew Lamb) - June 17, 2024 #10955

Closed

5 tasks

Blizzara mentioned this pull request Jun 28, 2024

fix: Support Substrait's compound names also for window functions #11163

Merged

findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024

Implement TPCH substrait integration teset, support tpch_1 (apache#10842

ae9af3f

) * support tpch_1 consumer_producer_test * refactor and optimize code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement TPCH substrait integration teset, support tpch_1 #10842

Implement TPCH substrait integration teset, support tpch_1 #10842

Lordworms commented Jun 9, 2024

Lordworms Jun 9, 2024

Lordworms commented Jun 9, 2024 •

edited

Loading

Lordworms Jun 9, 2024

alamb Jun 10, 2024

alamb left a comment

alamb Jun 10, 2024

alamb Jun 10, 2024

Lordworms Jun 10, 2024

alamb Jun 10, 2024

alamb Jun 10, 2024

Lordworms Jun 10, 2024

alamb Jun 10, 2024

Lordworms Jun 10, 2024

alamb Jun 10, 2024

Lordworms Jun 10, 2024

alamb Jun 11, 2024

alamb Jun 10, 2024

Lordworms Jun 10, 2024

Blizzara left a comment

Blizzara Jun 10, 2024

Lordworms Jun 10, 2024

Blizzara Jun 10, 2024

Lordworms Jun 10, 2024

Blizzara Jun 10, 2024

Lordworms Jun 10, 2024

Blizzara Jun 10, 2024

Lordworms Jun 10, 2024

Blizzara Jun 10, 2024

Lordworms Jun 10, 2024

Blizzara Jun 10, 2024

Lordworms Jun 10, 2024

Lordworms commented Jun 10, 2024

alamb left a comment

alamb Jun 11, 2024

		@@ -0,0 +1,2 @@
		l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
		1,1,1,1,17,21168.23,0.04,0.02,'N','O','1996-03-13','1996-02-12','1996-03-22','DELIVER IN PERSON','TRUCK','egular courts above the'

-//! tests contains in <https://github.com/substrait-io/consumer-testing/tree/main/substrait_consumer/tests/integration/queries/tpch_substrait_plans>
+//! TPCH `substrait_consumer` tests
+//!
+//! This module tests that substrait plans as json encoded protobuf can be
+//! correctly read as DataFusion plans.
+//!
+//! The input data comes from  <https://github.com/substrait-io/consumer-testing/tree/main/substrait_consumer/tests/integration/queries/tpch_substrait_plans>

	"Only NamedTable and VirtualTable reads are supported"
	"Only single file reads are supported"

Implement TPCH substrait integration teset, support tpch_1 #10842

Implement TPCH substrait integration teset, support tpch_1 #10842

Conversation

Lordworms commented Jun 9, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Lordworms commented Jun 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Blizzara left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lordworms commented Jun 10, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lordworms commented Jun 9, 2024 •

edited

Loading