-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement TPCH substrait integration teset, support tpch_1 #10842
Conversation
@@ -1253,6 +1335,8 @@ fn from_substrait_type( | |||
r#type::Kind::Struct(s) => Ok(DataType::Struct(from_substrait_struct_type( | |||
s, dfs_names, name_idx, | |||
)?)), | |||
r#type::Kind::Varchar(_) => Ok(DataType::Utf8), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently directly use Utf8
@@ -0,0 +1,2 @@ | |||
l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment | |||
1,1,1,1,17,21168.23,0.04,0.02,'N','O','1996-03-13','1996-02-12','1996-03-22','DELIVER IN PERSON','TRUCK','egular courts above the' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in order to minimize the repo size, just upload a CSV version of lineitem table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a single line row is fine 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Lordworms -- this is very cool.
I have some readability / organization / documentation suggestions I think would help this PR but we could also make them as a follow on PR too
Once we merge this PR what do you think about creating tickets to track supporting the other queries in TPCH? Or maybe we can just use #10710 🤔
@@ -0,0 +1,2 @@ | |||
l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment | |||
1,1,1,1,17,21168.23,0.04,0.02,'N','O','1996-03-13','1996-02-12','1996-03-22','DELIVER IN PERSON','TRUCK','egular courts above the' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a single line row is fine 👍
@@ -19,3 +19,4 @@ mod logical_plans; | |||
mod roundtrip_logical_plan; | |||
mod roundtrip_physical_plan; | |||
mod serialize; | |||
mod tpch; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about renaming this module to consumer_integration
to make it clearer that this is an integration test of existing substrait plans?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
@@ -0,0 +1,810 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please
- Move this file into a directory that makes it clearer where it came from. Perhaps
datafusion/substrait/tests/testdata/tpch_substrait_plans/query_1.json
- add a README.md file in
datafusion/substrait/tests/testdata/tpch_substrait_plans
that explains the files came from https://github.com/substrait-io/consumer-testing/tree/main/substrait_consumer/tests/integration/queries/tpch_substrait_plans?
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
//! tests contains in <https://github.com/substrait-io/consumer-testing/tree/main/substrait_consumer/tests/integration/queries/tpch_substrait_plans> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very cool -- thank you . I think the context of this PR may be lost after merge so some more documentation might help
Something like
//! tests contains in <https://github.com/substrait-io/consumer-testing/tree/main/substrait_consumer/tests/integration/queries/tpch_substrait_plans> | |
//! TPCH `substrait_consumer` tests | |
//! | |
//! This module tests that substrait plans as json encoded protobuf can be | |
//! correctly read as DataFusion plans. | |
//! | |
//! The input data comes from <https://github.com/substrait-io/consumer-testing/tree/main/substrait_consumer/tests/integration/queries/tpch_substrait_plans> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//! TPCH
substrait_consumer
tests
//!
//! This module tests that substrait plans as json encoded protobuf can be
//! correctly read as DataFusion plans.
//!
//! The input data comes from https://github.com/substrait-io/consumer-testing/tree/main/substrait_consumer/tests/integration/queries/tpch_substrait_plans
Got it
fn extract_filename(name: &str) -> Option<String> { | ||
let corrected_url = | ||
if name.starts_with("file://") && !name.starts_with("file:///") { | ||
name.replacen("file://", "file:///", 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes all URLs absolute (is that intended)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the file name in those json files are all starts with FILE:// which makes it impossible for url librarary to parse so I did the transformation, or we could try direct string parse otherwise
} | ||
|
||
// we could use the file name to check the original table provider | ||
// TODO: currently does not support multiple local files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we file at ticket for this feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #10864
.iter() | ||
.map(|item| item.field as usize) | ||
.collect(); | ||
match &t { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you matched on t
you could avoid the scan.clone()
later on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my bad, gotta fix it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this makes sense to me - mostly left some nits!
@@ -22,6 +22,9 @@ use datafusion::arrow::datatypes::{ | |||
use datafusion::common::{ | |||
not_impl_err, substrait_datafusion_err, substrait_err, DFSchema, DFSchemaRef, | |||
}; | |||
use substrait::proto::expression::literal::IntervalDayToSecond; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we combine/move these into the other substrait imports below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
|
||
if lf.items.len() > 1 || filename.is_none() { | ||
return not_impl_err!( | ||
"Only NamedTable and VirtualTable reads are supported" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Only NamedTable and VirtualTable reads are supported" | |
"Only single file reads are supported" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it!
let table_reference = TableReference::Bare { table: name.into() }; | ||
let t = ctx.table(table_reference).await?; | ||
let t = t.into_optimized_plan()?; | ||
match &read.projection { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this logic same as for NamedTable? If so, maybe extract into a function / reuse in some way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
_ => { | ||
not_impl_err!("Only NamedTable and VirtualTable reads are supported") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_ => { | |
not_impl_err!("Only NamedTable and VirtualTable reads are supported") | |
} | |
_ => not_impl_err!("Unsupported ReadType: {:?}", &read.as_ref().read_type), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it
@@ -810,14 +885,21 @@ pub async fn from_substrait_agg_func( | |||
f.function_reference | |||
); | |||
}; | |||
|
|||
let function_name = function_name.split(':').next().unwrap_or(function_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is same idea as in
let name = match name.rsplit_once(':') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
absolutely
|
||
async fn create_context() -> datafusion::common::Result<SessionContext> { | ||
let ctx = SessionContext::new(); | ||
ctx.register_csv( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the substrait plan indicates the files would be parquet, I wonder if that'll cause trouble now or later given we use csv (and it makes sense to use CSV here, I think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the only concern for using csv file is that we could not parse if there are some partitioned parquet filenames in the json file. But right now I think it is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great -- thank you so much @Lordworms 🙏
} | ||
|
||
// we could use the file name to check the original table provider | ||
// TODO: currently does not support multiple local files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #10864
Which issue does this PR close?
part of #10710
Closes #.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?