-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-9654: [Rust][DataFusion] Add EXPLAIN <SQL>
statement
#7959
Conversation
@@ -784,6 +784,18 @@ pub enum LogicalPlan { | |||
/// Whether the CSV file contains a header | |||
has_header: bool, | |||
}, | |||
/// Produces a relation with string representations of | |||
/// various parts of the plan | |||
Explain { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a subsequent PR, I would like to suggest we consider making this Explain(ExplainNode)
or something, and having a separate struct ExplainNode
with these fields. There is a bunch of boiler plate code for matching and reconstructing LogicalPlanner
nodes that I think would be less awkward (and require fewer changes to add / modify fields) if they were collected into a struct
@@ -45,7 +50,7 @@ impl ProjectionPushDown { | |||
} | |||
|
|||
fn optimize_plan( | |||
&self, | |||
&mut self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed to make this mut
so that optimize_explain
can call OptimizerRule::optimize
which requires a mut
reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO that is fine, these optimizers are just implementation structs anyway, so changing state has no impact.
3e077fc
to
037e3f6
Compare
I am sure this PR conflicts with #7880... FYI @jorgecarleitao |
This makes me very happy! I took a quick skim through and LGTM but will review more tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed this PR.
Looks really good and it is definitely useful! Thanks for it, @alamb !
I left minor comments, and I have 4 questions:
Physical Plan?
We are declaring EXPLAIN
as a LogicalPlan and then convert it to a Physical. I always understood EXPLAIN as a different type of statement, like the CreateExternalTable, the main reason being that it does not depend on data, just on the plan it explains, options, and the optimizers to considered.
Looking at CreateExternalTable, we do not construct a physical plan out of it. Wouldn't it be simpler if we do not create a physical plan, like we do for CreateExternalTable, since we do not need to multi-thread it nor run through partitions nor batches?
Plans as {:#?}
If I read correctly, we currently return the result of the explain as the debug ({:#?}
) of the plans. Shouldn't we reserve the debug format for our internal representation, and introduce a new format to EXPLAIN
(not necessarily in this PR)?
Other formats?
One thing I dislike about spark is that EXPLAIN is in a not-so-nice format to parse programmatically. One idea is to allow explain
to return the result in json. This is e.g. useful to construct a visual graph representation of the plan. One idea would be to support another string parameter, format
, and make it default to string
(not necessarily in this PR).
Support on table
API?
We should consider adding .explain
on the table.rs
.
rust/datafusion/src/logicalplan.rs
Outdated
#[derive(Debug, Clone, PartialEq)] | ||
pub struct StringifiedPlan { | ||
/// An identifier of what type of plan this string represents | ||
pub plan_type: Arc<String>, // TODO make this an enum? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say so, this is a finite and well defined number of cases.
|
||
/// returns true if this plan should be displayed. Generally | ||
/// `verbose_mode = true` will display all available plans | ||
pub fn should_display(&self, verbose_mode: bool) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would try to be systematic with the name: if LogicalPlan::Explain::verbose
, then verbose_mode
here or vice-versa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an enum might be cleaner too -- let me give it a shot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorgecarleitao -- check this out: 7e9e42c
@@ -45,7 +50,7 @@ impl ProjectionPushDown { | |||
} | |||
|
|||
fn optimize_plan( | |||
&self, | |||
&mut self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO that is fine, these optimizers are just implementation structs anyway, so changing state has no impact.
rust/datafusion/tests/sql.rs
Outdated
let expected = "\"logical_plan\"\t\"Projection: #c1\\n Selection: #c2 Gt Int64(10)\\n TableScan: aggregate_test_100 projection=None\"".to_string(); | ||
assert_eq!(expected, actual); | ||
|
||
// Also, expect same result with case explain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with lowercase explain
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good catch. Will fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 639db5f
register_aggregate_csv_by_sql(&mut ctx); | ||
let sql = "EXPLAIN VERBOSE SELECT c1 FROM aggregate_test_100 where c2 > 10"; | ||
let actual = execute(&mut ctx, sql).join("\n"); | ||
// Don't actually test the contents of the debuging output (as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes! :)
Thanks @jorgecarleitao
Yes, you are right that we could special case explain not to actually run a physical plan. I was thinking there are some benefits to using a physical plan, as follows:
I admit these are both forward looking features with likely no practical import in the near term. However, I felt making
I am torn on this and have no strong opinion. I could go either way -- I think that most of the information that is typically in explain plans is useful for developers, and thus there would always be substantial overlap between a Debug format and an explain format. But maybe the differences between them could justify the duplication? I am not sure.
I like this idea. The typical thing I have used in the past for such purposes is https://graphviz.org/ and
Good idea -- I defer to @andygrove on how he sees the DataFusion CLI evolving -- aka if he wants to mirror the SQLite style |
I agree with your answers, the changes, and ready to merge for me. It is being great fun working on this string of PRs from each of us, @alamb and @andygrove ! 👍 |
I agree @jorgecarleitao |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alamb and @jorgecarleitao this looks great. I will have some things to suggest as follow-ups most likely but nothing that should block this from being merged.
In order to help users and developers understand what DataFusion's planner is doing, this PR adds an `"EXPLAIN PLAN"` feature. All other database systems I have worked with have such a feature (e.g. see [MySql](https://dev.mysql.com/doc/refman/8.0/en/explain-output.html)). Example printout (the plans printed are simply the `std::fmt::Debug` representation of the plan structures:) ``` > explain SELECT status, COUNT(1) FROM http_api_requests_total WHERE path = '/api/v2/write' GROUP BY status; +--------------+----------------------------------------------------------------------+ | plan_type | plan | +--------------+----------------------------------------------------------------------+ | logical_plan | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] | | | Selection: #path Eq Utf8("/api/v2/write") And #path Eq Utf8("foo") | | | TableScan: http_api_requests_total projection=None | +--------------+----------------------------------------------------------------------+ 1 rows in set. Query took 0 seconds. ``` and ``` > explain VERBOSE SELECT status, COUNT(1) FROM http_api_requests_total WHERE path = '/api/v2/write' GROUP BY status; +-----------------------------------------+----------------------------------------------------------------------+ | plan_type | plan | +-----------------------------------------+----------------------------------------------------------------------+ | logical_plan | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] | | | Selection: #path Eq Utf8("/api/v2/write") And #path Eq Utf8("foo") | | | TableScan: http_api_requests_total projection=None | | logical_plan after projection_push_down | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] | | | Selection: #path Eq Utf8("/api/v2/write") And #path Eq Utf8("foo") | | | TableScan: http_api_requests_total projection=Some([6, 8]) | | logical_plan after type_coercion | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] | | | Selection: #path Eq Utf8("/api/v2/write") And #path Eq Utf8("foo") | | | TableScan: http_api_requests_total projection=Some([6, 8]) | | physical_plan | HashAggregateExec { | | | group_expr: [ | | | Column { | | | name: "status", | | | }, | | | ], | | | aggr_expr: [ | | | Count { | | | expr: Literal { | | | value: UInt8( | | | 1, | | | ), | | | }, | | | }, | | | ], | | | input: SelectionExec { | | | expr: BinaryExpr { | | | left: BinaryExpr { | | | left: Column { | | | name: "path", | | | }, | | | op: Eq, | | | right: Literal { | | | value: Utf8( | | | "/api/v2/write", | | | ), | | | }, | | | }, | | | op: And, | | | right: BinaryExpr { | | | left: Column { | | | name: "path", | | | }, | | | op: Eq, | | | right: Literal { | | | value: Utf8( | | | "foo", | | | ), | | | }, | | | }, | | | }, | | | input: DataSourceExec { | | | schema: Schema { | | | fields: [ | | | Field { | | | name: "path", | | | data_type: Utf8, | | | nullable: true, | | | dict_id: 0, | | | dict_is_ordered: false, | | | }, | | | Field { | | | name: "status", | | | data_type: Utf8, | | | nullable: true, | | | dict_id: 0, | | | dict_is_ordered: false, | | | }, | | | ], | | | metadata: {}, | | | }, | | | partitions.len: 1, | | | }, | | | }, | | | schema: Schema { | | | fields: [ | | | Field { | | | name: "status", | | | data_type: Utf8, | | | nullable: true, | | | dict_id: 0, | | | dict_is_ordered: false, | | | }, | | | Field { | | | name: "COUNT(UInt8(1))", | | | data_type: UInt64, | | | nullable: true, | | | dict_id: 0, | | | dict_is_ordered: false, | | | }, | | | ], | | | metadata: {}, | | | }, | | | } | +-----------------------------------------+----------------------------------------------------------------------+ 4 row in set. Query took 0 seconds. ``` Closes apache#7959 from alamb/alamb/ARROW-9654-explain Authored-by: alamb <andrew@nerdnetworks.org> Signed-off-by: Andy Grove <andygrove73@gmail.com>
In order to help users and developers understand what DataFusion's planner is doing, this PR adds an `"EXPLAIN PLAN"` feature. All other database systems I have worked with have such a feature (e.g. see [MySql](https://dev.mysql.com/doc/refman/8.0/en/explain-output.html)). Example printout (the plans printed are simply the `std::fmt::Debug` representation of the plan structures:) ``` > explain SELECT status, COUNT(1) FROM http_api_requests_total WHERE path = '/api/v2/write' GROUP BY status; +--------------+----------------------------------------------------------------------+ | plan_type | plan | +--------------+----------------------------------------------------------------------+ | logical_plan | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] | | | Selection: #path Eq Utf8("/api/v2/write") And #path Eq Utf8("foo") | | | TableScan: http_api_requests_total projection=None | +--------------+----------------------------------------------------------------------+ 1 rows in set. Query took 0 seconds. ``` and ``` > explain VERBOSE SELECT status, COUNT(1) FROM http_api_requests_total WHERE path = '/api/v2/write' GROUP BY status; +-----------------------------------------+----------------------------------------------------------------------+ | plan_type | plan | +-----------------------------------------+----------------------------------------------------------------------+ | logical_plan | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] | | | Selection: #path Eq Utf8("/api/v2/write") And #path Eq Utf8("foo") | | | TableScan: http_api_requests_total projection=None | | logical_plan after projection_push_down | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] | | | Selection: #path Eq Utf8("/api/v2/write") And #path Eq Utf8("foo") | | | TableScan: http_api_requests_total projection=Some([6, 8]) | | logical_plan after type_coercion | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] | | | Selection: #path Eq Utf8("/api/v2/write") And #path Eq Utf8("foo") | | | TableScan: http_api_requests_total projection=Some([6, 8]) | | physical_plan | HashAggregateExec { | | | group_expr: [ | | | Column { | | | name: "status", | | | }, | | | ], | | | aggr_expr: [ | | | Count { | | | expr: Literal { | | | value: UInt8( | | | 1, | | | ), | | | }, | | | }, | | | ], | | | input: SelectionExec { | | | expr: BinaryExpr { | | | left: BinaryExpr { | | | left: Column { | | | name: "path", | | | }, | | | op: Eq, | | | right: Literal { | | | value: Utf8( | | | "/api/v2/write", | | | ), | | | }, | | | }, | | | op: And, | | | right: BinaryExpr { | | | left: Column { | | | name: "path", | | | }, | | | op: Eq, | | | right: Literal { | | | value: Utf8( | | | "foo", | | | ), | | | }, | | | }, | | | }, | | | input: DataSourceExec { | | | schema: Schema { | | | fields: [ | | | Field { | | | name: "path", | | | data_type: Utf8, | | | nullable: true, | | | dict_id: 0, | | | dict_is_ordered: false, | | | }, | | | Field { | | | name: "status", | | | data_type: Utf8, | | | nullable: true, | | | dict_id: 0, | | | dict_is_ordered: false, | | | }, | | | ], | | | metadata: {}, | | | }, | | | partitions.len: 1, | | | }, | | | }, | | | schema: Schema { | | | fields: [ | | | Field { | | | name: "status", | | | data_type: Utf8, | | | nullable: true, | | | dict_id: 0, | | | dict_is_ordered: false, | | | }, | | | Field { | | | name: "COUNT(UInt8(1))", | | | data_type: UInt64, | | | nullable: true, | | | dict_id: 0, | | | dict_is_ordered: false, | | | }, | | | ], | | | metadata: {}, | | | }, | | | } | +-----------------------------------------+----------------------------------------------------------------------+ 4 row in set. Query took 0 seconds. ``` Closes apache#7959 from alamb/alamb/ARROW-9654-explain Authored-by: alamb <andrew@nerdnetworks.org> Signed-off-by: Andy Grove <andygrove73@gmail.com>
In order to help users and developers understand what DataFusion's planner is doing, this PR adds an
"EXPLAIN PLAN"
feature. All other database systems I have worked with have such a feature (e.g. see MySql).Example printout (the plans printed are simply the
std::fmt::Debug
representation of the plan structures:)and