ARROW-9654: [Rust][DataFusion] Add `EXPLAIN <SQL>` statement #7959

alamb · 2020-08-13T18:33:58Z

In order to help users and developers understand what DataFusion's planner is doing, this PR adds an "EXPLAIN PLAN" feature. All other database systems I have worked with have such a feature (e.g. see MySql).

Example printout (the plans printed are simply the std::fmt::Debug representation of the plan structures:)

> explain SELECT status, COUNT(1) FROM http_api_requests_total WHERE path = '/api/v2/write' GROUP BY status;
+--------------+----------------------------------------------------------------------+
| plan_type    | plan                                                                 |
+--------------+----------------------------------------------------------------------+
| logical_plan | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]]             |
|              |   Selection: #path Eq Utf8("/api/v2/write") And #path Eq Utf8("foo") |
|              |     TableScan: http_api_requests_total projection=None               |
+--------------+----------------------------------------------------------------------+
1 rows in set. Query took 0 seconds.

and

> explain VERBOSE SELECT status, COUNT(1) FROM http_api_requests_total WHERE path = '/api/v2/write' GROUP BY status;

+-----------------------------------------+----------------------------------------------------------------------+
| plan_type                               | plan                                                                 |
+-----------------------------------------+----------------------------------------------------------------------+
| logical_plan                            | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]]             |
|                                         |   Selection: #path Eq Utf8("/api/v2/write") And #path Eq Utf8("foo") |
|                                         |     TableScan: http_api_requests_total projection=None               |
| logical_plan after projection_push_down | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]]             |
|                                         |   Selection: #path Eq Utf8("/api/v2/write") And #path Eq Utf8("foo") |
|                                         |     TableScan: http_api_requests_total projection=Some([6, 8])       |
| logical_plan after type_coercion        | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]]             |
|                                         |   Selection: #path Eq Utf8("/api/v2/write") And #path Eq Utf8("foo") |
|                                         |     TableScan: http_api_requests_total projection=Some([6, 8])       |
| physical_plan                           | HashAggregateExec {                                                  |
|                                         |     group_expr: [                                                    |
|                                         |         Column {                                                     |
|                                         |             name: "status",                                          |
|                                         |         },                                                           |
|                                         |     ],                                                               |
|                                         |     aggr_expr: [                                                     |
|                                         |         Count {                                                      |
|                                         |             expr: Literal {                                          |
|                                         |                 value: UInt8(                                        |
|                                         |                     1,                                               |
|                                         |                 ),                                                   |
|                                         |             },                                                       |
|                                         |         },                                                           |
|                                         |     ],                                                               |
|                                         |     input: SelectionExec {                                           |
|                                         |         expr: BinaryExpr {                                           |
|                                         |             left: BinaryExpr {                                       |
|                                         |                 left: Column {                                       |
|                                         |                     name: "path",                                    |
|                                         |                 },                                                   |
|                                         |                 op: Eq,                                              |
|                                         |                 right: Literal {                                     |
|                                         |                     value: Utf8(                                     |
|                                         |                         "/api/v2/write",                             |
|                                         |                     ),                                               |
|                                         |                 },                                                   |
|                                         |             },                                                       |
|                                         |             op: And,                                                 |
|                                         |             right: BinaryExpr {                                      |
|                                         |                 left: Column {                                       |
|                                         |                     name: "path",                                    |
|                                         |                 },                                                   |
|                                         |                 op: Eq,                                              |
|                                         |                 right: Literal {                                     |
|                                         |                     value: Utf8(                                     |
|                                         |                         "foo",                                       |
|                                         |                     ),                                               |
|                                         |                 },                                                   |
|                                         |             },                                                       |
|                                         |         },                                                           |
|                                         |         input: DataSourceExec {                                      |
|                                         |             schema: Schema {                                         |
|                                         |                 fields: [                                            |
|                                         |                     Field {                                          |
|                                         |                         name: "path",                                |
|                                         |                         data_type: Utf8,                             |
|                                         |                         nullable: true,                              |
|                                         |                         dict_id: 0,                                  |
|                                         |                         dict_is_ordered: false,                      |
|                                         |                     },                                               |
|                                         |                     Field {                                          |
|                                         |                         name: "status",                              |
|                                         |                         data_type: Utf8,                             |
|                                         |                         nullable: true,                              |
|                                         |                         dict_id: 0,                                  |
|                                         |                         dict_is_ordered: false,                      |
|                                         |                     },                                               |
|                                         |                 ],                                                   |
|                                         |                 metadata: {},                                        |
|                                         |             },                                                       |
|                                         |             partitions.len: 1,                                       |
|                                         |         },                                                           |
|                                         |     },                                                               |
|                                         |     schema: Schema {                                                 |
|                                         |         fields: [                                                    |
|                                         |             Field {                                                  |
|                                         |                 name: "status",                                      |
|                                         |                 data_type: Utf8,                                     |
|                                         |                 nullable: true,                                      |
|                                         |                 dict_id: 0,                                          |
|                                         |                 dict_is_ordered: false,                              |
|                                         |             },                                                       |
|                                         |             Field {                                                  |
|                                         |                 name: "COUNT(UInt8(1))",                             |
|                                         |                 data_type: UInt64,                                   |
|                                         |                 nullable: true,                                      |
|                                         |                 dict_id: 0,                                          |
|                                         |                 dict_is_ordered: false,                              |
|                                         |             },                                                       |
|                                         |         ],                                                           |
|                                         |         metadata: {},                                                |
|                                         |     },                                                               |
|                                         | }                                                                    |
+-----------------------------------------+----------------------------------------------------------------------+


4 row in set. Query took 0 seconds.

alamb · 2020-08-13T18:36:44Z

rust/datafusion/src/logicalplan.rs

@@ -784,6 +784,18 @@ pub enum LogicalPlan {
        /// Whether the CSV file contains a header
        has_header: bool,
    },
+    /// Produces a relation with string representations of
+    /// various parts of the plan
+    Explain {


In a subsequent PR, I would like to suggest we consider making this Explain(ExplainNode) or something, and having a separate struct ExplainNode with these fields. There is a bunch of boiler plate code for matching and reconstructing LogicalPlanner nodes that I think would be less awkward (and require fewer changes to add / modify fields) if they were collected into a struct

alamb · 2020-08-13T18:37:50Z

rust/datafusion/src/optimizer/projection_push_down.rs

@@ -45,7 +50,7 @@ impl ProjectionPushDown {
    }

    fn optimize_plan(
-        &self,
+        &mut self,


I needed to make this mut so that optimize_explain can call OptimizerRule::optimize which requires a mut reference

IMO that is fine, these optimizers are just implementation structs anyway, so changing state has no impact.

…or same

alamb · 2020-08-13T18:45:24Z

I am sure this PR conflicts with #7880... FYI @jorgecarleitao

github-actions · 2020-08-13T18:46:51Z

https://issues.apache.org/jira/browse/ARROW-9654

andygrove · 2020-08-14T00:10:14Z

This makes me very happy! I took a quick skim through and LGTM but will review more tomorrow.

jorgecarleitao

I reviewed this PR.

Looks really good and it is definitely useful! Thanks for it, @alamb !

I left minor comments, and I have 4 questions:

Physical Plan?

We are declaring EXPLAIN as a LogicalPlan and then convert it to a Physical. I always understood EXPLAIN as a different type of statement, like the CreateExternalTable, the main reason being that it does not depend on data, just on the plan it explains, options, and the optimizers to considered.

Looking at CreateExternalTable, we do not construct a physical plan out of it. Wouldn't it be simpler if we do not create a physical plan, like we do for CreateExternalTable, since we do not need to multi-thread it nor run through partitions nor batches?

Plans as {:#?}

If I read correctly, we currently return the result of the explain as the debug ({:#?}) of the plans. Shouldn't we reserve the debug format for our internal representation, and introduce a new format to EXPLAIN (not necessarily in this PR)?

Other formats?

One thing I dislike about spark is that EXPLAIN is in a not-so-nice format to parse programmatically. One idea is to allow explain to return the result in json. This is e.g. useful to construct a visual graph representation of the plan. One idea would be to support another string parameter, format, and make it default to string (not necessarily in this PR).

Support on `table` API?

We should consider adding .explain on the table.rs.

jorgecarleitao · 2020-08-14T03:54:57Z

rust/datafusion/src/logicalplan.rs

+#[derive(Debug, Clone, PartialEq)]
+pub struct StringifiedPlan {
+    /// An identifier of what type of plan this string represents
+    pub plan_type: Arc<String>, // TODO make this an enum?


I would say so, this is a finite and well defined number of cases.

jorgecarleitao · 2020-08-14T03:56:03Z

rust/datafusion/src/logicalplan.rs

+
+    /// returns true if this plan should be displayed. Generally
+    /// `verbose_mode = true` will display all available plans
+    pub fn should_display(&self, verbose_mode: bool) -> bool {


I would try to be systematic with the name: if LogicalPlan::Explain::verbose, then verbose_mode here or vice-versa.

I think an enum might be cleaner too -- let me give it a shot

@jorgecarleitao -- check this out: 7e9e42c

jorgecarleitao · 2020-08-14T03:57:49Z

rust/datafusion/src/optimizer/projection_push_down.rs

@@ -45,7 +50,7 @@ impl ProjectionPushDown {
    }

    fn optimize_plan(
-        &self,
+        &mut self,


IMO that is fine, these optimizers are just implementation structs anyway, so changing state has no impact.

jorgecarleitao · 2020-08-14T04:00:35Z

rust/datafusion/tests/sql.rs

+    let expected = "\"logical_plan\"\t\"Projection: #c1\\n  Selection: #c2 Gt Int64(10)\\n    TableScan: aggregate_test_100 projection=None\"".to_string();
+    assert_eq!(expected, actual);
+
+    // Also, expect same result with case explain


with lowercase explain?

Yes, good catch. Will fix

fixed in 639db5f

jorgecarleitao · 2020-08-14T04:00:52Z

rust/datafusion/tests/sql.rs

+    register_aggregate_csv_by_sql(&mut ctx);
+    let sql = "EXPLAIN VERBOSE SELECT c1 FROM aggregate_test_100 where c2 > 10";
+    let actual = execute(&mut ctx, sql).join("\n");
+    // Don't actually test the contents of the debuging output (as


alamb · 2020-08-14T18:35:00Z

Thanks @jorgecarleitao

Physical Plan?

Looking at CreateExternalTable, we do not construct a physical plan out of it. Wouldn't it be simpler if we do not create a physical plan, like we do for CreateExternalTable, since we do not need to multi-thread it nor run through partitions nor batches?

Yes, you are right that we could special case explain not to actually run a physical plan. I was thinking there are some benefits to using a physical plan, as follows:

Immediate support for all other input/output channels (not just the CLI) -- so for example if there is ever a client library for DataFusion, it will not need any additional logic to handle EXPLAIN
Use / store the plans themselves as data ( imagine INSERT INTO foo EXPLAIN select * from bar)/

I admit these are both forward looking features with likely no practical import in the near term. However, I felt making Explain a first class part of both logical and physical plans would reduce special cases and make the code "cleaner" (though this is just my opinion)

Plans as {:#?}

If I read correctly, we currently return the result of the explain as the debug ({:#?}) of the plans. Shouldn't we reserve the debug format for our internal representation, and introduce a new format to EXPLAIN (not necessarily in this PR)?

I am torn on this and have no strong opinion. I could go either way -- I think that most of the information that is typically in explain plans is useful for developers, and thus there would always be substantial overlap between a Debug format and an explain format. But maybe the differences between them could justify the duplication? I am not sure.

Other formats?

One thing I dislike about spark is that EXPLAIN is in a not-so-nice format to parse programmatically. One idea is to allow explain to return the result in json. This is e.g. useful to construct a visual graph representation of the plan. One idea would be to support another string parameter, format, and make it default to string (not necessarily in this PR).

I like this idea. The typical thing I have used in the past for such purposes is https://graphviz.org/ and dot -- I have filed https://issues.apache.org/jira/browse/ARROW-9746 to track the improvement.

Support on table API?

We should consider adding .explain on the table.rs.

Good idea -- I defer to @andygrove on how he sees the DataFusion CLI evolving -- aka if he wants to mirror the SQLite style .table or .quit type commands or the current quit command or something else entirely.

jorgecarleitao · 2020-08-14T19:18:05Z

I agree with your answers, the changes, and ready to merge for me.

It is being great fun working on this string of PRs from each of us, @alamb and @andygrove ! 👍

alamb · 2020-08-14T20:19:28Z

It is being great fun working on this string of PRs from each of us, @alamb and @andygrove ! 👍

I agree @jorgecarleitao

andygrove

Thanks @alamb and @jorgecarleitao this looks great. I will have some things to suggest as follow-ups most likely but nothing that should block this from being merged.

In order to help users and developers understand what DataFusion's planner is doing, this PR adds an `"EXPLAIN PLAN"` feature. All other database systems I have worked with have such a feature (e.g. see [MySql](https://dev.mysql.com/doc/refman/8.0/en/explain-output.html)). Example printout (the plans printed are simply the `std::fmt::Debug` representation of the plan structures:) ``` > explain SELECT status, COUNT(1) FROM http_api_requests_total WHERE path = '/api/v2/write' GROUP BY status; +--------------+----------------------------------------------------------------------+ | plan_type | plan | +--------------+----------------------------------------------------------------------+ | logical_plan | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] | | | Selection: #path Eq Utf8("/api/v2/write") And #path Eq Utf8("foo") | | | TableScan: http_api_requests_total projection=None | +--------------+----------------------------------------------------------------------+ 1 rows in set. Query took 0 seconds. ``` and ``` > explain VERBOSE SELECT status, COUNT(1) FROM http_api_requests_total WHERE path = '/api/v2/write' GROUP BY status; +-----------------------------------------+----------------------------------------------------------------------+ | plan_type | plan | +-----------------------------------------+----------------------------------------------------------------------+ | logical_plan | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] | | | Selection: #path Eq Utf8("/api/v2/write") And #path Eq Utf8("foo") | | | TableScan: http_api_requests_total projection=None | | logical_plan after projection_push_down | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] | | | Selection: #path Eq Utf8("/api/v2/write") And #path Eq Utf8("foo") | | | TableScan: http_api_requests_total projection=Some([6, 8]) | | logical_plan after type_coercion | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] | | | Selection: #path Eq Utf8("/api/v2/write") And #path Eq Utf8("foo") | | | TableScan: http_api_requests_total projection=Some([6, 8]) | | physical_plan | HashAggregateExec { | | | group_expr: [ | | | Column { | | | name: "status", | | | }, | | | ], | | | aggr_expr: [ | | | Count { | | | expr: Literal { | | | value: UInt8( | | | 1, | | | ), | | | }, | | | }, | | | ], | | | input: SelectionExec { | | | expr: BinaryExpr { | | | left: BinaryExpr { | | | left: Column { | | | name: "path", | | | }, | | | op: Eq, | | | right: Literal { | | | value: Utf8( | | | "/api/v2/write", | | | ), | | | }, | | | }, | | | op: And, | | | right: BinaryExpr { | | | left: Column { | | | name: "path", | | | }, | | | op: Eq, | | | right: Literal { | | | value: Utf8( | | | "foo", | | | ), | | | }, | | | }, | | | }, | | | input: DataSourceExec { | | | schema: Schema { | | | fields: [ | | | Field { | | | name: "path", | | | data_type: Utf8, | | | nullable: true, | | | dict_id: 0, | | | dict_is_ordered: false, | | | }, | | | Field { | | | name: "status", | | | data_type: Utf8, | | | nullable: true, | | | dict_id: 0, | | | dict_is_ordered: false, | | | }, | | | ], | | | metadata: {}, | | | }, | | | partitions.len: 1, | | | }, | | | }, | | | schema: Schema { | | | fields: [ | | | Field { | | | name: "status", | | | data_type: Utf8, | | | nullable: true, | | | dict_id: 0, | | | dict_is_ordered: false, | | | }, | | | Field { | | | name: "COUNT(UInt8(1))", | | | data_type: UInt64, | | | nullable: true, | | | dict_id: 0, | | | dict_is_ordered: false, | | | }, | | | ], | | | metadata: {}, | | | }, | | | } | +-----------------------------------------+----------------------------------------------------------------------+ 4 row in set. Query took 0 seconds. ``` Closes apache#7959 from alamb/alamb/ARROW-9654-explain Authored-by: alamb <andrew@nerdnetworks.org> Signed-off-by: Andy Grove <andygrove73@gmail.com>

alamb commented Aug 13, 2020

View reviewed changes

ARROW-9654: [Rust][DataFusion] Add EXPLAIN <SQL> statement, tests f…

037e3f6

…or same

alamb force-pushed the alamb/ARROW-9654-explain branch from 3e077fc to 037e3f6 Compare August 13, 2020 18:43

andygrove added Component: Rust Component: Rust - DataFusion labels Aug 14, 2020

jorgecarleitao reviewed Aug 14, 2020

View reviewed changes

alamb added 3 commits August 14, 2020 14:58

Correct comment

639db5f

Use PlanType enum instead of hard coded string

7e9e42c

Update tests

0158e40

andygrove approved these changes Aug 14, 2020

View reviewed changes

andygrove closed this in 2dcc9a1 Aug 14, 2020

alamb deleted the alamb/ARROW-9654-explain branch October 26, 2020 20:05

This was referenced Apr 26, 2021

[Rust][DataFusion] Add FORMAT to explain plan and an easy to visualize format apache/arrow-rs#94

Closed

Add FORMAT to explain plan and an easy to visualize format apache/datafusion#96

Closed

This was referenced Aug 14, 2020

[Rust][DataFusion] Add an EXPLAIN command to the datafusion CLI #25713

Closed

[Rust][DataFusion] Add FORMAT to explain plan and an easy to visualize format #25798

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-9654: [Rust][DataFusion] Add `EXPLAIN <SQL>` statement #7959

ARROW-9654: [Rust][DataFusion] Add `EXPLAIN <SQL>` statement #7959

alamb commented Aug 13, 2020 •

edited

Loading

alamb Aug 13, 2020

alamb Aug 13, 2020

jorgecarleitao Aug 14, 2020

alamb commented Aug 13, 2020

github-actions bot commented Aug 13, 2020

andygrove commented Aug 14, 2020

jorgecarleitao left a comment

jorgecarleitao Aug 14, 2020

jorgecarleitao Aug 14, 2020

alamb Aug 14, 2020 •

edited

Loading

alamb Aug 14, 2020

jorgecarleitao Aug 14, 2020

jorgecarleitao Aug 14, 2020

alamb Aug 14, 2020

alamb Aug 14, 2020

jorgecarleitao Aug 14, 2020

alamb commented Aug 14, 2020

Physical Plan?

Plans as {:#?}

Other formats?

Support on `table` API?

jorgecarleitao commented Aug 14, 2020

alamb commented Aug 14, 2020

andygrove left a comment

ARROW-9654: [Rust][DataFusion] Add EXPLAIN <SQL> statement #7959

ARROW-9654: [Rust][DataFusion] Add EXPLAIN <SQL> statement #7959

Conversation

alamb commented Aug 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Aug 13, 2020

github-actions bot commented Aug 13, 2020

andygrove commented Aug 14, 2020

jorgecarleitao left a comment

Choose a reason for hiding this comment

Physical Plan?

Plans as {:#?}

Other formats?

Support on table API?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Aug 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Aug 14, 2020

Physical Plan?

Plans as {:#?}

Other formats?

Support on table API?

jorgecarleitao commented Aug 14, 2020

alamb commented Aug 14, 2020

andygrove left a comment

Choose a reason for hiding this comment

ARROW-9654: [Rust][DataFusion] Add `EXPLAIN <SQL>` statement #7959

ARROW-9654: [Rust][DataFusion] Add `EXPLAIN <SQL>` statement #7959

alamb commented Aug 13, 2020 •

edited

Loading

Support on `table` API?

alamb Aug 14, 2020 •

edited

Loading

Support on `table` API?