Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show Statistics in explain Verbose without enabling show statistics #8111

Closed
NGA-TRAN opened this issue Nov 9, 2023 · 5 comments
Closed
Assignees
Labels
enhancement New feature or request

Comments

@NGA-TRAN
Copy link
Contributor

NGA-TRAN commented Nov 9, 2023

Is your feature request related to a problem or challenge?

Currently, we can only see statistics in the explain if the flag datafusion.explain.show_statistics is on. In the case of InfluxDB IOx, we disable flag settings to avoids turning on other unexpected features such as DDL & DML. However, we still want to see statistics in explain.

Here is the current behavior

Create a table

create table t1(state string, city string, min_temp float, area int, time timestamp) as values 
    ('MA', 'Boston', 70.4, 1, 50);

Run explain with default settings. No statistics in the plan

explain select * from t1 where time <= to_timestamp(350);
+---------------+----------------------------------------------------------------+
| plan_type     | plan                                                           |
+---------------+----------------------------------------------------------------+
| logical_plan  | Filter: t1.time <= TimestampNanosecond(350000000000, None)     |
|               |   TableScan: t1 projection=[state, city, min_temp, area, time] |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                    |
|               |   FilterExec: time@4 <= 350000000000                           |
|               |     MemoryExec: partitions=1, partition_sizes=[1]              |
|               |                                                                |
+---------------+----------------------------------------------------------------+

Turn show_statistics on, and the explain includes statistics

set datafusion.explain.show_statistics = true;
0 rows in set. Query took 0.002 seconds.

❯ explain select * from t1 where time <= to_timestamp(350);
+---------------+--------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                             |
+---------------+--------------------------------------------------------------------------------------------------+
| logical_plan  | Filter: t1.time <= TimestampNanosecond(350000000000, None)                                       |
|               |   TableScan: t1 projection=[state, city, min_temp, area, time]                                   |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Absent, Bytes=Absent]              |
|               |   FilterExec: time@4 <= 350000000000, statistics=[Rows=Absent, Bytes=Absent]                     |
|               |     MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(1), Bytes=Exact(2896)] |
|               |                                                                                                  |
+---------------+--------------------------------------------------------------------------------------------------+

Describe the solution you'd like

Always showing statistics in explain verbose no matter the values of the settings

Describe alternatives you've considered

No response

Additional context

No response

@NGA-TRAN NGA-TRAN added the enhancement New feature or request label Nov 9, 2023
@NGA-TRAN
Copy link
Contributor Author

NGA-TRAN commented Nov 9, 2023

@alamb I will work on this

@alamb
Copy link
Contributor

alamb commented Nov 10, 2023

I would like to propose a slightly different interface, namely add two new lines -- the initial physical plan with statistics and the final physical plan with statistics

❯ explain verbose select * from t1 where time <= to_timestamp(350);
+---------------+--------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                             |
+---------------+--------------------------------------------------------------------------------------------------+
| logical_plan  | Filter: t1.time <= TimestampNanosecond(350000000000, None)                                       |
|               |   TableScan: t1 projection=[state, city, min_temp, area, time]                                   |
| initial_physical_plan_with_stats | CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Absent, Bytes=Absent]              |
|               |   FilterExec: time@4 <= 350000000000, statistics=[Rows=Absent, Bytes=Absent]                     |
|               |     MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(1), Bytes=Exact(2896)] |
|               |                                                                                                  |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                    |
|               |   FilterExec: time@4 <= 350000000000                           |
|               |     MemoryExec: partitions=1, partition_sizes=[1]              |
|               |                                                                |
...
| final_physical_plan_with_stats | CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Absent, Bytes=Absent]              |
|               |   FilterExec: time@4 <= 350000000000, statistics=[Rows=Absent, Bytes=Absent]                     |
|               |     MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(1), Bytes=Exact(2896)] |
|               |                                                                                                  |

+---------------+--------------------------------------------------------------------------------------------------+

(I am sorry for the bad formatting)

@NGA-TRAN
Copy link
Contributor Author

@alamb : can we go with 2 InitialPhysicalPlan and FinalPhysicalPlan? I would like to make sure stats won't get lost during optimization rules

@alamb
Copy link
Contributor

alamb commented Nov 10, 2023

@alamb : can we go with 2 InitialPhysicalPlan and FinalPhysicalPlan? I would like to make sure stats won't get lost during optimization rules

I think the usecase of tracing the flow of statistics through each stage in the plan is valuable, but fairly narrow and we shouldn't subject everyone using EXPLAIN VERBOSE to it.

What would you think about adding the two new lines to EXPLAIN VERBOSE for common usage and then adding something like EXPLAIN VERY VERBOSE that has statistics on every plan in a follow on PR?

Since DataFusion handles the parsing of EXPLAIN itself, it would be fairly quick to add handling for VERY VERBOSE 🤔

https://github.com/apache/arrow-datafusion/blob/7fde76e33dcc26b0816fc8513c396becd431c1ad/datafusion/sql/src/parser.rs#L50-L53

@NGA-TRAN
Copy link
Contributor Author

Yeah, I will add 2 line lines in this PR and implement VERY VERBOSE id=f we need it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants