Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add column statistics into explain #8112

Merged
merged 3 commits into from
Nov 12, 2023
Merged

Conversation

NGA-TRAN
Copy link
Contributor

@NGA-TRAN NGA-TRAN commented Nov 9, 2023

Which issue does this PR close?

Closes #8110

Rationale for this change

Show column statistics in the explain

What changes are included in this PR?

Explain before this PR

explain select * from t1 where time <= to_timestamp(350);
+---------------+---------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                              |
+---------------+---------------------------------------------------------------------------------------------------+
| logical_plan  | Filter: t1.time <= TimestampNanosecond(350000000000, None)                                        |
|               |   TableScan: t1 projection=[state, city, min_temp, area, time]                                    |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Absent, Bytes=Absent]               |
|               |   FilterExec: time@4 <= 350000000000, statistics=[Rows=Absent, Bytes=Absent]                      |
|               |     MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(10), Bytes=Exact(2960)] |
|               |                                                                                                   |
+---------------+---------------------------------------------------------------------------------------------------+

Same explain with changes in this PR

explain select * from t1 where time <= to_timestamp(350);
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Filter: t1.time <= TimestampNanosecond(350000000000, None)                                                                                                                                                                                                                                                                                                                                                                                              |
|               |   TableScan: t1 projection=[state, city, min_temp, area, time]                                                                                                                                                                                                                                                                                                                                                                                          |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Absent, Bytes=Absent, [(Column[0]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent),(Column[1]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent),(Column[2]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent),(Column[3]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent),(Column[4]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent)]]                        |
|               |   FilterExec: time@4 <= 350000000000, statistics=[Rows=Absent, Bytes=Absent, [(Column[0]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent),(Column[1]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent),(Column[2]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent),(Column[3]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent),(Column[4]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent)]]                               |
|               |     MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(1), Bytes=Exact(2896), [(Column[0]: Min=Absent, Max=Absent, Null=Exact(0), Distinct=Absent),(Column[1]: Min=Absent, Max=Absent, Null=Exact(0), Distinct=Absent),(Column[2]: Min=Absent, Max=Absent, Null=Exact(0), Distinct=Absent),(Column[3]: Min=Absent, Max=Absent, Null=Exact(0), Distinct=Absent),(Column[4]: Min=Absent, Max=Absent, Null=Exact(0), Distinct=Absent)]] |
|               |                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Select fewer columns

explain select state, min_temp from t1 where time <= to_timestamp(350);
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                              |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: t1.state, t1.min_temp                                                                                                                                                                                                                                                                                 |
|               |   Filter: t1.time <= TimestampNanosecond(350000000000, None)                                                                                                                                                                                                                                                      |
|               |     TableScan: t1 projection=[state, min_temp, time]                                                                                                                                                                                                                                                              |
| physical_plan | ProjectionExec: expr=[state@0 as state, min_temp@1 as min_temp], statistics=[Rows=Absent, Bytes=Absent, [(Column[0]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent),(Column[1]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent)]]                                                                    |
|               |   CoalesceBatchesExec: target_batch_size=8192, statistics=[Rows=Absent, Bytes=Absent, [(Column[0]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent),(Column[1]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent),(Column[2]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent)]]                    |
|               |     FilterExec: time@2 <= 350000000000, statistics=[Rows=Absent, Bytes=Absent, [(Column[0]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent),(Column[1]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent),(Column[2]: Min=Absent, Max=Absent, Null=Absent, Distinct=Absent)]]                           |
|               |       MemoryExec: partitions=1, partition_sizes=[1], statistics=[Rows=Exact(1), Bytes=Exact(2896), [(Column[0]: Min=Absent, Max=Absent, Null=Exact(0), Distinct=Absent),(Column[1]: Min=Absent, Max=Absent, Null=Exact(0), Distinct=Absent),(Column[2]: Min=Absent, Max=Absent, Null=Exact(0), Distinct=Absent)]] |
|               |                                                                                                                                                                                                                                                                                                                   |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Nov 9, 2023
@NGA-TRAN
Copy link
Contributor Author

NGA-TRAN commented Nov 9, 2023

@alamb and @berkaysynnada : This is the PR for #8110

@berkaysynnada
Copy link
Contributor

What do you think of skipping Absent statistics so that the plan lines don't become so long?

@NGA-TRAN
Copy link
Contributor Author

What do you think of skipping Absent statistics so that the plan lines don't become so long?

That is a good idea. Let me try

@NGA-TRAN
Copy link
Contributor Author

I have addressed @berkaysynnada's comment to only show non-absent stats. However, I still show Col[i] even if that column does not have stats for us to know those are columns in the query that do check their stats.

Also, it seems we do not compute column stats right now and they are all empty. I know @alamb is working on it to get statistics for them because we do need column min max in IOx

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me -- thank you @NGA-TRAN

cc @berkaysynnada

@alamb alamb merged commit 9e012a6 into apache:main Nov 12, 2023
22 checks passed
@andygrove andygrove added the performance Make DataFusion faster label Nov 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate performance Make DataFusion faster sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Show statistics min and max in explain
4 participants