Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BaselineMetrics, Timestamp metrics, add for CoalescePartitionsExec, rename output_time -> elapsed_compute #909

Merged
merged 1 commit into from
Aug 26, 2021

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Aug 20, 2021

Which issue does this PR close?

First part of #866

(see #938 and https://github.com/influxdata/influxdb_iox/pull/2387 for examples of how using this API works)

Rationale for this change

I would like to be able get an overall understanding of where time is being spent during query execution via EXPLAIN ANALYZE (see #858) so that I know where to focuse additional performance optimization activities

Additionally, I would like to be able to graph a stacked flamechart such as the following see more details on https://github.com/influxdata/influxdb_iox/issues/2273) that shows when the different operators ran in relation to each other.

Screen Shot 2021-08-12 at 11 14 33 AM

What changes are included in this PR?

Begin adding the following data for each operator, as it makes sense

  1. output_rows: total rows produced at the output of the operator
  2. compute_time: the total time spent (not including any time spent in the input stream or waiting to be scheduled)
  3. start_time: the wall clock time at which execute was run
  4. stop_time: the wall clock time at which the last output record batch was produced

Changes

  1. Renames cpu_time to elapsed_compute and documents clearly what it is and is not (as @mkmik pointed out that traditionally CPU time refer to a specific type of clock that is not what is implemented here)
  2. Adds a BaselineMetrics structure that has common metrics to assist in annotating
  3. Adds a Timestamp metric type and StartTimestamp and EndTimestamp values and associated aggregating code
  4. Print out metrics in deterministic order

Are there any user-facing changes?

Better ANALYZE output.

Using this setup

echo "1,A" > /tmp/foo.csv
echo "1,B" >> /tmp/foo.csv
echo "2,A" >> /tmp/foo.csv

Run CLI

cargo run --bin datafusion-cli

And then run this SQL:

CREATE EXTERNAL TABLE foo(x INT, b VARCHAR) STORED AS CSV LOCATION '/tmp/foo.csv';

SELECT SUM(x) FROM foo GROUP BY b;

EXPLAIN ANALYZE SELECT SUM(x) FROM foo GROUP BY b;

Before this PR

> EXPLAIN ANALYZE SELECT SUM(x) FROM foo GROUP BY b;
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type         | plan                                                                                                                                                      |
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | CoalescePartitionsExec, metrics=[]                                                                                                                        |
|                   |   ProjectionExec: expr=[SUM(foo.x)@1 as SUM(x)], metrics=[]                                                                                               |
|                   |     HashAggregateExec: mode=FinalPartitioned, gby=[b@0 as b], aggr=[SUM(x)], metrics=[outputRows=2]                                                       |
|                   |       CoalesceBatchesExec: target_batch_size=4096, metrics=[]                                                                                             |
|                   |         RepartitionExec: partitioning=Hash([Column { name: "b", index: 0 }], 16), metrics=[sendTime=968487, repartitionTime=5686072, fetchTime=110114033] |
|                   |           HashAggregateExec: mode=Partial, gby=[b@1 as b], aggr=[SUM(x)], metrics=[outputRows=2]                                                          |
|                   |             RepartitionExec: partitioning=RoundRobinBatch(16), metrics=[sendTime=12090, fetchTime=5106669, repartitionTime=0]                             |
|                   |               CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false, metrics=[]                                                            |
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+

After this PR

(note the presence of output_rows, elapsed_compute, start_timestamp, and end_timestamp)

> EXPLAIN ANALYZE SELECT SUM(x) FROM foo GROUP BY b;
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type         | plan                                                                                                                                                                                                                      |
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | CoalescePartitionsExec, metrics=[output_rows=2, elapsed_compute=NOT RECORDED]                                                                                                                                             |
|                   |   ProjectionExec: expr=[SUM(foo.x)@1 as SUM(x)], metrics=[]                                                                                                                                                               |
|                   |     HashAggregateExec: mode=FinalPartitioned, gby=[b@0 as b], aggr=[SUM(x)], metrics=[output_rows=2]                                                                                                                      |
|                   |       CoalesceBatchesExec: target_batch_size=4096, metrics=[]                                                                                                                                                             |
|                   |         RepartitionExec: partitioning=Hash([Column { name: "b", index: 0 }], 16), metrics=[repart_time{inputPartition=0}=6.295923ms, send_time{inputPartition=0}=NOT RECORDED, fetch_time{inputPartition=0}=127.894949ms] |
|                   |           HashAggregateExec: mode=Partial, gby=[b@1 as b], aggr=[SUM(x)], metrics=[output_rows=2]                                                                                                                         |
|                   |             RepartitionExec: partitioning=RoundRobinBatch(16), metrics=[send_time{inputPartition=0}=10.394µs, fetch_time{inputPartition=0}=6.053131ms, repart_time{inputPartition=0}=NOT RECORDED]                        |
|                   |               CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false, metrics=[]                                                                                                                            |
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

If you use VERBOSE you can also see the start / end timestamps among many other things

> EXPLAIN ANALYZE VERBOSE SELECT SUM(x) FROM foo GROUP BY b;
...
| Plan with Full Metrics | CoalescePartitionsExec, metrics=[start_timestamp{partition=0}=2021-08-24 17:14:08.746387 UTC, end_timestamp{partition=0}=2021-08-24 17:14:08.756462 UTC, elapsed_compute{partition=0}=NOT RECORDED, output_rows{partition=0}=2] 

Follow on work:

  1. Annotate additional operators
  2. Add additional wrapper as suggested by @houqp in Add "baseline" metrics to all built in operators #866 (comment) to help avoid special code for each operator. The BaselineMetrics is a start in this direction but there is still more to go

@github-actions github-actions bot added ballista datafusion Changes in the datafusion crate labels Aug 20, 2021
@alamb alamb force-pushed the alamb/baseline_metrics branch from f4eaac9 to 248d113 Compare August 20, 2021 17:33
@alamb alamb changed the title Add BaselineMetrics, Timestamp metrics, add for CoalscePartitionExec Add BaselineMetrics, Timestamp metrics, add for CoalescePartitionsExec Aug 20, 2021
@alamb alamb force-pushed the alamb/baseline_metrics branch 2 times, most recently from 8bce308 to f6ad883 Compare August 24, 2021 16:08
@@ -43,12 +45,17 @@ use pin_project_lite::pin_project;
pub struct CoalescePartitionsExec {
/// Input execution plan
input: Arc<dyn ExecutionPlan>,
/// Execution metrics
Copy link
Contributor Author

@alamb alamb Aug 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the pattern that instrumented operators now follow:

  1. They have a new metrics: ExecutionPlanMetricsSet field
  2. During execute() they create new BaselineMetrics
  3. They add elapsed_compute timer during their CPU intensive work

@alamb alamb force-pushed the alamb/baseline_metrics branch from b6436ac to 3c7b845 Compare August 24, 2021 17:09
@alamb alamb force-pushed the alamb/baseline_metrics branch from 3c7b845 to e5c5c07 Compare August 24, 2021 18:32
@alamb alamb marked this pull request as ready for review August 24, 2021 20:23
@alamb alamb force-pushed the alamb/baseline_metrics branch from e5c5c07 to 2874261 Compare August 25, 2021 19:02
@alamb alamb changed the title Add BaselineMetrics, Timestamp metrics, add for CoalescePartitionsExec Add BaselineMetrics, Timestamp metrics, add for CoalescePartitionsExec, rename output_time -> elapsed_compute Aug 25, 2021
Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can split this PR into smaller parts for easier review (introduce Timestamp rename to ElapsedCompute and add BaselineMetrics but I have them all together in a single PR to show how they work together to record the basic information.


use super::{Count, ExecutionPlanMetricsSet, MetricBuilder, Time, Timestamp};

/// Helper for creating and tracking common "baseline" metrics for
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the core new structure / wrapper as suggested by @houqp -- it doesn't quite wrap everything yet but I think it makes annotating basic information pretty simple

@alamb
Copy link
Contributor Author

alamb commented Aug 25, 2021

I have a few PRs backed up on this one (e.g. #938) so if someone has time to review this PR in the near term I would be most appreciative 🙏

@alamb
Copy link
Contributor Author

alamb commented Aug 25, 2021

cc @returnString

@alamb alamb requested review from andygrove, Dandandan and houqp August 25, 2021 19:51
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skimmed over this but LGTM

@alamb
Copy link
Contributor Author

alamb commented Aug 25, 2021

I'll give it until tomorrow in case anyone else wants to chime in or say they are interested in reviewing it . Thanks @andygrove

Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a great start 👍

@alamb
Copy link
Contributor Author

alamb commented Aug 26, 2021

Thanks all for the input

@alamb alamb merged commit d31c157 into apache:master Aug 26, 2021
@alamb alamb deleted the alamb/baseline_metrics branch August 26, 2021 12:23
@alamb alamb mentioned this pull request Aug 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants