-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27482][SQL][WEBUI] Show estimated BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page #24666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -27,4 +27,5 @@ import org.apache.spark.annotation.DeveloperApi | |
| class SQLMetricInfo( | ||
| val name: String, | ||
| val accumulatorId: Long, | ||
| val metricType: String) | ||
| val metricType: String, | ||
| val stats: Long = -1) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not all metric has its corresponding statistics (e.g.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. or we can put a
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cloud-fan Thanks for your comments. The idea is that each SQL metric can have a statistic value (-1 means not available/initialized). I set the statistic type to Long is because SQL Metric's value is always Long type as well. Put Option[Statistics] in SQLMetricInfo doesn't sound quite right though. It means that all SQL metrics have an attribute including rowCount, size & column stats. Let me know your feedback, thanks in advance. |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -33,7 +33,8 @@ import org.apache.spark.util.{AccumulatorContext, AccumulatorV2, Utils} | |
| * the executor side are automatically propagated and shown in the SQL UI through metrics. Updates | ||
| * on the driver side must be explicitly posted using [[SQLMetrics.postDriverMetricUpdates()]]. | ||
| */ | ||
| class SQLMetric(val metricType: String, initValue: Long = 0L) extends AccumulatorV2[Long, Long] { | ||
| class SQLMetric(val metricType: String, initValue: Long = 0L, val stats: Long = -1L) extends | ||
| AccumulatorV2[Long, Long] { | ||
| // This is a workaround for SPARK-11013. | ||
| // We may use -1 as initial value of the accumulator, if the accumulator is valid, we will | ||
| // update it at the end of task and the value will be at least 0. Then we can filter out the -1 | ||
|
|
@@ -42,7 +43,7 @@ class SQLMetric(val metricType: String, initValue: Long = 0L) extends Accumulato | |
| private var _zeroValue = initValue | ||
|
|
||
| override def copy(): SQLMetric = { | ||
| val newAcc = new SQLMetric(metricType, _value) | ||
| val newAcc = new SQLMetric(metricType, _value, stats) | ||
| newAcc._zeroValue = initValue | ||
| newAcc | ||
| } | ||
|
|
@@ -96,8 +97,8 @@ object SQLMetrics { | |
| metric.set((v * baseForAvgMetric).toLong) | ||
| } | ||
|
|
||
| def createMetric(sc: SparkContext, name: String): SQLMetric = { | ||
| val acc = new SQLMetric(SUM_METRIC) | ||
| def createMetric(sc: SparkContext, name: String, stats: Long = -1): SQLMetric = { | ||
| val acc = new SQLMetric(SUM_METRIC, stats = stats) | ||
| acc.register(sc, name = Some(name), countFailedValues = false) | ||
| acc | ||
| } | ||
|
|
@@ -193,6 +194,14 @@ object SQLMetrics { | |
| } | ||
| } | ||
|
|
||
| def stringStats(value: Long): String = { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we should handle stats in
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. one example: we can also display the difference between real row count and estimated row count, e.g. |
||
| if (value < 0) { | ||
| "" | ||
| } else { | ||
| s" est: ${stringValue(SUM_METRIC, Seq(value))}" | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Updates metrics based on the driver side value. This is useful for certain metrics that | ||
| * are only updated on the driver, e.g. subquery execution time, or number of files. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, for file sources, usually there is only sizeInBytes stats in logical plan level. So the estimated numOutputRows for logical plan should be empty for file sources.
What is the scenario of this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for file source table, there will be row count stats if CBO is enabled.