[SPARK-24050][SS] Calculate input / processing rates correctly for DataSourceV2 streaming sources #21126

tdas · 2018-04-23T06:23:52Z

What changes were proposed in this pull request?

In some streaming queries, the input and processing rates are not calculated at all (shows up as zero) because MicroBatchExecution fails to associated metrics from the executed plan of a trigger with the sources in the logical plan of the trigger. The way this executed-plan-leaf-to-logical-source attribution works is as follows. With V1 sources, there was no way to identify which execution plan leaves were generated by a streaming source. So did a best-effort attempt to match logical and execution plan leaves when the number of leaves were same. In cases where the number of leaves is different, we just give up and report zero rates. An example where this may happen is as follows.

val cachedStaticDF = someStaticDF.union(anotherStaticDF).cache()
val streamingInputDF = ...

val query = streamingInputDF.join(cachedStaticDF).writeStream....

In this case, the cachedStaticDF has multiple logical leaves, but in the trigger's execution plan it only has leaf because a cached subplan is represented as a single InMemoryTableScanExec leaf. This leads to a mismatch in the number of leaves causing the input rates to be computed as zero.

With DataSourceV2, all inputs are represented in the executed plan using DataSourceV2ScanExec, each of which has a reference to the associated logical DataSource and DataSourceReader. So its easy to associate the metrics to the original streaming sources.

In this PR, the solution is as follows. If all the streaming sources in a streaming query as v2 sources, then use a new code path where the execution-metrics-to-source mapping is done directly. Otherwise we fall back to existing mapping logic.

How was this patch tested?

New unit tests using V2 memory source
Existing unit tests using V1 source

SparkQA · 2018-04-23T07:05:02Z

Test build #89702 has finished for PR 21126 at commit d485db8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-04-23T08:01:44Z

jenkins retest this please.

SparkQA · 2018-04-23T11:30:41Z

Test build #89707 has finished for PR 21126 at commit d485db8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-04-23T18:04:22Z

@brkyvz @jose-torres

brkyvz

What is the behavior if there's a self join? Will we double count the numInputRows?

brkyvz · 2018-04-23T20:49:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala

+      // Check whether the streaming query's logical plan has only V2 data sources
+      val allStreamingLeaves =
+        logicalPlan.collect { case s: StreamingExecutionRelation => s }
+      allStreamingLeaves.forall { _.source.isInstanceOf[MicroBatchReader] }


we don't have a way to track these for ContinuousProcessing at the moment?

Maybe i can make it work for continuous as well with a small tweak

A point fix here won't be sufficient - right now the row count metrics don't make it to the driver at all in continuous processing.

Yeah. This code path is not used by continuous processing.

brkyvz · 2018-04-23T21:22:21Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala

+    )
+
+    val streamInput2 = MemoryStream[Int]
+    val staticInputDF2 = staticInputDF.union(staticInputDF).cache()


nit: unpersist later?

really doesnt matter as the testsuite will shutdown the sparkcontext anyways.

brkyvz · 2018-04-23T21:22:32Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala

+    q.recentProgress.filter(_.numInputRows > 0).lastOption
+  }
+
+


nit: extra line

brkyvz · 2018-04-23T21:23:37Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala

+    val streamInput2 = MemoryStream[Int]
+    val staticInputDF2 = staticInputDF.union(staticInputDF).cache()
+
+    testStream(streamInput2.toDF().join(staticInputDF2, "value"), useV2Sink = true)(


what if you do a stream-stream join?

e.g. self-join?

then there will be two DataSourceScanV2Execs reading from the same location. So we will be reading data twice, and the counts will reflect that. But yes, I should add a test for that.

Turns one things were broken for self-joins and self-union. updated the logic and added tests for those for v2 sources.

jose-torres · 2018-04-24T05:02:52Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala

+        assert(lastProgress.get.sources(0).numInputRows == 3)
+        assert(lastProgress.get.sources(1).numInputRows == 0)
+        true
+      }


nit: i'd suggest doing an AddData() for the other stream after, to make sure there's not some weird order dependence

SparkQA · 2018-04-24T07:26:14Z

Test build #89766 has finished for PR 21126 at commit 855e24d.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-24T09:37:42Z

Test build #89777 has finished for PR 21126 at commit 6d80cfd.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-04-24T22:20:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala

+          "min" -> stats.min,
+          "avg" -> stats.avg.toLong).mapValues(formatTimestamp)
+    }.headOption.getOrElse(Map.empty) ++ watermarkTimestamp
+


This above code stayed the same. The diff is pretty dumb.

jose-torres · 2018-04-24T22:41:41Z

lgtm

brkyvz

LGTM. Thanks for all the tests. I'm trusting the unit tests. I don't see a better way of figuring out unique sources.

brkyvz · 2018-04-24T23:54:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala

-                s" of the execution plan:\n" +
-                s"logical plan leaves: ${toString(allLogicalPlanLeaves)}\n" +
-                s"execution plan leaves: ${toString(allExecPlanLeaves)}\n")
+              s" of the execution plan:\n" +


existing nit: maybe we should've just used

""" | | """

SparkQA · 2018-04-25T00:07:40Z

Test build #4159 has finished for PR 21126 at commit 6d80cfd.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-25T00:37:58Z

Test build #89810 has started for PR 21126 at commit 7d79dce.

tdas · 2018-04-25T00:39:40Z

jenkins retest this please

SparkQA · 2018-04-25T03:52:59Z

Test build #89811 has finished for PR 21126 at commit a56707f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-04-25T06:54:03Z

jenkins retest this please

SparkQA · 2018-04-25T07:05:01Z

Test build #89825 has finished for PR 21126 at commit a56707f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-04-25T07:58:03Z

jenkins retest this please

SparkQA · 2018-04-25T11:37:06Z

Test build #89834 has finished for PR 21126 at commit a56707f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SPARK-24050

d485db8

brkyvz reviewed Apr 23, 2018

View reviewed changes

Fixed bug with self-unions

5104848

jose-torres reviewed Apr 24, 2018

View reviewed changes

Fixed self-union and self-join cases

855e24d

Addressed comment

6d80cfd

tdas commented Apr 24, 2018

View reviewed changes

brkyvz approved these changes Apr 24, 2018

View reviewed changes

Refactored code to avoid java unidoc error

7d79dce

Refactored code to avoid java unidoc error

a56707f

asfgit closed this in 396938e Apr 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24050][SS] Calculate input / processing rates correctly for DataSourceV2 streaming sources #21126

[SPARK-24050][SS] Calculate input / processing rates correctly for DataSourceV2 streaming sources #21126

tdas commented Apr 23, 2018

SparkQA commented Apr 23, 2018

tdas commented Apr 23, 2018

SparkQA commented Apr 23, 2018

tdas commented Apr 23, 2018

brkyvz left a comment

brkyvz Apr 23, 2018

tdas Apr 23, 2018

jose-torres Apr 24, 2018 •

edited

Loading

tdas Apr 24, 2018

brkyvz Apr 23, 2018

tdas Apr 23, 2018

brkyvz Apr 23, 2018

brkyvz Apr 23, 2018

brkyvz Apr 23, 2018

tdas Apr 23, 2018

tdas Apr 24, 2018

jose-torres Apr 24, 2018

SparkQA commented Apr 24, 2018

SparkQA commented Apr 24, 2018

tdas Apr 24, 2018

jose-torres commented Apr 24, 2018

brkyvz left a comment

brkyvz Apr 24, 2018

SparkQA commented Apr 25, 2018

SparkQA commented Apr 25, 2018

tdas commented Apr 25, 2018

SparkQA commented Apr 25, 2018

tdas commented Apr 25, 2018

SparkQA commented Apr 25, 2018

tdas commented Apr 25, 2018

SparkQA commented Apr 25, 2018

[SPARK-24050][SS] Calculate input / processing rates correctly for DataSourceV2 streaming sources #21126

[SPARK-24050][SS] Calculate input / processing rates correctly for DataSourceV2 streaming sources #21126

Conversation

tdas commented Apr 23, 2018

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Apr 23, 2018

tdas commented Apr 23, 2018

SparkQA commented Apr 23, 2018

tdas commented Apr 23, 2018

brkyvz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jose-torres Apr 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 24, 2018

SparkQA commented Apr 24, 2018

Choose a reason for hiding this comment

jose-torres commented Apr 24, 2018

brkyvz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 25, 2018

SparkQA commented Apr 25, 2018

tdas commented Apr 25, 2018

SparkQA commented Apr 25, 2018

tdas commented Apr 25, 2018

SparkQA commented Apr 25, 2018

tdas commented Apr 25, 2018

SparkQA commented Apr 25, 2018

jose-torres Apr 24, 2018 •

edited

Loading