Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24050][SS] Calculate input / processing rates correctly for DataSourceV2 streaming sources #21126

Closed
wants to merge 6 commits into from

Conversation

tdas
Copy link
Contributor

@tdas tdas commented Apr 23, 2018

What changes were proposed in this pull request?

In some streaming queries, the input and processing rates are not calculated at all (shows up as zero) because MicroBatchExecution fails to associated metrics from the executed plan of a trigger with the sources in the logical plan of the trigger. The way this executed-plan-leaf-to-logical-source attribution works is as follows. With V1 sources, there was no way to identify which execution plan leaves were generated by a streaming source. So did a best-effort attempt to match logical and execution plan leaves when the number of leaves were same. In cases where the number of leaves is different, we just give up and report zero rates. An example where this may happen is as follows.

val cachedStaticDF = someStaticDF.union(anotherStaticDF).cache()
val streamingInputDF = ...

val query = streamingInputDF.join(cachedStaticDF).writeStream....

In this case, the cachedStaticDF has multiple logical leaves, but in the trigger's execution plan it only has leaf because a cached subplan is represented as a single InMemoryTableScanExec leaf. This leads to a mismatch in the number of leaves causing the input rates to be computed as zero.

With DataSourceV2, all inputs are represented in the executed plan using DataSourceV2ScanExec, each of which has a reference to the associated logical DataSource and DataSourceReader. So its easy to associate the metrics to the original streaming sources.

In this PR, the solution is as follows. If all the streaming sources in a streaming query as v2 sources, then use a new code path where the execution-metrics-to-source mapping is done directly. Otherwise we fall back to existing mapping logic.

How was this patch tested?

  • New unit tests using V2 memory source
  • Existing unit tests using V1 source

@SparkQA
Copy link

SparkQA commented Apr 23, 2018

Test build #89702 has finished for PR 21126 at commit d485db8.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor Author

tdas commented Apr 23, 2018

jenkins retest this please.

@SparkQA
Copy link

SparkQA commented Apr 23, 2018

Test build #89707 has finished for PR 21126 at commit d485db8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor Author

tdas commented Apr 23, 2018

@brkyvz @jose-torres

Copy link
Contributor

@brkyvz brkyvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the behavior if there's a self join? Will we double count the numInputRows?

// Check whether the streaming query's logical plan has only V2 data sources
val allStreamingLeaves =
logicalPlan.collect { case s: StreamingExecutionRelation => s }
allStreamingLeaves.forall { _.source.isInstanceOf[MicroBatchReader] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have a way to track these for ContinuousProcessing at the moment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe i can make it work for continuous as well with a small tweak

Copy link
Contributor

@jose-torres jose-torres Apr 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A point fix here won't be sufficient - right now the row count metrics don't make it to the driver at all in continuous processing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. This code path is not used by continuous processing.

)

val streamInput2 = MemoryStream[Int]
val staticInputDF2 = staticInputDF.union(staticInputDF).cache()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unpersist later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really doesnt matter as the testsuite will shutdown the sparkcontext anyways.

q.recentProgress.filter(_.numInputRows > 0).lastOption
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra line

val streamInput2 = MemoryStream[Int]
val staticInputDF2 = staticInputDF.union(staticInputDF).cache()

testStream(streamInput2.toDF().join(staticInputDF2, "value"), useV2Sink = true)(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if you do a stream-stream join?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. self-join?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then there will be two DataSourceScanV2Execs reading from the same location. So we will be reading data twice, and the counts will reflect that. But yes, I should add a test for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns one things were broken for self-joins and self-union. updated the logic and added tests for those for v2 sources.

assert(lastProgress.get.sources(0).numInputRows == 3)
assert(lastProgress.get.sources(1).numInputRows == 0)
true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i'd suggest doing an AddData() for the other stream after, to make sure there's not some weird order dependence

@SparkQA
Copy link

SparkQA commented Apr 24, 2018

Test build #89766 has finished for PR 21126 at commit 855e24d.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 24, 2018

Test build #89777 has finished for PR 21126 at commit 6d80cfd.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

"min" -> stats.min,
"avg" -> stats.avg.toLong).mapValues(formatTimestamp)
}.headOption.getOrElse(Map.empty) ++ watermarkTimestamp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This above code stayed the same. The diff is pretty dumb.

@jose-torres
Copy link
Contributor

lgtm

Copy link
Contributor

@brkyvz brkyvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for all the tests. I'm trusting the unit tests. I don't see a better way of figuring out unique sources.

s" of the execution plan:\n" +
s"logical plan leaves: ${toString(allLogicalPlanLeaves)}\n" +
s"execution plan leaves: ${toString(allExecPlanLeaves)}\n")
s" of the execution plan:\n" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

existing nit: maybe we should've just used

"""
   |
   |
"""

@SparkQA
Copy link

SparkQA commented Apr 25, 2018

Test build #4159 has finished for PR 21126 at commit 6d80cfd.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 25, 2018

Test build #89810 has started for PR 21126 at commit 7d79dce.

@tdas
Copy link
Contributor Author

tdas commented Apr 25, 2018

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Apr 25, 2018

Test build #89811 has finished for PR 21126 at commit a56707f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor Author

tdas commented Apr 25, 2018

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Apr 25, 2018

Test build #89825 has finished for PR 21126 at commit a56707f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor Author

tdas commented Apr 25, 2018

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Apr 25, 2018

Test build #89834 has finished for PR 21126 at commit a56707f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 396938e Apr 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants