[SPARK-19497][SS]Implement streaming deduplication #16970

zsxwing · 2017-02-17T01:31:42Z

What changes were proposed in this pull request?

This PR adds a special streaming deduplication operator to support dropDuplicates with aggregation and watermark. It reuses the dropDuplicates API but creates new logical plan Deduplication and new physical plan DeduplicationExec.

The following cases are supported:

one or multiple dropDuplicates() without aggregation (with or without watermark)
dropDuplicates before aggregation

Not supported cases:

dropDuplicates after aggregation

Breaking changes:

dropDuplicates without aggregation doesn't work with complete or update mode.

How was this patch tested?

The new unit tests.

SparkQA · 2017-02-17T03:30:02Z

Test build #73028 has finished for PR 16970 at commit 63a7f4c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class Deduplication(
trait WatermarkSupport extends SparkPlan
case class DeduplicationExec(

zsxwing · 2017-02-17T18:49:03Z

...alyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationsSuite.scala

    outputMode = Update,
    expectedMsgs = Seq("multiple streaming aggregations"))

+  assertSupportedInStreamingPlan(


Added some missing tests.

zsxwing · 2017-02-17T18:53:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala

+        val key = getKey(row)
+        val value = store.get(key)
+        if (value.isEmpty) {
+          store.put(key.copy(), row.copy())


I don't know how to create an empty UnsafeRow. Right now the value is not necessary but doubles the size of state store.

can you store a null?

No. StateStore assumes value is not null.

naah. the HDFSBasedStateStore cant handle nulls. How about using UnsafeRow.createFromByteArray(0, 0). We can reused this immutable object.

Not work :(

java.lang.AssertionError: index (0) should < 0 at org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:133) at org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:352) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:113) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:313) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

How about this.

val row = InternalRow.apply(null) val unsafeRow = UnsafeProjection.create(Array[DataType](NullType)).apply(row)

This is valid generated unsafe row, that can be generated once, and reused.
Found this in the UnsafeRowSuite.

Cool! Updated.

brkyvz

Very excited about this. Left some comments about some extra things that we can support

brkyvz · 2017-02-17T19:02:15Z

...lyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala

        throwError("Queries with streaming sources must be executed with writeStream.start()")(p)

+      case p: Deduplication =>
+        throwError("Batch queries should not use Deduplication")(p)


because of this, I would prefer the naming to imply that as well. Maybe rename Deduplication to StreamingDeduplication or something.

Why is deduplication exclusive to streaming? Even if we don't want to implement a batch operator, I'd still allow it in the logical plan and just translate it to normal aggregation.

brkyvz · 2017-02-17T19:04:24Z

...lyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala

          "streaming DataFrames/Datasets")(plan)
    }

+    // Disallow multiple streaming deduplications


we should support these. Example use case:
I dedup on some higher level columns to gain exactly once semantics (infrastructure/application specific). Then I do data transformations, then I do a dedup on some more specific data, e.g. region (query specific)

brkyvz · 2017-02-17T19:08:55Z

...lyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala

+          throwErrorIf(
+            outputMode == InternalOutputModes.Complete
+              && collectStreamingDeduplications(subPlan).nonEmpty,
+            "Aggregation on dropDuplicates DataFrame/Dataset in Complete output mode " +


brkyvz · 2017-02-17T19:10:39Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

 }
+
+/** Streaming dropDuplicates */
+case class Deduplication(


ditto: IMHO Name should reflect that it is streaming

brkyvz · 2017-02-17T19:19:23Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/DeduplicationSuite.scala

+    val inputData = MemoryStream[(String, Int)]
+    val result = inputData.toDS().dropDuplicates("_1")
+
+    testStream(result, Append)(


I know the semantics are the same for Append and Update but just so that no one breaks it in the future, should we wrap these tests with:
Seq(Append, Update).foreach { mode =>

zsxwing · 2017-02-17T19:39:41Z

@brkyvz looks like you were looking at my old changes. I pushed a new commit and updated the PR description to reflect the latest supported queries.

brkyvz · 2017-02-17T19:43:44Z

aw man. I should always refresh before starting a review

SparkQA · 2017-02-17T20:18:34Z

Test build #73064 has finished for PR 16970 at commit 5a6af8b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2017-02-17T22:15:12Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+        if (groupColExprIds.contains(attr.exprId)) {
+          attr
+        } else {
+          Alias(new First(attr).toAggregateExpression(), attr.name)()


@marmbrus I tried to move this to SparkPlanner but failed because Alias(new First(attr).toAggregateExpression(), attr.name)() needs to be resolved before planning. Thoughts?

You could do this construction at planning time if you preserve the attribute ids?

Yeah, it works.

zsxwing · 2017-02-17T23:23:17Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala

        .format("memory")
        .queryName("testquery")
-        .outputMode("complete")
+        .outputMode("append")


This is a behavior change: the user cannot use dropDuplicates with complete without aggregation now because dropDuplicates is not an aggregation.

i see. this was allowed earlier, because dropDuplicates was an aggregate. but not any more. I this is not consistent with the fact that we dont allow complete mode in map-like queries.

@marmbrus any thoughts?

tdas · 2017-02-17T23:46:11Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

+
+/** A logical plan for `dropDuplicates`. */
+case class Deduplication(
+  keys: Seq[Attribute],


indent. can be on same line i think.

tdas · 2017-02-17T23:50:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

   */
  object Aggregation extends Strategy {
    def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
+      case Deduplication(keys, child) =>


Shouldnt there be a new strategy? After all dropping duplicates is not conceptually an aggregation. its just so happens that it can be implemented as a aggregation?

tdas · 2017-02-17T23:51:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala

    "numUpdatedStateRows" -> SQLMetrics.createMetric(sparkContext, "number of updated state rows"))
 }

+trait WatermarkSupport extends SparkPlan {


tdas · 2017-02-17T23:51:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala


+trait WatermarkSupport extends SparkPlan {
+
+  def keyExpressions: Seq[Attribute]


tdas · 2017-02-17T23:59:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala

+      val numTotalStateRows = longMetric("numTotalStateRows")
+      val numUpdatedStateRows = longMetric("numUpdatedStateRows")
+
+


tdas · 2017-02-18T00:01:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala

+
+    child.execute().mapPartitionsWithStateStore(
+      getStateId.checkpointLocation,
+      operatorId = getStateId.operatorId,


nit: why are these two specified with param names?

SparkQA · 2017-02-18T00:27:39Z

Test build #73076 has finished for PR 16970 at commit ba58e2a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class StreamingDeduplicationExec(

tdas · 2017-02-18T01:03:18Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/DeduplicationSuite.scala

+    )
+  }
+
+  private def assertNumStateRows(total: Seq[Long], updated: Seq[Long]): AssertOnQuery =


These methods seems to be common across StreamingAggregationSuite, MapGroupsWithStateSuite and this one. Can you make a trait?

tdas · 2017-02-18T01:14:31Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/DeduplicationSuite.scala

+    StateStore.stop()
+  }
+
+  test("deduplication") {


nit: deduplication with all columns

tdas · 2017-02-18T01:14:52Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/DeduplicationSuite.scala

+    )
+  }
+
+  test("deduplication with columns") {


nit: deduplication with some columns

tdas · 2017-02-18T01:15:17Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/DeduplicationSuite.scala

+    )
+  }
+
+  test("deduplication with aggregation - update") {


nit: update mode

tdas · 2017-02-18T01:15:40Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/DeduplicationSuite.scala

+    val inputData = MemoryStream[(String, Int)]
+    val result = inputData.toDS()
+      .dropDuplicates()
+      .groupBy($"_1")


nit: why not name the columns!?

tdas · 2017-02-18T01:16:03Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/DeduplicationSuite.scala

+    val inputData = MemoryStream[(String, Int)]
+    val result = inputData.toDS()
+      .dropDuplicates()
+      .groupBy($"_1")


nit: same as above.

tdas · 2017-02-18T01:19:36Z

overall looks good. just a bunch of nits.

tdas · 2017-02-18T01:22:21Z

...alyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationsSuite.scala

    outputMode = Complete,
    expectedMsgs = Seq("(map/flatMap)GroupsWithState"))

+  // Deduplication:  Not supported after a streaming aggregation


actually can you add a test for both, dropdup and mapgroupsWithstate, that tests that these operations is allowed on a batch subplan inside a streaming plan. that is,

assertSupportedInStreamingPlan( "Deduplication - Deduplication on batch relation", Deduplication(Seq(att), batchRelation), outputMode = Append )

tdas · 2017-02-18T01:25:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

          rewrittenResultExpressions,
          planLater(child))

+      case Deduplication(keys, child) =>


same thought as below. this is not really aggregation so should be a different strategy.

zsxwing · 2017-02-21T19:10:10Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

    val resolver = sparkSession.sessionState.analyzer.resolver
    val allColumns = queryExecution.analyzed.output
-    val groupCols = colNames.flatMap { colName =>
+    val groupCols = colNames.toSet.toSeq.flatMap { (colName: String) =>


Fixed an issue that groupCols may contain duplicated columns. It's not necessary because optimizer will remove duplicated columns. However, it's better to make less assumptions.

was this a bug with batch queries as well? and what would the result be without this fix?

The results will be same. It's just pretty weird that it depends on the optimizer to remove duplicated columns.

SparkQA · 2017-02-21T20:03:39Z

Test build #73225 has finished for PR 16970 at commit 0e72217.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-21T22:31:46Z

Test build #73236 has finished for PR 16970 at commit b2e9cb0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2017-02-22T01:13:38Z

@tdas I created https://issues.apache.org/jira/browse/SPARK-19690 to track the issue when joining a batch DataFrame with a streaming DataFrame. I will fix it in a separate PR to unblock this one as it touches many files.

SparkQA · 2017-02-22T03:09:15Z

Test build #73247 has finished for PR 16970 at commit 78dfdfe.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class Deduplication(

tdas · 2017-02-22T03:44:41Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

 }
+
+/** A logical plan for `dropDuplicates`. */
+case class Deduplication(


Most names are like "verbs" - aggregate, project, intersect. I think its best to name this "Deduplicate".

tdas · 2017-02-22T03:45:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala


+/**
+ * Replaces logical [[Deduplication]] operator with an [[Aggregate]] operator.
+ */


ReplaceDeduplicateWithAggregate. see comment below.

tdas · 2017-02-22T03:47:27Z

...alyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationsSuite.scala

+    outputMode = Append
+  )
+
+  // Deduplication:  Not supported after a streaming aggregation


nit: Change this comment to just // Deduplication to reflect the whole subsection

tdas · 2017-02-22T03:48:16Z

...alyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationsSuite.scala

+    expectedMsgs = Seq("dropDuplicates"))
+
+  assertSupportedInStreamingPlan(
+    "Deduplication - Deduplication on batch relation inside streaming relation",


nit: inside a streaming query.
sounds weird otherwise.

tdas · 2017-02-22T04:07:52Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/DeduplicationSuite.scala

+
+      AddData(inputData, 40), // Emit items less than watermark and drop their state
+      CheckLastBatch((15 -> 1), (25 -> 1)),
+        // states in aggregation in [40, 45)


nit: indent.

tdas · 2017-02-22T04:22:27Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

   * @group typedrel
   * @since 2.0.0
   */
  def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {


You have to add more documentation for streaming usage! especially you have to document that this will keep all past data as intermediate state, and you can use the withWatermark to limit how late the duplicate data can be and system will accordingly limit the state.

Also, double the docs on withWatermark and make sure its consistent.

you have not updates docs for dropDuplicates! You should at least point to withWatermark to limit the state, and mention its semantics (all data later than watermark will be ignored).

SparkQA · 2017-02-22T06:52:40Z

Test build #73265 has started for PR 16970 at commit 7a7c0c7.

zsxwing · 2017-02-22T17:05:09Z

retest this please

SparkQA · 2017-02-22T19:04:29Z

Test build #73285 has finished for PR 16970 at commit 7a7c0c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class Deduplicate(
case class StreamingDeduplicateExec(

SparkQA · 2017-02-22T23:35:42Z

Test build #73297 has finished for PR 16970 at commit d0b7b77.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? This PR adds a special streaming deduplication operator to support `dropDuplicates` with `aggregation` and watermark. It reuses the `dropDuplicates` API but creates new logical plan `Deduplication` and new physical plan `DeduplicationExec`. The following cases are supported: - one or multiple `dropDuplicates()` without aggregation (with or without watermark) - `dropDuplicates` before aggregation Not supported cases: - `dropDuplicates` after aggregation Breaking changes: - `dropDuplicates` without aggregation doesn't work with `complete` or `update` mode. ## How was this patch tested? The new unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#16970 from zsxwing/dedup.

uncleGen · 2017-02-28T11:12:03Z

witout aggregation, how to drop duplication between partitions?

lw-lin · 2017-03-03T00:51:49Z

@uncleGen I think requiredChildDistribution = ClusteredDistribution(keyExpressions) :: Nil (please see here) takes care of it.

uncleGen · 2017-03-03T01:05:56Z

@lw-lin Thanks, got it

Implement deduplication

63a7f4c

Address

5a6af8b

zsxwing commented Feb 17, 2017

View reviewed changes

brkyvz suggested changes Feb 17, 2017

View reviewed changes

zsxwing commented Feb 17, 2017

View reviewed changes

Deduplication for both batch and streaming

ba58e2a

zsxwing commented Feb 17, 2017

View reviewed changes

tdas reviewed Feb 17, 2017

View reviewed changes

tdas reviewed Feb 18, 2017

View reviewed changes

Address

2cfaab3

tdas reviewed Feb 18, 2017

View reviewed changes

Address TD's comments

0e72217

zsxwing commented Feb 21, 2017

View reviewed changes

Replace Deduplication with Aggregation in Optimizer

b2e9cb0

Fix

78dfdfe

tdas reviewed Feb 22, 2017

View reviewed changes

Address TD's comments

7a7c0c7

Update doc

d0b7b77

asfgit closed this in 9bf4e2b Feb 23, 2017

zsxwing deleted the dedup branch February 23, 2017 19:48


		trait WatermarkSupport extends SparkPlan {

		def keyExpressions: Seq[Attribute]

		val numTotalStateRows = longMetric("numTotalStateRows")
		val numUpdatedStateRows = longMetric("numUpdatedStateRows")

[SPARK-19497][SS]Implement streaming deduplication #16970

[SPARK-19497][SS]Implement streaming deduplication #16970

Uh oh!

Conversation

zsxwing commented Feb 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Feb 17, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas Feb 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brkyvz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zsxwing commented Feb 17, 2017

Uh oh!

brkyvz commented Feb 17, 2017

Uh oh!

SparkQA commented Feb 17, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zsxwing Feb 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas Feb 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 18, 2017

Uh oh!

zsxwing commented Feb 17, 2017 •

edited

Loading

tdas Feb 22, 2017 •

edited

Loading

zsxwing Feb 17, 2017 •

edited

Loading

tdas Feb 18, 2017 •

edited

Loading

tdas Feb 18, 2017 •

edited

Loading

zsxwing Feb 21, 2017 •

edited

Loading