Skip to content

Conversation

@zsxwing
Copy link
Member

@zsxwing zsxwing commented Feb 17, 2017

What changes were proposed in this pull request?

This PR adds a special streaming deduplication operator to support dropDuplicates with aggregation and watermark. It reuses the dropDuplicates API but creates new logical plan Deduplication and new physical plan DeduplicationExec.

The following cases are supported:

  • one or multiple dropDuplicates() without aggregation (with or without watermark)
  • dropDuplicates before aggregation

Not supported cases:

  • dropDuplicates after aggregation

Breaking changes:

  • dropDuplicates without aggregation doesn't work with complete or update mode.

How was this patch tested?

The new unit tests.

@SparkQA
Copy link

SparkQA commented Feb 17, 2017

Test build #73028 has finished for PR 16970 at commit 63a7f4c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Deduplication(
  • trait WatermarkSupport extends SparkPlan
  • case class DeduplicationExec(

outputMode = Update,
expectedMsgs = Seq("multiple streaming aggregations"))

assertSupportedInStreamingPlan(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some missing tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great!!

val key = getKey(row)
val value = store.get(key)
if (value.isEmpty) {
store.put(key.copy(), row.copy())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to create an empty UnsafeRow. Right now the value is not necessary but doubles the size of state store.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you store a null?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. StateStore assumes value is not null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naah. the HDFSBasedStateStore cant handle nulls. How about using UnsafeRow.createFromByteArray(0, 0). We can reused this immutable object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not work :(

java.lang.AssertionError: index (0) should < 0
	at org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:133)
	at org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:352)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:113)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:313)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Copy link
Contributor

@tdas tdas Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this.

val row = InternalRow.apply(null)
val unsafeRow = UnsafeProjection.create(Array[DataType](NullType)).apply(row)

This is valid generated unsafe row, that can be generated once, and reused.
Found this in the UnsafeRowSuite.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Updated.

Copy link
Contributor

@brkyvz brkyvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very excited about this. Left some comments about some extra things that we can support

throwError("Queries with streaming sources must be executed with writeStream.start()")(p)

case p: Deduplication =>
throwError("Batch queries should not use Deduplication")(p)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because of this, I would prefer the naming to imply that as well. Maybe rename Deduplication to StreamingDeduplication or something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is deduplication exclusive to streaming? Even if we don't want to implement a batch operator, I'd still allow it in the logical plan and just translate it to normal aggregation.

"streaming DataFrames/Datasets")(plan)
}

// Disallow multiple streaming deduplications
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should support these. Example use case:
I dedup on some higher level columns to gain exactly once semantics (infrastructure/application specific). Then I do data transformations, then I do a dedup on some more specific data, e.g. region (query specific)

throwErrorIf(
outputMode == InternalOutputModes.Complete
&& collectStreamingDeduplications(subPlan).nonEmpty,
"Aggregation on dropDuplicates DataFrame/Dataset in Complete output mode " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not?

}

/** Streaming dropDuplicates */
case class Deduplication(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto: IMHO Name should reflect that it is streaming

val inputData = MemoryStream[(String, Int)]
val result = inputData.toDS().dropDuplicates("_1")

testStream(result, Append)(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the semantics are the same for Append and Update but just so that no one breaks it in the future, should we wrap these tests with:
Seq(Append, Update).foreach { mode =>

@zsxwing
Copy link
Member Author

zsxwing commented Feb 17, 2017

@brkyvz looks like you were looking at my old changes. I pushed a new commit and updated the PR description to reflect the latest supported queries.

@brkyvz
Copy link
Contributor

brkyvz commented Feb 17, 2017

aw man. I should always refresh before starting a review

@SparkQA
Copy link

SparkQA commented Feb 17, 2017

Test build #73064 has finished for PR 16970 at commit 5a6af8b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (groupColExprIds.contains(attr.exprId)) {
attr
} else {
Alias(new First(attr).toAggregateExpression(), attr.name)()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marmbrus I tried to move this to SparkPlanner but failed because Alias(new First(attr).toAggregateExpression(), attr.name)() needs to be resolved before planning. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do this construction at planning time if you preserve the attribute ids?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it works.

.format("memory")
.queryName("testquery")
.outputMode("complete")
.outputMode("append")
Copy link
Member Author

@zsxwing zsxwing Feb 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a behavior change: the user cannot use dropDuplicates with complete without aggregation now because dropDuplicates is not an aggregation.

Copy link
Contributor

@tdas tdas Feb 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see. this was allowed earlier, because dropDuplicates was an aggregate. but not any more. I this is not consistent with the fact that we dont allow complete mode in map-like queries.

@marmbrus any thoughts?


/** A logical plan for `dropDuplicates`. */
case class Deduplication(
keys: Seq[Attribute],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent. can be on same line i think.

*/
object Aggregation extends Strategy {
def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
case Deduplication(keys, child) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldnt there be a new strategy? After all dropping duplicates is not conceptually an aggregation. its just so happens that it can be implemented as a aggregation?

"numUpdatedStateRows" -> SQLMetrics.createMetric(sparkContext, "number of updated state rows"))
}

trait WatermarkSupport extends SparkPlan {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs.


trait WatermarkSupport extends SparkPlan {

def keyExpressions: Seq[Attribute]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs

val numTotalStateRows = longMetric("numTotalStateRows")
val numUpdatedStateRows = longMetric("numUpdatedStateRows")


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra line


child.execute().mapPartitionsWithStateStore(
getStateId.checkpointLocation,
operatorId = getStateId.operatorId,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why are these two specified with param names?

@SparkQA
Copy link

SparkQA commented Feb 18, 2017

Test build #73076 has finished for PR 16970 at commit ba58e2a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class StreamingDeduplicationExec(

)
}

private def assertNumStateRows(total: Seq[Long], updated: Seq[Long]): AssertOnQuery =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods seems to be common across StreamingAggregationSuite, MapGroupsWithStateSuite and this one. Can you make a trait?

StateStore.stop()
}

test("deduplication") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: deduplication with all columns

)
}

test("deduplication with columns") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: deduplication with some columns

)
}

test("deduplication with aggregation - update") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: update mode

val inputData = MemoryStream[(String, Int)]
val result = inputData.toDS()
.dropDuplicates()
.groupBy($"_1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why not name the columns!?

val inputData = MemoryStream[(String, Int)]
val result = inputData.toDS()
.dropDuplicates()
.groupBy($"_1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same as above.

@tdas
Copy link
Contributor

tdas commented Feb 18, 2017

overall looks good. just a bunch of nits.

outputMode = Complete,
expectedMsgs = Seq("(map/flatMap)GroupsWithState"))

// Deduplication: Not supported after a streaming aggregation
Copy link
Contributor

@tdas tdas Feb 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually can you add a test for both, dropdup and mapgroupsWithstate, that tests that these operations is allowed on a batch subplan inside a streaming plan. that is,

assertSupportedInStreamingPlan(
     "Deduplication - Deduplication on batch relation",
      Deduplication(Seq(att), batchRelation),
      outputMode = Append
)

rewrittenResultExpressions,
planLater(child))

case Deduplication(keys, child) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thought as below. this is not really aggregation so should be a different strategy.

val resolver = sparkSession.sessionState.analyzer.resolver
val allColumns = queryExecution.analyzed.output
val groupCols = colNames.flatMap { colName =>
val groupCols = colNames.toSet.toSeq.flatMap { (colName: String) =>
Copy link
Member Author

@zsxwing zsxwing Feb 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed an issue that groupCols may contain duplicated columns. It's not necessary because optimizer will remove duplicated columns. However, it's better to make less assumptions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this a bug with batch queries as well? and what would the result be without this fix?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The results will be same. It's just pretty weird that it depends on the optimizer to remove duplicated columns.

@SparkQA
Copy link

SparkQA commented Feb 21, 2017

Test build #73225 has finished for PR 16970 at commit 0e72217.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 21, 2017

Test build #73236 has finished for PR 16970 at commit b2e9cb0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Feb 22, 2017

@tdas I created https://issues.apache.org/jira/browse/SPARK-19690 to track the issue when joining a batch DataFrame with a streaming DataFrame. I will fix it in a separate PR to unblock this one as it touches many files.

@SparkQA
Copy link

SparkQA commented Feb 22, 2017

Test build #73247 has finished for PR 16970 at commit 78dfdfe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Deduplication(

}

/** A logical plan for `dropDuplicates`. */
case class Deduplication(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most names are like "verbs" - aggregate, project, intersect. I think its best to name this "Deduplicate".


/**
* Replaces logical [[Deduplication]] operator with an [[Aggregate]] operator.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReplaceDeduplicateWithAggregate. see comment below.

outputMode = Append
)

// Deduplication: Not supported after a streaming aggregation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Change this comment to just // Deduplication to reflect the whole subsection

expectedMsgs = Seq("dropDuplicates"))

assertSupportedInStreamingPlan(
"Deduplication - Deduplication on batch relation inside streaming relation",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: inside a streaming query.
sounds weird otherwise.


AddData(inputData, 40), // Emit items less than watermark and drop their state
CheckLastBatch((15 -> 1), (25 -> 1)),
// states in aggregation in [40, 45)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent.

* @group typedrel
* @since 2.0.0
*/
def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to add more documentation for streaming usage! especially you have to document that this will keep all past data as intermediate state, and you can use the withWatermark to limit how late the duplicate data can be and system will accordingly limit the state.

Also, double the docs on withWatermark and make sure its consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have not updates docs for dropDuplicates! You should at least point to withWatermark to limit the state, and mention its semantics (all data later than watermark will be ignored).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@SparkQA
Copy link

SparkQA commented Feb 22, 2017

Test build #73265 has started for PR 16970 at commit 7a7c0c7.

@zsxwing
Copy link
Member Author

zsxwing commented Feb 22, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Feb 22, 2017

Test build #73285 has finished for PR 16970 at commit 7a7c0c7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Deduplicate(
  • case class StreamingDeduplicateExec(

@SparkQA
Copy link

SparkQA commented Feb 22, 2017

Test build #73297 has finished for PR 16970 at commit d0b7b77.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 9bf4e2b Feb 23, 2017
@zsxwing zsxwing deleted the dedup branch February 23, 2017 19:48
Yunni pushed a commit to Yunni/spark that referenced this pull request Feb 27, 2017
## What changes were proposed in this pull request?

This PR adds a special streaming deduplication operator to support `dropDuplicates` with `aggregation` and watermark. It reuses the `dropDuplicates` API but creates new logical plan `Deduplication` and new physical plan `DeduplicationExec`.

The following cases are supported:

- one or multiple `dropDuplicates()` without aggregation (with or without watermark)
- `dropDuplicates` before aggregation

Not supported cases:

- `dropDuplicates` after aggregation

Breaking changes:
- `dropDuplicates` without aggregation doesn't work with `complete` or `update` mode.

## How was this patch tested?

The new unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes apache#16970 from zsxwing/dedup.
@uncleGen
Copy link
Contributor

uncleGen commented Feb 28, 2017

witout aggregation, how to drop duplication between partitions?

@lw-lin
Copy link
Contributor

lw-lin commented Mar 3, 2017

@uncleGen I think requiredChildDistribution = ClusteredDistribution(keyExpressions) :: Nil (please see here) takes care of it.

@uncleGen
Copy link
Contributor

uncleGen commented Mar 3, 2017

@lw-lin Thanks, got it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants