SPARK-17829 [SQL] Stable format for offset log #15626

ghost · 2016-10-25T17:09:15Z

What changes were proposed in this pull request?

Currently we use java serialization for the WAL that stores the offsets contained in each batch. This has two main issues:
It can break across spark releases (though this is not the only thing preventing us from upgrading a running query)
It is unnecessarily opaque to the user.
I'd propose we require offsets to provide a user readable serialization and use that instead. JSON is probably a good option.

How was this patch tested?

Tests were added for KafkaSourceOffset in KafkaSourceOffsetSuite and for LongOffset in OffsetSuite

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

@zsxwing @marmbrus

zsxwing · 2016-10-25T17:36:21Z

jenkins, add to whitelist

SparkQA · 2016-10-25T19:41:59Z

Test build #67524 has finished for PR 15626 at commit bdd7fa7.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class AssignStrategy(partitions: Array[TopicPartition], kafkaParams: ju.Map[String, Object])
- class SingularMatrixException(message: String, cause: Throwable)
- case class SourceInfo(name: String, schema: StructType, partitionColumns: Seq[String])
- abstract class FileStatusCache

tdas

Overall, this is a good start, but I think there one critical piece missing (or maybe i am not understanding the PR correctly).

The OffsetSeq is finally written to the log by the file HDFSMetadataLog. Currently, HDFSMetadataLog serializes the offset using Java serialization, and writes the bytes to the file. The goal of this PR is to ultimately not use java serialization, and rather use JSON text representation in the file. In this PR I dont see any changes to the HDFSMetadatalog, so the serialization format hasnt changed. All its doing is instead of Java-serializing CompositeOffset, its serializing OffsetSeq.

I strongly suggest writing a unit test where you take a OffsetSeq, and use HDFSMetadataLog to write it a file. And see the contents of the file is a valid json or not.

tdas · 2016-10-25T18:57:46Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceOffset.scala

 */
 private[kafka010]
 case class KafkaSourceOffset(partitionToOffsets: Map[TopicPartition, Long]) extends Offset {
+  import org.json4s.jackson.Serialization.{write}


we usually put imports at the top, especially important ones like this.
And when it is just one component its just import org.json4s.jackson.Serialization.write

Actually its easier to read the code later if you import Serialization, and then use Serialization.write. Same for read.

Sounds good.

tdas · 2016-10-25T18:57:49Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceOffset.scala


 /** Companion object of the [[KafkaSourceOffset]] */
 private[kafka010] object KafkaSourceOffset {
+  import org.json4s.jackson.Serialization.{read}


same as above

tdas · 2016-10-25T18:59:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/LongOffset.scala

 }
+
+object LongOffset {
+


nit: extra line

tdas · 2016-10-25T18:59:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/LongOffset.scala

+
+  def apply(serialized: Offset) : LongOffset = new LongOffset(serialized.json.toLong)
+
+}


nit: extra lines above and below.

Can you please clarify best practices w.r.t. extra lines? I didn't see anything in the code style guide.

Not a strict guideline i guess, its matter of the taste of the reviewer quite a bit. For me i guess where there is no need to visually introduce a gap, there is no need for an empty line. For example: between def apply and } there no need to introduce a separation.

tdas · 2016-10-25T18:59:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Offset.scala

+
+  def apply(json: String): SerializedOffset = new SerializedOffset(json)
+
+}


nit: extra lines.

tdas · 2016-10-25T19:01:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetSeq.scala

 * vector clock that must progress linearly forward.
 */
-case class CompositeOffset(offsets: Seq[Option[Offset]]) extends Offset {
+case class OffsetSeq(offsets: Seq[Option[Offset]]) {


Why was this renamed?

I renamed to avoid confusing the class as being a type of Offset. Does that make sense or should I take an alternative approach?

Aah right. Of course. I didnt notice that. That makes sense.

tdas · 2016-10-25T19:19:08Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceOffset.scala

    KafkaSourceOffset(offsetTuples.map { case(t, p, o) => (new TopicPartition(t, p), o) }.toMap)
  }
+
+  def apply(serialized: Offset): KafkaSourceOffset = {


If I get this right, this can be called only on SerializedOffset. In that case, best to type the param as SerializedOffset.

Sounds good.

tdas · 2016-10-25T19:21:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Offset.scala

+}
+
+/** Used when loading */
+class SerializedOffset(override val json: String) extends Offset


this can be a case class in which case the apply below wont be needed.

Makes sense.

tdas · 2016-10-25T19:21:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Offset.scala

+  def json: String
+}
+
+/** Used when loading */


Improve the docs to something like "A JSON-serialized representation of an Offset that is used for saving offsets to the offset log"

tdas · 2016-10-25T19:25:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/LongOffset.scala

+
+object LongOffset {
+
+  def apply(serialized: Offset) : LongOffset = new LongOffset(serialized.json.toLong)


same comment as with SerializedOffset.apply, make the param SerializedOffset instead of Offset

SparkQA · 2016-10-26T01:51:11Z

Test build #67541 has finished for PR 15626 at commit 2fcf251.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SerializedOffset(override val json: String) extends Offset

koeninger · 2016-10-26T20:57:49Z

So it looks like this is the json format being used for kafka offsets:

[{"_1":{"hash":0,"partition":0,"topic":"t"},"_2":2},{"_1":{"hash":0,"partition":1,"topic":"t"},"_2":2}]

My main concerns are that is unnecessarily verbose, and is completely different from the format being used to specify per-topicpartition offsets in org.apache.spark.sql.kafka010.JsonUtils

I see no reason why those formats should be different - a user should be able to take offsets directly from the checkpoint and use them to start a job, for instance.

ghost · 2016-10-26T22:35:56Z

@koeninger Thanks for pointing this out. I have revised to serialization procedure to use org.apache.spark.sql.kafka010.JsonUtils

SparkQA · 2016-10-27T00:53:13Z

Test build #67605 has finished for PR 15626 at commit 3cf14fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…... everything is JSON serialized.

SparkQA · 2016-10-28T03:21:39Z

Test build #67685 has finished for PR 15626 at commit f1010ee.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-28T06:12:37Z

Test build #67684 has finished for PR 15626 at commit c7a8bd3.

This patch fails from timeout after a configured wait of 250m.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- class HDFSMetadataLog[T <: AnyRef : ClassTag](sparkSession: SparkSession, path: String)
- class OffsetStreamSourceLog(offsetLogVersion: String, sparkSession: SparkSession, path: String)

SparkQA · 2016-10-28T07:07:32Z

Test build #67690 has finished for PR 15626 at commit 1390711.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](

tdas · 2016-10-28T20:12:40Z

...nal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala

    one = KafkaSourceOffset(("t", 0, 1L)),
    two = KafkaSourceOffset(("t", 0, 2L), ("t", 1, 1L)))
+
+  compare(


I would make this a separate test("json serialization") { ... } just to verify whether json ser --> deser gives back the same results.

I have extended the KafkaSourceOffsetSuite to include more thorough evaluations of json serialization/deserialization, including to/from OffsetSeqLog.

SparkQA · 2016-10-28T20:27:34Z

Test build #67711 has finished for PR 15626 at commit b80b3ce.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

tdas

This is look good, pending some public API issues. In addition, can you also add to the scala docs of Source.getBatch that the user should not assume start and end offset to a particular specific type, and the safe way read the offset details is to use the Offset.json to parse the details.

tdas · 2016-10-28T20:14:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

 * doing a compaction, it will read all old log files and merge them with the new batch.
 */
-abstract class CompactibleFileStreamLog[T: ClassTag](
+abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](


Why is this AnyRef needed?

It's required by json4s that T be an AnyRef.

tdas · 2016-10-28T20:18:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala

+  private implicit val formats = Serialization.formats(NoTypeHints)
+
+  /** Needed to serialize type T into JSON */
+  private implicit val manifest = Manifest.classType[T](implicitly[ClassTag[T]].runtimeClass)


Why is this needed now? This wasnt needed earlier when we were using json serialization in FilestreamSinkLog, then why is the manifest needed now?

I got it why this is needed.

Can you added "Needed by Jackson to serialize/deserialize... "

tdas · 2016-10-28T20:19:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Offset.scala

+  def json: String
+}
+
+/** Used when loading */


Needs better docs. I think i had commented that earlier.

tdas · 2016-10-28T20:19:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetSeqLog.scala

+
+  override protected def serialize(metadata: OffsetSeq, out: OutputStream): Unit = {
+    // called inside a try-finally where the underlying stream is closed in the caller
+


nit: extra line

tdas · 2016-10-28T20:20:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetSeqLog.scala

+
+  override protected def deserialize(in: InputStream): OffsetSeq = {
+    // called inside a try-finally where the underlying stream is closed in the caller
+


nit: extra line.

tdas · 2016-10-28T20:32:49Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/OffsetSeqLogSuite.scala

+      assert(metadataLog.get(1) === Some(batch1Serialized))
+      assert(metadataLog.getLatest() === Some(1 -> batch1Serialized))
+      assert(metadataLog.get(None, Some(1)) ===
+             Array(0 -> batch0Serialized, 1 -> batch1Serialized))


incorrect continuation indent, should be 2 not eight.

tdas · 2016-10-28T20:32:53Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/OffsetSeqLogSuite.scala

+      assert(metadataLog.get(1) === Some(batch1Serialized))
+      assert(metadataLog.getLatest() === Some(1 -> batch1Serialized))
+      assert(metadataLog.get(None, Some(1)) ===
+             Array(0 -> batch0Serialized, 1 -> batch1Serialized))


incorrect continuation indent, should be 2 not eight.

tdas · 2016-10-28T20:36:29Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/OffsetSeqLogSuite.scala

+      val batch0Serialized = OffsetSeq.fill(batch0.offsets.map(_.map(o =>
+        SerializedOffset(o.json))).flatten: _*)
+
+      val batch1Serialized = OffsetSeq.fill(batch1.offsets.map(_.map(o =>


simpler to read if you use flatmap instead of map...map...flatten.

tdas · 2016-10-28T20:39:29Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/OffsetSeqLogSuite.scala

+      val dir = new File(temp, "dir") // use non-existent directory to test whether log make the dir
+      val metadataLog = new OffsetSeqLog("v1", spark, dir.getAbsolutePath)
+      val batch0 = OffsetSeq.fill(LongOffset(0))
+      val batch1 = OffsetSeq.fill(LongOffset(1), LongOffset(2))


would like to see this test with mixed offset types. how about define a custom offset and use that.

I added a StringOffset type. Let me know if you'd like to see something more complex.

tdas · 2016-10-28T20:50:57Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceOffset.scala

+  /**
+   * Returns [[KafkaSourceOffset]] from a JSON [[SerializedOffset]]
+   */
+  def apply(serialized: SerializedOffset): KafkaSourceOffset = {


This does not seem to used any where. I think the code in KafkaSourceOffset.getPartitionOffsets needs to change because it does not take into account that the param offset can be a SerializedOffset.

Great catch, thanks! I also added a test for this case.

marmbrus · 2016-10-28T21:40:05Z

Thanks for working on this! Could you include examples of the various logs, since we are committing to this specific JSON.

ghost · 2016-10-28T21:58:27Z

All log are now uniformly marshall entries to JSON. Each log has a particular way of organizing its JSON entries:
HDFSMetadataLog:
Each Batch ID is a separate file.
All log entries within a batch id are serialized in one line.

OffsetSeqLog:
Each Batch ID is a separate file.
All log entries (i.e., Offsets) are separated by a newline character i.e., one offset entry per line.

CompactibleFileStreamLog, FileStreamSinkLog, and FileStreamSourceLog
No changes to log file format i.e., log entries are separated by a new line character.
Log entry -> JSON serialization is now done in CompactibleFileStreamLog directly.

Offsets are meant to be implemented by the Source. We currently have two Offset implementations:

LongOffset: contains a single long value, which is also its JSON string representation
Examples:
2L ---> "{2}"
KafkaSourceOffset: contains a Map[TopicPartition, Long] i.e., TopicPartition -> Long Offset. It's JSON representation is extracted from org.apache.spark.sql.kafka010.JsonUtils
Examples:
KafkaSourceOffset(("t", 0, 1L)) ---> {"t":{"0":1}}
KafkaSourceOffset(("t", 0, 2L), ("t", 1, 3L)) ----> {"t":{"1":3,"0":2}}
KafkaSourceOffset(("t", 0, 2L), ("t", 1, 3L), ("t", 1, 4L)) ---> {"t":{"1":4,"0":2}}

SparkQA · 2016-10-28T23:38:03Z

Test build #67725 has finished for PR 15626 at commit bbc5a26.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-29T06:32:32Z

Test build #67739 has finished for PR 15626 at commit 9298022.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-31T19:58:09Z

Test build #67822 has finished for PR 15626 at commit 8a2028b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-31T21:46:00Z

Test build #67829 has finished for PR 15626 at commit 7beedcc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-01T00:43:08Z

Test build #67831 has finished for PR 15626 at commit d6fec94.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-01T03:43:05Z

Test build #67845 has finished for PR 15626 at commit 8ee336d.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-02T02:34:36Z

Test build #67940 has finished for PR 15626 at commit acab31e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-02T18:11:15Z

Test build #67994 has finished for PR 15626 at commit 81f0c01.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing

Looks pretty good. Just some minor issues.

zsxwing · 2016-11-02T18:51:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Offset.scala

+    case _ => false
+  }
+
+  override def hashCode(): Int = this.json.hashCode


need to add final

zsxwing · 2016-11-02T19:02:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Offset.scala

 * new data has arrived.
 */
-trait Offset extends Serializable {}
+abstract class Offset {


nit: could you document that we only compare the json string, please?

We should also document in Source.getBatch that start and end may be semantics same but has different json representations and the Source should handle this case.

zsxwing · 2016-11-02T19:11:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala

    // called inside a try-finally where the underlying stream is closed in the caller
-    val inStream = serializer.deserializeStream(in)
-    inStream.readObject[T]()
+    val reader = new InputStreamReader(in)


nit: should specify the charset like new InputStreamReader(in, StandardCharsets.UTF_8)

tdas · 2016-11-02T20:34:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

  /**
   * Returns the data that is between the offsets (`start`, `end`].
   */
  override def getBatch(start: Option[Offset], end: Offset): DataFrame = {


given this change that start and end Offset will not the custom Offset class defined by the source, i think it is important to document this in the Source.getBatch class that the user should not make that assumption.

SparkQA · 2016-11-02T22:03:17Z

Test build #68008 has finished for PR 15626 at commit cb54aa9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-02T22:49:26Z

Test build #68010 has finished for PR 15626 at commit dabb628.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-11-09T19:55:36Z

retest this please

zsxwing · 2016-11-09T19:55:44Z

LGTM. Pending tests.

SparkQA · 2016-11-09T23:00:35Z

Test build #68417 has finished for PR 15626 at commit dabb628.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-11-09T23:02:57Z

Thanks! Merging to master and 2.1.

## What changes were proposed in this pull request? Currently we use java serialization for the WAL that stores the offsets contained in each batch. This has two main issues: It can break across spark releases (though this is not the only thing preventing us from upgrading a running query) It is unnecessarily opaque to the user. I'd propose we require offsets to provide a user readable serialization and use that instead. JSON is probably a good option. ## How was this patch tested? Tests were added for KafkaSourceOffset in [KafkaSourceOffsetSuite](external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala) and for LongOffset in [OffsetSuite](sql/core/src/test/scala/org/apache/spark/sql/streaming/OffsetSuite.scala) Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. zsxwing marmbrus Author: Tyson Condie <tcondie@gmail.com> Author: Tyson Condie <tcondie@clash.local> Closes #15626 from tcondie/spark-8360. (cherry picked from commit 3f62e1b) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

This PR adds a new method `withWatermark` to the `Dataset` API, which can be used specify an _event time watermark_. An event time watermark allows the streaming engine to reason about the point in time after which we no longer expect to see late data. This PR also has augmented `StreamExecution` to use this watermark for several purposes: - To know when a given time window aggregation is finalized and thus results can be emitted when using output modes that do not allow updates (e.g. `Append` mode). - To minimize the amount of state that we need to keep for on-going aggregations, by evicting state for groups that are no longer expected to change. Although, we do still maintain all state if the query requires (i.e. if the event time is not present in the `groupBy` or when running in `Complete` mode). An example that emits windowed counts of records, waiting up to 5 minutes for late data to arrive. ```scala df.withWatermark("eventTime", "5 minutes") .groupBy(window($"eventTime", "1 minute") as 'window) .count() .writeStream .format("console") .mode("append") // In append mode, we only output finalized aggregations. .start() ``` ### Calculating the watermark. The current event time is computed by looking at the `MAX(eventTime)` seen this epoch across all of the partitions in the query minus some user defined _delayThreshold_. An additional constraint is that the watermark must increase monotonically. Note that since we must coordinate this value across partitions occasionally, the actual watermark used is only guaranteed to be at least `delay` behind the actual event time. In some cases we may still process records that arrive more than delay late. This mechanism was chosen for the initial implementation over processing time for two reasons: - it is robust to downtime that could affect processing delay - it does not require syncing of time or timezones between the producer and the processing engine. ### Other notable implementation details - A new trigger metric `eventTimeWatermark` outputs the current value of the watermark. - We mark the event time column in the `Attribute` metadata using the key `spark.watermarkDelay`. This allows downstream operations to know which column holds the event time. Operations like `window` propagate this metadata. - `explain()` marks the watermark with a suffix of `-T${delayMs}` to ease debugging of how this information is propagated. - Currently, we don't filter out late records, but instead rely on the state store to avoid emitting records that are both added and filtered in the same epoch. ### Remaining in this PR - [ ] The test for recovery is currently failing as we don't record the watermark used in the offset log. We will need to do so to ensure determinism, but this is deferred until apache#15626 is merged. ### Other follow-ups There are some natural additional features that we should consider for future work: - Ability to write records that arrive too late to some external store in case any out-of-band remediation is required. - `Update` mode so you can get partial results before a group is evicted. - Other mechanisms for calculating the watermark. In particular a watermark based on quantiles would be more robust to outliers. Author: Michael Armbrust <michael@databricks.com> Closes apache#15702 from marmbrus/watermarks.

This PR adds a new method `withWatermark` to the `Dataset` API, which can be used specify an _event time watermark_. An event time watermark allows the streaming engine to reason about the point in time after which we no longer expect to see late data. This PR also has augmented `StreamExecution` to use this watermark for several purposes: - To know when a given time window aggregation is finalized and thus results can be emitted when using output modes that do not allow updates (e.g. `Append` mode). - To minimize the amount of state that we need to keep for on-going aggregations, by evicting state for groups that are no longer expected to change. Although, we do still maintain all state if the query requires (i.e. if the event time is not present in the `groupBy` or when running in `Complete` mode). An example that emits windowed counts of records, waiting up to 5 minutes for late data to arrive. ```scala df.withWatermark("eventTime", "5 minutes") .groupBy(window($"eventTime", "1 minute") as 'window) .count() .writeStream .format("console") .mode("append") // In append mode, we only output finalized aggregations. .start() ``` ### Calculating the watermark. The current event time is computed by looking at the `MAX(eventTime)` seen this epoch across all of the partitions in the query minus some user defined _delayThreshold_. An additional constraint is that the watermark must increase monotonically. Note that since we must coordinate this value across partitions occasionally, the actual watermark used is only guaranteed to be at least `delay` behind the actual event time. In some cases we may still process records that arrive more than delay late. This mechanism was chosen for the initial implementation over processing time for two reasons: - it is robust to downtime that could affect processing delay - it does not require syncing of time or timezones between the producer and the processing engine. ### Other notable implementation details - A new trigger metric `eventTimeWatermark` outputs the current value of the watermark. - We mark the event time column in the `Attribute` metadata using the key `spark.watermarkDelay`. This allows downstream operations to know which column holds the event time. Operations like `window` propagate this metadata. - `explain()` marks the watermark with a suffix of `-T${delayMs}` to ease debugging of how this information is propagated. - Currently, we don't filter out late records, but instead rely on the state store to avoid emitting records that are both added and filtered in the same epoch. ### Remaining in this PR - [ ] The test for recovery is currently failing as we don't record the watermark used in the offset log. We will need to do so to ensure determinism, but this is deferred until #15626 is merged. ### Other follow-ups There are some natural additional features that we should consider for future work: - Ability to write records that arrive too late to some external store in case any out-of-band remediation is required. - `Update` mode so you can get partial results before a group is evicted. - Other mechanisms for calculating the watermark. In particular a watermark based on quantiles would be more robust to outliers. Author: Michael Armbrust <michael@databricks.com> Closes #15702 from marmbrus/watermarks. (cherry picked from commit c071878) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

## What changes were proposed in this pull request? Currently we use java serialization for the WAL that stores the offsets contained in each batch. This has two main issues: It can break across spark releases (though this is not the only thing preventing us from upgrading a running query) It is unnecessarily opaque to the user. I'd propose we require offsets to provide a user readable serialization and use that instead. JSON is probably a good option. ## How was this patch tested? Tests were added for KafkaSourceOffset in [KafkaSourceOffsetSuite](external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceOffsetSuite.scala) and for LongOffset in [OffsetSuite](sql/core/src/test/scala/org/apache/spark/sql/streaming/OffsetSuite.scala) Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. zsxwing marmbrus Author: Tyson Condie <tcondie@gmail.com> Author: Tyson Condie <tcondie@clash.local> Closes apache#15626 from tcondie/spark-8360.

This PR adds a new method `withWatermark` to the `Dataset` API, which can be used specify an _event time watermark_. An event time watermark allows the streaming engine to reason about the point in time after which we no longer expect to see late data. This PR also has augmented `StreamExecution` to use this watermark for several purposes: - To know when a given time window aggregation is finalized and thus results can be emitted when using output modes that do not allow updates (e.g. `Append` mode). - To minimize the amount of state that we need to keep for on-going aggregations, by evicting state for groups that are no longer expected to change. Although, we do still maintain all state if the query requires (i.e. if the event time is not present in the `groupBy` or when running in `Complete` mode). An example that emits windowed counts of records, waiting up to 5 minutes for late data to arrive. ```scala df.withWatermark("eventTime", "5 minutes") .groupBy(window($"eventTime", "1 minute") as 'window) .count() .writeStream .format("console") .mode("append") // In append mode, we only output finalized aggregations. .start() ``` ### Calculating the watermark. The current event time is computed by looking at the `MAX(eventTime)` seen this epoch across all of the partitions in the query minus some user defined _delayThreshold_. An additional constraint is that the watermark must increase monotonically. Note that since we must coordinate this value across partitions occasionally, the actual watermark used is only guaranteed to be at least `delay` behind the actual event time. In some cases we may still process records that arrive more than delay late. This mechanism was chosen for the initial implementation over processing time for two reasons: - it is robust to downtime that could affect processing delay - it does not require syncing of time or timezones between the producer and the processing engine. ### Other notable implementation details - A new trigger metric `eventTimeWatermark` outputs the current value of the watermark. - We mark the event time column in the `Attribute` metadata using the key `spark.watermarkDelay`. This allows downstream operations to know which column holds the event time. Operations like `window` propagate this metadata. - `explain()` marks the watermark with a suffix of `-T${delayMs}` to ease debugging of how this information is propagated. - Currently, we don't filter out late records, but instead rely on the state store to avoid emitting records that are both added and filtered in the same epoch. ### Remaining in this PR - [ ] The test for recovery is currently failing as we don't record the watermark used in the offset log. We will need to do so to ensure determinism, but this is deferred until apache#15626 is merged. ### Other follow-ups There are some natural additional features that we should consider for future work: - Ability to write records that arrive too late to some external store in case any out-of-band remediation is required. - `Update` mode so you can get partial results before a group is evicted. - Other mechanisms for calculating the watermark. In particular a watermark based on quantiles would be more robust to outliers. Author: Michael Armbrust <michael@databricks.com> Closes apache#15702 from marmbrus/watermarks.

Tyson Condie added 5 commits October 21, 2016 16:53

initial version of offse json serialization

60a71fa

remove CompositeOffsetSuite

5243114

update

3ccdc5c

update test parameters to avoid test name conflict

16c6cea

resolve conflict in StreamingQueryStatus

bdd7fa7

tdas suggested changes Oct 25, 2016

View reviewed changes

Addressing feedback from TD

2fcf251

Using JsonUtils to serialize KafkaSourceOffsets

3cf14fa

Tyson Condie added 3 commits October 27, 2016 19:01

extend JSON serialization format throughout logging use cases

c7a8bd3

merge with spark master

f1010ee

Remove serialize/deserialize data methods in CompactibleFileStreamLog…

1390711

…... everything is JSON serialized.

Add test suite for OffsetSeqLog

b80b3ce

tdas reviewed Oct 28, 2016

View reviewed changes

tdas suggested changes Oct 28, 2016

View reviewed changes

Revise KafkaSourceOffsetSuite tests

4663d5e

updates based on comments from TD

d62a877

update comment

bbc5a26

update FileStreamSource to extract long offset from serialized form

8a2028b

Tyson Condie added 2 commits October 31, 2016 13:03

Merge branch 'master' of https://github.com/apache/spark into spark-8360

7beedcc

revise offset factory methods to take Offset instead of SerializedOffset

d6fec94

marmbrus mentioned this pull request Oct 31, 2016

[SPARK-18124] Observed delay based Event Time Watermarks #15702

Closed

1 task

update

8ee336d

resolving broken tests

acab31e

update KafkaSource metadata logging

81f0c01

update python streaming tests to reflect new offset format

cb54aa9

zsxwing requested changes Nov 2, 2016

View reviewed changes

address Ryans comments

dabb628

tdas reviewed Nov 2, 2016

View reviewed changes

asfgit closed this in 3f62e1b Nov 9, 2016


		def apply(serialized: Offset) : LongOffset = new LongOffset(serialized.json.toLong)

		}


		def apply(json: String): SerializedOffset = new SerializedOffset(json)

		}


		object LongOffset {

		def apply(serialized: Offset) : LongOffset = new LongOffset(serialized.json.toLong)


		override protected def serialize(metadata: OffsetSeq, out: OutputStream): Unit = {
		// called inside a try-finally where the underlying stream is closed in the caller


		override protected def deserialize(in: InputStream): OffsetSeq = {
		// called inside a try-finally where the underlying stream is closed in the caller

SPARK-17829 [SQL] Stable format for offset log #15626

SPARK-17829 [SQL] Stable format for offset log #15626

Uh oh!

Conversation

ghost commented Oct 25, 2016 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

zsxwing commented Oct 25, 2016

Uh oh!

SparkQA commented Oct 25, 2016

Uh oh!

tdas left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 26, 2016

Uh oh!

koeninger commented Oct 26, 2016

Uh oh!

ghost commented Oct 26, 2016

Uh oh!

SparkQA commented Oct 27, 2016

Uh oh!

SparkQA commented Oct 28, 2016

Uh oh!

SparkQA commented Oct 28, 2016

Uh oh!

SparkQA commented Oct 28, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 28, 2016

Uh oh!

tdas left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ghost commented Oct 25, 2016 •

edited by ghost

Loading