[SPARK-49411][SS] Communicate State Store Checkpoint ID between driver and stateful operators #47895

siying · 2024-08-27T18:32:56Z

What changes were proposed in this pull request?

This is an incremental step to implement RocksDB state store checkpoint format V2.

Once conf STATE_STORE_CHECKPOINT_FORMAT_VERSION is set to be higher than version 2, the executor returns checkpointID to the driver (only done for RocksDB). The driver stores is locally. For the next batch, the State Store Checkpoint ID is sent to the executor to be used to load the state store. If the local version of the executor doesn't match the uniqueID, it will reload from the checkpoint.

There is no behavior change if the default checkpoint format is used.

Why are the changes needed?

This is an incremental step of the project of a new RocksDB State Store checkpoint format. The new format is to simplify checkpoint mechanism to make it less bug prone, and fix some unexpected query results in rare queries.

Does this PR introduce any user-facing change?

No

How was this patch tested?

A new unit test is added to cover format version. And another unit test is added to validate the uniqueID is passed back and force as expected.

Was this patch authored or co-authored using generative AI tooling?

No

WweiL · 2024-08-29T18:42:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala

-    val isFirstBatch: Boolean)
+    val isFirstBatch: Boolean,
+    val currentCheckpointUniqueId:
+      MutableMap[Long, Array[String]] = MutableMap[Long, Array[String]]())


Can we add comments on what are these unique Ids map to? I believe key is operator Id?

also better name it currentStateUniqueId as it is only related to state store not general checkpoint

I'm also confused by this. When I sketched an implementation of your proposal in my head, my assumption would be that IncrementalExecution would get just an ID, perhaps a single Long, that would correspond to the ID that it would bake into the physical plan sent to executors. So why is a map needed?

I'll add a comment, but it is basically operatorID->partitionID->checkpointID

WweiL · 2024-08-29T18:48:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

+  private def updateCheckpointId(
+      execCtx: MicroBatchExecutionContext,
+      latestExecPlan: SparkPlan): Unit = {
+    // This function cannot handle MBP now.


unnecessary comment

WweiL · 2024-08-29T22:25:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

-      if (loadedVersion != version) {
+      if (loadedVersion != version ||
+        (checkpointFormatVersion >= 2 && checkpointUniqueId.isDefined &&
+        (!loadedCheckpointId.isDefined || checkpointUniqueId.get != loadedCheckpointId.get))) {


nit: loadedCheckpointId.isEmpty

WweiL · 2024-08-30T18:01:57Z

...scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreIntegrationSuite.scala

+          .agg(count("*"))
+          .as[(Int, Long)]
+
+      // Run the stream with changelog checkpointing disabled.


WweiL · 2024-09-01T18:32:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

+  // Store checkpointIDs for state store checkpoints to be committed or have been committed to
+  // the commit log.
+  // operatorID -> (partitionID -> uniqueID)
+  private val currentCheckpointUniqueId = MutableMap[Long, Array[String]]()


Maybe this is better to be put into the stream execution context

operatorID -> (partitionID -> uniqueID), is this supposed to mean a map of maps? If so, then why is the type of currentCheckpointUniqueId just a single map?

I also don't fully understand why we would need a unique map for every operator X partition. Why is it not sufficient to have the following protocol, where we have one unique ID for every batch:

For the first batch, an ID is created and sent to all executors. When all tasks finish, that ID is persisted into the commit log. It is also kept in memory for the subsequent batch.

For any other batch, if there does not exist an ID in memory from the previous batch, then it must be read from the commit log and brought into memory. (This is the restart case.)

Then, using the ID in memory from the previous batch (call that prevId), this is sent to all executors in the physical plan, as well as a new ID for the current batch (call this currId). Before any processing start, executors must load and use the state for prevId to process the current batch. Then, they can start processing, and they upload their state as <state file name>_currId.<changelog|snapshot>.

What's wrong with that?

Right now, the uniqueID is generated in executor. As a potential optimization, the driver can send a uniqueID to all executors, but executors still need to modify it to make it unique among all attempts of the same task. After doing that, the IDs won't be unique anymore, so we need different IDs per partition.

WweiL · 2024-09-03T21:13:51Z

...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala

    try {
      if (version < 0) {
        throw QueryExecutionErrors.unexpectedStateStoreVersion(version)
      }
-      rocksDB.load(version, true)
+      rocksDB.load(version, uniqueId, true)


rocksDB.load(
version,
if (storeConf.stateStoreCheckpointFormatVersion >= 2) uniqueId else None)

WweiL · 2024-09-08T19:23:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

+  @volatile private var LastCommitBasedCheckpointId: Option[String] = None
+  @volatile private var lastCommittedCheckpointId: Option[String] = None
+  @volatile private var loadedCheckpointId: Option[String] = None
+  @volatile private var sessionCheckpointId: Option[String] = None


Should reset these to None in rollback()

neilramaswamy · 2024-09-10T01:53:36Z

fix some unexpected query results in rare queries

@siying can you provide some content about which situations there are specifically?

(Edit, seems to be here in the design doc.)

neilramaswamy

Going to stop reviewing since I have a few fundamental questions regarding the protocol.

neilramaswamy · 2024-09-11T18:40:55Z

...g/apache/spark/sql/execution/datasources/v2/state/StreamStreamJoinStatePartitionReader.scala

@@ -105,7 +105,7 @@ class StreamStreamJoinStatePartitionReader(
      val stateInfo = StatefulOperatorStateInfo(
        partition.sourceOptions.stateCheckpointLocation.toString,
        partition.queryId, partition.sourceOptions.operatorId,
-        partition.sourceOptions.batchId + 1, -1)
+        partition.sourceOptions.batchId + 1, -1, None)


Why is this None? I would image that users of the state data source reader now have to specify the id that they would like to read, given that state stores are now not uniquely identified by operator/partition/name, but by id/operator/partition/name?

Good point. Will check.

Any update here?

Any update here?

@neilramaswamy here, we don't know the checkpointID. We would know the ID after we persist to the commit log. But now it is just like the first time we restart the query -- we don't know it. I can leave a TODO.

how do we load the previous state store correctly in this case then in a stream restart?

The code needs to change after we persistent the ID to commit logs. The ID needs to be get from the commit logs and pass it to here. For now, we can say state store reader isn't supported in this new mode (it's likely working accidentally, but it's not worth even testing it). There is already a TODO comment above.

neilramaswamy · 2024-09-11T18:43:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala

-    val isFirstBatch: Boolean)
+    val isFirstBatch: Boolean,
+    val currentCheckpointUniqueId:
+      MutableMap[Long, Array[String]] = MutableMap[Long, Array[String]]())


I'm also confused by this. When I sketched an implementation of your proposal in my head, my assumption would be that IncrementalExecution would get just an ID, perhaps a single Long, that would correspond to the ID that it would bake into the physical plan sent to executors. So why is a map needed?

neilramaswamy · 2024-09-11T18:49:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

+  // Store checkpointIDs for state store checkpoints to be committed or have been committed to
+  // the commit log.
+  // operatorID -> (partitionID -> uniqueID)
+  private val currentCheckpointUniqueId = MutableMap[Long, Array[String]]()


operatorID -> (partitionID -> uniqueID), is this supposed to mean a map of maps? If so, then why is the type of currentCheckpointUniqueId just a single map?

I also don't fully understand why we would need a unique map for every operator X partition. Why is it not sufficient to have the following protocol, where we have one unique ID for every batch:

For the first batch, an ID is created and sent to all executors. When all tasks finish, that ID is persisted into the commit log. It is also kept in memory for the subsequent batch.

For any other batch, if there does not exist an ID in memory from the previous batch, then it must be read from the commit log and brought into memory. (This is the restart case.)

Then, using the ID in memory from the previous batch (call that prevId), this is sent to all executors in the physical plan, as well as a new ID for the current batch (call this currId). Before any processing start, executors must load and use the state for prevId to process the current batch. Then, they can start processing, and they upload their state as <state file name>_currId.<changelog|snapshot>.

What's wrong with that?

neilramaswamy · 2024-09-16T22:26:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala

+    val ret = StatefulOperatorStateInfo(
      checkpointLocation,
      runId,
-      statefulOperatorId.getAndIncrement(),
+      operatorId,
      currentBatchId,
-      numStateStores)
+      numStateStores,
+      currentCheckpointUniqueId.get(operatorId))
+    ret


ret is not needed

neilramaswamy · 2024-09-16T22:42:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

+      case e: StreamingDeduplicateWithinWatermarkExec =>
+        assert(e.stateInfo.isDefined)
+        updateCheckpointIdForOperator(execCtx, e.stateInfo.get.operatorId, e.getCheckpointInfo())
+      // TODO Need to deal with FlatMapGroupsWithStateExec, TransformWithStateExec,


Why not?

And I also don't see why we need to enumerate all of these here. Can we leverage the StatefulOperator trait and use that to get the state info? It should clean this up quite a bit.

You will, though, probably have to do some work to make sure that getCheckpointInfo can be called for any stateful operator.

neilramaswamy · 2024-09-16T22:44:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

-    watermarkTracker.updateWatermark(execCtx.executionPlan.executedPlan)
+    val latestExecPlan = execCtx.executionPlan.executedPlan
+    watermarkTracker.updateWatermark(latestExecPlan)
+    if (sparkSession.sessionState.conf.stateStoreCheckpointFormatVersion >= 2) {


I don't really like the >= 2 sprinkled everywhere. Can you define a constant somewhere, and then have a utility method that you call

+1.

I'd introduce a StreamingCheckpointProtocolVersion object or something and then add utility methods like:

def supportsStateCheckpointIds

I crated StatefulOperatorStateInfo.enableStateStoreCheckpointIds() after Neil's comment. This is a left over. Will switch.

neilramaswamy · 2024-09-16T22:48:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala

-    val isFirstBatch: Boolean)
+    val isFirstBatch: Boolean,
+    val currentCheckpointUniqueId:
+      MutableMap[Long, Array[String]] = MutableMap[Long, Array[String]]())


Is it always true that partition IDs are always [0, numPartitions)?

Yes it is true.

neilramaswamy · 2024-09-16T23:19:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

+    })
+  }
+
+  private def updateCheckpointId(


Let me make sure I understand the flow here:

Micro-batch ends, we call updateCheckpointId

This goes through every stateful operator and calls updateCheckpointIdForOperator

For each operator, we call into its getCheckpointInfo method

That method will access the checkpointInfoAccumulator

The checkpointInfoAccumulator is appended to using the unique ID from the state store after processing all data on the task

In the future, we'll write this to the commit log.

Is this right?

That's right. I should write a comment somewhere.

neilramaswamy · 2024-09-16T23:25:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

@@ -803,6 +843,14 @@ class RocksDB(
  /** Get the write buffer manager and cache */
  def getWriteBufferManagerAndCache(): (WriteBufferManager, Cache) = (writeBufferManager, lruCache)

+  def getLatestCheckpointInfo(partitionId: Int): StateStoreCheckpointInfo = {


Will this ever be called if lastCommittedCheckpointId is None or LastCommitBasedCheckpointId is None?

This will always be called. The caller has no knowledge on what's going on there.

can you add scaladocs please

neilramaswamy · 2024-09-16T23:28:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

+  // variables to manage checkpoint ID. Once a checkpoingting finishes, it nees to return
+  // the `lastCommittedCheckpointId` as the committed checkpointID, as well as
+  // `LastCommitBasedCheckpointId` as the checkpontID of the previous version that is based on.
+  // `loadedCheckpointId` is the checkpointID for the current live DB. After the batch finishes
+  // and checkpoint finishes, it will turn into `LastCommitBasedCheckpointId`.
+  // `sessionCheckpointId` store an ID to be used for future checkpoints. It is kept being used
+  // until we have to use a new one. We don't need to reuse any uniqueID, but reusing when possible
+  // can help debug problems.
+  @volatile private var LastCommitBasedCheckpointId: Option[String] = None
+  @volatile private var lastCommittedCheckpointId: Option[String] = None
+  @volatile private var loadedCheckpointId: Option[String] = None
+  @volatile private var sessionCheckpointId: Option[String] = None


We never read sessionCheckpointId and the comment doesn't really help me. What is it being used for?

Is there a reason LastCommitBasedCheckpointId is capitalized? And LastCommitBasedCheckpointId isn't even used in this PR since there is another TODO that says // TODO validate baseCheckpointId? Is that right?

neilramaswamy · 2024-09-16T23:37:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

+  @volatile private var LastCommitBasedCheckpointId: Option[String] = None
+  @volatile private var lastCommittedCheckpointId: Option[String] = None
+  @volatile private var loadedCheckpointId: Option[String] = None
+  @volatile private var sessionCheckpointId: Option[String] = None


Can you comment specifically why these are marked as volatile? From what I can tell, these are only read/written to by the query execution thread.

neilramaswamy · 2024-09-16T23:37:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala

+    partitionId: Int,
+    batchVersion: Long,
+    checkpointId: Option[String],
+    baseCheckpointId: Option[String])


We call this checkpointId in some places and baseCheckpointId in others? Can you clarify which is which, and what specifically it should be here?

neilramaswamy · 2024-09-16T23:50:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala

+      .map {
+        case (key, values) => key -> values.head
+      }


This list would be non-zero only if there was a task retry/speculative execution, right?

And as discussed earlier today offline, this has the issue of not working if the same partition has multiple state stores, e.g. in a stream-stream join, which is actually a very serious issue.

if there was a task retry/speculative execution

Also if there is a fan-out in foreachBatch, i.e. df.write.save() executed twice

neilramaswamy

High-level ideas look good, nothing super fundamental. Some clarity, testing, and question comments.

neilramaswamy · 2024-09-27T22:21:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala

@@ -190,6 +190,11 @@ trait StateStore extends ReadStateStore {
  /** Current metrics of the state store */
  def metrics: StateStoreMetrics

+  /** Return information on recently generated checkpoints */
+  def getCheckpointInfo: StateStoreCheckpointInfo = {
+    StateStoreCheckpointInfo(-1, -1, None, None)


Why default implementation? If all the sub-classes are overriding it, let's just make it required with no default.

+1, it could lead to bugs if this is incorrect, right? I'd remove a default implementation in such a case (it may require changes in tests I guess, but that can be handled with a trait or something)

neilramaswamy · 2024-09-27T22:28:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala

+    // The checkpoint ID for a checkpoint at `batchVersion`. This is used to identify the checkpoint
+    checkpointId: Option[String],


If we use a String, we need to mention that it's not necessarily one checkpoint ID. It could be many, comma-separated.

But to be honest, I don't think we should be using String here, because it's ambiguous. Is it 1 checkpoint? 4 checkpoints? You cannot simply tell by looking at the code. The naming is also off in the case of multiple checkpoints; it's StateStore*s*CheckpointInfo.

I think it makes more sense for us to return, all the way through the accumulator, a Seq[String]. Then, the only place that the merging should happen is inside of def getCheckpointInfo inside of StateStoreWriter. This avoids us from awkwardly having one-off merging logic inside of the s/s join, even though I know it's the only place.

neilramaswamy · 2024-09-27T22:30:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

+        execCtx.batchId == -1 || v == execCtx.batchId + 1,
+        s"version $v doesn't match current Batch ID ${execCtx.batchId}")


I don't understand the assertion here. We say v == batchId + 1 and then assert that v must match batchId?

I can rephrase it, but batch n commits to state store version n+1.

neilramaswamy · 2024-09-27T22:31:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

@@ -72,7 +72,8 @@ class RocksDB(
    localRootDir: File = Utils.createTempDir(),
    hadoopConf: Configuration = new Configuration,
    loggingId: String = "",
-    useColumnFamilies: Boolean = false) extends Logging {
+    useColumnFamilies: Boolean = false,
+    ifEnableCheckpointId: Boolean = false) extends Logging {


More consistent to call it enableStateStoreCheckpointIds.

I also think that the term "checkpoint ID" is very confusing. The term makes it feel like it's an ID for an entire checkpoint, when really it's an ID for a particular state store that has been checkpointed.

I know it's a tedious modification to make. I would be happy to alleviate some of this work by creating a branch with that change and putting up a PR that you can merge back in this branch.

Do you have a suggestion for a better name? I can definitely change it.

State store checkpoint ID?

neilramaswamy · 2024-09-27T22:33:09Z

...ain/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala

@@ -808,6 +824,45 @@ object SymmetricHashJoinStateManager {
    result
  }

+  def mergeStateStoreCheckpointInfo(


I already commented about this elsewhere (that it shouldn't be in the symmetric hash join state manager), but this was confusing to read because it is used in two places:

To merge the key with index state store with the key with index to value state store

To merge the results from step (1) for both the left and the right into one result

neilramaswamy · 2024-09-27T22:38:52Z

...scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreIntegrationSuite.scala

+  testWithChangelogCheckpointingEnabled(
+    s"checkpointFormatVersion2 validate ID with dedup and groupBy") {
+    val providerClassName = classOf[TestStateStoreProviderWrapper].getCanonicalName
+    TestStateStoreWrapper.clear()


All of these can be refactored into a beforeEach in the class

neilramaswamy · 2024-09-27T22:39:08Z

...scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreIntegrationSuite.scala

+      (SQLConf.STATE_STORE_CHECKPOINT_FORMAT_VERSION.key -> "2"),
+      (SQLConf.SHUFFLE_PARTITIONS.key, "2")) {
+      val checkpointDir = Utils.createTempDir().getCanonicalFile
+      checkpointDir.delete()


Why do you need to delete this? And why not use withTempDir?

That's a good point. I'll do it. I copy&pasted from a previous test without thinking.

neilramaswamy · 2024-09-27T22:42:16Z

...scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreIntegrationSuite.scala

@@ -222,6 +375,456 @@ class RocksDBStateStoreIntegrationSuite extends StreamTest
    }
  }

+  testWithChangelogCheckpointingEnabled(s"checkpointFormatVersion2") {


From what I can tell, none of the new suites that were added cover the edge case in the design doc, right? There's no speculative execution here.

I think what you could do is create new manual StateStores that simulate the race here, without needing to write a query that does this. Right?

neilramaswamy · 2024-09-27T22:44:08Z

...scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreIntegrationSuite.scala

+    assert(checkpointInfoList.size == 12)
+    checkpointInfoList.foreach { l =>
+      assert(l.checkpointId.isDefined)
+      if (l.batchVersion == 2 || l.batchVersion == 4 || l.batchVersion == 5) {


Sorry, I don't follow this. Why are we just checking these specific batchVersions? Shouldn't all of them, 0 to 5 inclusive, be present?

neilramaswamy · 2024-09-27T22:45:34Z

...scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreIntegrationSuite.scala

+    for {
+      a <- checkpointInfoList
+      b <- checkpointInfoList
+      if a.partitionId == b.partitionId && a.batchVersion == b.batchVersion + 1
+    } {
+      // if batch version exists, it should be the same as the checkpoint ID of the previous batch
+      assert(!a.baseCheckpointId.isDefined || b.checkpointId == a.baseCheckpointId)
+    }


This can definitely be refactored; you're using the same code snippet in all tests? Seems like a StateStoreCheckpointIdTestUtils could be good.

neilramaswamy

Assuming we have much stronger testing in the future, I'd be ok merging this.

neilramaswamy · 2024-10-07T17:26:19Z

...src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala

+      stateStoreCkptIds(0), stateStoreCkptIds(1), skippedNullValueCount)
    val rightSideJoiner = new OneSideHashJoiner(
      RightSide, right.output, rightKeys, rightInputIter,
      condition.rightSideOnly, postJoinFilter, stateWatermarkPredicates.right, partitionId,
-      skippedNullValueCount)
+      stateStoreCkptIds(2), stateStoreCkptIds(3), skippedNullValueCount)


This seems fragile, but I guess it's not a merge blocker.

I can definitely first deserialize to a case class, if you think that's better. Or do you think we should serialize into the checkpointInfo itself? I feel like it might be over-engineering, considering that the long term direction is probably to merge the 4 state stores into one.

neilramaswamy · 2024-10-07T17:26:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

+  // we have to use a new one. We have to update `sessionStateStoreCkptId` if we reload a previous
+  // batch version, because we have to use a new checkpointID for re-committing a version.
+  // The reusing is to help debugging but is not required for the algorithm to work.
+  private var LastCommitBasedStateStoreCkptId: Option[String] = None


I still don't understand why this is capitalized. I think we also ought to write down the threading model here. Who can read these? Who can write them? If there is concurrent access, what synchronizes access?

Also this comment has several typos, e.g. checkpoingting and nees, etc.

Good catch. I'll fix them.

brkyvz

Left mostly small comments. I would recommend leaving out the accumulator usage from this PR as that the correctness of that code path is somewhat dubious

brkyvz · 2024-10-08T20:02:17Z

...ore/src/main/scala/org/apache/spark/sql/execution/streaming/FlatMapGroupsWithStateExec.scala

@@ -243,6 +243,7 @@ trait FlatMapGroupsWithStateExecBase
            stateManager.stateSchema,
            NoPrefixKeyStateEncoderSpec(groupingAttributes.toStructType),
            stateInfo.get.storeVersion,
+            stateInfo.get.getStateStoreCkptId(partitionId).map(_(0)),


uber nit: this would seem to magical for new Scala learners. Can we write this out as _.get(0) if this is an array or Seq?

I am a newbie in Scala. I checked but there is no get() in Scala array. three is .apply() but it is even more confusing to me. I'll replace those (0) with .head. for the J&J case, I think (0), (1), etc looks OK.

brkyvz · 2024-10-08T20:06:47Z

...g/apache/spark/sql/execution/datasources/v2/state/StreamStreamJoinStatePartitionReader.scala

@@ -105,7 +105,7 @@ class StreamStreamJoinStatePartitionReader(
      val stateInfo = StatefulOperatorStateInfo(
        partition.sourceOptions.stateCheckpointLocation.toString,
        partition.queryId, partition.sourceOptions.operatorId,
-        partition.sourceOptions.batchId + 1, -1)
+        partition.sourceOptions.batchId + 1, -1, None)


how do we load the previous state store correctly in this case then in a stream restart?

brkyvz · 2024-10-08T20:08:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala

      currentBatchId,
-      numStateStores)
+      numStateStores,
+      currentStateStoreCkptId.get(operatorId))


are there any assertions we can add that this isn't empty for a batch after version 0?

The assertion will be more straight-forward after we add the support to persistent the ID to commit logs. For now, it is also empty when the query is just started. I can leave a comment here, saying we should add an assertion once only batch 0 can be empty.

brkyvz · 2024-10-08T20:11:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

@@ -900,12 +906,46 @@ class MicroBatchExecution(
   */
  protected def markMicroBatchExecutionStart(execCtx: MicroBatchExecutionContext): Unit = {}

+  private def updateStateStoreCkptIdForOperator(


can you please add scala docs for the methods below?

brkyvz · 2024-10-08T20:13:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

+    currentStateStoreCkptId.put(opId, checkpointInfo.map { c =>
+      assert(c.stateStoreCkptId.isDefined)
+      c.stateStoreCkptId.get
+    })


nit: inlining like the put side makes the code a bit harder to read. Can you please move this out into a variable?

brkyvz · 2024-10-08T22:32:01Z

...ain/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala

+      joinCkptInfo.left.keyToNumValues.stateStoreCkptId.map(
+        Array(
+          _,
+          joinCkptInfo.left.valueToNumKeys.stateStoreCkptId.get,
+          joinCkptInfo.right.keyToNumValues.stateStoreCkptId.get,
+          joinCkptInfo.right.valueToNumKeys.stateStoreCkptId.get)),


can you not inline these please?

brkyvz · 2024-10-08T22:32:15Z

...ain/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala

+    assert(
+      joinCkptInfo.left.keyToNumValues.partitionId == joinCkptInfo.right.keyToNumValues.partitionId)
+    assert(joinCkptInfo.left.keyToNumValues.batchVersion ==
+      joinCkptInfo.right.keyToNumValues.batchVersion)
+    assert(joinCkptInfo.left.keyToNumValues.stateStoreCkptId.isDefined ==
+      joinCkptInfo.right.keyToNumValues.stateStoreCkptId.isDefined)


messages for the assertions please

brkyvz · 2024-10-08T22:32:30Z

...ain/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala

+    // Stream-stream join has 4 state stores instead of one. So it will generate 4 different
+    // checkpoint IDs. They are translated from each joiners' state store into an array
+    // through mergeStateStoreCheckpointInfo(). This function is used to read it back into
+    // individual state store checkpoint IDs.


can you make it the method scaladoc please

brkyvz · 2024-10-08T22:35:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala

+  /**
+   * Aggregator used for the executors to pass new state store checkpoints' IDs to driver.
+   * For the general checkpoint ID workflow, see comments of
+   * class class [[StatefulOperatorStateInfo]]
+   */
+  val checkpointInfoAccumulator: CollectionAccumulator[StatefulOpStateStoreCheckpointInfo] = {
+    SparkContext.getActive.map(_.collectionAccumulator[StatefulOpStateStoreCheckpointInfo]).get


please don't use accumulators for this but prefer an RPC channel. Accumulators can cause some havoc with failed or speculative tasks. Is it possible to remove this part from this PR and have that be a separate PR?

brkyvz · 2024-10-08T22:38:51Z

...src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala

+      keyToNumValuesStateStoreCkptId,
+      keyWithIndexToValueStateStoreCkptId,


mind using parameter names for these two to prevent accidental ordering issues?

brkyvz

Wanted to leave quick feedback. Still halfway through of my pass

brkyvz · 2024-10-10T18:21:45Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

+   * @param execCtx
+   * @param latestExecPlan


you can remove these

@siying missed this comment

brkyvz · 2024-10-10T18:21:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

+      latestExecPlan: SparkPlan): Unit = {
+    latestExecPlan.collect {
+      case e: StateStoreWriter =>
+        assert(e.stateInfo.isDefined)


did you forget addressing this?

brkyvz · 2024-10-10T18:26:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

-    watermarkTracker.updateWatermark(execCtx.executionPlan.executedPlan)
+    val latestExecPlan = execCtx.executionPlan.executedPlan
+    watermarkTracker.updateWatermark(latestExecPlan)
+    if (StatefulOperatorStateInfo.enableStateStoreCheckpointIds(sparkSession.sessionState.conf)) {


should you be using sparkSessionForStream here? Otherwise this can change from microbatch to microbatch, which is risky

brkyvz · 2024-10-10T18:30:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala

+case class StateStoreCheckpointInfo(
+    partitionId: Int,
+    batchVersion: Long,
+    // The checkpoint ID for a checkpoint at `batchVersion`. This is used to identify the checkpoint


can you move these above to @param lines in the scaladoc?

brkyvz · 2024-10-10T18:30:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreRDD.scala

@@ -90,6 +91,7 @@ class ReadStateStoreRDD[T: ClassTag, U: ClassTag](
    val inputIter = dataRDD.iterator(partition, ctxt)
    val store = StateStore.getReadOnly(
      storeProviderId, keySchema, valueSchema, keyStateEncoderSpec, storeVersion,
+      stateStoreCkptIds.map(_(partition.index).head),


nit: _.apply(...).head

brkyvz · 2024-10-10T18:30:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreRDD.scala

@@ -126,6 +129,7 @@ class StateStoreRDD[T: ClassTag, U: ClassTag](
    val inputIter = dataRDD.iterator(partition, ctxt)
    val store = StateStore.get(
      storeProviderId, keySchema, valueSchema, keyStateEncoderSpec, storeVersion,
+      uniqueId.map(_(partition.index).head),


brkyvz

LGTM!

brkyvz · 2024-10-17T21:33:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

+   * @param execCtx
+   * @param latestExecPlan


@siying missed this comment

brkyvz · 2024-10-17T21:34:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

-  private val sparkSessionForStream = sparkSession.cloneSession()
+  protected val sparkSessionForStream = sparkSession.cloneSession()


for the future - I feel like we should refactor these abstractions a bit to ensure that developers cannot make the same wrong usage of session mistakes again. Today it's too subtle and easy to hit

brkyvz · 2024-10-17T21:37:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala

+}
+
+object StatefulOperatorStateInfo {
+  def enableStateStoreCheckpointIds(conf: SQLConf): Boolean = {


docs please

HeartSaVioR · 2024-10-18T04:49:51Z

e4863d6 has passed CI and I don't think 1c380ee would break the CI in any way - even CI for 1c380ee passed the streaming tests.

HeartSaVioR · 2024-10-18T04:51:32Z

I'm merging the PR on behalf of @brkyvz as he asked personally. It's also approved by two more contributors (my team) so I feel OK to merge this. Just to leave DISCLAIMER.

HeartSaVioR · 2024-10-18T04:52:30Z

Thanks! Merging to master.

siying · 2024-10-18T21:07:43Z

@HeartSaVioR thank you for your help!

…r and stateful operators ### What changes were proposed in this pull request? This is an incremental step to implement RocksDB state store checkpoint format V2. Once conf STATE_STORE_CHECKPOINT_FORMAT_VERSION is set to be higher than version 2, the executor returns checkpointID to the driver (only done for RocksDB). The driver stores is locally. For the next batch, the State Store Checkpoint ID is sent to the executor to be used to load the state store. If the local version of the executor doesn't match the uniqueID, it will reload from the checkpoint. There is no behavior change if the default checkpoint format is used. ### Why are the changes needed? This is an incremental step of the project of a new RocksDB State Store checkpoint format. The new format is to simplify checkpoint mechanism to make it less bug prone, and fix some unexpected query results in rare queries. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A new unit test is added to cover format version. And another unit test is added to validate the uniqueID is passed back and force as expected. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47895 from siying/unique_id2. Authored-by: Siying Dong <siying.dong@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

github-actions bot added SQL STRUCTURED STREAMING labels Aug 27, 2024

siying marked this pull request as draft August 27, 2024 18:33

siying changed the title ~~[WIP] Communicate CheckpointID between driver and stateful operators~~ [SPARK-49411][SS] Communicate CheckpointID between driver and stateful operators Aug 27, 2024

siying marked this pull request as ready for review August 27, 2024 22:22

WweiL reviewed Aug 29, 2024

View reviewed changes

WweiL reviewed Aug 30, 2024

View reviewed changes

WweiL reviewed Sep 1, 2024

View reviewed changes

WweiL reviewed Sep 3, 2024

View reviewed changes

WweiL reviewed Sep 8, 2024

View reviewed changes

neilramaswamy reviewed Sep 11, 2024

View reviewed changes

siying force-pushed the unique_id2 branch from cb9a5ab to 3f5509e Compare September 16, 2024 23:11

neilramaswamy reviewed Sep 16, 2024

View reviewed changes

siying force-pushed the unique_id2 branch 2 times, most recently from 8dab4a8 to efe3ab5 Compare September 23, 2024 23:13

neilramaswamy reviewed Sep 27, 2024

View reviewed changes

siying force-pushed the unique_id2 branch from 3355a1f to af9d774 Compare September 30, 2024 16:42

siying added 8 commits September 30, 2024 10:14

checkpoint

2ac3a20

comments

bc96b99

address comments

655b080

Support stream-stream join

b2451c4

minor change

c93c952

consolidate >=2

352d0ce

add comment

052910a

comments

5d86901

siying added 7 commits September 30, 2024 10:14

Rename checkpointUniqueID to checkpointID

f5c7d45

more unit test

e672cce

add another unit test

fad6384

Multiple stateful

0783c41

change data structure

5f63dda

rename

a899c3a

update

f75ca29

siying changed the title ~~[SPARK-49411][SS] Communicate CheckpointID between driver and stateful operators~~ [SPARK-49411][SS] Communicate State Store Checkpoint ID between driver and stateful operators Sep 30, 2024

siying force-pushed the unique_id2 branch from b912343 to f75ca29 Compare September 30, 2024 21:35

siying added 3 commits September 30, 2024 15:00

Rename

1afa8b7

add comment

1488252

add test

3dc4817

WweiL approved these changes Oct 4, 2024

View reviewed changes

WweiL mentioned this pull request Oct 4, 2024

[SPARK-49883][SS] State Store Checkpoint Structure V2 Integration with RocksDB and RocksDBFileManager #48355

Open

neilramaswamy approved these changes Oct 7, 2024

View reviewed changes

Address comments

f008b99

brkyvz suggested changes Oct 8, 2024

View reviewed changes

address comments

f4d30f9

brkyvz suggested changes Oct 10, 2024

View reviewed changes

siying added 2 commits October 14, 2024 13:49

address comments

fa1522c

fix and extra comment

e4863d6

brkyvz approved these changes Oct 17, 2024

View reviewed changes

Address comments

1c380ee

HeartSaVioR closed this in 5697df7 Oct 18, 2024

		// The checkpoint ID for a checkpoint at `batchVersion`. This is used to identify the checkpoint
		checkpointId: Option[String],

		execCtx.batchId == -1 \|\| v == execCtx.batchId + 1,
		s"version $v doesn't match current Batch ID ${execCtx.batchId}")

		keyToNumValuesStateStoreCkptId,
		keyWithIndexToValueStateStoreCkptId,

		private val sparkSessionForStream = sparkSession.cloneSession()
		protected val sparkSessionForStream = sparkSession.cloneSession()

[SPARK-49411][SS] Communicate State Store Checkpoint ID between driver and stateful operators #47895

[SPARK-49411][SS] Communicate State Store Checkpoint ID between driver and stateful operators #47895

Conversation

siying commented Aug 27, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

WweiL Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neilramaswamy commented Sep 10, 2024 • edited Loading

neilramaswamy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neilramaswamy Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neilramaswamy left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neilramaswamy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brkyvz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siying commented Aug 27, 2024 •

edited

Loading

WweiL Aug 29, 2024 •

edited

Loading

neilramaswamy commented Sep 10, 2024 •

edited

Loading

neilramaswamy Sep 17, 2024 •

edited

Loading

neilramaswamy left a comment •

edited

Loading