[SPARK-21145][SS] Added StateStoreProviderId with queryRunId to reload StateStoreProviders when query is restarted #18355

tdas · 2017-06-19T23:06:47Z

What changes were proposed in this pull request?

StateStoreProvider instances are loaded on-demand in a executor when a query is started. When a query is restarted, the loaded provider instance will get reused. Now, there is a non-trivial chance, that the task of the previous query run is still running, while the tasks of the restarted run has started. So for a stateful partition, there may be two concurrent tasks related to the same stateful partition, and there for using the same provider instance. This can lead to inconsistent results and possibly random failures, as state store implementations are not designed to be thread-safe.

To fix this, I have introduced a StateStoreProviderId, that unique identifies a provider loaded in an executor. It has the query run id in it, thus making sure that restarted queries will force the executor to load a new provider instance, thus avoiding two concurrent tasks (from two different runs) from reusing the same provider instance.

Additional minor bug fixes

All state stores related to query run is marked as deactivated in the StateStoreCoordinator so that the executors can unload them and clear resources.
Moved the code that determined the checkpoint directory of a state store from implementation-specific code (HDFSBackedStateStoreProvider) to non-specific code (StateStoreId), so that implementation do not accidentally get it wrong.
- Also added store name to the path, to support multiple stores per sql operator partition.

Note: This change does not address the scenario where two tasks of the same run (e.g. speculative tasks) are concurrently running in the same executor. The chance of this very small, because ideally speculative tasks should never run in the same executor.

How was this patch tested?

Existing unit tests + new unit test.

…uery restarted

tdas · 2017-06-19T23:08:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala


 /** Used to identify the state store for a given operator. */
-case class OperatorStateId(
+case class StatefulOperatorStateInfo(


Renamed to ***Info, so that there is less confusion with StateStoreId

tdas · 2017-06-19T23:10:12Z

...c/test/scala/org/apache/spark/sql/execution/streaming/state/StateStoreCoordinatorSuite.scala

+      // Stop and verify whether the stores are deactivated in the coordinator
+      query.stop()
+      assert(coordRef.getLocation(providerId).isEmpty)
+


remove this line.

tdas · 2017-06-19T23:11:36Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala

      }
      awaitTerminationLock.notifyAll()
    }
+    stateStoreCoordinator.deactivateInstances(terminatedQuery.runId)


this is the change that deactivate all the state stores related to the query thus enabling the executors to lazily unload all the related provider instances.

SparkQA · 2017-06-20T00:36:17Z

Test build #78266 has finished for PR 18355 at commit 3da6b0f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class StateStoreProviderId(storeId: StateStoreId, queryRunId: UUID)
case class StatefulOperatorStateInfo(

SparkQA · 2017-06-20T04:09:31Z

Test build #78270 has finished for PR 18355 at commit d2f1676.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-20T21:08:32Z

Test build #78317 has finished for PR 18355 at commit 35b1bb6.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-21T01:35:59Z

Test build #3804 has finished for PR 18355 at commit 35b1bb6.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-21T02:09:41Z

Test build #78349 has finished for PR 18355 at commit a5fefab.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

shaneknapp · 2017-06-21T04:21:00Z

test this please

SparkQA · 2017-06-21T04:38:58Z

Test build #78358 has finished for PR 18355 at commit a5fefab.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2017-06-21T20:31:38Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/StateStoreSuite.scala

+
+      // Both providers should have the same StateStoreId, but the should be different objects
+      assert(loadedProvidersAfterRun2(0).stateStoreId === loadedProvidersAfterRun2(1).stateStoreId)
+      assert(loadedProvidersAfterRun2(0).hashCode !== loadedProvidersAfterRun2(1).hashCode)


nit: assert(loadedProvidersAfterRun2(0) ne loadedProvidersAfterRun2(1))

zsxwing · 2017-06-21T20:31:56Z

LGTM. Just one nit.

zsxwing · 2017-06-21T20:32:20Z

retest this please

SparkQA · 2017-06-21T20:49:43Z

Test build #78408 has finished for PR 18355 at commit 0ad5a5c.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2017-06-21T20:53:19Z

retest this please.

SparkQA · 2017-06-21T21:09:49Z

Test build #78410 has finished for PR 18355 at commit 0ad5a5c.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-22T20:37:44Z

Test build #3806 has finished for PR 18355 at commit 0ad5a5c.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-23T03:58:46Z

Test build #78492 has finished for PR 18355 at commit 762fe60.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2017-06-23T07:41:49Z

Merging this to master. Thank you @zsxwing for reviewing and @HyukjinKwon for suggesting the workaround to the unidoc issue.

…d StateStoreProviders when query is restarted ## What changes were proposed in this pull request? StateStoreProvider instances are loaded on-demand in a executor when a query is started. When a query is restarted, the loaded provider instance will get reused. Now, there is a non-trivial chance, that the task of the previous query run is still running, while the tasks of the restarted run has started. So for a stateful partition, there may be two concurrent tasks related to the same stateful partition, and there for using the same provider instance. This can lead to inconsistent results and possibly random failures, as state store implementations are not designed to be thread-safe. To fix this, I have introduced a `StateStoreProviderId`, that unique identifies a provider loaded in an executor. It has the query run id in it, thus making sure that restarted queries will force the executor to load a new provider instance, thus avoiding two concurrent tasks (from two different runs) from reusing the same provider instance. Additional minor bug fixes - All state stores related to query run is marked as deactivated in the `StateStoreCoordinator` so that the executors can unload them and clear resources. - Moved the code that determined the checkpoint directory of a state store from implementation-specific code (`HDFSBackedStateStoreProvider`) to non-specific code (StateStoreId), so that implementation do not accidentally get it wrong. - Also added store name to the path, to support multiple stores per sql operator partition. *Note:* This change does not address the scenario where two tasks of the same run (e.g. speculative tasks) are concurrently running in the same executor. The chance of this very small, because ideally speculative tasks should never run in the same executor. ## How was this patch tested? Existing unit tests + new unit test. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes apache#18355 from tdas/SPARK-21145.

andrusha · 2018-03-19T10:02:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala

    val env = SparkEnv.get
    if (env != null) {
-      if (_coordRef == null) {
+      logInfo("Env is not null")


@tdas could you say what's the reason to add this message on the "INFO" level?

Added StateStoreProviderId with queryRunId to reload providers when q…

3da6b0f

…uery restarted

tdas commented Jun 19, 2017

View reviewed changes

tdas changed the title ~~Added StateStoreProviderId with queryRunId to reload StateStoreProviders when query is restarted~~ [SPARK-21145][SS] Added StateStoreProviderId with queryRunId to reload StateStoreProviders when query is restarted Jun 19, 2017

Added unit test to verify providers are reloaded

d2f1676

Fixed StateStore coordinator reference bug

35b1bb6

Merge remote-tracking branch 'apache-github/master' into SPARK-21145

a5fefab

zsxwing reviewed Jun 21, 2017

View reviewed changes

Addressed comments

0ad5a5c

tdas mentioned this pull request Jun 22, 2017

[DO-NOT-MERGE][SPARK-21145][SS] Added StateStoreProviderId with queryRunId to reload StateStoreProviders when query is restarted #18396

Closed

Moved around afterAll() in StreamTests

762fe60

asfgit closed this in fe24634 Jun 23, 2017

andrusha reviewed Mar 19, 2018

View reviewed changes

[SPARK-21145][SS] Added StateStoreProviderId with queryRunId to reload StateStoreProviders when query is restarted #18355

[SPARK-21145][SS] Added StateStoreProviderId with queryRunId to reload StateStoreProviders when query is restarted #18355

Uh oh!

Conversation

tdas commented Jun 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

tdas Jun 19, 2017

Choose a reason for hiding this comment

Uh oh!

tdas Jun 19, 2017

Choose a reason for hiding this comment

Uh oh!

tdas Jun 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 20, 2017

Uh oh!

SparkQA commented Jun 20, 2017

Uh oh!

SparkQA commented Jun 20, 2017

Uh oh!

SparkQA commented Jun 21, 2017

Uh oh!

SparkQA commented Jun 21, 2017

Uh oh!

shaneknapp commented Jun 21, 2017

Uh oh!

SparkQA commented Jun 21, 2017

Uh oh!

zsxwing Jun 21, 2017

Choose a reason for hiding this comment

Uh oh!

zsxwing commented Jun 21, 2017

Uh oh!

zsxwing commented Jun 21, 2017

Uh oh!

SparkQA commented Jun 21, 2017

Uh oh!

tdas commented Jun 21, 2017

Uh oh!

SparkQA commented Jun 21, 2017

Uh oh!

SparkQA commented Jun 22, 2017

Uh oh!

SparkQA commented Jun 23, 2017

Uh oh!

tdas commented Jun 23, 2017

Uh oh!

andrusha Mar 19, 2018

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tdas commented Jun 19, 2017 •

edited

Loading

tdas Jun 19, 2017 •

edited

Loading