[SPARK-29568][SS] Stop existing running streams when a new stream is launched #26225

brkyvz · 2019-10-23T09:50:14Z

What changes were proposed in this pull request?

This PR adds a SQL Conf: spark.sql.streaming.stopActiveRunOnRestart. When this conf is true (by default it is), an already running stream will be stopped, if a new copy gets launched on the same checkpoint location.

Why are the changes needed?

In multi-tenant environments where you have multiple SparkSessions, you can accidentally start multiple copies of the same stream (i.e. streams using the same checkpoint location). This will cause all new instantiations of the new stream to fail. However, sometimes you may want to turn off the old stream, as the old stream may have turned into a zombie (you no longer have access to the query handle or SparkSession).

It would be nice to have a SQL flag that allows the stopping of the old stream for such zombie cases.

Does this PR introduce any user-facing change?

Yes. Now by default, if you launch a new copy of an already running stream on a multi-tenant cluster, the existing stream will be stopped.

How was this patch tested?

Unit tests in StreamingQueryManagerSuite

SparkQA · 2019-10-23T10:02:38Z

Test build #112529 has finished for PR 26225 at commit 1d4167f.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

anew · 2019-10-23T13:11:24Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala

+        val queryManager = activeOption.getOrElse(this)
+        logInfo(s"Stopping existing streaming query [id=${query.id}], as a new run is being " +
+          "started.")
+        queryManager.get(query.id).stop()


If the existing stream is a "zombie", can it happen that it does not respond to stop() and then this will block forever?

Great question. I can add some safeguards against this, but in most cases we mean that the stream is a "zombie", because we lost all references to it, not because it is uninterruptable.

I think this should be fine. If stop returns, the query should already be stopped, because stop waits until the streaming thread dies.

Ah, this has a deadlock. We are waiting for stopping query inside a lock which the query needs to remove itself from the active queries.

SparkQA · 2019-10-23T16:49:40Z

Test build #112530 has finished for PR 26225 at commit e30ec9a.

This patch fails from timeout after a configured wait of 400m.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-10-24T06:51:15Z

retest this please

SparkQA · 2019-10-24T07:05:02Z

Test build #112590 has finished for PR 26225 at commit e30ec9a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

uncleGen · 2019-10-24T08:42:19Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala

+        sparkSession.sessionState.conf.getConf(SQLConf.STOP_RUNNING_DUPLICATE_STREAM)
+      if (streamAlreadyActive && turnOffOldStream) {
+        val queryManager = activeOption.getOrElse(this)
+        logInfo(s"Stopping existing streaming query [id=${query.id}], as a new run is being " +


nit: whether giving a warning is better?

I agree, make this a warning, and add the previous runId and new runId to make it easier to debug.

dongjoon-hyun · 2019-10-24T20:29:39Z

Retest this please.

dongjoon-hyun · 2019-10-24T20:50:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      "older stream's SparkSession may not be possible, and the stream may have turned into a " +
+      "zombie stream. When this flag is true, we will stop the old stream to start the new one.")
+    .booleanConf
+    .createWithDefault(true)


Shall we have false by default to avoid the behavior changes?
cc @gatorsmile

Great question. Here's my argument why we should change it:

This change is going into Spark 3.0, a release where we can actually break existing behavior (unless it is critical behavior which people depend on)

The existing behavior was that any new start of a stream would fail, because an existing stream was already running. This is programming error on the user's part.

However, there are legitimate cases, where a user would like to restart a new instance of the stream (because they upgrade the code for instance), but they have no way of stopping the existing stream, because it turns into a zombie.

I would argue that 3 is more common than 2, and including 1, this is where we can change behavior and mention in release notes.

+1 for the release notes.

nit: I think the docs can be better. here are confusing parts.

it seems that this will work only when the stream is restarted in a different session. but is it s

the term stream is confusing here. does it refer to a streaming query, a query run? We should try to be clear by same starting a "streaming query" instead of a "stream" in the explanation, and depending on what is consistent with other confs.

+1 one the name now. I like it.

dongjoon-hyun · 2019-10-24T20:51:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .checkValue(v => Set(1, 2).contains(v), "Valid versions are 1 and 2")
      .createWithDefault(2)

+  val STOP_RUNNING_DUPLICATE_STREAM = buildConf("spark.sql.streaming.stopExistingDuplicateStream")


stopExistingDuplicateStream -> stopExistingDuplicatedStream?

SparkQA · 2019-10-25T03:14:35Z

Test build #112621 has finished for PR 26225 at commit e30ec9a.

This patch fails from timeout after a configured wait of 400m.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2019-10-29T10:25:05Z

retest this please

SparkQA · 2019-10-29T17:09:42Z

Test build #112847 has finished for PR 26225 at commit e30ec9a.

This patch fails from timeout after a configured wait of 400m.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-10-30T00:40:08Z

Seems like it affects testing time considerably.

HyukjinKwon · 2019-10-30T00:40:11Z

retest this please

SparkQA · 2019-10-30T07:05:02Z

Test build #112874 has finished for PR 26225 at commit e30ec9a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

uncleGen · 2019-10-30T07:36:15Z

retest this please

gaborgsomogyi · 2019-10-30T09:22:17Z

Seems like it affects testing time considerably.

+1 on this

SparkQA · 2019-10-30T14:21:03Z

Test build #112900 has finished for PR 26225 at commit e30ec9a.

This patch fails from timeout after a configured wait of 400m.
This patch merges cleanly.
This patch adds no public classes.

zsxwing

One minor comment about the error message. Otherwise looks good to me.

zsxwing · 2019-11-05T22:51:08Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala

+          "Cannot start query with id ${query.id} as another query with same id is " +
+            "already active. Perhaps you are attempting to restart a query from checkpoint " +
+            "that is already active. You may stop the old query by setting the SQL " +
+            s"""configuration: spark.conf.set("${SQLConf.STOP_RUNNING_DUPLICATE_STREAM}", true).""")


SQLConf.STOP_RUNNING_DUPLICATE_STREAM => SQLConf.STOP_RUNNING_DUPLICATE_STREAM.key
And

You may stop the old query by setting the SQL ... and retry.

brkyvz · 2019-11-07T01:18:56Z

There was a deadlock causing tests to fail. cc @zsxwing @tdas @dongjoon-hyun addressed your comments. Can you ptal?

SparkQA · 2019-11-07T01:25:20Z

Test build #113349 has finished for PR 26225 at commit ff14e95.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-07T04:21:07Z

Test build #113351 has finished for PR 26225 at commit d8d4e8f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-07T19:46:33Z

Test build #113394 has finished for PR 26225 at commit d999fb7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-07T23:54:36Z

Test build #113404 has finished for PR 26225 at commit 2216fe2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2019-11-08T17:11:29Z

retest this please

SparkQA · 2019-11-08T20:56:07Z

Test build #113468 has finished for PR 26225 at commit 2216fe2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-11-10T01:35:06Z

Thank you for updating, @brkyvz !

dongjoon-hyun · 2019-11-10T01:36:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .checkValue(v => Set(1, 2).contains(v), "Valid versions are 1 and 2")
      .createWithDefault(2)

+  val STOP_RUNNING_DUPLICATE_STREAM = buildConf("spark.sql.streaming.stopExistingDuplicatedStream")


STOP_RUNNING_DUPLICATE_STREAM -> STOP_RUNNING_DUPLICATED_STREAM?

dongjoon-hyun · 2019-11-10T01:38:00Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala

+    // The following code block checks if a stream with the same name or id is running. Then it
+    // returns an Option of an already active stream to stop outside of the lock
+    // to avoid a deadlock.
+    val activeDuplicateQuery = activeQueriesLock.synchronized {


activeDuplicateQuery -> activeDuplicatedQuery?

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala

brkyvz · 2019-11-10T01:48:00Z

There's a deadlock if you stop it there. It's mentioned in the comment above

…

On Sat, Nov 9, 2019, 5:43 PM Dongjoon Hyun ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala <#26225 (comment)>: > @@ -353,17 +356,44 @@ class StreamingQueryManager private[sql] (sparkSession: SparkSession) extends Lo } // Make sure no other query with same id is active across all sessions - val activeOption = - Option(sparkSession.sharedState.activeStreamingQueries.putIfAbsent(query.id, this)) - if (activeOption.isDefined || activeQueries.values.exists(_.id == query.id)) { + val activeOption = Option(sparkSession.sharedState.activeStreamingQueries.get(query.id)) + .orElse(activeQueries.get(query.id)) + + val turnOffOldStream = + sparkSession.sessionState.conf.getConf(SQLConf.STOP_RUNNING_DUPLICATE_STREAM) + if (activeOption.isDefined && turnOffOldStream) { + val oldQuery = activeOption.get + logWarning(s"Stopping existing streaming query [id=${query.id}, runId=${oldQuery.runId}]," + + " as a new run is being started.") + Some(oldQuery) Then, we don't need val activeDuplicateQuery = declararion. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26225?email_source=notifications&email_token=ABIAE6Z24SHPQ5JR7SV4O6TQS5RNJA5CNFSM4JD6YV42YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCLADEHQ#discussion_r344468786>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABIAE64Q7YQRTIBESZUIUPDQS5RNJANCNFSM4JD6YV4Q> .

dongjoon-hyun · 2019-11-10T02:08:23Z

Oops. I overlooked the above comment. Thanks, @brkyvz !

tdas

Almost LGTM. I am mostly grumbling about the names.

tdas · 2019-11-08T22:50:13Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala

          throw new IllegalArgumentException(
            s"Cannot start query with name $name as a query with that name is already active")
        }
      }


super nit: probably update this to "is already active in this SparkSession." to be clearer

tdas · 2019-11-08T22:51:39Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala

+    // The following code block checks if a stream with the same name or id is running. Then it
+    // returns an Option of an already active stream to stop outside of the lock
+    // to avoid a deadlock.
+    val activeDuplicateQuery = activeQueriesLock.synchronized {


maybe rename this to activeQuerySharedLock to indicate this is shared across sessions.

or directly use sharedState.activeQueryLock

tdas · 2019-11-08T22:53:03Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala

+      val activeOption = Option(sparkSession.sharedState.activeStreamingQueries.get(query.id))
+        .orElse(activeQueries.get(query.id))
+
+      val turnOffOldStream =


TuneOff?? make it same as the conf.

tdas · 2019-11-08T23:27:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      "older stream's SparkSession may not be possible, and the stream may have turned into a " +
+      "zombie stream. When this flag is true, we will stop the old stream to start the new one.")
+    .booleanConf
+    .createWithDefault(true)


nit: I think the docs can be better. here are confusing parts.

it seems that this will work only when the stream is restarted in a different session. but is it s

the term stream is confusing here. does it refer to a streaming query, a query run? We should try to be clear by same starting a "streaming query" instead of a "stream" in the explanation, and depending on what is consistent with other confs.

tdas · 2019-11-09T00:33:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .checkValue(v => Set(1, 2).contains(v), "Valid versions are 1 and 2")
      .createWithDefault(2)

+  val STOP_RUNNING_DUPLICATE_STREAM = buildConf("spark.sql.streaming.stopExistingDuplicatedStream")


make the conf CAPS name consistent with the actual string.

tdas · 2019-11-09T00:35:07Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala

-        Option(sparkSession.sharedState.activeStreamingQueries.putIfAbsent(query.id, this))
-      if (activeOption.isDefined || activeQueries.values.exists(_.id == query.id)) {
+      val activeOption = Option(sparkSession.sharedState.activeStreamingQueries.get(query.id))
+        .orElse(activeQueries.get(query.id))


why do we need to check both? can it be in the shared state and but not in the local one?

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala

tdas · 2019-11-12T01:44:07Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryManagerSuite.scala

+          }
+          assert(e.getMessage.contains("same id"))
+        } finally {
+          query1.stop()


stop all active streams

tdas · 2019-11-12T01:44:38Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryManagerSuite.scala

+            assert(!query1.isActive,
+              "First query should have stopped before starting the second query")
+          } finally {
+            query2.stop()


stop all active streams

tdas · 2019-11-12T01:44:43Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryManagerSuite.scala

+            assert(!query1.isActive,
+              "First query should have stopped before starting the second query")
+          } finally {
+            query2.stop()


stop all active streams

SparkQA · 2019-11-12T23:14:06Z

Test build #113650 has finished for PR 26225 at commit 9fbf56a.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

tdas · 2019-11-13T02:02:53Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala

+        query.id, query.streamingQuery) // we need to put the StreamExecution, not the wrapper
+      if (oldActiveQuery != null) {
+        throw new ConcurrentModificationException(
+          "Another instance of this query was just started by a concurrent session.")


This is not the correct error message when stopActiveRunOnRestart is false.
If the active run was stopped, then this error message is correct.
If the active run was not stopped, then this error will be thrown and therefore should simply say that there is an active run (run id ...).

In other words, this can stay as the same message as it was in Spark 2.4,.... may be improved by adding the run id.

tdas · 2019-11-13T02:04:10Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala

+    activeRunOpt.foreach(_.stop())
+
+    activeQueriesSharedLock.synchronized {
+      // We still can have a race condition when two concurrent instances try to start the same


nit: This comment is true only if the active run was stopped. So qualify the comment accordingly.

tdas · 2019-11-13T02:05:19Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala

      }
+    }

+    activeRunOpt.foreach(_.stop())


nit: Please document here that stop() will automatically clear the activeStreamingQueries. Without this implicit easy-to-miss information, it is hard to reason about this code.

tdas · 2019-11-13T02:05:55Z

The implementation looks good, but please fix the error message before merging.

brkyvz · 2019-11-13T02:19:30Z

If stopActiveRunOnRestart is false, this piece of code is not even executed. Another error is thrown earlier.

…

On Tue, Nov 12, 2019, 6:04 PM Tathagata Das ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala <#26225 (comment)>: > + activeRunOpt.foreach(_.stop()) + + activeQueriesSharedLock.synchronized { + // We still can have a race condition when two concurrent instances try to start the same + // stream, while a third one was already active. In this case, we throw a + // ConcurrentModificationException. + val oldActiveQuery = sparkSession.sharedState.activeStreamingQueries.put( + query.id, query.streamingQuery) // we need to put the StreamExecution, not the wrapper + if (oldActiveQuery != null) { + throw new ConcurrentModificationException( + "Another instance of this query was just started by a concurrent session.") This is not the correct error message when stopActiveRunOnRestart is false. If the active run was stopped, then this error message is correct. If the active run was not stopped, then this error will be thrown and therefore should simply say that there is an active run (run id ...). In other words, this can stay as the same message as it was in Spark 2.4,.... may be improved by adding the run id. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26225?email_source=notifications&email_token=ABIAE67GGGFBE62ES54DPNDQTNODTA5CNFSM4JD6YV42YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCLKS3FA#pullrequestreview-315960724>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABIAE65ERNLXRWXYE3U7HK3QTNODTANCNFSM4JD6YV4Q> .

tdas · 2019-11-13T03:21:21Z

Aah. I was confused. LGTM then.

tdas · 2019-11-13T03:21:51Z

Update the description with the new conf name.

SparkQA · 2019-11-13T03:50:04Z

Test build #113657 has finished for PR 26225 at commit 3cea936.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-13T07:30:05Z

Test build #113667 has finished for PR 26225 at commit bff9162.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2019-11-13T16:56:38Z

Thanks! Merging to master

Stop existing running streams when a new stream is launched

1d4167f

Update StreamingQueryManager.scala

e30ec9a

anew reviewed Oct 23, 2019

View reviewed changes

dongjoon-hyun added the STRUCTURED STREAMING label Oct 23, 2019

uncleGen reviewed Oct 24, 2019

View reviewed changes

dongjoon-hyun reviewed Oct 24, 2019

View reviewed changes

zsxwing reviewed Nov 5, 2019

View reviewed changes

brkyvz added 2 commits November 6, 2019 17:16

save

7b6b17c

fix mc

ff14e95

brkyvz added 3 commits November 6, 2019 17:36

simplify

30892ba

fix ss

d8d4e8f

fix issue

dd20574

save

d999fb7

fix existing test

2216fe2

dongjoon-hyun reviewed Nov 10, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala Outdated Show resolved Hide resolved

tdas suggested changes Nov 12, 2019

View reviewed changes

Address comments

9fbf56a

fix mc

3cea936

tdas reviewed Nov 13, 2019

View reviewed changes

Update StreamingQueryManager.scala

bff9162

asfgit closed this in 363af16 Nov 13, 2019

[SPARK-29568][SS] Stop existing running streams when a new stream is launched #26225

[SPARK-29568][SS] Stop existing running streams when a new stream is launched #26225

Uh oh!

Conversation

brkyvz commented Oct 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Oct 23, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zsxwing Nov 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 23, 2019

Uh oh!

HyukjinKwon commented Oct 24, 2019

Uh oh!

SparkQA commented Oct 24, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Oct 24, 2019

Uh oh!

dongjoon-hyun Oct 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas Nov 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Oct 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 25, 2019

Uh oh!

gaborgsomogyi commented Oct 29, 2019

Uh oh!

SparkQA commented Oct 29, 2019

Uh oh!

HyukjinKwon commented Oct 30, 2019

Uh oh!

HyukjinKwon commented Oct 30, 2019

Uh oh!

SparkQA commented Oct 30, 2019

Uh oh!

uncleGen commented Oct 30, 2019

Uh oh!

gaborgsomogyi commented Oct 30, 2019

Uh oh!

SparkQA commented Oct 30, 2019

Uh oh!

zsxwing left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brkyvz commented Nov 7, 2019

Uh oh!

brkyvz commented Oct 23, 2019 •

edited

Loading

zsxwing Nov 5, 2019 •

edited

Loading

dongjoon-hyun Oct 24, 2019 •

edited

Loading

tdas Nov 13, 2019 •

edited

Loading

dongjoon-hyun Oct 24, 2019 •

edited

Loading