[SPARK-23491][SS] Remove explicit job cancellation from ContinuousExecution reconfiguring #20622

jose-torres · 2018-02-15T17:29:22Z

What changes were proposed in this pull request?

Remove queryExecutionThread.interrupt() from ContinuousExecution. As detailed in the JIRA, interrupting the thread is only relevant in the microbatch case; for continuous processing the query execution can quickly clean itself up without.

How was this patch tested?

existing tests

jose-torres · 2018-02-15T17:31:38Z

cc @mgaido91 - this should completely resolve the other symptom you posted in SPARK-23416

cc @zsxwing @tdas

mgaido91 · 2018-02-15T19:11:57Z

thanks for pinging me @jose-torres! Unfortunately I don't know yet structured streaming codebase well enough to give a feedback. Thanks anyway for looking at it!

SparkQA · 2018-02-15T19:46:31Z

Test build #87486 has finished for PR 20622 at commit 3d8acd2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jose-torres · 2018-02-15T19:49:33Z

StreamingOuterJoinSuite failure is a known flakiness issue.

jose-torres · 2018-02-15T19:49:37Z

retest this please

jose-torres · 2018-02-15T22:12:31Z

@zsxwing pointed out that the original behavior was more subtly wrong than I expected.

What we want to do is cancel the Spark job, and then cleanly restart it from the last checkpoint. But in fact, this was not working, since cancelling a Spark job throws an opaque SparkException which we didn't anticipate.

The reason things seemed to work was that the interrupt() call would almost always (but was not guaranteed to) interrupt the job cancellation, thus preventing the SparkException. So I've updated the PR to anticipate that SparkException, and filed SPARK-23444 to ask for a better handle for job cancellations.

Note that the continuous processing reconfiguration tests will always deterministically fail if they don't properly catch this exception. So the checking logic isn't really fragile despite being weird, and I think it's a significant upgrade over the flakiness.

SparkQA · 2018-02-15T22:18:28Z

Test build #87491 has finished for PR 20622 at commit 3d8acd2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-16T00:24:55Z

Test build #87495 has finished for PR 20622 at commit 3ad7b3f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-16T03:43:24Z

Test build #87499 has finished for PR 20622 at commit 35f5b4a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-02-17T01:23:48Z

...src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala

Can you add comments on when can this exception happen and what this code segment is doing?

SparkQA · 2018-02-17T06:48:57Z

Test build #87517 has finished for PR 20622 at commit 3b56232.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-02-21T01:37:35Z

@jose-torres I had a long offline chat with @zsxwing, kudos to him for catching a corner case in the current solution. The following sequence of events may occur.

In the query thread, the epoch tracking thread is started
Before the query thread actually starts the Spark job, the epoch tracking thread may detect some sort of reconfiguration and attempt to cancelJob even before the query thread has started spark jobs.
Query thread starts spark job, gets blocked, never terminates.

Fundamentally, its not a great setup that one thread is starting the jobs and another thread is canceling them. Because of the async nature, we have no way reasoning which attempt wins, starting or cancelling. Rather let's make sure that we start and cancel in the same thread (then we can do some reasoning). Here is an alternate solution.

The epoch thread ONLY interrupts the query thread. It's not responsible for any Spark state management (other than the enum state).
The query thread cancels jobs and stops sources in the finally clause.

There is less likely to be race conditions that end up not canceling Spark job as a single thread (the query thread) is responsible for all Spark state management.

jose-torres · 2018-02-21T18:56:06Z

The alternate solution doesn't solve the problem SPARK-23441 sets out to address. Using a thread interrupt to kill the stream appears to be a fundamentally fragile solution, requiring us to maintain a whitelist of exceptions we think Spark execution might surface in response to an interrupt.

If you don't think there's a simple way to remove this interrupt currently, I can close this PR and write a new one that simply addresses the test flakiness using the alternate solution.

zsxwing · 2018-02-21T19:47:58Z

Using a thread interrupt to kill the stream appears to be a fundamentally fragile solution, requiring us to maintain a whitelist of exceptions we think Spark execution might surface in response to an interrupt.

I think the original issue is it interrupts the thread and also cancels the Spark job. Then runJob may throw InterruptedException or SparkException and cause the test flakiness.

jose-torres · 2018-02-21T19:54:18Z

The issue that caused the test flakiness is as you say. But I had understood that interrupting is itself problematic, requiring us to maintain an awkward whitelist of possible exceptions in isInterruptedByStop.

tdas · 2018-02-21T20:59:38Z

Unfortunately, we havent figured out any good to avoid that for MicroBatchExecution till now. The streaming thread is doing a whole lot of different things, and interrupting is the only reliable way to stop it immediately. And since different pieces of code can react to interrupts differently, it can finally manifest a small set of interrupt-related exceptions. This whitelist of exceptions have been sufficient for a while now for MicroBatchExecution, so I dont think this will grow much more.

I am convinced that ContinuousExecution has the same set of problems (thread needs to be interrupted from whatever it is doing) and therefore needs to be solved in a similar way. The only difference is that besides stopping, there is an additional way to interrupt the currently active query (i.e. while reconfiguring). And we need to catch the same set of exceptions as stop, except the expected state will be RECONFIGURING instead of TERMINATED. So we can reuse the method StreamExecution.isInterruptionException.

jose-torres · 2018-02-22T18:25:58Z

The difference in ContinuousExecution is that the thread isn't doing any metadata work like looking for new batches - it's either running the Spark job or cleaning up after finishing it.

In any case, this sounds like something we aren't plausibly going to solve here, so I'll update the PR to just resolve the flakiness in the way you suggest.

tdas · 2018-02-22T20:48:51Z

...src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala

+      case t: Throwable
+          if StreamExecution.isInterruptionException(t) && state.get() == RECONFIGURING =>
+        // interrupted by reconfiguration - swallow exception so we can restart the query
    } finally {


Add cancel job here.

SparkQA · 2018-02-22T20:59:47Z

Test build #87620 has finished for PR 20622 at commit 0e5e52f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-22T23:39:02Z

Test build #4128 has finished for PR 20622 at commit ae2d853.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-22T23:42:03Z

Test build #4135 has finished for PR 20622 at commit ae2d853.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-23T00:07:10Z

Test build #87623 has finished for PR 20622 at commit ae2d853.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-23T00:28:43Z

Test build #4127 has finished for PR 20622 at commit ae2d853.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-23T00:29:20Z

Test build #4130 has finished for PR 20622 at commit ae2d853.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-23T00:30:53Z

Test build #4133 has finished for PR 20622 at commit ae2d853.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-23T00:31:45Z

Test build #4131 has finished for PR 20622 at commit ae2d853.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-23T00:33:43Z

Test build #4132 has finished for PR 20622 at commit ae2d853.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-23T00:35:59Z

Test build #4134 has finished for PR 20622 at commit ae2d853.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-23T00:37:51Z

Test build #4129 has finished for PR 20622 at commit ae2d853.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-02-23T23:03:59Z

...src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala

+      case t: Throwable
+          if StreamExecution.isInterruptionException(t) && state.get() == RECONFIGURING =>
+        stopSources()
+        sparkSession.sparkContext.cancelJobGroup(runId.toString)


I think its cleaner to put this in the finally since this is the invariant we want when this entire method terminates.

So we only swallow the exception when we are reconfiguration (btw always add logging when swallowing exceptions to leave a trail for debugging), and stopSources() and cancelJobGroups() can be finally as we want that as invariant no matter what happens in this runContinuous method.

tdas · 2018-02-23T23:54:37Z

LGTM, assuming tests pass.

SparkQA · 2018-02-24T02:40:49Z

Test build #87636 has finished for PR 20622 at commit d404baf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-24T02:43:28Z

Test build #87637 has finished for PR 20622 at commit d3b16c1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…cution reconfiguring ## What changes were proposed in this pull request? Remove queryExecutionThread.interrupt() from ContinuousExecution. As detailed in the JIRA, interrupting the thread is only relevant in the microbatch case; for continuous processing the query execution can quickly clean itself up without. ## How was this patch tested? existing tests Author: Jose Torres <jose@databricks.com> Closes #20622 from jose-torres/SPARK-23441.

tdas reviewed Feb 17, 2018

View reviewed changes

remove job group cancel from continuous processing

0e5e52f

jose-torres force-pushed the SPARK-23441 branch from 3b56232 to 0e5e52f Compare February 22, 2018 18:39

jose-torres changed the title ~~[SPARK-23441][SS] Remove queryExecutionThread.interrupt() from ContinuousExecution~~ [SPARK-23491][SS] Remove explicit job cancellation from ContinuousExecution reconfiguring Feb 22, 2018

tdas reviewed Feb 22, 2018

View reviewed changes

add stop/cancel

ae2d853

tdas reviewed Feb 23, 2018

View reviewed changes

jose-torres added 2 commits February 23, 2018 15:27

move stops to finally

d404baf

add log

d3b16c1

asfgit closed this in 7ec8365 Feb 26, 2018

[SPARK-23491][SS] Remove explicit job cancellation from ContinuousExecution reconfiguring #20622

[SPARK-23491][SS] Remove explicit job cancellation from ContinuousExecution reconfiguring #20622

Uh oh!

Conversation

jose-torres commented Feb 15, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

jose-torres commented Feb 15, 2018

Uh oh!

mgaido91 commented Feb 15, 2018

Uh oh!

SparkQA commented Feb 15, 2018

Uh oh!

jose-torres commented Feb 15, 2018

Uh oh!

jose-torres commented Feb 15, 2018

Uh oh!

jose-torres commented Feb 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Feb 15, 2018

Uh oh!

SparkQA commented Feb 16, 2018

Uh oh!

SparkQA commented Feb 16, 2018

Uh oh!

tdas Feb 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jose-torres Feb 17, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 17, 2018

Uh oh!

tdas commented Feb 21, 2018

Uh oh!

jose-torres commented Feb 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zsxwing commented Feb 21, 2018

Uh oh!

jose-torres commented Feb 21, 2018

Uh oh!

tdas commented Feb 21, 2018

Uh oh!

jose-torres commented Feb 22, 2018

Uh oh!

tdas Feb 22, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 22, 2018

Uh oh!

SparkQA commented Feb 22, 2018

Uh oh!

SparkQA commented Feb 22, 2018

Uh oh!

SparkQA commented Feb 23, 2018

Uh oh!

SparkQA commented Feb 23, 2018

Uh oh!

SparkQA commented Feb 23, 2018

Uh oh!

SparkQA commented Feb 23, 2018

Uh oh!

SparkQA commented Feb 23, 2018

Uh oh!

SparkQA commented Feb 23, 2018

Uh oh!

SparkQA commented Feb 23, 2018

Uh oh!

SparkQA commented Feb 23, 2018

Uh oh!

tdas Feb 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas Feb 23, 2018

Choose a reason for hiding this comment

jose-torres commented Feb 15, 2018 •

edited

Loading

tdas Feb 17, 2018 •

edited

Loading

jose-torres commented Feb 21, 2018 •

edited

Loading

tdas Feb 23, 2018 •

edited

Loading