Skip to content

Conversation

@jose-torres
Copy link
Contributor

What changes were proposed in this pull request?

Remove queryExecutionThread.interrupt() from ContinuousExecution. As detailed in the JIRA, interrupting the thread is only relevant in the microbatch case; for continuous processing the query execution can quickly clean itself up without.

How was this patch tested?

existing tests

@jose-torres
Copy link
Contributor Author

cc @mgaido91 - this should completely resolve the other symptom you posted in SPARK-23416

cc @zsxwing @tdas

@mgaido91
Copy link
Contributor

thanks for pinging me @jose-torres! Unfortunately I don't know yet structured streaming codebase well enough to give a feedback. Thanks anyway for looking at it!

@SparkQA
Copy link

SparkQA commented Feb 15, 2018

Test build #87486 has finished for PR 20622 at commit 3d8acd2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jose-torres
Copy link
Contributor Author

StreamingOuterJoinSuite failure is a known flakiness issue.

@jose-torres
Copy link
Contributor Author

retest this please

@jose-torres
Copy link
Contributor Author

jose-torres commented Feb 15, 2018

@zsxwing pointed out that the original behavior was more subtly wrong than I expected.

What we want to do is cancel the Spark job, and then cleanly restart it from the last checkpoint. But in fact, this was not working, since cancelling a Spark job throws an opaque SparkException which we didn't anticipate.

The reason things seemed to work was that the interrupt() call would almost always (but was not guaranteed to) interrupt the job cancellation, thus preventing the SparkException. So I've updated the PR to anticipate that SparkException, and filed SPARK-23444 to ask for a better handle for job cancellations.

Note that the continuous processing reconfiguration tests will always deterministically fail if they don't properly catch this exception. So the checking logic isn't really fragile despite being weird, and I think it's a significant upgrade over the flakiness.

@SparkQA
Copy link

SparkQA commented Feb 15, 2018

Test build #87491 has finished for PR 20622 at commit 3d8acd2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 16, 2018

Test build #87495 has finished for PR 20622 at commit 3ad7b3f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 16, 2018

Test build #87499 has finished for PR 20622 at commit 35f5b4a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@tdas tdas Feb 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add comments on when can this exception happen and what this code segment is doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@SparkQA
Copy link

SparkQA commented Feb 17, 2018

Test build #87517 has finished for PR 20622 at commit 3b56232.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor

tdas commented Feb 21, 2018

@jose-torres I had a long offline chat with @zsxwing, kudos to him for catching a corner case in the current solution. The following sequence of events may occur.

  • In the query thread, the epoch tracking thread is started
  • Before the query thread actually starts the Spark job, the epoch tracking thread may detect some sort of reconfiguration and attempt to cancelJob even before the query thread has started spark jobs.
  • Query thread starts spark job, gets blocked, never terminates.

Fundamentally, its not a great setup that one thread is starting the jobs and another thread is canceling them. Because of the async nature, we have no way reasoning which attempt wins, starting or cancelling. Rather let's make sure that we start and cancel in the same thread (then we can do some reasoning). Here is an alternate solution.

  • The epoch thread ONLY interrupts the query thread. It's not responsible for any Spark state management (other than the enum state).
  • The query thread cancels jobs and stops sources in the finally clause.

There is less likely to be race conditions that end up not canceling Spark job as a single thread (the query thread) is responsible for all Spark state management.

@jose-torres
Copy link
Contributor Author

jose-torres commented Feb 21, 2018

The alternate solution doesn't solve the problem SPARK-23441 sets out to address. Using a thread interrupt to kill the stream appears to be a fundamentally fragile solution, requiring us to maintain a whitelist of exceptions we think Spark execution might surface in response to an interrupt.

If you don't think there's a simple way to remove this interrupt currently, I can close this PR and write a new one that simply addresses the test flakiness using the alternate solution.

@zsxwing
Copy link
Member

zsxwing commented Feb 21, 2018

Using a thread interrupt to kill the stream appears to be a fundamentally fragile solution, requiring us to maintain a whitelist of exceptions we think Spark execution might surface in response to an interrupt.

I think the original issue is it interrupts the thread and also cancels the Spark job. Then runJob may throw InterruptedException or SparkException and cause the test flakiness.

@jose-torres
Copy link
Contributor Author

The issue that caused the test flakiness is as you say. But I had understood that interrupting is itself problematic, requiring us to maintain an awkward whitelist of possible exceptions in isInterruptedByStop.

@tdas
Copy link
Contributor

tdas commented Feb 21, 2018

Unfortunately, we havent figured out any good to avoid that for MicroBatchExecution till now. The streaming thread is doing a whole lot of different things, and interrupting is the only reliable way to stop it immediately. And since different pieces of code can react to interrupts differently, it can finally manifest a small set of interrupt-related exceptions. This whitelist of exceptions have been sufficient for a while now for MicroBatchExecution, so I dont think this will grow much more.

I am convinced that ContinuousExecution has the same set of problems (thread needs to be interrupted from whatever it is doing) and therefore needs to be solved in a similar way. The only difference is that besides stopping, there is an additional way to interrupt the currently active query (i.e. while reconfiguring). And we need to catch the same set of exceptions as stop, except the expected state will be RECONFIGURING instead of TERMINATED. So we can reuse the method StreamExecution.isInterruptionException.

@jose-torres
Copy link
Contributor Author

The difference in ContinuousExecution is that the thread isn't doing any metadata work like looking for new batches - it's either running the Spark job or cleaning up after finishing it.

In any case, this sounds like something we aren't plausibly going to solve here, so I'll update the PR to just resolve the flakiness in the way you suggest.

@jose-torres jose-torres changed the title [SPARK-23441][SS] Remove queryExecutionThread.interrupt() from ContinuousExecution [SPARK-23491][SS] Remove explicit job cancellation from ContinuousExecution reconfiguring Feb 22, 2018
case t: Throwable
if StreamExecution.isInterruptionException(t) && state.get() == RECONFIGURING =>
// interrupted by reconfiguration - swallow exception so we can restart the query
} finally {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add cancel job here.

@SparkQA
Copy link

SparkQA commented Feb 22, 2018

Test build #87620 has finished for PR 20622 at commit 0e5e52f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 22, 2018

Test build #4128 has finished for PR 20622 at commit ae2d853.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 22, 2018

Test build #4135 has finished for PR 20622 at commit ae2d853.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 23, 2018

Test build #87623 has finished for PR 20622 at commit ae2d853.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 23, 2018

Test build #4127 has finished for PR 20622 at commit ae2d853.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 23, 2018

Test build #4130 has finished for PR 20622 at commit ae2d853.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 23, 2018

Test build #4133 has finished for PR 20622 at commit ae2d853.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 23, 2018

Test build #4131 has finished for PR 20622 at commit ae2d853.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 23, 2018

Test build #4132 has finished for PR 20622 at commit ae2d853.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 23, 2018

Test build #4134 has finished for PR 20622 at commit ae2d853.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 23, 2018

Test build #4129 has finished for PR 20622 at commit ae2d853.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case t: Throwable
if StreamExecution.isInterruptionException(t) && state.get() == RECONFIGURING =>
stopSources()
sparkSession.sparkContext.cancelJobGroup(runId.toString)
Copy link
Contributor

@tdas tdas Feb 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its cleaner to put this in the finally since this is the invariant we want when this entire method terminates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we only swallow the exception when we are reconfiguration (btw always add logging when swallowing exceptions to leave a trail for debugging), and stopSources() and cancelJobGroups() can be finally as we want that as invariant no matter what happens in this runContinuous method.

@tdas
Copy link
Contributor

tdas commented Feb 23, 2018

LGTM, assuming tests pass.

@SparkQA
Copy link

SparkQA commented Feb 24, 2018

Test build #87636 has finished for PR 20622 at commit d404baf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 24, 2018

Test build #87637 has finished for PR 20622 at commit d3b16c1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 7ec8365 Feb 26, 2018
asfgit pushed a commit that referenced this pull request Feb 13, 2019
…cution reconfiguring

## What changes were proposed in this pull request?

Remove queryExecutionThread.interrupt() from ContinuousExecution. As detailed in the JIRA, interrupting the thread is only relevant in the microbatch case; for continuous processing the query execution can quickly clean itself up without.

## How was this patch tested?

existing tests

Author: Jose Torres <jose@databricks.com>

Closes #20622 from jose-torres/SPARK-23441.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants