[SPARK-19617][SS]Fix the race condition when starting and stopping a query quickly #16947

zsxwing · 2017-02-16T01:04:14Z

What changes were proposed in this pull request?

The streaming thread in StreamExecution uses the following ways to check if it should exit:

Catch an InterruptException.
StreamExecution.state is TERMINATED.

When starting and stopping a query quickly, the above two checks may both fail:

Hit HADOOP-14084 and swallow InterruptException
StreamExecution.stop is called before state becomes ACTIVE. Then runBatches changes the state from TERMINATED to ACTIVE.

If the above cases both happen, the query will hang forever.

This PR changes state to AtomicReference and usescompareAndSet to make sure we only change the state from INITIALIZING to ACTIVE. It also removes the runUninterruptibly hack from ``HDFSMetadata`, because HADOOP-14084 won't cause any problem after we fix the race condition.

How was this patch tested?

Jenkins

SparkQA · 2017-02-16T03:39:53Z

Test build #72972 has finished for PR 16947 at commit d52ac13.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-16T07:32:48Z

Test build #72988 has started for PR 16947 at commit fb27a97.

zsxwing · 2017-02-16T18:29:40Z

retest this please

zsxwing · 2017-02-16T19:13:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

-                logDebug(s"Stream running from $committedOffsets to $availableOffsets")
-              } else {
-                constructNextBatch()
+      if (state.compareAndSet(INITIALIZING, ACTIVE)) {


Most changes here are space changes. You can use https://github.com/apache/spark/pull/16947/files?w=1 to review it.

zsxwing · 2017-02-16T19:16:28Z

It also removes the runUninterruptibly hack from ``HDFSMetadata`

I will submit a backport PR for 2.1 to not include this change because this is needed for 2.1 due to HADOOP-10622 (Master only support Hadoop 2.6+, which already fixed HADOOP-10622).

SparkQA · 2017-02-16T21:41:55Z

Test build #3575 has finished for PR 16947 at commit 7317b0f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2017-02-17T22:42:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala

-        // the query fast.
-        writeBatch(batchId, metadata)
-      }
+      writeBatch(batchId, metadata)


Didnt we disable interrupt because with local files, hadoop used shell commands to do file manipulation which could hang when interrupted? Are we removing this now because that has been fixed in hadoop?

Are we removing this now because that has been fixed in hadoop?

Yes. We dropped the support to Hadoop 2.5 and earlier versions.

tdas · 2017-02-17T23:40:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

+        })
+        updateStatusMessage("Stopped")
+      } else {
+        // `stop()` is already called. Let `finally` finish the rest work.


finish the cleanup

tdas · 2017-02-17T23:43:34Z

minor grammar issue in the comment, otherwise LGTM.

tdas · 2017-02-18T00:02:37Z

LGTM. Merge when tests finish to master and 2.1

zsxwing · 2017-02-18T00:04:28Z

@tdas we need another PR for 2.1 since this PR assumes Hadoop 2.6+. I'm doing it now.

SparkQA · 2017-02-18T01:39:11Z

Test build #73078 has finished for PR 16947 at commit 13f76f6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2017-02-18T03:04:14Z

Thanks! Merging to master.

zsxwing · 2017-02-18T03:16:10Z

#16979 is the backport for branch-2.1.

… query quickly (branch-2.1) ## What changes were proposed in this pull request? Backport #16947 to branch 2.1. Note: we still need to support old Hadoop versions in 2.1.*. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16979 from zsxwing/SPARK-19617-branch-2.1.

don't interrupt 'mkdirs' to workaround HADOOP-14084

d52ac13

zsxwing changed the title ~~[SPARK-19617][SS]Don't interrupt 'mkdirs' to workaround HADOOP-14084~~ [SPARK-19617][SS][WIP]Don't interrupt 'mkdirs' to workaround HADOOP-14084 Feb 16, 2017

Fix

fb27a97

zsxwing changed the title ~~[SPARK-19617][SS][WIP]Don't interrupt 'mkdirs' to workaround HADOOP-14084~~ [SPARK-19617][SS]Fix the race condition when starting and stopping a query quickly Feb 16, 2017

Fix comment

7317b0f

zsxwing commented Feb 16, 2017

View reviewed changes

tdas reviewed Feb 17, 2017

View reviewed changes

Address

13f76f6

asfgit closed this in 15b144d Feb 18, 2017

zsxwing deleted the SPARK-19617 branch February 18, 2017 03:08

zsxwing mentioned this pull request Feb 18, 2017

[SPARK-19617][SS]Fix the race condition when starting and stopping a query quickly (branch-2.1) #16979

Closed

zsxwing mentioned this pull request Feb 24, 2017

[SPARK-19633][SS] FileSource read from FileSink #16987

Closed

[SPARK-19617][SS]Fix the race condition when starting and stopping a query quickly #16947

[SPARK-19617][SS]Fix the race condition when starting and stopping a query quickly #16947

Uh oh!

Conversation

zsxwing commented Feb 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Feb 16, 2017

Uh oh!

SparkQA commented Feb 16, 2017

Uh oh!

zsxwing commented Feb 16, 2017

Uh oh!

zsxwing Feb 16, 2017

Choose a reason for hiding this comment

Uh oh!

zsxwing commented Feb 16, 2017

Uh oh!

SparkQA commented Feb 16, 2017

Uh oh!

tdas Feb 17, 2017

Choose a reason for hiding this comment

Uh oh!

zsxwing Feb 17, 2017

Choose a reason for hiding this comment

Uh oh!

tdas Feb 17, 2017

Choose a reason for hiding this comment

Uh oh!

tdas commented Feb 17, 2017

Uh oh!

tdas commented Feb 18, 2017

Uh oh!

zsxwing commented Feb 18, 2017

Uh oh!

SparkQA commented Feb 18, 2017

Uh oh!

zsxwing commented Feb 18, 2017

Uh oh!

zsxwing commented Feb 18, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zsxwing commented Feb 16, 2017 •

edited

Loading