-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19617][SS]Fix the race condition when starting and stopping a query quickly #16947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #72972 has finished for PR 16947 at commit
|
|
Test build #72988 has started for PR 16947 at commit |
|
retest this please |
| logDebug(s"Stream running from $committedOffsets to $availableOffsets") | ||
| } else { | ||
| constructNextBatch() | ||
| if (state.compareAndSet(INITIALIZING, ACTIVE)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most changes here are space changes. You can use https://github.com/apache/spark/pull/16947/files?w=1 to review it.
I will submit a backport PR for 2.1 to not include this change because this is needed for 2.1 due to HADOOP-10622 (Master only support Hadoop 2.6+, which already fixed HADOOP-10622). |
|
Test build #3575 has finished for PR 16947 at commit
|
| // the query fast. | ||
| writeBatch(batchId, metadata) | ||
| } | ||
| writeBatch(batchId, metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didnt we disable interrupt because with local files, hadoop used shell commands to do file manipulation which could hang when interrupted? Are we removing this now because that has been fixed in hadoop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we removing this now because that has been fixed in hadoop?
Yes. We dropped the support to Hadoop 2.5 and earlier versions.
| }) | ||
| updateStatusMessage("Stopped") | ||
| } else { | ||
| // `stop()` is already called. Let `finally` finish the rest work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finish the cleanup
|
minor grammar issue in the comment, otherwise LGTM. |
|
LGTM. Merge when tests finish to master and 2.1 |
|
@tdas we need another PR for 2.1 since this PR assumes Hadoop 2.6+. I'm doing it now. |
|
Test build #73078 has finished for PR 16947 at commit
|
|
Thanks! Merging to master. |
|
#16979 is the backport for branch-2.1. |
… query quickly (branch-2.1) ## What changes were proposed in this pull request? Backport #16947 to branch 2.1. Note: we still need to support old Hadoop versions in 2.1.*. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16979 from zsxwing/SPARK-19617-branch-2.1.
What changes were proposed in this pull request?
The streaming thread in StreamExecution uses the following ways to check if it should exit:
StreamExecution.stateis TERMINATED.When starting and stopping a query quickly, the above two checks may both fail:
statebecomesACTIVE. Then runBatches changes the state fromTERMINATEDtoACTIVE.If the above cases both happen, the query will hang forever.
This PR changes
statetoAtomicReferenceand usescompareAndSetto make sure we only change the state fromINITIALIZINGtoACTIVE. It also removes therunUninterruptiblyhack from ``HDFSMetadata`, because HADOOP-14084 won't cause any problem after we fix the race condition.How was this patch tested?
Jenkins