[SPARK-9519][Yarn] Confirm stop sc successfully when application was killed #7846

Sephiroth-Lin · 2015-08-01T03:06:53Z

Currently, when we kill application on Yarn, then will call sc.stop() at Yarn application state monitor thread, then in YarnClientSchedulerBackend.stop() will call interrupt this will cause SparkContext not stop fully as we will wait executor to exit.

SparkQA · 2015-08-01T05:30:42Z

Test build #39318 has finished for PR 7846 at commit 243d2c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-08-01T06:54:30Z

This feels too hacky to be a good solution, relying on a flag to pass around who should interrupt a thread. Why not close the sc in the finally block of the monitor thread and be done?

Sephiroth-Lin · 2015-08-01T07:56:25Z

@srowen We need call interrupt in YarnClientSchedulerBackend.stop(), details see PR #5305 and PR #3143, so even if we call sc.stop() in the finally block of the monitor thread it also can not stop successfully.

srowen · 2015-08-01T08:11:41Z

Is the sequence that sc.stop causes the backend to stop which may interrupt the monitor thread, which may be the thing causing it to stop? This change doesn't stop this sequence from happening though; there's still a race condition. Why would the thread interrupt itself as the last thing it does?

This is cleaner if it's entirely local to the monitor thread. The backend doesn't need a new field for this. The thread can have a "stop" method that interrupts it only if it's blocked in monitorApplication.

Sephiroth-Lin · 2015-08-01T08:58:58Z

Yes, this change doesn't stop this sequence from happening. As monitor thread is daemon thread, we don't need call interrupt after sc.stop().
Below I am not very clear:

there's still a race condition
the thread can have a "stop" method that interrupts it only if it's blocked in monitorApplication
Thank you!

srowen · 2015-08-01T10:45:09Z

If you're asking what I mean, I mean that the monitor thread itself can have the flag, like isMonitoring, that is true when it starts the blocking call to monitorApplication and false immediately after. Then expose another method like stop() or something, that interrupts the thread only if isMonitoring is true. This means that if the thread itself initiates sc.stop(), it won't get interrupted, but can still be interrupted in the blocking call to the library method.

SparkQA · 2015-08-01T18:51:24Z

Test build #39354 has finished for PR 7846 at commit ad0e23b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-08-01T20:55:13Z

yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala

Yeah I think that's tidier. Now that it's its own named class, name and daemon status can be set by the class itself I think.

SparkQA · 2015-08-02T10:20:29Z

Test build #1278 has finished for PR 7846 at commit ad0e23b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-08-04T07:28:28Z

Any more thoughts on this one? matching the keyword 'yarn' I will reference @sryza and @vanzin

vanzin · 2015-08-04T17:49:55Z

yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala

I'd call this allowInterrupt.

So it too me a bit to understand why this code is like this. Basically when you interrupt it's because the SparkContext is being shut down (sc.stop() called by user code), and you do not want sc.stop() to be called again here. Now if monitorApplication() returns, it means the YARN app finished before sc.stop() was called, which means this code should call sc.stop(). Could you write a small comment explaining that so that in the future people know what's going on here?

vanzin · 2015-08-04T17:50:42Z

Looks good, I just think we need a comment explaining the code for future readers.

Sephiroth-Lin · 2015-08-05T01:43:35Z

@vanzin @srowen Updated, thank you!

vanzin · 2015-08-05T02:20:11Z

yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala

nit: "for SPARK-9519".

vanzin · 2015-08-05T02:22:35Z

LGTM, I'll leave it here to see if anyone else has comments, otherwise I'll merge in the morning.

SparkQA · 2015-08-05T04:10:02Z

Test build #39798 has finished for PR 7846 at commit 2e8e365.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-05T05:16:17Z

Test build #39809 has finished for PR 7846 at commit 1ae736d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

… killed Currently, when we kill application on Yarn, then will call sc.stop() at Yarn application state monitor thread, then in YarnClientSchedulerBackend.stop() will call interrupt this will cause SparkContext not stop fully as we will wait executor to exit. Author: linweizhong <linweizhong@huawei.com> Closes #7846 from Sephiroth-Lin/SPARK-9519 and squashes the following commits: 1ae736d [linweizhong] Update comments 2e8e365 [linweizhong] Add comment explaining the code ad0e23b [linweizhong] Update 243d2c7 [linweizhong] Confirm stop sc successfully when application was killed (cherry picked from commit 7a969a6) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

vanzin · 2015-08-05T17:16:44Z

Merged to master and 1.5, thanks!

Confirm stop sc successfully when application was killed

243d2c7

Update

ad0e23b

srowen reviewed Aug 1, 2015
View reviewed changes

vanzin reviewed Aug 4, 2015
View reviewed changes

Add comment explaining the code

2e8e365

vanzin reviewed Aug 5, 2015
View reviewed changes

yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala Outdated

Copy link

Contributor

vanzin Aug 5, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "for SPARK-9519".

Update comments

1ae736d

asfgit closed this in 7a969a6 Aug 5, 2015

Sephiroth-Lin deleted the SPARK-9519 branch May 15, 2016 10:10

[SPARK-9519][Yarn] Confirm stop sc successfully when application was killed #7846

[SPARK-9519][Yarn] Confirm stop sc successfully when application was killed #7846

Uh oh!

Conversation

Sephiroth-Lin commented Aug 1, 2015

Uh oh!

SparkQA commented Aug 1, 2015

Uh oh!

srowen commented Aug 1, 2015

Uh oh!

Sephiroth-Lin commented Aug 1, 2015

Uh oh!

srowen commented Aug 1, 2015

Uh oh!

Sephiroth-Lin commented Aug 1, 2015

Uh oh!

srowen commented Aug 1, 2015

Uh oh!

SparkQA commented Aug 1, 2015

Uh oh!

srowen Aug 1, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 2, 2015

Uh oh!

srowen commented Aug 4, 2015

Uh oh!

vanzin Aug 4, 2015

Choose a reason for hiding this comment

Uh oh!

vanzin commented Aug 4, 2015

Uh oh!

Sephiroth-Lin commented Aug 5, 2015

Uh oh!

vanzin Aug 5, 2015

Choose a reason for hiding this comment

Uh oh!

vanzin commented Aug 5, 2015

Uh oh!

SparkQA commented Aug 5, 2015

Uh oh!

SparkQA commented Aug 5, 2015

Uh oh!

vanzin commented Aug 5, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants