[SPARK-22760][CORE][YARN] When sc.stop() is called, set stopped is true before removing executors #19951

KaiXinXiaoLei · 2017-12-12T11:45:01Z

What changes were proposed in this pull request?

When the number of executors is big, and YarnSchedulerBackend.stop() is running，
before YarnSchedulerBackend.stopped=true, if some executor is stoped, then YarnSchedulerBackend.onDisconnected() will be called. There is a problem as follows:

17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Asking each executor to shut down
17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Disabling executor 63.
17/12/12 15:34:45 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler or it has been stopped.
at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:163)
at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:133)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:356)
at org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:497)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:301)
at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121)
at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:217)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

. So i change the code, when removing executor, check sc.isStopped in YarnSchedulerBackend.onDisconnected(). if sc.isStopped=true, the message will not be sent.

How was this patch tested?

Run "spark-sql --master yarn -f query.sql" many times, the problem will be exists.

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

SparkQA · 2017-12-12T11:49:12Z

Test build #84766 has finished for PR 19951 at commit c4dcc19.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-13T01:36:53Z

Test build #84814 has finished for PR 19951 at commit 40bb11f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

KaiXinXiaoLei · 2017-12-13T01:52:10Z

@devaraj-kavali @vanzin ,using #19741, i still find the problem "Could not find CoarseGrainedScheduler", i change the code ,please review, thanks

srowen · 2017-12-13T16:48:34Z

@yoonlee95 maybe @tgravescs does this make sense?

tgravescs · 2017-12-13T19:21:55Z

Overall seems to make sense would like a few more details.

there were changes to the dispatcher to try to ignore some of the these errors:
https://github.com/apache/spark/pull/18547/files

these look like different messages then that one as it handled mostly rpcenvstopped.

So when these errors come out the dispatcher is not stopped yet?
you see lots of these errors or just a single one? Do these cause job failure or just clutter the logs?

vanzin · 2017-12-13T21:25:31Z

This seems like a race during shutdown:

executor disconnects, which causes "on disconnect" event to be queued
at the same time, the stop() thread ends up calling Dispatcher.stop() which unregisters all endpoints and enqueues a message that stops each endpoint receiver
driver endpoint inbox is drained; "on disconnect" callback is called, driver tries to send a message to itself, but because it has been unregistered above, it fails.

You could argue that what the RpcEnv is doing above is sort of fishy (delivering messages to the endpoint after it's already been unregistered), but this looks like an ok workaround.

vanzin · 2017-12-13T22:00:59Z

BTW, it might still be possible to hit that race even with the changes here. I'm not sure there's a way to completely get rid of it, though, so perhaps catching the exception and not logging it if things are stopping might be a more sure way to get rid of the logs.

KaiXinXiaoLei · 2017-12-14T01:43:00Z

@tgravescs the job is end, then error log apper. i think this error log will cause illusions to believe the failure of the task.

@vanzin Your analysis is right. And i think my code can get rid of this exception. Because when things are stopping, the sc.stopped is set true firstly. So driver endpoint does not need to send messages to itself.

vanzin · 2017-12-14T01:47:36Z

Yeah, but if an executor died right before the context was stopped, the "on disconnect" event might be in the queue when the stop call happens, and trigger the same code path that will throw the exception.

KaiXinXiaoLei · 2017-12-14T01:58:31Z

My mean is, CoarseGrainedSchedulerBackend.stopExecutors() is called, then same executors is exited. The driver does not need feel these executor is disconnected and send message, otherwise the exception will be appear.

KaiXinXiaoLei · 2017-12-14T02:00:24Z

I have another way to modify this problem:

when CoarseGrainedSchedulerBackend.stopExecutors() is called, and use executorsPendingToRemove to log the stopped executors, and when driver feel these executor disconnected, will not send message because CoarseGrainedSchedulerBackend.disableExecutor() is set false

vanzin · 2017-12-14T02:06:57Z

I know what you're saying. I'm saying that you're not considering another race that also exists in this code.

KaiXinXiaoLei · 2017-12-14T02:12:03Z

@vanzin yeah, it is difficult to consider all the race. So i continue to analyze the source code, and i think my another way to solve the problem better.

rezasafi · 2017-12-14T20:55:21Z

This is interesting. Thanks for working on it. It seems to me that the race situation is benign here and removing all the race cases will cause some extra communications that may introduce extra over-head.

SparkQA · 2018-03-12T13:43:41Z

Test build #88177 has finished for PR 19951 at commit 40bb11f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2018-03-13T08:17:37Z

It's not possible to differentiate the two race conditions @vanzin described in this code path without adding extra communication loads, since this should be a minor issue and there is no simple fix for it, I'd suggest we close this PR for now and revisit when someone come off a better solution, WDYT @vanzin @tgravescs @rezasafi @srowen @KaiXinXiaoLei ?

vanzin · 2018-03-13T16:49:13Z

Since the target of the fix is silencing a misleading exception, handling that exception as I suggested before would be a feasible solution. But anything more complicated than that is overkill.

Closes apache#20458 Closes apache#20530 Closes apache#20557 Closes apache#20966 Closes apache#20857 Closes apache#19694 Closes apache#18227 Closes apache#20683 Closes apache#20881 Closes apache#20347 Closes apache#20825 Closes apache#20078 Closes apache#21281 Closes apache#19951 Closes apache#20905 Closes apache#20635 Author: Sean Owen <srowen@gmail.com> Closes apache#21303 from srowen/ClosePRs.

change code

c4dcc19

code style

40bb11f

vanzin mentioned this pull request May 11, 2018

[BUILD] Close stale PRs #21303

Closed

asfgit closed this in 348ddfd May 12, 2018

[SPARK-22760][CORE][YARN] When sc.stop() is called, set stopped is true before removing executors #19951

[SPARK-22760][CORE][YARN] When sc.stop() is called, set stopped is true before removing executors #19951

Uh oh!

Conversation

KaiXinXiaoLei commented Dec 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 12, 2017

Uh oh!

SparkQA commented Dec 13, 2017

Uh oh!

KaiXinXiaoLei commented Dec 13, 2017

Uh oh!

srowen commented Dec 13, 2017

Uh oh!

tgravescs commented Dec 13, 2017

Uh oh!

vanzin commented Dec 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanzin commented Dec 13, 2017

Uh oh!

KaiXinXiaoLei commented Dec 14, 2017

Uh oh!

vanzin commented Dec 14, 2017

Uh oh!

KaiXinXiaoLei commented Dec 14, 2017

Uh oh!

KaiXinXiaoLei commented Dec 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanzin commented Dec 14, 2017

Uh oh!

KaiXinXiaoLei commented Dec 14, 2017

Uh oh!

rezasafi commented Dec 14, 2017

Uh oh!

SparkQA commented Mar 12, 2018

Uh oh!

jiangxb1987 commented Mar 13, 2018

Uh oh!

vanzin commented Mar 13, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

KaiXinXiaoLei commented Dec 12, 2017 •

edited

Loading

vanzin commented Dec 13, 2017 •

edited

Loading

KaiXinXiaoLei commented Dec 14, 2017 •

edited

Loading