Skip to content

Conversation

@KaiXinXiaoLei
Copy link

@KaiXinXiaoLei KaiXinXiaoLei commented Dec 12, 2017

What changes were proposed in this pull request?

When the number of executors is big, and YarnSchedulerBackend.stop() is running,
before YarnSchedulerBackend.stopped=true, if some executor is stoped, then YarnSchedulerBackend.onDisconnected() will be called. There is a problem as follows:

17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Asking each executor to shut down
17/12/12 15:34:45 INFO YarnClientSchedulerBackend: Disabling executor 63.
17/12/12 15:34:45 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler or it has been stopped.
at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:163)
at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:133)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:356)
at org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:497)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:301)
at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121)
at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:217)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

. So i change the code, when removing executor, check sc.isStopped in YarnSchedulerBackend.onDisconnected(). if sc.isStopped=true, the message will not be sent.

How was this patch tested?

Run "spark-sql --master yarn -f query.sql" many times, the problem will be exists.

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

@SparkQA
Copy link

SparkQA commented Dec 12, 2017

Test build #84766 has finished for PR 19951 at commit c4dcc19.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 13, 2017

Test build #84814 has finished for PR 19951 at commit 40bb11f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@KaiXinXiaoLei
Copy link
Author

@devaraj-kavali @vanzin ,using #19741, i still find the problem "Could not find CoarseGrainedScheduler", i change the code ,please review, thanks

@srowen
Copy link
Member

srowen commented Dec 13, 2017

@yoonlee95 maybe @tgravescs does this make sense?

@tgravescs
Copy link
Contributor

Overall seems to make sense would like a few more details.

there were changes to the dispatcher to try to ignore some of the these errors:
https://github.com/apache/spark/pull/18547/files

these look like different messages then that one as it handled mostly rpcenvstopped.

So when these errors come out the dispatcher is not stopped yet?
you see lots of these errors or just a single one? Do these cause job failure or just clutter the logs?

@vanzin
Copy link
Contributor

vanzin commented Dec 13, 2017

This seems like a race during shutdown:

  • executor disconnects, which causes "on disconnect" event to be queued
  • at the same time, the stop() thread ends up calling Dispatcher.stop() which unregisters all endpoints and enqueues a message that stops each endpoint receiver
  • driver endpoint inbox is drained; "on disconnect" callback is called, driver tries to send a message to itself, but because it has been unregistered above, it fails.

You could argue that what the RpcEnv is doing above is sort of fishy (delivering messages to the endpoint after it's already been unregistered), but this looks like an ok workaround.

@vanzin
Copy link
Contributor

vanzin commented Dec 13, 2017

BTW, it might still be possible to hit that race even with the changes here. I'm not sure there's a way to completely get rid of it, though, so perhaps catching the exception and not logging it if things are stopping might be a more sure way to get rid of the logs.

@KaiXinXiaoLei
Copy link
Author

@tgravescs the job is end, then error log apper. i think this error log will cause illusions to believe the failure of the task.

@vanzin Your analysis is right. And i think my code can get rid of this exception. Because when things are stopping, the sc.stopped is set true firstly. So driver endpoint does not need to send messages to itself.

@vanzin
Copy link
Contributor

vanzin commented Dec 14, 2017

Yeah, but if an executor died right before the context was stopped, the "on disconnect" event might be in the queue when the stop call happens, and trigger the same code path that will throw the exception.

@KaiXinXiaoLei
Copy link
Author

My mean is, CoarseGrainedSchedulerBackend.stopExecutors() is called, then same executors is exited. The driver does not need feel these executor is disconnected and send message, otherwise the exception will be appear.

@KaiXinXiaoLei
Copy link
Author

KaiXinXiaoLei commented Dec 14, 2017

I have another way to modify this problem:
change coarsegrainedschedulerbackend

when CoarseGrainedSchedulerBackend.stopExecutors() is called, and use executorsPendingToRemove to log the stopped executors, and when driver feel these executor disconnected, will not send message because CoarseGrainedSchedulerBackend.disableExecutor() is set false

@vanzin
Copy link
Contributor

vanzin commented Dec 14, 2017

I know what you're saying. I'm saying that you're not considering another race that also exists in this code.

@KaiXinXiaoLei
Copy link
Author

@vanzin yeah, it is difficult to consider all the race. So i continue to analyze the source code, and i think my another way to solve the problem better.

@rezasafi
Copy link
Contributor

This is interesting. Thanks for working on it. It seems to me that the race situation is benign here and removing all the race cases will cause some extra communications that may introduce extra over-head.

@SparkQA
Copy link

SparkQA commented Mar 12, 2018

Test build #88177 has finished for PR 19951 at commit 40bb11f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor

It's not possible to differentiate the two race conditions @vanzin described in this code path without adding extra communication loads, since this should be a minor issue and there is no simple fix for it, I'd suggest we close this PR for now and revisit when someone come off a better solution, WDYT @vanzin @tgravescs @rezasafi @srowen @KaiXinXiaoLei ?

@vanzin
Copy link
Contributor

vanzin commented Mar 13, 2018

Since the target of the fix is silencing a misleading exception, handling that exception as I suggested before would be a feasible solution. But anything more complicated than that is overkill.

@vanzin vanzin mentioned this pull request May 11, 2018
@asfgit asfgit closed this in 348ddfd May 12, 2018
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
Closes apache#20458
Closes apache#20530
Closes apache#20557
Closes apache#20966
Closes apache#20857
Closes apache#19694
Closes apache#18227
Closes apache#20683
Closes apache#20881
Closes apache#20347
Closes apache#20825
Closes apache#20078

Closes apache#21281
Closes apache#19951
Closes apache#20905
Closes apache#20635

Author: Sean Owen <srowen@gmail.com>

Closes apache#21303 from srowen/ClosePRs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants