[SPARK-9795] Dynamic allocation: avoid double counting when killing same executor twice #8078

andrewor14 · 2015-08-10T20:47:00Z

This is based on @KaiXinXiaoLei's changes in #7716.

The issue is that when someone calls sc.killExecutor("1") on the same executor twice quickly, then the executor target will be adjusted downwards by 2 instead of 1 even though we're only actually killing one executor. In certain cases where we don't adjust the target back upwards quickly, we'll end up with jobs hanging.

This is a common danger because there are many places where this is called:

HeartbeatReceiver kills an executor that has not been sending heartbeats
ExecutorAllocationManager kills an executor that has been idle
The user code might call this, which may interfere with the previous callers

While it's not clear whether this fixes SPARK-9745, fixing this potential race condition seems like a strict improvement. I've added a regression test to illustrate the issue.

andrewor14 · 2015-08-10T20:47:18Z

@vanzin @srowen could you have a look?

harishreedharan · 2015-08-10T21:08:59Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

Nit: It looks like unknownExecutors is useful only for that one log statement. If we don't need that log line, we could reduce the number of set copies and traversals. How about something like this (this is less scala-like, but reduces the number of traversals and copies):

val knownExecutors = new HashSet[String] executorsIds.foreach { id => if (executorDataMap.contains(id)) { knownExecutors += id } }

This also makes the other changes in this file unnecessary.

That looks fine, but in this patch I wanted to limit the scope of the changes so I'm going to leave this as is.

harishreedharan · 2015-08-10T21:10:59Z

LGTM

vanzin · 2015-08-10T21:47:43Z

LGTM

SparkQA · 2015-08-10T23:06:06Z

Test build #40318 has finished for PR 8078 at commit fb149da.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-08-10T23:26:19Z

retest this please

SparkQA · 2015-08-11T02:38:54Z

Test build #1430 timed out for PR 8078 at commit fb149da after a configured wait of 175m.

SparkQA · 2015-08-11T02:39:23Z

Test build #1429 timed out for PR 8078 at commit fb149da after a configured wait of 175m.

andrewor14 · 2015-08-11T17:43:44Z

retest this please

SparkQA · 2015-08-11T20:27:19Z

Test build #40479 has finished for PR 8078 at commit fb149da.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-08-11T20:41:01Z

retest this please

andrewor14 · 2015-08-11T21:59:24Z

retest this please

SparkQA · 2015-08-11T23:14:49Z

Test build #40508 has finished for PR 8078 at commit fb149da.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-12T00:05:09Z

Test build #1460 has finished for PR 8078 at commit fb149da.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-08-12T05:14:56Z

Jenkins, retest this please.

andrewor14 · 2015-08-12T06:09:56Z

retest this please, just in case

SparkQA · 2015-08-12T07:40:03Z

Test build #40593 has finished for PR 8078 at commit fb149da.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-08-12T07:46:42Z

retest this please. Pretty sure I didn't change any SQL...

SparkQA · 2015-08-12T09:08:07Z

Test build #40596 has finished for PR 8078 at commit fb149da.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-12T10:16:14Z

Test build #40608 has finished for PR 8078 at commit fb149da.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-08-12T16:24:12Z

Alright, the latest commit actually passed tests so I'm going to merge this into master 1.5.

…ame executor twice This is based on KaiXinXiaoLei's changes in #7716. The issue is that when someone calls `sc.killExecutor("1")` on the same executor twice quickly, then the executor target will be adjusted downwards by 2 instead of 1 even though we're only actually killing one executor. In certain cases where we don't adjust the target back upwards quickly, we'll end up with jobs hanging. This is a common danger because there are many places where this is called: - `HeartbeatReceiver` kills an executor that has not been sending heartbeats - `ExecutorAllocationManager` kills an executor that has been idle - The user code might call this, which may interfere with the previous callers While it's not clear whether this fixes SPARK-9745, fixing this potential race condition seems like a strict improvement. I've added a regression test to illustrate the issue. Author: Andrew Or <andrew@databricks.com> Closes #8078 from andrewor14/da-double-kill. (cherry picked from commit be5d191) Signed-off-by: Andrew Or <andrew@databricks.com>

…ame executor twice This is based on KaiXinXiaoLei's changes in apache#7716. The issue is that when someone calls `sc.killExecutor("1")` on the same executor twice quickly, then the executor target will be adjusted downwards by 2 instead of 1 even though we're only actually killing one executor. In certain cases where we don't adjust the target back upwards quickly, we'll end up with jobs hanging. This is a common danger because there are many places where this is called: - `HeartbeatReceiver` kills an executor that has not been sending heartbeats - `ExecutorAllocationManager` kills an executor that has been idle - The user code might call this, which may interfere with the previous callers While it's not clear whether this fixes SPARK-9745, fixing this potential race condition seems like a strict improvement. I've added a regression test to illustrate the issue. Author: Andrew Or <andrew@databricks.com> Closes apache#8078 from andrewor14/da-double-kill.

Do not double count when adjusting target downwards

fb149da

andrewor14 changed the title ~~[SPARK-9795] Dynamic allocation: avoid double counting when killing same executor~~ [SPARK-9795] Dynamic allocation: avoid double counting when killing same executor twice Aug 10, 2015

andrewor14 mentioned this pull request Aug 10, 2015

[SPARK-9375] Make sure the total number of executor(s) requested by the driver is not negative #7716

Closed

harishreedharan reviewed Aug 10, 2015
View reviewed changes

asfgit closed this in be5d191 Aug 12, 2015

andrewor14 deleted the da-double-kill branch August 12, 2015 16:26

[SPARK-9795] Dynamic allocation: avoid double counting when killing same executor twice #8078

[SPARK-9795] Dynamic allocation: avoid double counting when killing same executor twice #8078

Uh oh!

Conversation

andrewor14 commented Aug 10, 2015

Uh oh!

andrewor14 commented Aug 10, 2015

Uh oh!

harishreedharan Aug 10, 2015

Choose a reason for hiding this comment

Uh oh!

andrewor14 Aug 11, 2015

Choose a reason for hiding this comment

Uh oh!

harishreedharan commented Aug 10, 2015

Uh oh!

vanzin commented Aug 10, 2015

Uh oh!

SparkQA commented Aug 10, 2015

Uh oh!

andrewor14 commented Aug 10, 2015

Uh oh!

SparkQA commented Aug 11, 2015

Uh oh!

SparkQA commented Aug 11, 2015

Uh oh!

andrewor14 commented Aug 11, 2015

Uh oh!

SparkQA commented Aug 11, 2015

Uh oh!

andrewor14 commented Aug 11, 2015

Uh oh!

andrewor14 commented Aug 11, 2015

Uh oh!

SparkQA commented Aug 11, 2015

Uh oh!

SparkQA commented Aug 12, 2015

Uh oh!

JoshRosen commented Aug 12, 2015

Uh oh!

andrewor14 commented Aug 12, 2015

Uh oh!

SparkQA commented Aug 12, 2015

Uh oh!

andrewor14 commented Aug 12, 2015

Uh oh!

SparkQA commented Aug 12, 2015

Uh oh!

SparkQA commented Aug 12, 2015

Uh oh!

andrewor14 commented Aug 12, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants