[SPARK-9375] Make sure the total number of executor(s) requested by the driver is not negative #7716

KaiXinXiaoLei · 2015-07-28T04:09:56Z

In the code:
if (!replace) {
doRequestTotalExecutors(numExistingExecutors + numPendingExecutors
- executorsPendingToRemove.size - knownExecutors.size)
}

The value of “(numExistingExecutors + numPendingExecutors - executorsPendingToRemove.size - knownExecutors.size)” maybe negative if there is the same executorId in knownExecutors and executorsPendingToRemove. And knownExecutors and executorsPendingToRemove should not have the same executorId.

SparkQA · 2015-07-28T05:57:12Z

Test build #38653 has finished for PR 7716 at commit a6fb8aa.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

KaiXinXiaoLei · 2015-07-28T07:06:40Z

I think this patch has no association with the failed unit tests. And in my machine, this test passed..please retest.

KaiXinXiaoLei · 2015-07-28T07:06:45Z

Jenkins, retest this please.

SparkQA · 2015-07-28T09:01:36Z

Test build #133 has finished for PR 7716 at commit a6fb8aa.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-28T09:02:50Z

Test build #38684 has finished for PR 7716 at commit a6fb8aa.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-07-28T11:00:44Z

OK, that looks possibly legitimate -- CC @andrewor14 for a look at this.

sryza · 2015-07-28T19:56:19Z

@KaiXinXiaoLei we fixed a couple issues that had this symptom in 1.3. Are you definitely running with a version that's 1.4 or later?

KaiXinXiaoLei · 2015-08-03T01:28:15Z

i run in the latest version.

sryza · 2015-08-03T20:02:40Z

@andrewor14, the code that @KaiXinXiaoLei suggests fixing seems to be code most recently updated in SPARK-8119.

KaiXinXiaoLei · 2015-08-04T06:21:23Z

I think it's not the same. There maybe the same executorId in knownExecutors and executorsPendingToRemove.

andrewor14 · 2015-08-07T22:13:06Z

I've bumped the priority of this issue in case it is a regression.

@KaiXinXiaoLei I'm trying to understand how it's possible for an executor ID to be in both knownExecutors and executorsPendingToRemove. In the drive logs of the application that had this negative number, can you search for the this message with no recent heartbeats? I wonder whether this is actually related to the heartbeat receiver mechanism.

andrewor14 · 2015-08-07T22:25:19Z

I guess one way to reproduce this would be to call sc.killExecutor("1") twice. Since it's the same executor both times, we may double count the number of executors to remove, which could result in the -1 thing you saw.

However, I believe this problem already existed in 1.4 well before SPARK-8119 and related changes. The old code didn't guard against that either:

spark/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

Line 391 in 4b5bbc5

    
           val newTotal = (numExistingExecutors + numPendingExecutors - executorsPendingToRemove.size

.

KaiXinXiaoLei · 2015-08-10T02:41:21Z

@andrewor14 I use the code before SPARK-8119. Now i use the latest code, test again, and not found this problem. Now i close PR, Thanks.

andrewor14 · 2015-08-10T17:40:20Z

@KaiXinXiaoLei I think this patch is good to have, however. It does guard against the scenario I discussed earlier, where the user calls sc.killExecutor twice on the same executor. There are many potentially places even within Spark where we might do this, and the result is that we lower the executors target too much. I even suspect this may be the cause of SPARK-9745.

I would like to merge a patch with this change + a regression test. Would you mind re-opening this? If not I can also do it myself.

andrewor14 · 2015-08-10T20:52:33Z

I've opened #8078, which should be functionally the same as this patch + has a regression test.

…ame executor twice This is based on KaiXinXiaoLei's changes in #7716. The issue is that when someone calls `sc.killExecutor("1")` on the same executor twice quickly, then the executor target will be adjusted downwards by 2 instead of 1 even though we're only actually killing one executor. In certain cases where we don't adjust the target back upwards quickly, we'll end up with jobs hanging. This is a common danger because there are many places where this is called: - `HeartbeatReceiver` kills an executor that has not been sending heartbeats - `ExecutorAllocationManager` kills an executor that has been idle - The user code might call this, which may interfere with the previous callers While it's not clear whether this fixes SPARK-9745, fixing this potential race condition seems like a strict improvement. I've added a regression test to illustrate the issue. Author: Andrew Or <andrew@databricks.com> Closes #8078 from andrewor14/da-double-kill. (cherry picked from commit be5d191) Signed-off-by: Andrew Or <andrew@databricks.com>

…ame executor twice This is based on KaiXinXiaoLei's changes in #7716. The issue is that when someone calls `sc.killExecutor("1")` on the same executor twice quickly, then the executor target will be adjusted downwards by 2 instead of 1 even though we're only actually killing one executor. In certain cases where we don't adjust the target back upwards quickly, we'll end up with jobs hanging. This is a common danger because there are many places where this is called: - `HeartbeatReceiver` kills an executor that has not been sending heartbeats - `ExecutorAllocationManager` kills an executor that has been idle - The user code might call this, which may interfere with the previous callers While it's not clear whether this fixes SPARK-9745, fixing this potential race condition seems like a strict improvement. I've added a regression test to illustrate the issue. Author: Andrew Or <andrew@databricks.com> Closes #8078 from andrewor14/da-double-kill.

…ame executor twice This is based on KaiXinXiaoLei's changes in apache#7716. The issue is that when someone calls `sc.killExecutor("1")` on the same executor twice quickly, then the executor target will be adjusted downwards by 2 instead of 1 even though we're only actually killing one executor. In certain cases where we don't adjust the target back upwards quickly, we'll end up with jobs hanging. This is a common danger because there are many places where this is called: - `HeartbeatReceiver` kills an executor that has not been sending heartbeats - `ExecutorAllocationManager` kills an executor that has been idle - The user code might call this, which may interfere with the previous callers While it's not clear whether this fixes SPARK-9745, fixing this potential race condition seems like a strict improvement. I've added a regression test to illustrate the issue. Author: Andrew Or <andrew@databricks.com> Closes apache#8078 from andrewor14/da-double-kill.

change file

a6fb8aa

KaiXinXiaoLei closed this Aug 10, 2015

kevincox mentioned this pull request Aug 10, 2015

Kevincox hanging no executor fix #8075

Closed

andrewor14 mentioned this pull request Aug 10, 2015

[SPARK-9795] Dynamic allocation: avoid double counting when killing same executor twice #8078

Closed

[SPARK-9375] Make sure the total number of executor(s) requested by the driver is not negative #7716

[SPARK-9375] Make sure the total number of executor(s) requested by the driver is not negative #7716

Uh oh!

Conversation

KaiXinXiaoLei commented Jul 28, 2015

Uh oh!

SparkQA commented Jul 28, 2015

Uh oh!

KaiXinXiaoLei commented Jul 28, 2015

Uh oh!

KaiXinXiaoLei commented Jul 28, 2015

Uh oh!

SparkQA commented Jul 28, 2015

Uh oh!

SparkQA commented Jul 28, 2015

Uh oh!

srowen commented Jul 28, 2015

Uh oh!

sryza commented Jul 28, 2015

Uh oh!

KaiXinXiaoLei commented Aug 3, 2015

Uh oh!

sryza commented Aug 3, 2015

Uh oh!

KaiXinXiaoLei commented Aug 4, 2015

Uh oh!

andrewor14 commented Aug 7, 2015

Uh oh!

andrewor14 commented Aug 7, 2015

Uh oh!

KaiXinXiaoLei commented Aug 10, 2015

Uh oh!

andrewor14 commented Aug 10, 2015

Uh oh!

andrewor14 commented Aug 10, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants