Skip to content

Conversation

@KaiXinXiaoLei
Copy link

In the code:
if (!replace) {
doRequestTotalExecutors(numExistingExecutors + numPendingExecutors
- executorsPendingToRemove.size - knownExecutors.size)
}

The value of “(numExistingExecutors + numPendingExecutors - executorsPendingToRemove.size - knownExecutors.size)” maybe negative if there is the same executorId in knownExecutors and executorsPendingToRemove. And knownExecutors and executorsPendingToRemove should not have the same executorId.

@SparkQA
Copy link

SparkQA commented Jul 28, 2015

Test build #38653 has finished for PR 7716 at commit a6fb8aa.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@KaiXinXiaoLei
Copy link
Author

I think this patch has no association with the failed unit tests. And in my machine, this test passed..please retest.

@KaiXinXiaoLei
Copy link
Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 28, 2015

Test build #133 has finished for PR 7716 at commit a6fb8aa.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 28, 2015

Test build #38684 has finished for PR 7716 at commit a6fb8aa.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jul 28, 2015

OK, that looks possibly legitimate -- CC @andrewor14 for a look at this.

@sryza
Copy link
Contributor

sryza commented Jul 28, 2015

@KaiXinXiaoLei we fixed a couple issues that had this symptom in 1.3. Are you definitely running with a version that's 1.4 or later?

@KaiXinXiaoLei
Copy link
Author

i run in the latest version.

@sryza
Copy link
Contributor

sryza commented Aug 3, 2015

@andrewor14, the code that @KaiXinXiaoLei suggests fixing seems to be code most recently updated in SPARK-8119.

@KaiXinXiaoLei
Copy link
Author

I think it's not the same. There maybe the same executorId in knownExecutors and executorsPendingToRemove.

@andrewor14
Copy link
Contributor

I've bumped the priority of this issue in case it is a regression.

@KaiXinXiaoLei I'm trying to understand how it's possible for an executor ID to be in both knownExecutors and executorsPendingToRemove. In the drive logs of the application that had this negative number, can you search for the this message with no recent heartbeats? I wonder whether this is actually related to the heartbeat receiver mechanism.

@andrewor14
Copy link
Contributor

I guess one way to reproduce this would be to call sc.killExecutor("1") twice. Since it's the same executor both times, we may double count the number of executors to remove, which could result in the -1 thing you saw.

However, I believe this problem already existed in 1.4 well before SPARK-8119 and related changes. The old code didn't guard against that either:

val newTotal = (numExistingExecutors + numPendingExecutors - executorsPendingToRemove.size
.

@KaiXinXiaoLei
Copy link
Author

@andrewor14 I use the code before SPARK-8119. Now i use the latest code, test again, and not found this problem. Now i close PR, Thanks.

@andrewor14
Copy link
Contributor

@KaiXinXiaoLei I think this patch is good to have, however. It does guard against the scenario I discussed earlier, where the user calls sc.killExecutor twice on the same executor. There are many potentially places even within Spark where we might do this, and the result is that we lower the executors target too much. I even suspect this may be the cause of SPARK-9745.

I would like to merge a patch with this change + a regression test. Would you mind re-opening this? If not I can also do it myself.

@andrewor14
Copy link
Contributor

I've opened #8078, which should be functionally the same as this patch + has a regression test.

asfgit pushed a commit that referenced this pull request Aug 12, 2015
…ame executor twice

This is based on KaiXinXiaoLei's changes in #7716.

The issue is that when someone calls `sc.killExecutor("1")` on the same executor twice quickly, then the executor target will be adjusted downwards by 2 instead of 1 even though we're only actually killing one executor. In certain cases where we don't adjust the target back upwards quickly, we'll end up with jobs hanging.

This is a common danger because there are many places where this is called:
- `HeartbeatReceiver` kills an executor that has not been sending heartbeats
- `ExecutorAllocationManager` kills an executor that has been idle
- The user code might call this, which may interfere with the previous callers

While it's not clear whether this fixes SPARK-9745, fixing this potential race condition seems like a strict improvement. I've added a regression test to illustrate the issue.

Author: Andrew Or <andrew@databricks.com>

Closes #8078 from andrewor14/da-double-kill.

(cherry picked from commit be5d191)
Signed-off-by: Andrew Or <andrew@databricks.com>
asfgit pushed a commit that referenced this pull request Aug 12, 2015
…ame executor twice

This is based on KaiXinXiaoLei's changes in #7716.

The issue is that when someone calls `sc.killExecutor("1")` on the same executor twice quickly, then the executor target will be adjusted downwards by 2 instead of 1 even though we're only actually killing one executor. In certain cases where we don't adjust the target back upwards quickly, we'll end up with jobs hanging.

This is a common danger because there are many places where this is called:
- `HeartbeatReceiver` kills an executor that has not been sending heartbeats
- `ExecutorAllocationManager` kills an executor that has been idle
- The user code might call this, which may interfere with the previous callers

While it's not clear whether this fixes SPARK-9745, fixing this potential race condition seems like a strict improvement. I've added a regression test to illustrate the issue.

Author: Andrew Or <andrew@databricks.com>

Closes #8078 from andrewor14/da-double-kill.
CodingCat pushed a commit to CodingCat/spark that referenced this pull request Aug 17, 2015
…ame executor twice

This is based on KaiXinXiaoLei's changes in apache#7716.

The issue is that when someone calls `sc.killExecutor("1")` on the same executor twice quickly, then the executor target will be adjusted downwards by 2 instead of 1 even though we're only actually killing one executor. In certain cases where we don't adjust the target back upwards quickly, we'll end up with jobs hanging.

This is a common danger because there are many places where this is called:
- `HeartbeatReceiver` kills an executor that has not been sending heartbeats
- `ExecutorAllocationManager` kills an executor that has been idle
- The user code might call this, which may interfere with the previous callers

While it's not clear whether this fixes SPARK-9745, fixing this potential race condition seems like a strict improvement. I've added a regression test to illustrate the issue.

Author: Andrew Or <andrew@databricks.com>

Closes apache#8078 from andrewor14/da-double-kill.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants