-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-9375] Make sure the total number of executor(s) requested by the driver is not negative #7716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #38653 has finished for PR 7716 at commit
|
|
I think this patch has no association with the failed unit tests. And in my machine, this test passed..please retest. |
|
Jenkins, retest this please. |
|
Test build #133 has finished for PR 7716 at commit
|
|
Test build #38684 has finished for PR 7716 at commit
|
|
OK, that looks possibly legitimate -- CC @andrewor14 for a look at this. |
|
@KaiXinXiaoLei we fixed a couple issues that had this symptom in 1.3. Are you definitely running with a version that's 1.4 or later? |
|
i run in the latest version. |
|
@andrewor14, the code that @KaiXinXiaoLei suggests fixing seems to be code most recently updated in SPARK-8119. |
|
I think it's not the same. There maybe the same executorId in knownExecutors and executorsPendingToRemove. |
|
I've bumped the priority of this issue in case it is a regression. @KaiXinXiaoLei I'm trying to understand how it's possible for an executor ID to be in both |
|
I guess one way to reproduce this would be to call However, I believe this problem already existed in 1.4 well before SPARK-8119 and related changes. The old code didn't guard against that either: spark/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala Line 391 in 4b5bbc5
|
|
@andrewor14 I use the code before SPARK-8119. Now i use the latest code, test again, and not found this problem. Now i close PR, Thanks. |
|
@KaiXinXiaoLei I think this patch is good to have, however. It does guard against the scenario I discussed earlier, where the user calls I would like to merge a patch with this change + a regression test. Would you mind re-opening this? If not I can also do it myself. |
|
I've opened #8078, which should be functionally the same as this patch + has a regression test. |
…ame executor twice This is based on KaiXinXiaoLei's changes in #7716. The issue is that when someone calls `sc.killExecutor("1")` on the same executor twice quickly, then the executor target will be adjusted downwards by 2 instead of 1 even though we're only actually killing one executor. In certain cases where we don't adjust the target back upwards quickly, we'll end up with jobs hanging. This is a common danger because there are many places where this is called: - `HeartbeatReceiver` kills an executor that has not been sending heartbeats - `ExecutorAllocationManager` kills an executor that has been idle - The user code might call this, which may interfere with the previous callers While it's not clear whether this fixes SPARK-9745, fixing this potential race condition seems like a strict improvement. I've added a regression test to illustrate the issue. Author: Andrew Or <andrew@databricks.com> Closes #8078 from andrewor14/da-double-kill. (cherry picked from commit be5d191) Signed-off-by: Andrew Or <andrew@databricks.com>
…ame executor twice This is based on KaiXinXiaoLei's changes in #7716. The issue is that when someone calls `sc.killExecutor("1")` on the same executor twice quickly, then the executor target will be adjusted downwards by 2 instead of 1 even though we're only actually killing one executor. In certain cases where we don't adjust the target back upwards quickly, we'll end up with jobs hanging. This is a common danger because there are many places where this is called: - `HeartbeatReceiver` kills an executor that has not been sending heartbeats - `ExecutorAllocationManager` kills an executor that has been idle - The user code might call this, which may interfere with the previous callers While it's not clear whether this fixes SPARK-9745, fixing this potential race condition seems like a strict improvement. I've added a regression test to illustrate the issue. Author: Andrew Or <andrew@databricks.com> Closes #8078 from andrewor14/da-double-kill.
…ame executor twice This is based on KaiXinXiaoLei's changes in apache#7716. The issue is that when someone calls `sc.killExecutor("1")` on the same executor twice quickly, then the executor target will be adjusted downwards by 2 instead of 1 even though we're only actually killing one executor. In certain cases where we don't adjust the target back upwards quickly, we'll end up with jobs hanging. This is a common danger because there are many places where this is called: - `HeartbeatReceiver` kills an executor that has not been sending heartbeats - `ExecutorAllocationManager` kills an executor that has been idle - The user code might call this, which may interfere with the previous callers While it's not clear whether this fixes SPARK-9745, fixing this potential race condition seems like a strict improvement. I've added a regression test to illustrate the issue. Author: Andrew Or <andrew@databricks.com> Closes apache#8078 from andrewor14/da-double-kill.
In the code:
if (!replace) {
doRequestTotalExecutors(numExistingExecutors + numPendingExecutors
- executorsPendingToRemove.size - knownExecutors.size)
}
The value of “(numExistingExecutors + numPendingExecutors - executorsPendingToRemove.size - knownExecutors.size)” maybe negative if there is the same executorId in knownExecutors and executorsPendingToRemove. And knownExecutors and executorsPendingToRemove should not have the same executorId.