-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-9795] Dynamic allocation: avoid double counting when killing same executor twice #8078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It looks like unknownExecutors is useful only for that one log statement. If we don't need that log line, we could reduce the number of set copies and traversals. How about something like this (this is less scala-like, but reduces the number of traversals and copies):
val knownExecutors = new HashSet[String]
executorsIds.foreach { id =>
if (executorDataMap.contains(id)) {
knownExecutors += id
}
}
This also makes the other changes in this file unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks fine, but in this patch I wanted to limit the scope of the changes so I'm going to leave this as is.
|
LGTM |
1 similar comment
|
LGTM |
|
Test build #40318 has finished for PR 8078 at commit
|
|
retest this please |
|
Test build #1430 timed out for PR 8078 at commit |
|
Test build #1429 timed out for PR 8078 at commit |
|
retest this please |
|
Test build #40479 has finished for PR 8078 at commit
|
|
retest this please |
1 similar comment
|
retest this please |
|
Test build #40508 has finished for PR 8078 at commit
|
|
Test build #1460 has finished for PR 8078 at commit
|
|
Jenkins, retest this please. |
|
retest this please, just in case |
|
Test build #40593 has finished for PR 8078 at commit
|
|
retest this please. Pretty sure I didn't change any SQL... |
|
Test build #40596 has finished for PR 8078 at commit
|
|
Test build #40608 has finished for PR 8078 at commit
|
|
Alright, the latest commit actually passed tests so I'm going to merge this into master 1.5. |
…ame executor twice This is based on KaiXinXiaoLei's changes in #7716. The issue is that when someone calls `sc.killExecutor("1")` on the same executor twice quickly, then the executor target will be adjusted downwards by 2 instead of 1 even though we're only actually killing one executor. In certain cases where we don't adjust the target back upwards quickly, we'll end up with jobs hanging. This is a common danger because there are many places where this is called: - `HeartbeatReceiver` kills an executor that has not been sending heartbeats - `ExecutorAllocationManager` kills an executor that has been idle - The user code might call this, which may interfere with the previous callers While it's not clear whether this fixes SPARK-9745, fixing this potential race condition seems like a strict improvement. I've added a regression test to illustrate the issue. Author: Andrew Or <andrew@databricks.com> Closes #8078 from andrewor14/da-double-kill. (cherry picked from commit be5d191) Signed-off-by: Andrew Or <andrew@databricks.com>
…ame executor twice This is based on KaiXinXiaoLei's changes in apache#7716. The issue is that when someone calls `sc.killExecutor("1")` on the same executor twice quickly, then the executor target will be adjusted downwards by 2 instead of 1 even though we're only actually killing one executor. In certain cases where we don't adjust the target back upwards quickly, we'll end up with jobs hanging. This is a common danger because there are many places where this is called: - `HeartbeatReceiver` kills an executor that has not been sending heartbeats - `ExecutorAllocationManager` kills an executor that has been idle - The user code might call this, which may interfere with the previous callers While it's not clear whether this fixes SPARK-9745, fixing this potential race condition seems like a strict improvement. I've added a regression test to illustrate the issue. Author: Andrew Or <andrew@databricks.com> Closes apache#8078 from andrewor14/da-double-kill.
This is based on @KaiXinXiaoLei's changes in #7716.
The issue is that when someone calls
sc.killExecutor("1")on the same executor twice quickly, then the executor target will be adjusted downwards by 2 instead of 1 even though we're only actually killing one executor. In certain cases where we don't adjust the target back upwards quickly, we'll end up with jobs hanging.This is a common danger because there are many places where this is called:
HeartbeatReceiverkills an executor that has not been sending heartbeatsExecutorAllocationManagerkills an executor that has been idleWhile it's not clear whether this fixes SPARK-9745, fixing this potential race condition seems like a strict improvement. I've added a regression test to illustrate the issue.