-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-8119] HeartbeatReceiver should replace executors, not kill #7107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is an alternative to #6662. @vanzin @SaintBacchus |
d9c9336 to
18b110e
Compare
|
Test build #36067 has finished for PR 7107 at commit
|
|
retest this please |
|
Test build #36070 has finished for PR 7107 at commit
|
|
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps instead of adding this method, add a boolean flag to killExecutors above?
(Or, because of ExecutorAllocationClient, create a private method that takes the list of executors and a boolean saying whether to update the target executor count, and call it from both of these methods.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was planning to do that but it technically changes public API. The methods in ExecutorAllocationClient are actually public because SparkContext extends it, so we actually can't change the signature there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but you can follow the second suggestion (private method called by the other two).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sounds good. I didn't do it initially because there isn't really a great name for the private method (we already have doKillExecutors). I'll figure something out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case you haven't done it yet, you might be able to just call this method from killExecutors.
|
Approach looks sane, it just feels like both implementations could share more code. |
|
Test build #36183 timed out for PR 7107 at commit |
|
This is awaiting the clean up in |
|
retest this please |
|
Test build #37276 has finished for PR 7107 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewor14 If CoarseGrainedSchedulerBackend.removeExecutor was happened before HeartbeatReceiver.killExecutorThread executorsPendingToRemove will have a zombie executorId
and the newTotal will got wrong number,would you like add some check in there ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we already called removeExecutor then executorDataMap will not contain executorId, so we won't reach this if case. I believe the scenario you described cannot happen.
The issue is that sc.killExecutors automatically assumes that the application wishes to adjust its resource requirements permanently downwards. This is not the intention in HeartbeatReceiver, however, which simply wants a replacement for the expired executor.
18b110e to
342cde2
Compare
This issue is actually somewhat difficult to test. There are many components in the picture and each one has to be mocked out. Many of the executor allocation requests are also done asynchronously so we need to find ways to block on these requests.
342cde2 to
1cd2cd7
Compare
|
@vanzin I updated this with your suggestion. |
|
LGTM. |
|
Test build #37542 has finished for PR 7107 at commit
|
|
Test build #1088 has finished for PR 7107 at commit
|
|
Test build #1092 has finished for PR 7107 at commit
|
|
Test build #1090 has finished for PR 7107 at commit
|
|
+1 LGTM |
**Symptom.** If an executor in an application times out, `HeartbeatReceiver` attempts to kill it. After this happens, however, the application never gets an executor back even when there are cluster resources available. **Cause.** The issue is that `sc.killExecutor` automatically assumes that the application wishes to adjust its resource requirements permanently downwards. This is not the intention in `HeartbeatReceiver`, however, which simply wants a replacement for the expired executor. **Fix.** Differentiate between the intention to kill and the intention to replace an executor with a fresh one. More details can be found in the commit message. Author: Andrew Or <andrew@databricks.com> Closes apache#7107 from andrewor14/heartbeat-no-kill and squashes the following commits: 1cd2cd7 [Andrew Or] Add regression test for SPARK-8119 25a347d [Andrew Or] Reuse more code in scheduler backend 31ebd40 [Andrew Or] Differentiate between kill and replace Conflicts: core/src/test/scala/org/apache/spark/HeartbeatReceiverSuite.scala MKIM: It was not trivial to resolve the conflict
Symptom. If an executor in an application times out,
HeartbeatReceiverattempts to kill it. After this happens, however, the application never gets an executor back even when there are cluster resources available.Cause. The issue is that
sc.killExecutorautomatically assumes that the application wishes to adjust its resource requirements permanently downwards. This is not the intention inHeartbeatReceiver, however, which simply wants a replacement for the expired executor.Fix. Differentiate between the intention to kill and the intention to replace an executor with a fresh one. More details can be found in the commit message.