[SPARK-33799][CORE] Handle excluded executors/nodes in ExecutorMonitor #30795

Ngone51 · 2020-12-16T07:07:02Z

What changes were proposed in this pull request?

This PR proposes to handle exclusion events in ExecutorMonitor so it doesn't count excluded executors as available executors for running tasks.

The main change includes:

implement onExecutorExcluded/onExecutorUnexcluded/onNodeExcluded/onNodeUnexcluded insides ExecutorMonitor.
Allow the ExecutorAllocationManager to request at most (maxNumExecutors + excludeExecutors)

Note that this improvement only tasks effects when both dynamic allocation and exclusion features are enabled but with spark.excludeOnFailure.killExcludedExecutors=false. We don't want to handle the exclude executors specifically when we do kill excluded executors. Because in that case, we assume that there would be new executors launched later to replace those killed executors.

Why are the changes needed?

Currently, the excluded executors are counted as available executors for running tasks. But that's not correct since the TaskScheduler never schedules tasks on those excluded executors. As a result, it can lower the scheduling efficiency of the TaskScheduler. In the worst case, the TaskSet can not be scheduled anywhere and it then has to go through getCompletelyExcludedTaskIfAny(...) path which is inefficient.

This PR makes the Spark be aware of the lack of executors at dynamic allocation level. So we can launch the new executors early before the TaskScheduler realizes the problem, which could ease the worst case and improve scheduling efficiency.

Besides, this also prevents the ExecutorAllocationManager from going into the fake minExecutor status when removing idle executors. For example, when we have 5 executors (2 excluded) and minExecutor=3, and we need to remove 2 idle but not exluded executors. Then, we'd have 3 executors with 2 excluded executor at the end and only one executor can launch tasks indeed. (this worths a followup to kill the idle-exclude-executor first if this PR gets approved). And this PR could avoid the problem since we'd remove the excluded executors in first place.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added unit tests.

Ngone51 · 2020-12-16T07:08:56Z

cc @tgravescs @mridulm @jiangxb1987 @attilapiros Cloud you please take a look?

cc @vanzin @squito FYI

SparkQA · 2020-12-16T07:56:39Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37467/

SparkQA · 2020-12-16T08:25:15Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37467/

SparkQA · 2020-12-16T09:10:52Z

Test build #132865 has finished for PR 30795 at commit c6301f8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-12-16T22:53:08Z

Retest this please

dongjoon-hyun

This issue is registered as Improvement, but the content seems to be needed at branch-3.1. Are you targeting this at Apache Spark 3.1.0? If then, please adjust the JIRA accordingly, @Ngone51 .

cc @HyukjinKwon since he is the release manager of Apache Spark 3.1.0.

SparkQA · 2020-12-16T23:54:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37506/

SparkQA · 2020-12-17T00:03:44Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37506/

SparkQA · 2020-12-17T01:56:40Z

Test build #132904 has finished for PR 30795 at commit c6301f8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-12-17T02:11:00Z

I don't think we should port this to branch-3.1. I am also synced with @Ngone51 offline. It was just a mistake that the affected version was set to 3.1.

Ngone51 · 2020-12-17T02:14:13Z

@dongjoon-hyun @HyukjinKwon Thanks for correcting.

tgravescs · 2020-12-17T14:48:29Z

so there were changes to attempt to help with this dealing with unschedulableTaskSets that went in a while ago. seems like they are trying to solve similar things so we should see overlap. I don't know I'll have time to review today but will try tomorrow

mridulm

Thanks for working on this @Ngone51 !
Also, +CC @tgravescs

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala

Ngone51 · 2020-12-18T04:21:35Z

so there were changes to attempt to help with this dealing with unschedulableTaskSets that went in a while ago. seems like they are trying to solve similar things so we should see overlap.

Yeah, I also noticed the change(#28287). This PR can reduce the times of getting into the bad situation mentioned in #28287. As this PR essentially replaces those excluded executors with new healthy executors. Thus, a taskset can get more opportunities to launch tasks. Besides, we'd launch the new healthy executor early in this PR comparing to the solution in #28287, which helps improve the scheduling efficiency.

However, this PR can not 100% replace #28287. Because we don't handle SparkListenerExecutorExcludedForStage/SparkListenerNodeExcludedForStage in this PR. So the taskset can still get completely excluded to launch tasks.

cc @venkata91 FYI

Ngone51 · 2020-12-18T04:54:26Z

@mridulm Thanks for the review. Have addressed the comments.

SparkQA · 2020-12-18T06:09:00Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37591/

SparkQA · 2020-12-18T06:37:47Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37591/

SparkQA · 2020-12-18T07:29:27Z

Test build #132991 has finished for PR 30795 at commit 6239a99.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-18T09:12:17Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37603/

SparkQA · 2020-12-18T09:55:18Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37603/

SparkQA · 2020-12-18T10:18:50Z

Test build #133004 has finished for PR 30795 at commit 75cdd1e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs

At some point I would really like to see the dynamic allocation manager put into the scheduler itself. We keep adding things like this where we are just tracking more and more things twice and disconnected via message passing which could be dropped. This is one reason in the previous PR to help this I said it should really have more knowledge of the excluded listings. that is obviously a bunch of work though.

Maybe I'm missing something here, but I think this also has an issue with the removal. since you changed the definition of executorCountWithResourceProfile to not include the excluded nodes, if the excluded nodes idle timeout we could be keeping around those excluded nodes. ie min=3, we have 5 active and 2 are excluded., we think only 3 so the 2 are never removed. I think we want to take the excluded nodes into account here and remove them if idle. ie see removeExecutors in ExecutorAllocationManager.

tgravescs · 2020-12-18T15:19:48Z

core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala

+    // Increase the maxNumExecutors by adding the excluded executors so that manager can
+    // launch new executors to replace the excluded executors.
+    val exclude = executorMonitor.excludedExecutorCount
+    val maxOverheadExecutors = maxNumExecutors + exclude


so I don't agree with this, at least not how its defined. The user defined the maximum number of executors to use, this is getting more than that. I realize that some are excluded, but this also comes down to a resource utilization question as well. If I am in multi-tenant environment, I want to make sure 1 job doesn't take over the entire cluster. max is one way to do this. I think we would either need to redefine this, which isn't great for backwards compatibility and could result in unexpected behavior or we add another config that is around the excluded nodes. this would either just be an allow to go over or a allow to go over by X. The downside to this is default would be 0 or false so you would have to configure if you do set max and want to use this feature. But I don't see a lot of jobs setting max unless they are trying to be nice in multi-tenant so it seems ok as long as its in release notes, etc.

you will notice the other logic for unschedulableTaskSets does not increase this, just increases the number we ask for.

Make sense to me. Add an extra conf would be a good choice.

Although, I'm rethinking this change. It only takes effect when users set the max explicitly and the cluster reaches the max.( By default, max is Int.MaxValue. So we won't reach the max normally.) However, we still want to replace those excluded executors even if the cluster doesn't reach the max. For example, max/2 may be enough for task scheduling. And TaskScheduler also thinks there're max/2 executors without realizing X executors actually excluded.

So I think what we actually need here is to forcibly replace excluded executors when dynamic allocation & exclusion (but not kill) are both enabled. And it should not be related to the max value.

tgravescs · 2020-12-18T15:28:24Z

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala

    var pendingRemoval: Boolean = false
    var decommissioning: Boolean = false
    var hasActiveShuffle: Boolean = false
+    // whether the executor is temporarily excluded by the `HealthTracker`


we should expand this to state excluded for entire application (I realize HealthTracker implies this but would like to be more explicit), does not include excluded within the stage level.

mridulm · 2020-12-19T07:28:06Z

I completely agree, we should look towards merging DRA into scheduler - the async eventing is not helping matters, and frankly has outlived its usefulness.

Ngone51 · 2020-12-23T03:10:36Z

...since you changed the definition of executorCountWithResourceProfile to not include the excluded nodes, if the excluded nodes idle timeout we could be keeping around those excluded nodes. ie min=3, we have 5 active and 2 are excluded., we think only 3 so the 2 are never removed. ...

Yea. I have mentioned this issue in the PR description. In this case, the better way is to remove excluded executors first. I thought it could be a follow-up if this PR gets approved.

Ngone51 · 2020-12-23T03:13:30Z

At some point I would really like to see the dynamic allocation manager put into the scheduler itself.

+1.

tgravescs · 2021-01-04T19:50:54Z

catching up from vacation, I think this still needs the comments addressed so just ping me once it's ready.

github-actions · 2021-04-15T00:18:27Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Ngone51 added 2 commits December 16, 2020 13:32

.

fa5f141

add JIRA ID

c6301f8

github-actions bot added the CORE label Dec 16, 2020

dongjoon-hyun reviewed Dec 16, 2020

View reviewed changes

mridulm reviewed Dec 18, 2020

View reviewed changes

Ngone51 added 3 commits December 18, 2020 12:24

rename isExcluded to excluded

f46a9a2

rename increase.. to increment..

f8d8495

extract the variable & hanlde exec being null

6239a99

fix tests

75cdd1e

tgravescs reviewed Dec 18, 2020

View reviewed changes

github-actions bot added the Stale label Apr 15, 2021

github-actions bot closed this Apr 16, 2021

[SPARK-33799][CORE] Handle excluded executors/nodes in ExecutorMonitor #30795

[SPARK-33799][CORE] Handle excluded executors/nodes in ExecutorMonitor #30795

Uh oh!

Conversation

Ngone51 commented Dec 16, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Ngone51 commented Dec 16, 2020

Uh oh!

SparkQA commented Dec 16, 2020

Uh oh!

SparkQA commented Dec 16, 2020

Uh oh!

SparkQA commented Dec 16, 2020

Uh oh!

dongjoon-hyun commented Dec 16, 2020

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 16, 2020

Uh oh!

SparkQA commented Dec 17, 2020

Uh oh!

SparkQA commented Dec 17, 2020

Uh oh!

HyukjinKwon commented Dec 17, 2020

Uh oh!

Ngone51 commented Dec 17, 2020

Uh oh!

tgravescs commented Dec 17, 2020

Uh oh!

mridulm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Ngone51 commented Dec 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ngone51 commented Dec 18, 2020

Uh oh!

SparkQA commented Dec 18, 2020

Uh oh!

SparkQA commented Dec 18, 2020

Uh oh!

SparkQA commented Dec 18, 2020

Uh oh!

SparkQA commented Dec 18, 2020

Uh oh!

SparkQA commented Dec 18, 2020

Uh oh!

SparkQA commented Dec 18, 2020

Uh oh!

tgravescs left a comment

Choose a reason for hiding this comment

Uh oh!

tgravescs Dec 18, 2020

Choose a reason for hiding this comment

Uh oh!

Ngone51 Dec 23, 2020

Choose a reason for hiding this comment

Uh oh!

tgravescs Dec 18, 2020

Choose a reason for hiding this comment

Uh oh!

mridulm commented Dec 19, 2020

Uh oh!

Ngone51 commented Dec 23, 2020

Uh oh!

Ngone51 commented Dec 23, 2020

Uh oh!

tgravescs commented Jan 4, 2021

dongjoon-hyun left a comment •

edited

Loading

mridulm left a comment •

edited

Loading

Ngone51 commented Dec 18, 2020 •

edited

Loading