-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources #29332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ping @tgravescs @jiangxb1987 @mridulm Please take a look, thanks! |
|
Test build #126958 has finished for PR 29332 at commit
|
|
retest this please |
|
Test build #126961 has started for PR 29332 at commit |
| val taskLimit = resourceProfile.taskResources.get(limitingResource).map(_.amount) | ||
| .getOrElse { | ||
| val errorMsg = "limitingResource returns from ResourceProfile " + | ||
| s"$resourceProfile doesn't actually contain that task resource!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgravescs This actually should not happen, right? According to:
spark/core/src/main/scala/org/apache/spark/resource/ResourceProfile.scala
Lines 194 to 197 in 998086c
| if (taskResourcesToCheck.nonEmpty) { | |
| throw new SparkException("No executor resource configs were not specified for the " + | |
| s"following task configs: ${taskResourcesToCheck.keys.mkString(",")}") | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct, I believe it was there just a double check and make sure nothing broke in the future
tgravescs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah thanks for fixing, I think this got lost in the change that cores no longer had to be limiting resource and thinking barrier scheduling didn't work with dynamic allocation. I think I was thinking it would be covered when dynamic allocation support was added, but it obviously affects the default profile.
| val taskLimit = resourceProfile.taskResources.get(limitingResource).map(_.amount) | ||
| .getOrElse { | ||
| val errorMsg = "limitingResource returns from ResourceProfile " + | ||
| s"$resourceProfile doesn't actually contain that task resource!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct, I believe it was there just a double check and make sure nothing broke in the future
| val cpusPerTask = ResourceProfile.getTaskCpusOrDefaultForProfile(rp, conf) | ||
| val executorsWithResourceProfile = executorDataMap.values.filter(_.resourceProfileId == rp.id) | ||
| executorsWithResourceProfile.map(_.totalCores / cpusPerTask).sum | ||
| val (rpIds, cpus, resources) = executorDataMap.values.toArray.map { executor => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should first filter executorDataMap for active executors.
We are doing a lot of calculation here even when not in a barrier stage. I think out in DagScheduler.checkBarrierStageWithNumSlots we should move the call to this after check for rdd.isBarrier (or make it las) so everyone doesn't have to pay for it. This is definitely doing more work than it was before and we want to keep scheduler code as fast as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense.
| executor.resourceProfileId, | ||
| executor.totalCores, | ||
| executor.resourcesInfo.map { case (name, rInfo) => | ||
| (name, rInfo.resourceAddresses.length * rInfo.slotsPerAddress) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, here we are trying to calculate the total concurrent tasks that could run, even if some of them are busy at the moment so I think we should add some documentation there just to clarify.
Also can we add in a function to ExecutorResourceInfo to get this (rInfo.resourceAddresses.length * rInfo.slotsPerAddress).
|
Test build #127022 has finished for PR 29332 at commit
|
tgravescs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, pending Jenkins
|
test this please |
|
Test build #127053 has started for PR 29332 at commit |
|
Test build #127074 has finished for PR 29332 at commit
|
|
thanks, merging to master! |
|
thanks all! BTW, shall we backport this to branch-3.0? Though, it would be a little different fix since branch-3.0 doesn't have |
|
Yea let's create a new a PR for 3.0 |
|
branch 3 doesn't need it because cores are forced to be limiting resource. |
|
I think that's the problem. Think about the case where a barrier stage requires 2 CPUs and 2 GPUs but the cluster only has 2 CPUs and 1 GPU. In 3.0, since the limiting resource is cores/CPUs, the barrier stage would have a chance to launch tasks. However, it could only launch one task because the real limiting resources should be GPU. In this case, the barrier stage will fail because of partial task launch. But the error message is quite confusing for users as it suggests to disable delay scheduling, while the real cause should be insufficient (custom) resources. If we backport this fix to 3.0, the barrier stage should fail early before it was able to launch the task. |
|
This change it checking to see what your executors have. Spark will not let you start the application if your task requirements are such that cpus is not your limiting resource. so the only way you can get an executor with 2 cores and 1 gpu is if each task requires 2 cores and 1 gpu and at that point gpu is not your limiting resource, they are equivalent. |
|
That's not true when we don't need to check cores in some clusters? for example, Standalone. |
|
Oh right, if you didn't specify the executor cores in standalone mode - you get them all by default. Which honestly causes lots of issues - I filed a jira for this for other things. that is left up to the user to properly configure their cluster and job. We should remove the isDynamicAllocationCheck here so that the warning gets printed for standalone as well: If you want to go through and make sure everything is updated to schedule based on resources, I'm fine with it but I'm pretty sure needs to be more then this. |
|
I am not sure about the other things(probably, you could link the specific JIRA?) At least, I'd like to backport the fix to schedule based on the resources: #29395. |
What changes were proposed in this pull request?
Make
CoarseGrainedSchedulerBackend.maxNumConcurrentTasks()considers all kinds of resources when calculating the max concurrent tasksRefactor
calculateAvailableSlots()to make it be able to be used for bothCoarseGrainedSchedulerBackendandTaskSchedulerImplWhy are the changes needed?
Currently,
CoarseGrainedSchedulerBackend.maxNumConcurrentTasks()only considers the CPU for the max concurrent tasks. This can cause the application to hang when a barrier stage requires extra custom resources but the cluster doesn't have enough corresponding resources. Because, without the checking for other custom resources inmaxNumConcurrentTasks, the barrier stage can be submitted to theTaskSchedulerImpl. But theTaskSchedulerImplwon't launch tasks for the barrier stage due to the insufficient task slots calculated byTaskSchedulerImpl.calculateAvailableSlots(which does check all kinds of resources).The application hang issue can be reproduced by the added unit test.
Does this PR introduce any user-facing change?
Yes. In case of a barrier stage requires more custom resources than the cluster has, the application can get hang before this PR but can fail due to insufficient resources at the end after this PR.
How was this patch tested?
Added a unit test.