-
Notifications
You must be signed in to change notification settings - Fork 29k
[3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources #29395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -19,9 +19,12 @@ package org.apache.spark | |
|
|
||
| import scala.concurrent.duration._ | ||
|
|
||
| import org.apache.spark.TestUtils.createTempScriptWithExpectedOutput | ||
| import org.apache.spark.internal.config._ | ||
| import org.apache.spark.rdd.{PartitionPruningRDD, RDD} | ||
| import org.apache.spark.resource.TestResourceIDs.{EXECUTOR_GPU_ID, TASK_GPU_ID, WORKER_GPU_ID} | ||
| import org.apache.spark.scheduler.BarrierJobAllocationFailed._ | ||
| import org.apache.spark.scheduler.BarrierJobSlotsNumberCheckFailed | ||
| import org.apache.spark.util.ThreadUtils | ||
|
|
||
| /** | ||
|
|
@@ -259,4 +262,37 @@ class BarrierStageOnSubmittedSuite extends SparkFunSuite with LocalSparkContext | |
| testSubmitJob(sc, rdd, | ||
| message = ERROR_MESSAGE_BARRIER_REQUIRE_MORE_SLOTS_THAN_CURRENT_TOTAL_NUMBER) | ||
| } | ||
|
|
||
| test("SPARK-32518: CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should " + | ||
| "consider all kinds of resources for the barrier stage") { | ||
| withTempDir { dir => | ||
| val discoveryScript = createTempScriptWithExpectedOutput( | ||
| dir, "gpuDiscoveryScript", """{"name": "gpu","addresses":["0"]}""") | ||
|
|
||
| val conf = new SparkConf() | ||
| .setMaster("local-cluster[1, 2, 1024]") | ||
| .setAppName("test-cluster") | ||
| .set(WORKER_GPU_ID.amountConf, "1") | ||
| .set(WORKER_GPU_ID.discoveryScriptConf, discoveryScript) | ||
| .set(EXECUTOR_GPU_ID.amountConf, "1") | ||
| .set(TASK_GPU_ID.amountConf, "1") | ||
| // disable barrier stage retry to fail the application as soon as possible | ||
| .set(BARRIER_MAX_CONCURRENT_TASKS_CHECK_MAX_FAILURES, 1) | ||
| // disable the check to simulate the behavior of Standalone in order to | ||
| // reproduce the issue. | ||
| .set(Tests.SKIP_VALIDATE_CORES_TESTING, true) | ||
| sc = new SparkContext(conf) | ||
| // setup an executor which will have 2 CPUs and 1 GPU | ||
| TestUtils.waitUntilExecutorsUp(sc, 1, 60000) | ||
|
|
||
| val exception = intercept[BarrierJobSlotsNumberCheckFailed] { | ||
| sc.parallelize(Range(1, 10), 2) | ||
| .barrier() | ||
| .mapPartitions { iter => iter } | ||
| .collect() | ||
| } | ||
| assert(exception.getMessage.contains("[SPARK-24819]: Barrier execution " + | ||
| "mode does not allow run a barrier stage that requires more slots")) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure if its worth it but it would be nice to perhaps print what the limiting resource is. If its to much change or work to track we may just skip it.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This actually a good idea. But as you mentioned, I'm afraid this needs much more changes. So, I'd like to skip it here.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, we can revisit if it becomes an issue later. |
||
| } | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 2 configs are backported from Master branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto. This should be
3.0.1when it comes tobranch-3.0, @Ngone51 .Also, after merging this, please update
masterbranch consistently.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @dongjoon-hyun for letting me know. I was wondering about it previously.