Skip to content

Conversation

@Ngone51
Copy link
Member

@Ngone51 Ngone51 commented Aug 10, 2020

What changes were proposed in this pull request?

  1. Make CoarseGrainedSchedulerBackend.maxNumConcurrentTasks() considers all kinds of resources when calculating the max concurrent tasks

  2. Refactor calculateAvailableSlots() to make it be able to be used for both CoarseGrainedSchedulerBackend and TaskSchedulerImpl

Why are the changes needed?

Currently, CoarseGrainedSchedulerBackend.maxNumConcurrentTasks() only considers the CPU for the max concurrent tasks. This can cause the application to hang when a barrier stage requires extra custom resources but the cluster doesn't have enough corresponding resources. Because, without the checking for other custom resources in maxNumConcurrentTasks, the barrier stage can be submitted to the TaskSchedulerImpl. But the TaskSchedulerImpl won't launch tasks for the barrier stage due to the insufficient task slots calculated by TaskSchedulerImpl.calculateAvailableSlots (which does check all kinds of resources).

If the barrier stage doesn't launch all the tasks in one true, the application will fail and suggest user to disable delay scheduling. However, this actually a misleading suggestion since the real root cause is not enough resources.

Does this PR introduce any user-facing change?

Yes. In case of a barrier stage requires more custom resources than the cluster has, previously, the application will fail with misleading suggestion of disabling delay scheduling. After this PR, the application will fail with the error message saying not enough resources.

How was this patch tested?

Added a unit test.

@Ngone51
Copy link
Member Author

Ngone51 commented Aug 10, 2020

ping @cloud-fan @tgravescs

ConfigBuilder("spark.testing.skipValidateCores")
.version("3.1.0")
.booleanConf
.createWithDefault(false)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 2 configs are backported from Master branch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. This should be 3.0.1 when it comes to branch-3.0, @Ngone51 .
Also, after merging this, please update master branch consistently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @dongjoon-hyun for letting me know. I was wondering about it previously.

@SparkQA
Copy link

SparkQA commented Aug 10, 2020

Test build #127268 has finished for PR 29395 at commit c980996.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

.collect()
}
assert(exception.getMessage.contains("[SPARK-24819]: Barrier execution " +
"mode does not allow run a barrier stage that requires more slots"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if its worth it but it would be nice to perhaps print what the limiting resource is. If its to much change or work to track we may just skip it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually a good idea. But as you mentioned, I'm afraid this needs much more changes. So, I'd like to skip it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, we can revisit if it becomes an issue later.

@SparkQA
Copy link

SparkQA commented Aug 11, 2020

Test build #127324 has finished for PR 29395 at commit f635c18.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented Aug 12, 2020

@tgravescs @clockfly Is it looks OK now?

Copy link
Contributor

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

.createWithDefault(2)

val RESOURCES_WARNING_TESTING = ConfigBuilder("spark.resources.warnings.testing")
.version("3.1.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be 3.0.1 when it comes to branch-3.0, @Ngone51 .

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please config version properly.

@SparkQA
Copy link

SparkQA commented Aug 14, 2020

Test build #127436 has finished for PR 29395 at commit 9c18479.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented Aug 14, 2020

retest this please.

@SparkQA
Copy link

SparkQA commented Aug 14, 2020

Test build #127442 has finished for PR 29395 at commit 9c18479.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Aug 14, 2020

Test build #127450 has finished for PR 29395 at commit 9c18479.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Aug 14, 2020

Test build #127458 has finished for PR 29395 at commit 9c18479.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented Aug 15, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Aug 15, 2020

Test build #127473 has finished for PR 29395 at commit 9c18479.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented Aug 17, 2020

retest this please.

@SparkQA
Copy link

SparkQA commented Aug 17, 2020

Test build #127496 has finished for PR 29395 at commit 9c18479.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented Aug 17, 2020

org.apache.spark.sql.DataFrameSuite.SPARK-28224: Aggregate sum big decimal overflow is quite flaky. But I'm sure it can pass on my laptop.

@Ngone51
Copy link
Member Author

Ngone51 commented Aug 17, 2020

retest this please.

@Ngone51
Copy link
Member Author

Ngone51 commented Aug 17, 2020

Seems like we need to wait for this fix: #29448

@SparkQA
Copy link

SparkQA commented Aug 17, 2020

Test build #127498 has finished for PR 29395 at commit 9c18479.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51 Ngone51 force-pushed the backport-spark-32518 branch from 9c18479 to daa205d Compare August 18, 2020 01:49
@SparkQA
Copy link

SparkQA commented Aug 18, 2020

Test build #127519 has finished for PR 29395 at commit daa205d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

cloud-fan commented Aug 18, 2020

thanks, merging to 3.0!

cloud-fan pushed a commit that referenced this pull request Aug 18, 2020
…ntTasks should consider all kinds of resources

### What changes were proposed in this pull request?

1.  Make `CoarseGrainedSchedulerBackend.maxNumConcurrentTasks()` considers all kinds of resources when calculating the max concurrent tasks

2. Refactor `calculateAvailableSlots()` to make it be able to be used for both `CoarseGrainedSchedulerBackend` and `TaskSchedulerImpl`

### Why are the changes needed?

Currently, `CoarseGrainedSchedulerBackend.maxNumConcurrentTasks()` only considers the CPU for the max concurrent tasks. This can cause the application to hang when a barrier stage requires extra custom resources but the cluster doesn't have enough corresponding resources. Because, without the checking for other custom resources in `maxNumConcurrentTasks`, the barrier stage can be submitted to the `TaskSchedulerImpl`. But the `TaskSchedulerImpl` won't launch tasks for the barrier stage due to the insufficient task slots calculated by `TaskSchedulerImpl.calculateAvailableSlots` (which does check all kinds of resources).

If the barrier stage doesn't launch all the tasks in one true, the application will fail and suggest user to disable delay scheduling. However, this actually a misleading suggestion since the real root cause is not enough resources.

### Does this PR introduce _any_ user-facing change?

Yes. In case of a barrier stage requires more custom resources than the cluster has, previously, the application will fail with misleading suggestion of disabling delay scheduling. After this PR, the application will fail with the error message saying not enough resources.

### How was this patch tested?

Added a unit test.

Closes #29395 from Ngone51/backport-spark-32518.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan cloud-fan closed this Aug 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants