[SPARK-19702][MESOS] Increase default refuse_seconds timeout in the Mesos Spark Dispatcher #17031

mgummelt · 2017-02-23T00:40:52Z

What changes were proposed in this pull request?

Increase default refuse_seconds timeout, and make it configurable. See JIRA for details on how this reduces the risk of starvation.

How was this patch tested?

Unit tests, Manual testing, and Mesos/Spark integration test suite

cc @susanxhuynh @skonto @jmlvanre

mgummelt · 2017-02-23T00:41:48Z

...rs/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala

This is the meat of the functionality change. We call this whenever the state of queuedDrivers or pendingRetryDrivers has changed.

SparkQA · 2017-02-23T00:44:29Z

Test build #73303 has finished for PR 17031 at commit a16a429.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-23T01:19:08Z

Test build #73307 has finished for PR 17031 at commit 42636b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

skonto · 2017-02-23T13:03:45Z

...c/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosFineGrainedSchedulerBackend.scala

Do we still support fine grained?

It's deprecated, but I had to make some changes to it just to compile. I hope to completely remove by 2.2.

skonto · 2017-02-23T13:13:35Z

@mgummelt How the operator could be informed about starvation, shouldn't this be transparent? It could be challenging to watch for corner cases even those with low probability. Are you planning to add the timer-monitoring stuff after this PR? Still reviewing it... need some time...

mgummelt · 2017-02-23T17:54:06Z

@skonto It should be clear in the logs. As long as you have at least INFO logs enabled, you'll see "Suppressing offers." in the logs, and little or nothing after, since the offer cycles stop. Unfortunately, Mesos doesn't expose the suppressed state of frameworks, so you can't glean this from state.json.

susanxhuynh · 2017-02-24T17:04:13Z

The suppress / revive logic LGTM. I didn't look that closely at the refactoring changes. Where are the Mesos/Spark integration tests that you mentioned? @mgummelt

susanxhuynh · 2017-02-24T17:13:35Z

If we're concerned about the lost reviveOffer() and don't want to handle that corner case, do we want to document it somewhere for operators? "If jobs aren't running and you see [...] in the logs, do this".

mgummelt · 2017-02-24T19:23:57Z

@susanxhuynh Mesos/Spark integration tests: https://github.com/typesafehub/mesos-spark-integration-tests. We run them as a subset of DC/OS Spark integration tests: https://github.com/mesosphere/spark-build/blob/master/tests/test.py#L89

mgummelt · 2017-02-24T19:25:20Z

@susanxhuynh I don't think it's worth documenting. It should be clear in the logs, which should be where an operator turns if they notice no jobs are launching.

skonto · 2017-02-27T16:52:37Z

@mgummelt Yes they should look at the logs but how do they know this is something that requires action from their side and not a cluster issue or anything else. It should be documented since it requires manual intervention. Also it makes harder to build recover logic for monitoring systems if they have to dig into logs, I would preferred this to be advertised somewhere like a rest api or something.

The general problem of resource starvation is solved for all other frameworks in the Universe?
I see this solution https://dcosjira.atlassian.net/browse/CASSANDRA-17 for cassandra, should not we have a unified approach? In a SMACK stack, pretty common use case, there will be problems I guess.

Here https://issues.apache.org/jira/browse/MESOS-6112 it is mentioned that we should see the issue with > 5 frameworks, the duplicate https://issues.apache.org/jira/browse/MESOS-3202 refers to a number less than that. What is the minimum setup to reproduce this (how 5 comes up?) and are there any integration tests testing this for spark?

Logic for suppress & revive LGTM.

skonto · 2017-02-27T17:35:05Z

The only way to fix this generally is to implement some periodic timer that calls reviveOffers() if there are queued/pending drivers to be scheduled. This can be chatty and complicates the code, so I haven't implemented it here.

Shouldn't we only check if we actually get any offers from the master lately and call reviveOffers() only if not? We could have a backoff approach here...

skonto · 2017-02-27T18:45:00Z

resource-managers/mesos/src/main/scala/org/apache/spark/deploy/mesos/ui/MesosClusterPage.scala

Since you are refactoring the code s/url/_.

skonto · 2017-02-27T18:47:57Z

...rs/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala

Parentheses are redundant.

skonto · 2017-02-27T18:50:26Z

...rs/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala

return is redundant.

skonto · 2017-02-27T18:52:41Z

...rs/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala

brackets are redundant.

The brackets are consistent with our other format strings. I'm not trying to refactor all the code in this PR, btw. I just touched the code whose poor style was hindering my ability to solve the problem related to this PR.

mgummelt · 2017-02-27T19:20:37Z

Given the concerns about the dispatcher being stuck in a suppressed state, I'm going to solve this a different way. I'm going to increase the default offer decline timeout to 120s and make it configurable, just like it is in the driver. This will make it so that the offer will be offered to 120 other frameworks before circling back to the dispatcher, rather than the default 5. I'll also keep the explicit revive calls when a new driver is submitted or an existing one fails, which immediately causes offers to be re-offered to the dispatcher.

This removes the risk that the driver gets stuck in a suppressed state, because the dispatcher never suppresses itself.

skonto · 2017-02-27T19:25:09Z

Ok like the Cassandra case you mean right?

mgummelt · 2017-02-27T20:20:54Z

@skonto Cassandra supports suppress/revive https://github.com/mesosphere/dcos-cassandra-service/blob/master/cassandra-scheduler/src/main/java/com/mesosphere/dcos/cassandra/scheduler/CassandraScheduler.java#L423

I can't speak for all the frameworks in Universe, Cassandra and Kafka both support suppress revive, and everything built with the DefaultScheduler in dcos-commons gets it for free: https://github.com/mesosphere/dcos-commons/blob/master/sdk/scheduler/src/main/java/com/mesosphere/sdk/scheduler/DefaultScheduler.java#L838

mgummelt · 2017-02-27T20:22:29Z

@skonto @susanxhuynh I've updated the solution to use a longer (120s) default refuse timeout, instead of suppressing offers. Please re-review. Just as the previous refuse seconds settings were undocumented, I've left this one undocumented. Users should almost never need to customize it.

SparkQA · 2017-02-27T20:32:47Z

Test build #73531 has finished for PR 17031 at commit b6e3205.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

skonto · 2017-02-27T21:10:22Z

Ok I see. Cassandra uses 30. What is a reasonable timeout?

mgummelt · 2017-02-27T22:39:40Z

It depends on the application. It's the amount of time you have to wait before having the opportunity to use those resources again. But if you explicitly revive, which we do here whenever we need more resources, then it doesn't matter. We could set it to infinity and still never be starved, because we'll always get another shot at the resources when we revive.

skonto · 2017-03-01T08:12:27Z

@mgummelt Here is my rationale about the refuse time. As stated here:
https://issues.apache.org/jira/browse/MESOS-3202 and given the timeout for cassandra, I have at most 30 seconds for some other framework to accept resources in the framework list, otherwise the first one will be asked again. So implicitly along with the master delay for making offers, this value limits the number of frameworks that will be asked for the offer which was declined by cassandra (assuming cassandra is the first framework in the list). So if you have many frameworks in that list at last some will starve. So refuse offer should has a large value to give the opportunity more frameworks to be asked for the offer. We need to break that loop right? Am I missing something here?

mgummelt · 2017-03-01T23:46:32Z

Your understanding is correct. You must set refuse_seconds for all your frameworks to some value N, such that N >= #frameworks. So for this change, if some operator is running >120 frameworks, they may need to configure this value. However, I'm not aware of any Mesos cluster on Earth running that many frameworks.

mgummelt · 2017-03-02T17:21:17Z

@skonto Any other concerns? Can I get a LGTM?

skonto · 2017-03-03T11:27:06Z

@mgummelt LGTM. Thanks for the clarifications. Btw I would expect N to be a mesos cluster config option because this is a global issue/workaround. @srowen could we get a merge pls?

srowen

This looks like a large amount of change relative to the description. Is this intentional, all the incidental code changes?

skonto · 2017-03-03T11:37:49Z

@srowen There are parts for refactoring only purposes, improving quality.

mgummelt · 2017-03-03T17:59:04Z

@srowen Yes, most of the code is refactoring that I came across when solving this. If that's going to delay this being merged, please let me know and I can remove the refactoring.

mgummelt · 2017-03-03T18:00:04Z

@skonto I completely agree that this is a cluster-wide issue, but unfortunately that's the state of things. In the long-term, optimistic offers in Mesos should fix this.

mgummelt · 2017-03-04T01:54:19Z

@srowen Just to move things along, I removed everything not directly relevant to this JIRA.

SparkQA · 2017-03-04T02:09:06Z

Test build #73883 has finished for PR 17031 at commit ba864d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2017-03-04T10:04:42Z

Compared to the title, this looks like a significant change, still. Is the intent something different from the JIRA? this doens't just increase a default. I don't have any opinion on the changes, just commenting on the consistency of change vs discussion and paper trail

skonto · 2017-03-04T11:42:32Z

@mgummelt do we want to keep the suppress/revive technique, time-out increase is not enough? I think that is the added code here compared to what someone expects from the title.
In the jira it says:

We must implement increase the refuse_seconds timeout to solve this problem. Another option would have been to implement suppress/revive, but that can cause starvation due to the unreliability of mesos RPC calls.

Also the description must be updated IMHO.

mgummelt · 2017-03-06T16:40:34Z

@srowen To support increasing the default, I've had to:

make refuse_seconds configurable
factor out declineOffer so the dispatcher can use it in addition to the coarse grained scheduler.
persist the schedulerDriver in both the dispatcher scheduler and the coarse grained scheduler so we can access it in callbacks that aren't passed the driver object.

mgummelt · 2017-03-06T16:42:28Z

@skonto updated the description.

SparkQA · 2017-03-06T16:58:09Z

Test build #74020 has finished for PR 17031 at commit b5fb61e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgummelt · 2017-03-07T18:53:07Z

@srowen ping

srowen · 2017-03-07T21:29:15Z

Merged to master

mgummelt · 2017-03-07T21:39:03Z

Thanks!

…esos Spark Dispatcher Increase default refuse_seconds timeout, and make it configurable. See JIRA for details on how this reduces the risk of starvation. Unit tests, Manual testing, and Mesos/Spark integration test suite cc susanxhuynh skonto jmlvanre Author: Michael Gummelt <mgummelt@mesosphere.io> Closes apache#17031 from mgummelt/SPARK-19702-suppress-revive.

mgummelt commented Feb 23, 2017

View reviewed changes

skonto reviewed Feb 23, 2017

View reviewed changes

mgummelt changed the title ~~[SPARK-19702] Add suppress/revive support to the Mesos Spark Dispatcher~~ [SPARK-19702][MESOS] Add suppress/revive support to the Mesos Spark Dispatcher Feb 23, 2017

skonto reviewed Feb 27, 2017

View reviewed changes

...rs/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala Outdated

Copy link

Contributor

skonto Feb 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return is redundant.

skonto reviewed Feb 27, 2017

View reviewed changes

mgummelt changed the title ~~[SPARK-19702][MESOS] Add suppress/revive support to the Mesos Spark Dispatcher~~ [SPARK-19702][MESOS] Increase default refuse_seconds timeout in the Mesos Spark Dispatcher Feb 27, 2017

srowen reviewed Mar 3, 2017

View reviewed changes

Increase default refuse_seconds timeout, and add offer revival

ba864d0

mgummelt force-pushed the SPARK-19702-suppress-revive branch from b6e3205 to ba864d0 Compare March 4, 2017 01:53

remove comment

b5fb61e

asfgit closed this in 2e30c0b Mar 7, 2017

mgummelt deleted the SPARK-19702-suppress-revive branch March 7, 2017 21:39

[SPARK-19702][MESOS] Increase default refuse_seconds timeout in the Mesos Spark Dispatcher #17031

[SPARK-19702][MESOS] Increase default refuse_seconds timeout in the Mesos Spark Dispatcher #17031

Uh oh!

Conversation

mgummelt commented Feb 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 23, 2017

Uh oh!

SparkQA commented Feb 23, 2017

Uh oh!

skonto Feb 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skonto commented Feb 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgummelt commented Feb 23, 2017

Uh oh!

susanxhuynh commented Feb 24, 2017

Uh oh!

susanxhuynh commented Feb 24, 2017

Uh oh!

mgummelt commented Feb 24, 2017

Uh oh!

mgummelt commented Feb 24, 2017

Uh oh!

skonto commented Feb 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skonto commented Feb 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgummelt commented Feb 27, 2017

Uh oh!

skonto commented Feb 27, 2017

Uh oh!

mgummelt commented Feb 27, 2017

Uh oh!

mgummelt commented Feb 27, 2017

Uh oh!

SparkQA commented Feb 27, 2017

Uh oh!

skonto commented Feb 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgummelt commented Feb 27, 2017

Uh oh!

skonto commented Mar 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgummelt commented Mar 1, 2017

Uh oh!

mgummelt commented Mar 2, 2017

Uh oh!

skonto commented Mar 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen left a comment

mgummelt commented Feb 23, 2017 •

edited

Loading

skonto Feb 23, 2017 •

edited

Loading

skonto commented Feb 23, 2017 •

edited

Loading

skonto commented Feb 27, 2017 •

edited

Loading

skonto commented Feb 27, 2017 •

edited

Loading

skonto commented Feb 27, 2017 •

edited

Loading

skonto commented Mar 1, 2017 •

edited

Loading

skonto commented Mar 3, 2017 •

edited

Loading

skonto commented Mar 3, 2017 •

edited

Loading

mgummelt commented Mar 4, 2017 •

edited

Loading

skonto commented Mar 4, 2017 •

edited

Loading