-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19702][MESOS] Increase default refuse_seconds timeout in the Mesos Spark Dispatcher #17031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the meat of the functionality change. We call this whenever the state of queuedDrivers or pendingRetryDrivers has changed.
|
Test build #73303 has finished for PR 17031 at commit
|
|
Test build #73307 has finished for PR 17031 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still support fine grained?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's deprecated, but I had to make some changes to it just to compile. I hope to completely remove by 2.2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok cool!
|
@mgummelt How the operator could be informed about starvation, shouldn't this be transparent? It could be challenging to watch for corner cases even those with low probability. Are you planning to add the timer-monitoring stuff after this PR? Still reviewing it... need some time... |
|
@skonto It should be clear in the logs. As long as you have at least INFO logs enabled, you'll see "Suppressing offers." in the logs, and little or nothing after, since the offer cycles stop. Unfortunately, Mesos doesn't expose the suppressed state of frameworks, so you can't glean this from state.json. |
|
The suppress / revive logic LGTM. I didn't look that closely at the refactoring changes. Where are the Mesos/Spark integration tests that you mentioned? @mgummelt |
|
If we're concerned about the lost reviveOffer() and don't want to handle that corner case, do we want to document it somewhere for operators? "If jobs aren't running and you see [...] in the logs, do this". |
|
@susanxhuynh Mesos/Spark integration tests: https://github.com/typesafehub/mesos-spark-integration-tests. We run them as a subset of DC/OS Spark integration tests: https://github.com/mesosphere/spark-build/blob/master/tests/test.py#L89 |
|
@susanxhuynh I don't think it's worth documenting. It should be clear in the logs, which should be where an operator turns if they notice no jobs are launching. |
|
@mgummelt Yes they should look at the logs but how do they know this is something that requires action from their side and not a cluster issue or anything else. It should be documented since it requires manual intervention. Also it makes harder to build recover logic for monitoring systems if they have to dig into logs, I would preferred this to be advertised somewhere like a rest api or something. The general problem of resource starvation is solved for all other frameworks in the Universe? Here https://issues.apache.org/jira/browse/MESOS-6112 it is mentioned that we should see the issue with > 5 frameworks, the duplicate https://issues.apache.org/jira/browse/MESOS-3202 refers to a number less than that. What is the minimum setup to reproduce this (how 5 comes up?) and are there any integration tests testing this for spark? Logic for suppress & revive LGTM. |
Shouldn't we only check if we actually get any offers from the master lately and call reviveOffers() only if not? We could have a backoff approach here... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you are refactoring the code s/url/_.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parentheses are redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return is redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
brackets are redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The brackets are consistent with our other format strings. I'm not trying to refactor all the code in this PR, btw. I just touched the code whose poor style was hindering my ability to solve the problem related to this PR.
|
Given the concerns about the dispatcher being stuck in a suppressed state, I'm going to solve this a different way. I'm going to increase the default offer decline timeout to 120s and make it configurable, just like it is in the driver. This will make it so that the offer will be offered to 120 other frameworks before circling back to the dispatcher, rather than the default 5. I'll also keep the explicit revive calls when a new driver is submitted or an existing one fails, which immediately causes offers to be re-offered to the dispatcher. This removes the risk that the driver gets stuck in a suppressed state, because the dispatcher never suppresses itself. |
|
Ok like the Cassandra case you mean right? |
|
@skonto Cassandra supports suppress/revive https://github.com/mesosphere/dcos-cassandra-service/blob/master/cassandra-scheduler/src/main/java/com/mesosphere/dcos/cassandra/scheduler/CassandraScheduler.java#L423 I can't speak for all the frameworks in Universe, Cassandra and Kafka both support suppress revive, and everything built with the |
|
@skonto @susanxhuynh I've updated the solution to use a longer (120s) default refuse timeout, instead of suppressing offers. Please re-review. Just as the previous refuse seconds settings were undocumented, I've left this one undocumented. Users should almost never need to customize it. |
|
Test build #73531 has finished for PR 17031 at commit
|
|
Ok I see. Cassandra uses 30. What is a reasonable timeout? |
|
It depends on the application. It's the amount of time you have to wait before having the opportunity to use those resources again. But if you explicitly revive, which we do here whenever we need more resources, then it doesn't matter. We could set it to infinity and still never be starved, because we'll always get another shot at the resources when we revive. |
|
@mgummelt Here is my rationale about the refuse time. As stated here: |
|
Your understanding is correct. You must set refuse_seconds for all your frameworks to some value N, such that N >= #frameworks. So for this change, if some operator is running >120 frameworks, they may need to configure this value. However, I'm not aware of any Mesos cluster on Earth running that many frameworks. |
|
@skonto Any other concerns? Can I get a LGTM? |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a large amount of change relative to the description. Is this intentional, all the incidental code changes?
|
@srowen There are parts for refactoring only purposes, improving quality. |
|
@srowen Yes, most of the code is refactoring that I came across when solving this. If that's going to delay this being merged, please let me know and I can remove the refactoring. |
|
@skonto I completely agree that this is a cluster-wide issue, but unfortunately that's the state of things. In the long-term, optimistic offers in Mesos should fix this. |
b6e3205 to
ba864d0
Compare
|
@srowen Just to move things along, I removed everything not directly relevant to this JIRA. |
|
Test build #73883 has finished for PR 17031 at commit
|
|
Compared to the title, this looks like a significant change, still. Is the intent something different from the JIRA? this doens't just increase a default. I don't have any opinion on the changes, just commenting on the consistency of change vs discussion and paper trail |
|
@mgummelt do we want to keep the suppress/revive technique, time-out increase is not enough? I think that is the added code here compared to what someone expects from the title.
Also the description must be updated IMHO. |
|
@srowen To support increasing the default, I've had to:
|
|
@skonto updated the description. |
|
Test build #74020 has finished for PR 17031 at commit
|
|
@srowen ping |
|
Merged to master |
|
Thanks! |
…esos Spark Dispatcher Increase default refuse_seconds timeout, and make it configurable. See JIRA for details on how this reduces the risk of starvation. Unit tests, Manual testing, and Mesos/Spark integration test suite cc susanxhuynh skonto jmlvanre Author: Michael Gummelt <mgummelt@mesosphere.io> Closes apache#17031 from mgummelt/SPARK-19702-suppress-revive.
…esos Spark Dispatcher Increase default refuse_seconds timeout, and make it configurable. See JIRA for details on how this reduces the risk of starvation. Unit tests, Manual testing, and Mesos/Spark integration test suite cc susanxhuynh skonto jmlvanre Author: Michael Gummelt <mgummelt@mesosphere.io> Closes apache#17031 from mgummelt/SPARK-19702-suppress-revive.
What changes were proposed in this pull request?
Increase default refuse_seconds timeout, and make it configurable. See JIRA for details on how this reduces the risk of starvation.
How was this patch tested?
Unit tests, Manual testing, and Mesos/Spark integration test suite
cc @susanxhuynh @skonto @jmlvanre