Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scheduler version enforcement #1872

Merged
merged 2 commits into from
Oct 27, 2016
Merged

Add scheduler version enforcement #1872

merged 2 commits into from
Oct 27, 2016

Conversation

dadgar
Copy link
Contributor

@dadgar dadgar commented Oct 26, 2016

@armon For review

Copy link
Member

@armon armon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The major feedback is that the worker needs to detect this and do a very long or permanent back off until the leader changes. While the same leader is running, we will never be able to dequeue work.

Timeout time.Duration
Schedulers []string
Timeout time.Duration
SchedulerVersion uint16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something about the uint16 feels dirty...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gives us a lot of versions while keeping it low overhead. We will never pass 65536 version bumps.

@@ -134,8 +134,9 @@ func (w *Worker) run() {
func (w *Worker) dequeueEvaluation(timeout time.Duration) (*structs.Evaluation, string, bool) {
// Setup the request
req := structs.EvalDequeueRequest{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worker should probably not sit in a hot loop of calling dequeue on this error, e.g. the failure is basically semi-permanent.

Copy link
Contributor Author

@dadgar dadgar Oct 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I was thinking we want it to spam the logs so the operator knows they are doing something bad. On the error we start backing off:

nomad/nomad/worker.go

Lines 146 to 163 in 962f4d4

REQ:
// Check if we are paused
w.checkPaused()
// Make a blocking RPC
start := time.Now()
err := w.srv.RPC("Eval.Dequeue", &req, &resp)
metrics.MeasureSince([]string{"nomad", "worker", "dequeue_eval"}, start)
if err != nil {
if time.Since(w.start) > dequeueErrGrace && !w.srv.IsShutdown() {
w.logger.Printf("[ERR] worker: failed to dequeue evaluation: %v", err)
}
if w.backoffErr(backoffBaselineSlow, backoffLimitSlow) {
return nil, "", true
}
goto REQ
}
w.backoffReset()

@dadgar dadgar merged commit 42fb115 into master Oct 27, 2016
@dadgar dadgar deleted the f-dequeue branch October 27, 2016 18:42
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants