-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add scheduler version enforcement #1872
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The major feedback is that the worker needs to detect this and do a very long or permanent back off until the leader changes. While the same leader is running, we will never be able to dequeue work.
Timeout time.Duration | ||
Schedulers []string | ||
Timeout time.Duration | ||
SchedulerVersion uint16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something about the uint16
feels dirty...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It gives us a lot of versions while keeping it low overhead. We will never pass 65536 version bumps.
@@ -134,8 +134,9 @@ func (w *Worker) run() { | |||
func (w *Worker) dequeueEvaluation(timeout time.Duration) (*structs.Evaluation, string, bool) { | |||
// Setup the request | |||
req := structs.EvalDequeueRequest{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The worker should probably not sit in a hot loop of calling dequeue on this error, e.g. the failure is basically semi-permanent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I was thinking we want it to spam the logs so the operator knows they are doing something bad. On the error we start backing off:
Lines 146 to 163 in 962f4d4
REQ: | |
// Check if we are paused | |
w.checkPaused() | |
// Make a blocking RPC | |
start := time.Now() | |
err := w.srv.RPC("Eval.Dequeue", &req, &resp) | |
metrics.MeasureSince([]string{"nomad", "worker", "dequeue_eval"}, start) | |
if err != nil { | |
if time.Since(w.start) > dequeueErrGrace && !w.srv.IsShutdown() { | |
w.logger.Printf("[ERR] worker: failed to dequeue evaluation: %v", err) | |
} | |
if w.backoffErr(backoffBaselineSlow, backoffLimitSlow) { | |
return nil, "", true | |
} | |
goto REQ | |
} | |
w.backoffReset() |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
@armon For review