-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: allow pausing and un-pausing of leader broker routine #13045
Conversation
OK, I think I have this satisfied in my head and will check internally to ensure my logic is correct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job on a difficult task! I left some non-blocking comments and questions. I'd personally wait on schmichael's review to merge, but you may have already been through that and it's just not updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
Lots of minor comments, but one thing that may be worth looking into is to disable deployment monitoring for the nomad job run
command if the eval broker is disabled, otherwise it will get stuck:
$ nomad run example.nomad
==> 2022-06-10T11:11:17-04:00: Monitoring evaluation "076115a0"
2022-06-10T11:11:17-04:00: Evaluation triggered by job "countdash"
==> 2022-06-10T11:11:18-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:19-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:20-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:21-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:22-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:23-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:24-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:25-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:26-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:27-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:28-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:29-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:30-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:31-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:32-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:33-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:34-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:35-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:36-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:37-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:38-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:39-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:40-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:41-04:00: Monitoring evaluation "076115a0"
Also missing a CHANGELOG
entry.
website/content/docs/commands/operator/scheduler-set-config.mdx
Outdated
Show resolved
Hide resolved
website/content/docs/commands/operator/scheduler-set-config.mdx
Outdated
Show resolved
Hide resolved
Changelog has been added. I quite like the UX of the job submission in situation when the eval broker is paused. It shows the evaluation is making no progress which is exactly correct and will unblock if the eval broker is enabled and the eval being watched changes state. In order to avoid this we would need to make an extra call to the API for another that monitors evals such as deployments promotion, job stop, and others. That commands also have the detach flag which can be used to avoid this. |
I think the problem is that the person submitting the job may not know that the eval broker is paused (or even what an eval broker is 😄) and this type of output is not normal for Nomad so, from their perspective, the This section also updates quite often, so you get one of those lines every second.
I think they all use the An API call is an option (it doesn't have to happen at every iteration), another one could be to piggy-back in the Another option would be to improve how this information is displayed. If this monitor gets "glinterized" like the deployment monitor it would look less like a problem with the command, but that's a lot of work. A simpler option is to have a counter for that loop and either breaks it entirely or prints a message that the broker may be paused and that this behaviour is expected. Or maybe even This is also an exceptional scenario, so the simplest option is probably the best. Printing a message every, say 10th, iteration may be good enough for now? |
That the "Monitoring evaluation" message falls inside the loop instead of at the top feels almost accidental; in most cases we'd only ever see it once. If we move it up so that it's outside the loop we still the effect of "hey we're waiting here" without making a bunch of noise in the terminal? |
Good point, I think that would work as well 👍 |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Test failure: percy-ui and unrelated
Related to #11638
Commits: split to make review easier
This changes allows operators to control whether or not the eval broker is running on the leader server. This is useful in outage situations where administrators wish to stop any work being available to the scheduler workers. It is also requisite work for future work which will allow deletion of evaluations which have not been processed.
Alongside pausing the eval broker, the blocked evals process is also disabled at the same time. This sub-process pushes evaluations into the eval broker and is also restored from the state store using the same process as the eval broker. It therefore makes sense, considering the end goal, to disable both processes for state consistency.
To ensure operators pausing/un-pausing the broker doesn't conflict with leadership transitions a mutex is used to control access to the eval broker and blocked evals processes. This is used along with a leadership check when changing the broker status which requires taking into account operator configuration.
The issue mentions disabling the
reapFailedEvaluations
process, however, I don't believe this is required for this change as disabling the eval broker also flushes all stored evaluation data. It therefore seemed safer to leave this alone and not require additional coordination.The new
operator scheduler
commands allow inspecting and modifying the scheduler config. This is useful as it doesn't require you to supply the full payload object or remember the exact curl command to run.