Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: allow pausing and un-pausing of leader broker routine #13045

Merged
merged 6 commits into from
Jul 6, 2022
Merged

Conversation

jrasell
Copy link
Member

@jrasell jrasell commented May 17, 2022

Test failure: percy-ui and unrelated
Related to #11638
Commits: split to make review easier

This changes allows operators to control whether or not the eval broker is running on the leader server. This is useful in outage situations where administrators wish to stop any work being available to the scheduler workers. It is also requisite work for future work which will allow deletion of evaluations which have not been processed.

Alongside pausing the eval broker, the blocked evals process is also disabled at the same time. This sub-process pushes evaluations into the eval broker and is also restored from the state store using the same process as the eval broker. It therefore makes sense, considering the end goal, to disable both processes for state consistency.

To ensure operators pausing/un-pausing the broker doesn't conflict with leadership transitions a mutex is used to control access to the eval broker and blocked evals processes. This is used along with a leadership check when changing the broker status which requires taking into account operator configuration.

The issue mentions disabling the reapFailedEvaluations process, however, I don't believe this is required for this change as disabling the eval broker also flushes all stored evaluation data. It therefore seemed safer to leave this alone and not require additional coordination.

The new operator scheduler commands allow inspecting and modifying the scheduler config. This is useful as it doesn't require you to supply the full payload object or remember the exact curl command to run.

@jrasell jrasell self-assigned this May 17, 2022
command/operator_scheduler_set_config.go Outdated Show resolved Hide resolved
nomad/eval_endpoint.go Outdated Show resolved Hide resolved
@jrasell
Copy link
Member Author

jrasell commented May 19, 2022

I am also wondering if we need evalBroker enabled checks on other RPC handlers such as Ack and Nack? Specifically I am figuring out what exactly is going to happen when we pause the broker after an eval has been dequeued, but before it has been processed by the scheduler.

I think I need to spend a little more time thinking about what exactly pausing and flushing the evalBroker will do to anything in-flight.

OK, I think I have this satisfied in my head and will check internally to ensure my logic is correct.

Copy link
Contributor

@DerekStrickland DerekStrickland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job on a difficult task! I left some non-blocking comments and questions. I'd personally wait on schmichael's review to merge, but you may have already been through that and it's just not updated.

nomad/leader_test.go Show resolved Hide resolved
api/operator_test.go Show resolved Hide resolved
command/operator_scheduler_get_config.go Outdated Show resolved Hide resolved
command/operator_scheduler_get_config_test.go Outdated Show resolved Hide resolved
command/operator_scheduler_get_config_test.go Outdated Show resolved Hide resolved
command/operator_scheduler_set_config.go Outdated Show resolved Hide resolved
Copy link
Contributor

@lgfa29 lgfa29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

Lots of minor comments, but one thing that may be worth looking into is to disable deployment monitoring for the nomad job run command if the eval broker is disabled, otherwise it will get stuck:

$ nomad run example.nomad
==> 2022-06-10T11:11:17-04:00: Monitoring evaluation "076115a0"
    2022-06-10T11:11:17-04:00: Evaluation triggered by job "countdash"
==> 2022-06-10T11:11:18-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:19-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:20-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:21-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:22-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:23-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:24-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:25-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:26-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:27-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:28-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:29-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:30-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:31-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:32-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:33-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:34-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:35-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:36-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:37-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:38-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:39-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:40-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:41-04:00: Monitoring evaluation "076115a0"

Also missing a CHANGELOG entry.

command/operator_scheduler_set_config.go Outdated Show resolved Hide resolved
command/operator_scheduler_set_config.go Show resolved Hide resolved
helper/broker/notify.go Outdated Show resolved Hide resolved
helper/broker/notify.go Outdated Show resolved Hide resolved
helper/broker/notify.go Outdated Show resolved Hide resolved
helper/broker/notify.go Outdated Show resolved Hide resolved
helper/broker/notify.go Show resolved Hide resolved
nomad/eval_endpoint.go Show resolved Hide resolved
nomad/operator_endpoint.go Outdated Show resolved Hide resolved
helper/broker/notify.go Outdated Show resolved Hide resolved
e2e/operator_scheduler/input/basic.nomad Outdated Show resolved Hide resolved
helper/broker/notify.go Show resolved Hide resolved
@jrasell
Copy link
Member Author

jrasell commented Jun 14, 2022

Nice work!

Lots of minor comments, but one thing that may be worth looking into is to disable deployment monitoring for the nomad job run command if the eval broker is disabled, otherwise it will get stuck:

$ nomad run example.nomad
==> 2022-06-10T11:11:17-04:00: Monitoring evaluation "076115a0"
    2022-06-10T11:11:17-04:00: Evaluation triggered by job "countdash"
==> 2022-06-10T11:11:18-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:19-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:20-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:21-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:22-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:23-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:24-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:25-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:26-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:27-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:28-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:29-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:30-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:31-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:32-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:33-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:34-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:35-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:36-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:37-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:38-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:39-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:40-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:41-04:00: Monitoring evaluation "076115a0"

Also missing a CHANGELOG entry.

Changelog has been added.

I quite like the UX of the job submission in situation when the eval broker is paused. It shows the evaluation is making no progress which is exactly correct and will unblock if the eval broker is enabled and the eval being watched changes state. In order to avoid this we would need to make an extra call to the API for another that monitors evals such as deployments promotion, job stop, and others. That commands also have the detach flag which can be used to avoid this.

@lgfa29
Copy link
Contributor

lgfa29 commented Jun 14, 2022

I quite like the UX of the job submission in situation when the eval broker is paused. It shows the evaluation is making no progress which is exactly correct and will unblock if the eval broker is enabled and the eval being watched changes state.

I think the problem is that the person submitting the job may not know that the eval broker is paused (or even what an eval broker is 😄) and this type of output is not normal for Nomad so, from their perspective, the nomad run command is misbehaving, which may result in them retrying multiple times, queuing up more evals.

This section also updates quite often, so you get one of those lines every second.

In order to avoid this we would need to make an extra call to the API for another that monitors evals such as deployments promotion, job stop, and others.

I think they all use the monitor? So what would need to happen is breaking this loop if the CLI detects that the eval broker is paused, so no progress will be made.

An API call is an option (it doesn't have to happen at every iteration), another one could be to piggy-back in the Eval.Info response, like in a QueryMeta field, though this may not work for stale queries, where the leader is not contacted.

Another option would be to improve how this information is displayed. If this monitor gets "glinterized" like the deployment monitor it would look less like a problem with the command, but that's a lot of work.

A simpler option is to have a counter for that loop and either breaks it entirely or prints a message that the broker may be paused and that this behaviour is expected. Or maybe even Ask the user if they want to break the monitoring?

This is also an exceptional scenario, so the simplest option is probably the best. Printing a message every, say 10th, iteration may be good enough for now?

@tgross
Copy link
Member

tgross commented Jun 14, 2022

That the "Monitoring evaluation" message falls inside the loop instead of at the top feels almost accidental; in most cases we'd only ever see it once. If we move it up so that it's outside the loop we still the effect of "hey we're waiting here" without making a bunch of noise in the terminal?

@lgfa29
Copy link
Contributor

lgfa29 commented Jun 14, 2022

That the "Monitoring evaluation" message falls inside the loop instead of at the top feels almost accidental; in most cases we'd only ever see it once. If we move it up so that it's outside the loop we still the effect of "hey we're waiting here" without making a bunch of noise in the terminal?

Good point, I think that would work as well 👍

@github-actions
Copy link

github-actions bot commented Nov 4, 2022

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 4, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backport/1.3.x backport to 1.3.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants