core: allow pausing and un-pausing of leader broker routine #13045

jrasell · 2022-05-17T15:35:21Z

Test failure: percy-ui and unrelated
Related to #11638
Commits: split to make review easier

This changes allows operators to control whether or not the eval broker is running on the leader server. This is useful in outage situations where administrators wish to stop any work being available to the scheduler workers. It is also requisite work for future work which will allow deletion of evaluations which have not been processed.

Alongside pausing the eval broker, the blocked evals process is also disabled at the same time. This sub-process pushes evaluations into the eval broker and is also restored from the state store using the same process as the eval broker. It therefore makes sense, considering the end goal, to disable both processes for state consistency.

To ensure operators pausing/un-pausing the broker doesn't conflict with leadership transitions a mutex is used to control access to the eval broker and blocked evals processes. This is used along with a leadership check when changing the broker status which requires taking into account operator configuration.

The issue mentions disabling the reapFailedEvaluations process, however, I don't believe this is required for this change as disabling the eval broker also flushes all stored evaluation data. It therefore seemed safer to leave this alone and not require additional coordination.

The new operator scheduler commands allow inspecting and modifying the scheduler config. This is useful as it doesn't require you to supply the full payload object or remember the exact curl command to run.

command/operator_scheduler_set_config.go

nomad/eval_endpoint.go

jrasell · 2022-05-19T09:45:43Z

I am also wondering if we need evalBroker enabled checks on other RPC handlers such as Ack and Nack? Specifically I am figuring out what exactly is going to happen when we pause the broker after an eval has been dequeued, but before it has been processed by the scheduler.

~~I think I need to spend a little more time thinking about what exactly pausing and flushing the evalBroker will do to anything in-flight.~~

OK, I think I have this satisfied in my head and will check internally to ensure my logic is correct.

DerekStrickland

Great job on a difficult task! I left some non-blocking comments and questions. I'd personally wait on schmichael's review to merge, but you may have already been through that and it's just not updated.

nomad/leader_test.go

api/operator_test.go

command/operator_scheduler_get_config.go

command/operator_scheduler_get_config_test.go

command/operator_scheduler_set_config.go

lgfa29

Nice work!

Lots of minor comments, but one thing that may be worth looking into is to disable deployment monitoring for the nomad job run command if the eval broker is disabled, otherwise it will get stuck:

$ nomad run example.nomad
==> 2022-06-10T11:11:17-04:00: Monitoring evaluation "076115a0"
    2022-06-10T11:11:17-04:00: Evaluation triggered by job "countdash"
==> 2022-06-10T11:11:18-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:19-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:20-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:21-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:22-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:23-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:24-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:25-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:26-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:27-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:28-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:29-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:30-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:31-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:32-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:33-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:34-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:35-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:36-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:37-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:38-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:39-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:40-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:41-04:00: Monitoring evaluation "076115a0"

Also missing a CHANGELOG entry.

command/operator_scheduler_set_config.go

helper/broker/notify.go

nomad/eval_endpoint.go

nomad/operator_endpoint.go

helper/broker/notify.go

e2e/operator_scheduler/input/basic.nomad

helper/broker/notify.go

website/content/docs/commands/operator/scheduler-set-config.mdx

jrasell · 2022-06-14T12:55:00Z

Nice work!

Lots of minor comments, but one thing that may be worth looking into is to disable deployment monitoring for the nomad job run command if the eval broker is disabled, otherwise it will get stuck:

$ nomad run example.nomad
==> 2022-06-10T11:11:17-04:00: Monitoring evaluation "076115a0"
    2022-06-10T11:11:17-04:00: Evaluation triggered by job "countdash"
==> 2022-06-10T11:11:18-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:19-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:20-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:21-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:22-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:23-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:24-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:25-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:26-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:27-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:28-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:29-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:30-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:31-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:32-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:33-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:34-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:35-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:36-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:37-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:38-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:39-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:40-04:00: Monitoring evaluation "076115a0"
==> 2022-06-10T11:11:41-04:00: Monitoring evaluation "076115a0"

Also missing a CHANGELOG entry.

Changelog has been added.

I quite like the UX of the job submission in situation when the eval broker is paused. It shows the evaluation is making no progress which is exactly correct and will unblock if the eval broker is enabled and the eval being watched changes state. In order to avoid this we would need to make an extra call to the API for another that monitors evals such as deployments promotion, job stop, and others. That commands also have the detach flag which can be used to avoid this.

lgfa29 · 2022-06-14T15:03:23Z

I quite like the UX of the job submission in situation when the eval broker is paused. It shows the evaluation is making no progress which is exactly correct and will unblock if the eval broker is enabled and the eval being watched changes state.

I think the problem is that the person submitting the job may not know that the eval broker is paused (or even what an eval broker is 😄) and this type of output is not normal for Nomad so, from their perspective, the nomad run command is misbehaving, which may result in them retrying multiple times, queuing up more evals.

This section also updates quite often, so you get one of those lines every second.

In order to avoid this we would need to make an extra call to the API for another that monitors evals such as deployments promotion, job stop, and others.

I think they all use the monitor? So what would need to happen is breaking this loop if the CLI detects that the eval broker is paused, so no progress will be made.

An API call is an option (it doesn't have to happen at every iteration), another one could be to piggy-back in the Eval.Info response, like in a QueryMeta field, though this may not work for stale queries, where the leader is not contacted.

Another option would be to improve how this information is displayed. If this monitor gets "glinterized" like the deployment monitor it would look less like a problem with the command, but that's a lot of work.

A simpler option is to have a counter for that loop and either breaks it entirely or prints a message that the broker may be paused and that this behaviour is expected. Or maybe even Ask the user if they want to break the monitoring?

This is also an exceptional scenario, so the simplest option is probably the best. Printing a message every, say 10th, iteration may be good enough for now?

tgross · 2022-06-14T15:21:54Z

That the "Monitoring evaluation" message falls inside the loop instead of at the top feels almost accidental; in most cases we'd only ever see it once. If we move it up so that it's outside the loop we still the effect of "hey we're waiting here" without making a bunch of noise in the terminal?

lgfa29 · 2022-06-14T19:24:15Z

That the "Monitoring evaluation" message falls inside the loop instead of at the top feels almost accidental; in most cases we'd only ever see it once. If we move it up so that it's outside the loop we still the effect of "hey we're waiting here" without making a bunch of noise in the terminal?

Good point, I think that would work as well 👍

command/operator_scheduler_get_config.go

command/operator_scheduler_set_config.go

github-actions · 2022-11-04T02:39:19Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

jrasell requested review from schmichael, lgfa29 and jazzyfresh May 17, 2022 15:35

jrasell self-assigned this May 17, 2022

schmichael reviewed May 17, 2022

View reviewed changes

command/operator_scheduler_set_config.go Outdated Show resolved Hide resolved

nomad/eval_endpoint.go Outdated Show resolved Hide resolved

vercel bot deployed to Preview – nomad-storybook-and-ui May 18, 2022 07:10 View deployment

jrasell requested a review from schmichael May 19, 2022 08:57

vercel bot deployed to Preview – nomad-storybook-and-ui May 19, 2022 09:00 View deployment

jrasell force-pushed the f-pause-eval branch from de0766c to 8a3df7f Compare May 30, 2022 07:56

vercel bot deployed to Preview – nomad-storybook-and-ui May 30, 2022 08:00 View deployment

jrasell added this to the 1.3.2 milestone Jun 6, 2022

DerekStrickland approved these changes Jun 10, 2022

View reviewed changes

lgfa29 approved these changes Jun 10, 2022

View reviewed changes

tgross reviewed Jun 10, 2022

View reviewed changes

vercel bot deployed to Preview – nomad-storybook-and-ui June 13, 2022 10:28 View deployment

vercel bot deployed to Preview – nomad June 13, 2022 10:29 View deployment

jrasell added a commit that referenced this pull request Jun 13, 2022

changelog: add entry for #13045.

ca36c54

vercel bot deployed to Preview – nomad-storybook-and-ui June 13, 2022 13:11 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui June 14, 2022 12:52 View deployment

vercel bot deployed to Preview – nomad June 14, 2022 12:52 View deployment

jrasell force-pushed the f-pause-eval branch from 44d7879 to 9177f83 Compare June 14, 2022 12:59

vercel bot deployed to Preview – nomad-storybook-and-ui June 14, 2022 13:02 View deployment

vercel bot deployed to Preview – nomad June 14, 2022 13:02 View deployment

schmichael reviewed Jun 21, 2022

View reviewed changes

command/operator_scheduler_get_config.go Show resolved Hide resolved

command/operator_scheduler_set_config.go Show resolved Hide resolved

command/operator_scheduler_set_config.go Show resolved Hide resolved

command/operator_scheduler_set_config.go Show resolved Hide resolved

vercel bot deployed to Preview – nomad-storybook-and-ui July 1, 2022 10:16 View deployment

vercel bot deployed to Preview – nomad July 1, 2022 10:17 View deployment

jrasell force-pushed the f-pause-eval branch from f6c63f8 to f985774 Compare July 1, 2022 10:43

vercel bot deployed to Preview – nomad July 1, 2022 10:46 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui July 1, 2022 10:47 View deployment

jrasell force-pushed the f-pause-eval branch from f985774 to d64cf6d Compare July 1, 2022 10:54

vercel bot deployed to Preview – nomad July 1, 2022 10:57 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui July 1, 2022 10:57 View deployment

jrasell added 6 commits July 1, 2022 14:37

core: allow pause/un-pause of eval broker on region leader.

4b170e1

agent: add ability to pause eval broker via scheduler config.

9292cdc

cli: add operator scheduler commands to interact with config.

1695012

api: add ability to pause eval broker via scheduler config

8caeb94

e2e: add operator scheduler test for eval broker pause.

df7eb6a

docs: include new opertor scheduler CLI and pause eval API info.

5f8a5ea

jrasell force-pushed the f-pause-eval branch from d64cf6d to 5f8a5ea Compare July 1, 2022 13:40

vercel bot deployed to Preview – nomad July 1, 2022 13:42 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui July 1, 2022 13:46 View deployment

schmichael approved these changes Jul 6, 2022

View reviewed changes

jrasell added the backport/1.3.x backport to 1.3.x release line label Jul 6, 2022

jrasell merged commit 24220d0 into main Jul 6, 2022

jrasell deleted the f-pause-eval branch July 6, 2022 14:13

hc-github-team-nomad-core mentioned this pull request Jul 6, 2022

Backport of core: allow pausing and un-pausing of leader broker routine into release/1.3.x #13614

Merged

github-actions bot locked as resolved and limited conversation to collaborators Nov 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: allow pausing and un-pausing of leader broker routine #13045

core: allow pausing and un-pausing of leader broker routine #13045

jrasell commented May 17, 2022 •

edited

Loading

jrasell commented May 19, 2022 •

edited

Loading

DerekStrickland left a comment

lgfa29 left a comment •

edited

Loading

jrasell commented Jun 14, 2022

lgfa29 commented Jun 14, 2022

tgross commented Jun 14, 2022

lgfa29 commented Jun 14, 2022

github-actions bot commented Nov 4, 2022

core: allow pausing and un-pausing of leader broker routine #13045

core: allow pausing and un-pausing of leader broker routine #13045

Conversation

jrasell commented May 17, 2022 • edited Loading

jrasell commented May 19, 2022 • edited Loading

DerekStrickland left a comment

Choose a reason for hiding this comment

lgfa29 left a comment • edited Loading

Choose a reason for hiding this comment

jrasell commented Jun 14, 2022

lgfa29 commented Jun 14, 2022

tgross commented Jun 14, 2022

lgfa29 commented Jun 14, 2022

github-actions bot commented Nov 4, 2022

jrasell commented May 17, 2022 •

edited

Loading

jrasell commented May 19, 2022 •

edited

Loading

lgfa29 left a comment •

edited

Loading