Standardise using Cronitor for all scheduled tasks #3380

benthorner · 2021-11-24T13:34:49Z

https://www.pivotaltracker.com/story/show/180344153

Previously we only used this wrapper for tasks in "nightly_tasks"
and "reporting_tasks", but it's important these tasks are running
regularly as well, and we shouldn't rely on them emitting lots of
errors in order to know otherwise.

Since I'm about to add another task to this file, this seems like
a good opportunity to fix what seems to be a gap in our monitoring.

Some of the tasks run frequently, others less so. We may want to
tune Cronitor for some of them - the more critical ones - and set
a sensible default for others e.g. a bit over a day.

Previously we only used this wrapper for tasks in "nightly_tasks" and "reporting_tasks", but it's important these tasks are running regularly as well, and we shouldn't rely on them emitting lots of errors in order to know otherwise. Since I'm about to add another task to this file, this seems like a good opportunity to fix what seems to be a gap in our monitoring. Some of the tasks run frequently, others less so. We may want to tune Cronitor for some of them - the more critical ones - and set a sensible default for others e.g. a bit over a day.

idavidmcdonald · 2021-11-24T15:40:20Z

I'd suggest a review from @leohemsted on this but there is a chance that each cronitor task specified in here needs an equivalent set up manually in the cronitor user interface...

leohemsted · 2021-11-24T16:30:08Z

@idavidmcdonald is correct - we'll need to create the alert in cronitor, and then add the URL/task-name to the CRONITOR_KEYS dict in credentials/production/paas/environment-variables

benthorner · 2021-11-24T17:58:19Z

Thanks @leohemsted, I didn't realise.

I'll mention some of what @servingUpAces said on Slack too:

We didn't add cronitor to catch bugs but to catch tasks that run once a day but do not complete because the app is restart while it's processing.

Scheduled jobs is probably alright to check that it runs once an hour, we trigger the task once every 15 minutes, but services can only schedule once an hour.

I'm more convinced about adding this monitor for less frequent jobs. But still not 100% convinced.

Although adding £28/month to our running costs doesn't seem like much, we should still consider it and check with Caley or at least make her aware that the bill is going up.

Resource-wise, it's just a couple of HTTP requests, which aren't going to slow down anything considerably when we're talking in minutes. We'll also still continue running the task if Cronitor is down for some reason.

I'm planning to add a @cronitor for the new hourly task I'm adding, so I'd like to reach a conclusion on the existing ones here. Here's a few different proposals of what we could be checking for:

">= daily" - check tasks that run about once a day, in case they get killed due to a deploy / transient issue.
">= daily + critical tasks" - adds monitoring where we do scheduling on behalf of the user (jobs, alerts).
">= daily + critical tasks + maintenance" - also adds protection for issues that would build up over time.

Here's an overview of what each proposal would look like:

Task	Frequency	proposal (>= daily)	proposal (>= daily + critical tasks)	proposal (>= daily + critical tasks + maintenance)
run-scheduled-jobs	every 15 mins	❌	✅	✅
delete-verify-codes	~hourly	❌	❌	✅
delete-invitations	~hourly	❌	❌	✅
switch-current-sms-provider-on-slow-delivery	minutely	❌	❌	❌
check-job-status	minutely	❌	❌	❌
tend-providers-back-to-middle	every 5 mins	❌	❌	❌
check-for-missing-rows-in-completed-jobs	every 10 mins	❌	❌	❌
replay-created-notifications	every 15 mins	❌	❌	❌
check-if-letters-still-pending-virus-check	week-daily (change to daily)	✅	✅	✅
check-if-letters-still-in-created	week-daily (change to daily)	✅	✅	✅
check-for-services-with-high-failure-rates-or-sending-to-tv-numbers	week-daily	✅	✅	✅
trigger-link-tests	every 15 mins	❌	❌	❌
auto-expire-broadcast-messages	every 5 mins	❌	✅	✅
remove-yesterdays-planned-tests-on-govuk-alerts	daily	✅	✅	✅
(my new hourly task)	hourly	❌	❌	❌

Still feels a bit arbitrary. I'm trying to think about them in terms of "would I bother writing a manual alert for this?". I don't think we need to monitor monitoring tasks like "trigger link tests", and we could potentially drop the "tv numbers" one. We also don't need to have an alert for everything, but (2) strikes me as a valuable addition.

What do we think?

(I would also accept a blanket "let's leave it as-is and not add @cronitor to anything for now".)

leohemsted · 2021-12-02T10:48:55Z

Been quiet on this for a while but I guess one thought I'm struggling with a bit is - what's the next step if a fifteen minutely task fails? What's the lag on us finding out about it through a cronitor email?

We added cronitor originally because nightly tasks were silently failing due to paas cells rolling at a similar time of night to these tasks running, and them being killed without having a chance to emit a stack trace error log. If a nightly task silently fails, we'll likely investigate the cause, and may well need to and manually kick it off again.

If a fifteen minutely task silently fails, if we happen to see the cronitor email before the next run, which isn't a guarantee given there'd be a buffer before the email sends. Would we have any action other than "wait for the next 15 minute interval and see what happens"? Same with the hourly tasks. They're non-essential to run immediately, if we delete old invites an hour late, that shouldn't impact the system badly.

If either task noisily fails we'll get an error log, which will send an email out which we should act on. If we're not acting on those emails, I think the answer is to reduce the amount of error logs we emit that we can't act on, rather than increasing the volume of a different type of error log.

I think I'd rather we just improve the quality of the production-notify-delivery-worker-reporting-exception-warning and production-notify-delivery-worker-periodic-exception-warning alarm emails, rather than add a whole new and relatively costly way of finding out that things didn't work.

Or you'll find before long we're all filtering cronitor emails into a folder somewhere that none of us read and the whole cycle will continue.

idavidmcdonald · 2021-12-06T11:56:12Z

I don't have a strong opinion on this (partially because I'm finding it hard to evaluate which is the best option).

Unless people feel strongly and we have a good idea of how long it might take to start using one of the suggested approaches, I'd suggest we don't do anything about changing our cronitor approach for the moment.

Happy for you to make the call Ben.

benthorner · 2021-12-06T15:00:31Z

@idavidmcdonald I'm content to leave it as-is and just document how we use it now, so it's clear in future, based on some of @leohemsted's comment and talking in Slack. I propose we have a quick decision record like the following, which links back to this PR (can discuss elsewhere if controversial):

More detail in this discussion PR.

We use Cronitor to monitor infrequent but critical tasks have run as expected e.g. collate-letter-pdfs-to-be-sent must run on schedule. While we usually get an alert if a task logs an exception, Cronitor covers us if a task fails silently, which is usually due to a deployment or if an instance gets recycled.

We don't add Cronitor to more frequent tasks. Adding a task requires extra config in -credentials and costs more. There's also a risk of reducing the value of Cronitor alerts if we alert on all tasks indiscriminately, since more frequent tasks naturally recur without needing any action from us.

We did consider the potential for Cronitor to cover us for schedule / scheduler bugs, but we should get alerts about these together with other errors. We don't add Cronitor for daily -alert- tasks (e.g. check-if-letters-still-in-created) - although these are critical, they are really temporary until we have more timely alerts.

What do you think?

idavidmcdonald · 2021-12-06T17:45:56Z

I like your idea Ben and that is a nice length.

One thing that might be good is to clarify that this is currently what we are doing, rather than a decision made today to adopt a new process (I think).

benthorner · 2021-12-07T10:30:44Z

Thanks @idavidmcdonald. I've a note about it being retrospective in the Wiki version.

Closing in favour of new documentation: https://github.com/alphagov/notifications-manuals/wiki/Decision-records#use-cronitor-to-monitor-daily-tasks---7122021

benthorner force-pushed the scheduled-cronitor-180344153 branch from a68eb66 to ac6ce90 Compare November 24, 2021 13:42

benthorner closed this Dec 7, 2021

benthorner deleted the scheduled-cronitor-180344153 branch December 7, 2021 10:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardise using Cronitor for all scheduled tasks #3380

Standardise using Cronitor for all scheduled tasks #3380

benthorner commented Nov 24, 2021

idavidmcdonald commented Nov 24, 2021

leohemsted commented Nov 24, 2021

benthorner commented Nov 24, 2021

leohemsted commented Dec 2, 2021

idavidmcdonald commented Dec 6, 2021

benthorner commented Dec 6, 2021

idavidmcdonald commented Dec 6, 2021

benthorner commented Dec 7, 2021

Standardise using Cronitor for all scheduled tasks #3380

Standardise using Cronitor for all scheduled tasks #3380

Conversation

benthorner commented Nov 24, 2021

idavidmcdonald commented Nov 24, 2021

leohemsted commented Nov 24, 2021

benthorner commented Nov 24, 2021

leohemsted commented Dec 2, 2021

idavidmcdonald commented Dec 6, 2021

benthorner commented Dec 6, 2021

idavidmcdonald commented Dec 6, 2021

benthorner commented Dec 7, 2021