This describes how we treat errors on GOV.UK
We've recently migrated to a new error tracking service, Sentry. This provides us an opportunity to rethink how we treat errors.
There are 2 principles:
Applications should report exceptions to Sentry. Applications must not swallow errors.
Sentry notifications should be something that requires a developer of the app to do something about it. Not just a piece of information.
The goal of GOV.UK is that applications should not error. When something goes wrong it should be fixed.
A code change makes the application crash.
Desired behaviour: error is sent to Sentry, developers are notified and fix the error. Developers mark the error in Sentry as Resolved
. This means a recurrence of the error will alert developers again.
Frontend applications often see timeouts when talking to the content-store.
There's no or little user impact because the request will be answered by the caching layer.
Example: https://sentry.io/govuk/app-finder-frontend/issues/352985400
Desired behaviour: error is not sent to Sentry. Instead, we rely on Smokey and Icinga checks to make sure we the site functions.
Publishing applications sometimes see timeouts when talking to publishing-api. This results in the publisher seeing an error page and possibly losing data.
Example: https://sentry.io/govuk/app-content-tagger/issues/367277928
Desired behaviour: apps handle these errors better, for example by offloading the work to a Sidekiq worker. Since these errors aren't actionable, they should not be reported to Sentry. They should be tracked in Graphite.
Sidekiq worker sends something to the publishing-api, which times out. Sidekiq retries, the next time it works.
Desired behaviour: errors are not reported to Sentry until retries are exhausted. See this PR for an example.
Relevant: getsentry/sentry-ruby#784
MySQL errors on staging while data sync happens.
Example: https://sentry.io/govuk/app-whitehall/issues/343619055
Desired behaviour: our environment is set up such that these errors do not occur.
User makes a request the application can't handle (example).
Often happens in security checks.
Example: https://sentry.io/govuk/app-frontend/issues/400074979
Desired behaviour: user gets feedback, error is not reported to Sentry
Rummager crashes on date parsing, returns 422
, which raises an error in finder-frontend.
Example: https://sentry.io/govuk/app-finder-frontend/issues/400074507
Desired behaviour: a 4XX reponse is returned to the browser, including an error message. Nothing is ever sent to Sentry.
Something goes wrong and we need to let developers know.
Example: Slimmer's old behaviour
Desired behaviour: developers do not use Sentry for logging. The app either raises the actual error (which causes the user to see the error) or logs the error to Kibana.
Rails reports ActionDispatch::RemoteIp::IpSpoofAttackError
.
Example: https://sentry.io/govuk/app-service-manual-frontend/issues/365951370
Desired behaviour: HTTP 400 is returned, error is not reported to Sentry.
Often a controller will do something like Thing.find(params[:id])
and rely on Rails to show a 404 page for the ActiveRecord::RecordNotFound
it raises (context).
Desired behaviour: errors are not reported to Sentry