Skip to content

Files

Latest commit

f78767c · Nov 29, 2017

History

History
109 lines (55 loc) · 4.16 KB

rfc-087-dealing-with-errors.md

File metadata and controls

109 lines (55 loc) · 4.16 KB

Dealing with errors

Summary

This describes how we treat errors on GOV.UK

Problem

We've recently migrated to a new error tracking service, Sentry. This provides us an opportunity to rethink how we treat errors.

Proposal

There are 2 principles:

1. When something goes wrong, we should be notified

Applications should report exceptions to Sentry. Applications must not swallow errors.

2. Notifications should be actionable

Sentry notifications should be something that requires a developer of the app to do something about it. Not just a piece of information.

3. Applications should not error

The goal of GOV.UK is that applications should not error. When something goes wrong it should be fixed.

Classifying errors

Bug

A code change makes the application crash.

Desired behaviour: error is sent to Sentry, developers are notified and fix the error. Developers mark the error in Sentry as Resolved. This means a recurrence of the error will alert developers again.

Intermittent errors without user impact

Frontend applications often see timeouts when talking to the content-store.

There's no or little user impact because the request will be answered by the caching layer.

Example: https://sentry.io/govuk/app-finder-frontend/issues/352985400

Desired behaviour: error is not sent to Sentry. Instead, we rely on Smokey and Icinga checks to make sure we the site functions.

Intermittent errors with user impact

Publishing applications sometimes see timeouts when talking to publishing-api. This results in the publisher seeing an error page and possibly losing data.

Example: https://sentry.io/govuk/app-content-tagger/issues/367277928

Desired behaviour: apps handle these errors better, for example by offloading the work to a Sidekiq worker. Since these errors aren't actionable, they should not be reported to Sentry. They should be tracked in Graphite.

Intermittent retryable errors

Sidekiq worker sends something to the publishing-api, which times out. Sidekiq retries, the next time it works.

Desired behaviour: errors are not reported to Sentry until retries are exhausted. See this PR for an example.

Relevant: getsentry/sentry-ruby#784

Expected environment-based errors

MySQL errors on staging while data sync happens.

Example: https://sentry.io/govuk/app-whitehall/issues/343619055

Desired behaviour: our environment is set up such that these errors do not occur.

Bad request errors

User makes a request the application can't handle (example).

Often happens in security checks.

Example: https://sentry.io/govuk/app-frontend/issues/400074979

Desired behaviour: user gets feedback, error is not reported to Sentry

Incorrect bubbling up of errors

Rummager crashes on date parsing, returns 422, which raises an error in finder-frontend.

Example: https://sentry.io/govuk/app-finder-frontend/issues/400074507

Desired behaviour: a 4XX reponse is returned to the browser, including an error message. Nothing is ever sent to Sentry.

Manually logged errors

Something goes wrong and we need to let developers know.

Example: Slimmer's old behaviour

Desired behaviour: developers do not use Sentry for logging. The app either raises the actual error (which causes the user to see the error) or logs the error to Kibana.

IP spoof errors

Rails reports ActionDispatch::RemoteIp::IpSpoofAttackError.

Example: https://sentry.io/govuk/app-service-manual-frontend/issues/365951370

Desired behaviour: HTTP 400 is returned, error is not reported to Sentry.

Database entry not found

Often a controller will do something like Thing.find(params[:id]) and rely on Rails to show a 404 page for the ActiveRecord::RecordNotFound it raises (context).

Desired behaviour: errors are not reported to Sentry