Skip to content

Decision records

Ben Thorner edited this page Apr 1, 2022 · 16 revisions

This page is a record of some of the decisions we've made.

We also have separate documents for individual decisions. If a decision is long and complex or there are lots of aspects to discuss, a separate document may be more appropriate. Otherwise, document the key points here.

Trial a new dependency management process - 17/3/2022

Based on spike into dependency managing bots (PyUp and Dependabot), the discussions with the team, and on Pay's current workflow,following decisions have been made:

  1. We will continue using Pyup for Python-related dependencies, and we will use Dependabot for the rest, including Node packages and Terraform. There will be a card to make sure all repos are consistent on this.

  2. We will clear out the backlog of dependency updates to have a clean slate.

  3. For three months, we will trial a dependency management system similar to what GOV.UK Pay has: we will try to update all dependencies as soon as the bots tell us about the updates. This will be handled by the person on support and treated like any other support tickets - that is if the person on support is too busy to merge and deploy the PRs, they should ask other team members for help. After three months, if it turns out to be too full-on, we can iterate and decide to do something else. If we can handle it like Pay does, that puts us in a good position where our dependency stack is up-to-date as we go. If this trials proves to be overwhelming, we can cut it short.

For Docker images and API clients, we maintain the current processes. That is we use the “latest” tag for the former, and keep dependencies unpinned where we can for the latter.

More info and links in here

Continue to use the AWS SES global suppression list rather than the account level suppression list - 7/2/2022

AWS SES have introduced a new account level suppression list to replace the use of the global suppression list. For the moment, we will continue to use the global suppression list and not opt in for the account level suppression list.

Email addresses that go on the global suppression list will be removed within 14 days. We can not remove email addresses sooner ourselves anymore which is a problem for the few cases we get of emails incorrectly being placed on the global suppression list. We get one or two support tickets a month for users not getting our emails because they don't expect to be on the global suppression list.

We are cautious of using the account level suppression list as we will not get any automatic removal after 14 days and are worried that this may require more support and us having a process for dealing with any email that ends up the account level suppression list (correctly or incorrectly). We don't currently have a realistic idea of how many users are on the global suppression list when they shouldn't be.

For the moment, we will tolerate the impact to users that we are not able to remove them from the global suppression list ourselves and ask them to wait a few days before trying again.

We may want to revisit this decision in the future when we are able to give it more thought and investigation.

Use Cronitor to monitor daily tasks - 7/12/2021

Note: this is retrospective documentation of a past decision. More details can be found in this discussion PR.

We use Cronitor to monitor infrequent but critical tasks have run as expected e.g. collate-letter-pdfs-to-be-sent must run on schedule. While we usually get an alert if a task logs an exception, Cronitor covers us if a task fails silently, which is usually due to a deployment or if an instance gets recycled.

We don't add Cronitor to more frequent tasks. Adding a task requires extra config in -credentials and costs more. There's also a risk of reducing the value of Cronitor alerts if we alert on all tasks indiscriminately, since more frequent tasks naturally recur without needing any action from us.

We did consider the potential for Cronitor to cover us for schedule / scheduler bugs, but we should get alerts about these together with other errors. We don't add Cronitor for daily -alert- tasks (e.g. check-if-letters-still-in-created) - although these are critical, they are really temporary until we have more timely alerts.

Use github container registry (ghcr) instead of Dockerhub - 6/12/2021

We have decided to stop using Dockerhub in favour of ghcr for the following reasons:

  • Cost: Notify has a Dockerhub "Pro" account, which costs $60/year which used to allow up to 50,000 pulls/day [link]. Back in May, Dockerhub notified us that we were averaging ~56,000 pulls/day and we should either reduce the usage or pay $18,000 for a "service account" which would increase our allowance to 150,000 pulls/day. In contrast, ghcr is already provided as part of our enterprise agreement with Github without any limit in the number of pulls.
  • Security: Since ghcr is part of Github we can take advantage of the same access controls as with the rest of our repos, meaning that we will have one less thing to manage

Consequences:

As a result of this decision, we should be keeping an eye out for any images that we pull from Dockerhub and use this pipeline to copy them to ghcr instead.

No cookies on gov.uk/alerts - 3/3/2021

We have decided we will not have cookies on gov.uk/alerts for the following reasons

  • We don't have a need for Google analytics. We think we can get per second requests and break down by URL stats through Fastly which will be sufficient
  • We don't have any other use cases for why we need to have any other cookies
  • We don't want to damage users trust in the emergency alerts and whether we are tracking them etc
  • we don't want users to see a big cookie banner first rather than the important content telling them what to do to potentially save their life

This decision could be reversed if we discover new needs for cookies and feel their benefits outweigh the downsides of having them

Host gov.uk/alerts as a S3 static site - 3/3/2021

We have decided for the interim to host gov.uk/alerts as a static website in S3 for the following reasons

  • We think this is one of the quickest ways to get a website created
  • We think it will be relatively easy to build and run
  • We think S3 will offer us the ability to handle high load (although Fastly will protect us from most of it)
  • We think it will be more reliable than GOV.UK PaaS
  • We get easy built in security permissions and access logging in AWS that match the rest of our broadcast infrastructure for free, without needing to replicate them in a different environment like the PaaS

This decision is likely one to last a few months, until we find that we outgrow this solution. When we do more exploration into publishing alerts in real time, our infrastructure needs will become more complex and we should be open to completely changing this decision if we see fit. The team has already discussed some of the many alternatives we may have at that point.

Use Pass / GPG to provide Terraform secrets - 1/1/2022

We decided to use a new Terraform provider to retrieve secrets from our Pass secret store since the original provider no longer works on new Mac M1 machines.

Terraform needs secrets for some resources it manages e.g.

The new Pass provider is not backwards compatible: we now need to remember to set a PASSWORD_STORE_DIR environment variable when running any Terraform that uses the provider. We also suspect the new provider won't be maintained - the author has disabled GitHub issues.

Given these problems, we investigated some alternatives:

  • Use AWS Secrets Manager to store secrets in each environment. This works but leads to some small duplication of secrets between environments; storing secrets in multiple locations also makes it harder to search for secrets globally if we need to change them.

  • Sync Pass secrets with AWS Secrets Manager in each environment. We have to do this for Concourse itself, but the link is invisible and the extra infra required to achieve it is excessive when it's still possible to use Pass directly for our Terraform use case.

  • Use a single AWS Secrets Manager accessible from all environments. This isn't compatible with the current - totally locked down - security model we have for broadcasts infra and would require equivalent controls. We would also have to namespace secrets where they differ by environment.

  • Don't copy / paste secrets in the first place. This is possible between AWS resources but not for external systems like PagerDuty. This approach also doesn't scale to PaaS apps, although PaaS may in future at least introduce a managed service for credentials.

Short version: as far as we know, there's no good solution to managing the variety of application secrets we have.

Clone this wiki locally