-
Notifications
You must be signed in to change notification settings - Fork 4
Decision records
This page is a record of some of the decisions we've made.
We also have separate documents for individual decisions. If a decision is long and complex or there are lots of aspects to discuss, a separate document may be more appropriate. Otherwise, document the key points here.
Based on Leo's spike into both bots (PyUp and Dependabot), the discussions with the team, and on Pay's current workflow,following decisions have been made:
-
We will continue using Pyup for Python-related dependencies, and we will use Dependabot for the rest, including Node packages and Terraform. There will be a card to make sure all repos are consistent on this.
-
We will clear out the backlog of dependency updates to have a clean slate.
-
For three months, we will trial a dependency management system similar to what GOV.UK Pay has: we will try to update all dependencies as soon as the bots tell us about the updates. This will be handled by the person on support and treated like any other support tickets - that is if the person on support is too busy to merge and deploy the PRs, they should ask other team members for help. After three months, if it turns out to be too full-on, we can iterate and decide to do something else. If we can handle it like Pay does, that puts us in a good position where our dependency stack is up-to-date as we go. If this trials proves to be overwhelming, we can cut it short.
For Docker images and API clients, we maintain the current processes. That is we use the “latest” tag for the former, and keep dependencies unpinned where we can for the latter.
Continue to use the AWS SES global suppression list rather than the account level suppression list - 7/2/2022
AWS SES have introduced a new account level suppression list to replace the use of the global suppression list. For the moment, we will continue to use the global suppression list and not opt in for the account level suppression list.
Email addresses that go on the global suppression list will be removed within 14 days. We can not remove email addresses sooner ourselves anymore which is a problem for the few cases we get of emails incorrectly being placed on the global suppression list. We get one or two support tickets a month for users not getting our emails because they don't expect to be on the global suppression list.
We are cautious of using the account level suppression list as we will not get any automatic removal after 14 days and are worried that this may require more support and us having a process for dealing with any email that ends up the account level suppression list (correctly or incorrectly). We don't currently have a realistic idea of how many users are on the global suppression list when they shouldn't be.
For the moment, we will tolerate the impact to users that we are not able to remove them from the global suppression list ourselves and ask them to wait a few days before trying again.
We may want to revisit this decision in the future when we are able to give it more thought and investigation.
Note: this is retrospective documentation of a past decision. More details can be found in this discussion PR.
We use Cronitor to monitor infrequent but critical tasks have run as expected e.g. collate-letter-pdfs-to-be-sent
must run on schedule. While we usually get an alert if a task logs an exception, Cronitor covers us if a task fails silently, which is usually due to a deployment or if an instance gets recycled.
We don't add Cronitor to more frequent tasks. Adding a task requires extra config in -credentials
and costs more. There's also a risk of reducing the value of Cronitor alerts if we alert on all tasks indiscriminately, since more frequent tasks naturally recur without needing any action from us.
We did consider the potential for Cronitor to cover us for schedule / scheduler bugs, but we should get alerts about these together with other errors. We don't add Cronitor for daily -alert-
tasks (e.g. check-if-letters-still-in-created
) - although these are critical, they are really temporary until we have more timely alerts.
We have decided to stop using Dockerhub in favour of ghcr for the following reasons:
- Cost: Notify has a Dockerhub "Pro" account, which costs $60/year which used to allow up to 50,000 pulls/day [link]. Back in May, Dockerhub notified us that we were averaging ~56,000 pulls/day and we should either reduce the usage or pay $18,000 for a "service account" which would increase our allowance to 150,000 pulls/day. In contrast, ghcr is already provided as part of our enterprise agreement with Github without any limit in the number of pulls.
- Security: Since ghcr is part of Github we can take advantage of the same access controls as with the rest of our repos, meaning that we will have one less thing to manage
As a result of this decision, we should be keeping an eye out for any images that we pull from Dockerhub and use this pipeline to copy them to ghcr instead.
We have decided we will not have cookies on gov.uk/alerts for the following reasons
- We don't have a need for Google analytics. We think we can get per second requests and break down by URL stats through Fastly which will be sufficient
- We don't have any other use cases for why we need to have any other cookies
- We don't want to damage users trust in the emergency alerts and whether we are tracking them etc
- we don't want users to see a big cookie banner first rather than the important content telling them what to do to potentially save their life
This decision could be reversed if we discover new needs for cookies and feel their benefits outweigh the downsides of having them
We have decided for the interim to host gov.uk/alerts as a static website in S3 for the following reasons
- We think this is one of the quickest ways to get a website created
- We think it will be relatively easy to build and run
- We think S3 will offer us the ability to handle high load (although Fastly will protect us from most of it)
- We think it will be more reliable than GOV.UK PaaS
- We get easy built in security permissions and access logging in AWS that match the rest of our broadcast infrastructure for free, without needing to replicate them in a different environment like the PaaS
This decision is likely one to last a few months, until we find that we outgrow this solution. When we do more exploration into publishing alerts in real time, our infrastructure needs will become more complex and we should be open to completely changing this decision if we see fit. The team has already discussed some of the many alternatives we may have at that point.