Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust reserved concurrency limit for fetch Lambda #71

Merged
merged 1 commit into from
Jun 16, 2022

Conversation

jacobwinch
Copy link
Contributor

@jacobwinch jacobwinch commented Jun 14, 2022

What does this change?

tl;dr

This PR allows the Lambda which fetches saved articles to scale more liberally in order to avoid unreliable deployments.

Longer version

AWS Lambda provides a couple of different ways of controlling concurrency. In the Mobile account, it is common to use reserved concurrency.

Reserving concurrency has the following effects.

Other functions can't prevent your function from scaling – All of your account's functions in the same Region without reserved concurrency share the pool of unreserved concurrency. Without reserved concurrency, other functions can use up all of the available concurrency. This prevents your function from scaling up when needed.

Your function can't scale out of control – Reserved concurrency also limits your function from using concurrency from the unreserved pool, which caps its maximum concurrency. You can reserve concurrency to prevent your function from using all the available concurrency in the Region, or from overloading downstream resources.

It looks like we introduced reserved concurrency to prevent one of these lambdas from impacting on other services (see #42 and #43 for history). These limits were likely correct at the time, but they are now regularly causing this service to be throttled (N.B. throttling results in us serving 5XXs to clients):

image

This problem is particularly noticeable when the project is deployed. I presume this is because the execution times immediately after deployment are much slower (due to cold starts), so we need more concurrent executions for a short period in order to deal with the volume of requests.

I'm adjusting the limit in an attempt to avoid these problems.

How to test

There isn't a great way to test this; we'll need to ship to PROD and monitor things.

How can we measure success?

Hopefully this new limit is sufficient to allow deployments to run without impacting on reliability.

Have we considered potential risks?

There are a couple of risks here:

1. As we increase the amount of reserved concurrency used by this lambda, we shrink the pool of resources which is available for lambdas which use unreserved concurrency.

However:

  • our unreserved concurrency currently accounts for substantial proportion of the account limit:

image

  • other core Lambda-based services (e.g. notifications) use reserved concurrency anyway, so this change does not affect them

  • our max unreserved concurrent executions metric suggests that we are well below the throttling limit for this category:

image

2. We are hardcoding limits and they are likely to go out of date again.

We have a couple of options to mitigate this:

a) Swap to using unreserved concurrency
b) Configure alerting so that we notice if this type of problem happens again (N.B. this was also discussed here)

Copy link
Contributor

@DavidLawes DavidLawes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the information you've provided here Jacob :)

At the end of the PR description there were 2 mitigations we could put in place to reduce/control the risk (move all lambdas to unreserved concurrency and put monitoring in place). Do you think it's worth reviewing and/or implementing these mitigations? If yes, we'll create tickets for the MSS team to look at them.

@jacobwinch
Copy link
Contributor Author

Thanks for the review @DavidLawes!

Do you think it's worth reviewing and/or implementing these mitigations? If yes, we'll create tickets for the MSS team to look at them.

I'd suggest creating a card for 2b (and I'd be happy to pair on that when it is prioritised, if helpful). On reflection I'm not sure that 2a is advisable anyway because it'd allow this Lambda to stop other services from operating (by using up the whole pool), so it is probably worse for reliability overall.

@jacobwinch jacobwinch merged commit cf95e09 into master Jun 16, 2022
@jacobwinch jacobwinch deleted the jw-lambda-concurrency branch June 16, 2022 08:46
@DavidLawes
Copy link
Contributor

Thanks Jacob :) I've created a task for add the 5xx alerting, mss will reach out for pairing/assistance on this, thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants