Optimize Autoscaling Configuration #1357

erichfi · 2023-06-14T10:43:31Z

User Story:

As a backend developer,
I want to optimize the autoscaling configuration,
So that it responds more quickly when the average CPU usage exceeds the threshold.

Acceptance Criteria

GIVEN the average CPU usage has been above the threshold for 30 seconds,
WHEN the autoscaling checks the CPU usage,
THEN it should initiate the scaling process immediately.

Product & Design Links:

N/A

Tech Details:

This task will involve adjusting the autoscaling configuration to use more responsive metrics, potentially including request volume.

Open Questions:

Is the CPU usage the most effective metric for our autoscaling needs, or should we consider alternatives like request volume?

Notes/Assumptions:

This feature assumes that faster autoscaling will better handle sudden surges in traffic and prevent system overloads.

tim-schultz · 2023-06-14T14:46:46Z

All tests were with 1000 users over 30 minutes on staging against the latest prod release

We currently have 300 seconds set for Scale-out cooldown period in prod

Max of 20 tasks for api and each worker - 300 seconds for Scale-out cooldown period
`✗ is status 200
↳ 65% — ✓ 162173 / ✗ 86385

 http_req_failed................: 34.75% ✓ 86385      ✗ 162173`

Max of 30 tasks for api and each worker - 30 seconds for Scale-out cooldown period

 ✗ is status 200
  ↳  94% — ✓ 355560 / ✗ 19019
 ✗ is status 502
  ↳  5% — ✓ 19006 / ✗ 355573

http_req_failed................: 5.07% ✓ 19019 ✗ 355560

Max of 30 tasks for api and each worker - 300 seconds for Scale-out cooldown period
✗ is status 200
↳ 81% — ✓ 175774 / ✗ 38711
✗ is status 502
↳ 18% — ✓ 38672 / ✗ 175813

http_req_failed................: 18.04% ✓ 38711 ✗ 175774

If we lower Scale-out cooldown period tasks are spinning up a lot quicker and everything seems to recover quicker as well. It seems like if keep a high limit on tasks we should be able to scale a good amount quicker 🚀

tim-schultz · 2023-06-15T00:00:07Z

All tests were started with the following task counts:
scorer-api: 4
passport worker: 2
registry worker: 2

Additional tests:

Scaling based on CPU
https://app.k6.io/runs/1806539 - Max Tasks 40
Scaling Settings: ECSServiceAverageCPUUtilization - ECSServiceAverageCPUUtilization: Target Value: 30, Scale-out cooldown period: 30
Scaling based on CPU and request count
https://app.k6.io/runs/1806718 - Max Tasks 40
Scaling Settings: ECSServiceAverageCPUUtilization - ECSServiceAverageCPUUtilization: Target Value: 30, Scale-out cooldown period: 30
ALBRequestCountPerTarget: Target Value: 250, Scale-out cooldown period 20

Compared to # 1
10.5% increase in requests made
43% decrease in failed requests

Comparing # 2 to the last test in K6 and that was described here we were able to process 23% more requests and were able to decrease the error rate by a factor of 4

Suggested Settings:
Seems like # 2 is the best option. Here are the settings that were used for that test:

Scorer Service:

AND

Passport Worker:

Registry Worker:

In terms of scaling in, the settings should remain the same and tasks should spin down to the current minimums in a similar manner.

nutrina · 2023-06-15T15:25:18Z

Looks great

erichfi added this to Passport Jun 14, 2023

erichfi converted this from a draft issue Jun 14, 2023

tim-schultz self-assigned this Jun 14, 2023

tim-schultz moved this from Backlog to In Progress (WIP) in Passport Jun 14, 2023

tim-schultz moved this from In Progress (WIP) to Backlog in Passport Jun 14, 2023

tim-schultz removed their assignment Jun 14, 2023

tim-schultz self-assigned this Jun 14, 2023

tim-schultz moved this from Backlog to In Progress (WIP) in Passport Jun 14, 2023

tim-schultz moved this from In Progress (WIP) to Done in Passport Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Autoscaling Configuration #1357

Optimize Autoscaling Configuration #1357

erichfi commented Jun 14, 2023

tim-schultz commented Jun 14, 2023 •

edited

Loading

tim-schultz commented Jun 15, 2023 •

edited

Loading

nutrina commented Jun 15, 2023

Optimize Autoscaling Configuration #1357

Optimize Autoscaling Configuration #1357

Comments

erichfi commented Jun 14, 2023

User Story:

Acceptance Criteria

Product & Design Links:

Tech Details:

Open Questions:

Notes/Assumptions:

tim-schultz commented Jun 14, 2023 • edited Loading

tim-schultz commented Jun 15, 2023 • edited Loading

nutrina commented Jun 15, 2023

tim-schultz commented Jun 14, 2023 •

edited

Loading

tim-schultz commented Jun 15, 2023 •

edited

Loading