Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Autoscaling Configuration #1357

Open
erichfi opened this issue Jun 14, 2023 · 3 comments
Open

Optimize Autoscaling Configuration #1357

erichfi opened this issue Jun 14, 2023 · 3 comments
Assignees

Comments

@erichfi
Copy link
Contributor

erichfi commented Jun 14, 2023

User Story:

As a backend developer,
I want to optimize the autoscaling configuration,
So that it responds more quickly when the average CPU usage exceeds the threshold.

Acceptance Criteria

GIVEN the average CPU usage has been above the threshold for 30 seconds,
WHEN the autoscaling checks the CPU usage,
THEN it should initiate the scaling process immediately.

Product & Design Links:

N/A

Tech Details:

This task will involve adjusting the autoscaling configuration to use more responsive metrics, potentially including request volume.

Open Questions:

Is the CPU usage the most effective metric for our autoscaling needs, or should we consider alternatives like request volume?

Notes/Assumptions:

This feature assumes that faster autoscaling will better handle sudden surges in traffic and prevent system overloads.

@erichfi erichfi added this to Passport Jun 14, 2023
@erichfi erichfi converted this from a draft issue Jun 14, 2023
@tim-schultz tim-schultz self-assigned this Jun 14, 2023
@tim-schultz tim-schultz moved this from Backlog to In Progress (WIP) in Passport Jun 14, 2023
@tim-schultz tim-schultz moved this from In Progress (WIP) to Backlog in Passport Jun 14, 2023
@tim-schultz tim-schultz removed their assignment Jun 14, 2023
@tim-schultz
Copy link
Contributor

tim-schultz commented Jun 14, 2023

All tests were with 1000 users over 30 minutes on staging against the latest prod release

We currently have 300 seconds set for Scale-out cooldown period in prod

Max of 20 tasks for api and each worker - 300 seconds for Scale-out cooldown period
`✗ is status 200
↳ 65% — ✓ 162173 / ✗ 86385

 http_req_failed................: 34.75% ✓ 86385      ✗ 162173`

Max of 30 tasks for api and each worker - 30 seconds for Scale-out cooldown period

 ✗ is status 200
  ↳  94% — ✓ 355560 / ✗ 19019
 ✗ is status 502
  ↳  5% — ✓ 19006 / ✗ 355573

http_req_failed................: 5.07% ✓ 19019 ✗ 355560

Max of 30 tasks for api and each worker - 300 seconds for Scale-out cooldown period
✗ is status 200
↳ 81% — ✓ 175774 / ✗ 38711
✗ is status 502
↳ 18% — ✓ 38672 / ✗ 175813

http_req_failed................: 18.04% ✓ 38711 ✗ 175774

If we lower Scale-out cooldown period tasks are spinning up a lot quicker and everything seems to recover quicker as well. It seems like if keep a high limit on tasks we should be able to scale a good amount quicker 🚀

@tim-schultz tim-schultz self-assigned this Jun 14, 2023
@tim-schultz tim-schultz moved this from Backlog to In Progress (WIP) in Passport Jun 14, 2023
@tim-schultz
Copy link
Contributor

tim-schultz commented Jun 15, 2023

All tests were started with the following task counts:
scorer-api: 4
passport worker: 2
registry worker: 2

Additional tests:

  1. Scaling based on CPU
    https://app.k6.io/runs/1806539 - Max Tasks 40
    Scaling Settings: ECSServiceAverageCPUUtilization - ECSServiceAverageCPUUtilization: Target Value: 30, Scale-out cooldown period: 30

  2. Scaling based on CPU and request count
    https://app.k6.io/runs/1806718 - Max Tasks 40
    Scaling Settings: ECSServiceAverageCPUUtilization - ECSServiceAverageCPUUtilization: Target Value: 30, Scale-out cooldown period: 30
    ALBRequestCountPerTarget: Target Value: 250, Scale-out cooldown period 20

Compared to # 1
10.5% increase in requests made
43% decrease in failed requests

  1. Comparing # 2 to the last test in K6 and that was described here we were able to process 23% more requests and were able to decrease the error rate by a factor of 4

Suggested Settings:
Seems like # 2 is the best option. Here are the settings that were used for that test:

Scorer Service:
Image
AND
Image

Passport Worker:
Image

Registry Worker:
Image

In terms of scaling in, the settings should remain the same and tasks should spin down to the current minimums in a similar manner.

@nutrina
Copy link
Collaborator

nutrina commented Jun 15, 2023

Looks great

@tim-schultz tim-schultz moved this from In Progress (WIP) to Done in Passport Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants