aws-asg: configureable retry interval/retry #518

eidam · 2021-08-13T10:46:56Z

Hey,

first of all, thank you for the autoscaler, seems to be designed really well!

We are using AWS autoscaler and we have some ELB connection draining, which might take more than 2 minutes in some cases. The autoscaler gives up after couple of retries, would it be possible to make retry count and interval configurable?

https://github.com/hashicorp/nomad-autoscaler/blob/main/plugins/builtin/target/aws-asg/plugin/aws.go#L16-L20

Thanks!

jrasell · 2021-08-13T10:57:04Z

Hi @eidam and thanks for the kind words. That sounds like a sensible and useful idea! If you have any logs available showing the autoscaler giving up on the retry that would be great information to have also.

eidam · 2021-08-13T11:01:23Z

Hey @jrasell, sure thing! Here are the logs :)

2021-08-13T10:40:51.090Z [INFO]  policy_eval.worker: scaling target: id=05a354e1-741a-8a26-325c-138d5f639ada policy_id=5a352933-6197-5353-2c2a-c1a402afe1ad queue=cluster target=aws-asg from=5 to=4 reason="scaling down because metric is 4" meta=map[nomad_policy_id:5a352933-6197-5353-2c2a-c1a402afe1ad]
2021-08-13T10:40:51.210Z [INFO]  internal_plugin.aws-asg: triggering drain on node: node_id=05a01c2a-020b-888f-6499-2465cc2b1bc8 deadline=30m0s
2021-08-13T10:40:51.251Z [INFO]  internal_plugin.aws-asg: received node drain message: node_id=05a01c2a-020b-888f-6499-2465cc2b1bc8 msg="Drain complete for node 05a01c2a-020b-888f-6499-2465cc2b1bc8"
2021-08-13T10:41:02.076Z [INFO]  internal_plugin.aws-asg: received node drain message: node_id=05a01c2a-020b-888f-6499-2465cc2b1bc8 msg="All allocations on node "05a01c2a-020b-888f-6499-2465cc2b1bc8" have stopped"
2021-08-13T10:41:05.597Z [WARN]  internal_plugin.aws-asg: node pool status readiness check failed: error="node 05a01c2a-020b-888f-6499-2465cc2b1bc8 is ineligible"
2021-08-13T10:41:20.597Z [WARN]  internal_plugin.aws-asg: node pool status readiness check failed: error="node 05a01c2a-020b-888f-6499-2465cc2b1bc8 is ineligible"
2021-08-13T10:41:35.597Z [WARN]  internal_plugin.aws-asg: node pool status readiness check failed: error="node 05a01c2a-020b-888f-6499-2465cc2b1bc8 is ineligible"
2021-08-13T10:41:50.597Z [WARN]  internal_plugin.aws-asg: node pool status readiness check failed: error="node 05a01c2a-020b-888f-6499-2465cc2b1bc8 is ineligible"
2021-08-13T10:42:05.597Z [WARN]  internal_plugin.aws-asg: node pool status readiness check failed: error="node 05a01c2a-020b-888f-6499-2465cc2b1bc8 is ineligible"
2021-08-13T10:42:20.597Z [WARN]  internal_plugin.aws-asg: node pool status readiness check failed: error="node 05a01c2a-020b-888f-6499-2465cc2b1bc8 is ineligible"
2021-08-13T10:42:35.607Z [WARN]  internal_plugin.aws-asg: node pool status readiness check failed: error="node 05a01c2a-020b-888f-6499-2465cc2b1bc8 is ineligible"
2021-08-13T10:42:50.597Z [WARN]  internal_plugin.aws-asg: node pool status readiness check failed: error="node 05a01c2a-020b-888f-6499-2465cc2b1bc8 is ineligible"
2021-08-13T10:43:05.597Z [WARN]  internal_plugin.aws-asg: node pool status readiness check failed: error="node 05a01c2a-020b-888f-6499-2465cc2b1bc8 is ineligible"
2021-08-13T10:43:20.598Z [WARN]  internal_plugin.aws-asg: node pool status readiness check failed: error="node 05a01c2a-020b-888f-6499-2465cc2b1bc8 is ineligible"
2021-08-13T10:43:24.438Z [ERROR] internal_plugin.aws-asg: failed to ensure all activities completed: action=scale_in asg_name=nomad-client-arm-webapp error="reached retry limit"

eidam · 2021-08-13T11:03:04Z

Good to mention is that the downscale happen eventually, as ASG will remove it once the connections are drained and the nomad client goes to dead state afterwards. It's just Nomad not waiting long enough for it to happen.

eidam · 2021-08-17T18:04:48Z

I might try implementing this, gonna submit a PR once I will have something working. :)

khaledabdelaziz · 2021-09-03T09:12:44Z

Hello,
I'm having same issue with Autoscaler.. I kicked off a test last evening as I noticed this happening and autoscaler stopped like in 2nd round of testing..

Setup:
1- Autoscaler (v0.3.3) runs as nomad (v1.0.4) job
2- Autoscaler checks prometheus to add additional nodes when memory or cpu exceed 80%
3- Jobs with scaling policy, checks prometheus for certain metrics and scale in/out based on that

Test:
1- I run apache with 5 instances as initial count with setting to scale up to 20 if number of jobs in the clusters go over 3
2- I set a cron job to start a 4th job (webapp.nomad) at the beginning of the hour and stop it at minute 30
3- I expected Autoscaling to keep scaling the cluster and apache in and out when the job runs
4- that happened once then stopped

Full logs and jobs definition are attached
P.S.: there is a system job of fabio lb running across all nodes

autoscaler.log
autoscaler.txt
httpd.txt
webapp.txt

lgfa29 · 2021-11-25T01:07:38Z

Hi @khaledabdelaziz 👋

Your issue is actually different from what's being reported here. I think you just need to adjust your scaling policy to account for resources that are currently blocked. Looking at the Autoscaler logs your provided, it seems like your job currently being stuck in a deployment, which prevents further scaling.

You cluster policy is also only taking into account allocated resources, and not the allocations that are pending new resources. Take a look at this message and see if it helps you craft an improve cluster scaling policy.

alopezsanchez · 2022-08-03T16:43:49Z

Hi there 👋🏻

I'm facing the same problem, and I would like to help. Are you open to new PRs, right?
I will try to help with this issue!

lgfa29 · 2022-08-05T21:04:24Z

Hi @alopezsanchez 👋

A PR would be great! Feel free to reach out if you have any questions 🙂

…ximum waiting time The maximum waiting time for checking the status of the ASG is 2 minutes, which is not configurable. When using the ASG target plugin and downscaling, the ELB connection draining could take more than 2 minutes. If this happens, Nomad will purge the nodes without waiting for the ASG to remove the instances. Logs: ``` 2022-07-28T15:48:51.939Z [ERROR] internal_plugin.aws-asg: failed to ensure all activities completed: action=scale_in asg_name=preproduction/nomad error="reached retry limit" 2022-07-28T15:48:52.123Z [INFO] internal_plugin.aws-asg: successfully purged Nomad node: node_id=34862212-f89a-b107-903d-f4036a44543d nomad_evals=["88a3c562-31e6-f303-cae3-1289bd924958", "b0b08d76-0c2c-e5f3-1b8d-d6f032cd89aa"] ``` So, this commit adds a new config property called `retry_attempts`, which is the maximum number of attempts the plugin will make to ensure that the ASG has completed all its actions. Apart from that, if this PR is accepted, I will update the [documentation](https://www.nomadproject.io/tools/autoscaling/plugins/target/aws-asg#policy-configuration-options) accordingly. Resolves hashicorp#518

douglaje · 2023-10-13T20:46:45Z

FYI, the PR #594 doesn't work correctly. It's reading the config set on the plugin struct, which is details about the nomad cluster and different from the config passed to the autoscaler during the Status and Scale calls which is what actually contains the retry_attempts value set in the config.

lgfa29 · 2023-12-21T04:29:42Z

Hi @douglaje 👋

Apologies for missing your comment. I think I didn't get the notification because this is a closed issue.

The retry_attempts is a plugin config, not a policy config. So it should be set in your autoscaler configuration:

target "aws-asg" {
  driver = "aws-asg"
  config = {
    aws_region     = "us-east-1"
    retry_attempts = "20"
  }
}

I noticed we're missing this in our docs, so I opened hashicorp/nomad#19549.

jrasell added stage/accepted theme/target/aws-asg type/enhancement labels Aug 13, 2021

alopezsanchez mentioned this issue Aug 9, 2022

feat(aws-asg): add retry_attempts configuration to customize the maximum waiting time #594

Merged

lgfa29 closed this as completed in #594 Dec 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws-asg: configureable retry interval/retry #518

aws-asg: configureable retry interval/retry #518

eidam commented Aug 13, 2021

jrasell commented Aug 13, 2021

eidam commented Aug 13, 2021

eidam commented Aug 13, 2021

eidam commented Aug 17, 2021

khaledabdelaziz commented Sep 3, 2021

lgfa29 commented Nov 25, 2021

alopezsanchez commented Aug 3, 2022

lgfa29 commented Aug 5, 2022

douglaje commented Oct 13, 2023

lgfa29 commented Dec 21, 2023

aws-asg: configureable retry interval/retry #518

aws-asg: configureable retry interval/retry #518

Comments

eidam commented Aug 13, 2021

jrasell commented Aug 13, 2021

eidam commented Aug 13, 2021

eidam commented Aug 13, 2021

eidam commented Aug 17, 2021

khaledabdelaziz commented Sep 3, 2021

lgfa29 commented Nov 25, 2021

alopezsanchez commented Aug 3, 2022

lgfa29 commented Aug 5, 2022

douglaje commented Oct 13, 2023

lgfa29 commented Dec 21, 2023