-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aws-asg: configureable retry interval/retry #518
Comments
Hi @eidam and thanks for the kind words. That sounds like a sensible and useful idea! If you have any logs available showing the autoscaler giving up on the retry that would be great information to have also. |
Hey @jrasell, sure thing! Here are the logs :)
|
Good to mention is that the downscale happen eventually, as ASG will remove it once the connections are drained and the nomad client goes to dead state afterwards. It's just Nomad not waiting long enough for it to happen. |
I might try implementing this, gonna submit a PR once I will have something working. :) |
Hello, Setup: Test: Full logs and jobs definition are attached |
Hi @khaledabdelaziz 👋 Your issue is actually different from what's being reported here. I think you just need to adjust your scaling policy to account for resources that are currently blocked. Looking at the Autoscaler logs your provided, it seems like your job currently being stuck in a deployment, which prevents further scaling. You cluster policy is also only taking into account allocated resources, and not the allocations that are pending new resources. Take a look at this message and see if it helps you craft an improve cluster scaling policy. |
Hi there 👋🏻 I'm facing the same problem, and I would like to help. Are you open to new PRs, right? |
Hi @alopezsanchez 👋 A PR would be great! Feel free to reach out if you have any questions 🙂 |
…ximum waiting time The maximum waiting time for checking the status of the ASG is 2 minutes, which is not configurable. When using the ASG target plugin and downscaling, the ELB connection draining could take more than 2 minutes. If this happens, Nomad will purge the nodes without waiting for the ASG to remove the instances. Logs: ``` 2022-07-28T15:48:51.939Z [ERROR] internal_plugin.aws-asg: failed to ensure all activities completed: action=scale_in asg_name=preproduction/nomad error="reached retry limit" 2022-07-28T15:48:52.123Z [INFO] internal_plugin.aws-asg: successfully purged Nomad node: node_id=34862212-f89a-b107-903d-f4036a44543d nomad_evals=["88a3c562-31e6-f303-cae3-1289bd924958", "b0b08d76-0c2c-e5f3-1b8d-d6f032cd89aa"] ``` So, this commit adds a new config property called `retry_attempts`, which is the maximum number of attempts the plugin will make to ensure that the ASG has completed all its actions. Apart from that, if this PR is accepted, I will update the [documentation](https://www.nomadproject.io/tools/autoscaling/plugins/target/aws-asg#policy-configuration-options) accordingly. Resolves hashicorp#518
FYI, the PR #594 doesn't work correctly. It's reading the config set on the plugin struct, which is details about the nomad cluster and different from the config passed to the autoscaler during the |
Hi @douglaje 👋 Apologies for missing your comment. I think I didn't get the notification because this is a closed issue. The target "aws-asg" {
driver = "aws-asg"
config = {
aws_region = "us-east-1"
retry_attempts = "20"
}
} I noticed we're missing this in our docs, so I opened hashicorp/nomad#19549. |
Hey,
first of all, thank you for the autoscaler, seems to be designed really well!
We are using AWS autoscaler and we have some ELB connection draining, which might take more than 2 minutes in some cases. The autoscaler gives up after couple of retries, would it be possible to make retry count and interval configurable?
https://github.com/hashicorp/nomad-autoscaler/blob/main/plugins/builtin/target/aws-asg/plugin/aws.go#L16-L20
Thanks!
The text was updated successfully, but these errors were encountered: