Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-asg: configureable retry interval/retry #518

Closed
eidam opened this issue Aug 13, 2021 · 10 comments · Fixed by #594
Closed

aws-asg: configureable retry interval/retry #518

eidam opened this issue Aug 13, 2021 · 10 comments · Fixed by #594

Comments

@eidam
Copy link

eidam commented Aug 13, 2021

Hey,

first of all, thank you for the autoscaler, seems to be designed really well!

We are using AWS autoscaler and we have some ELB connection draining, which might take more than 2 minutes in some cases. The autoscaler gives up after couple of retries, would it be possible to make retry count and interval configurable?

https://github.com/hashicorp/nomad-autoscaler/blob/main/plugins/builtin/target/aws-asg/plugin/aws.go#L16-L20

Thanks!

@jrasell
Copy link
Member

jrasell commented Aug 13, 2021

Hi @eidam and thanks for the kind words. That sounds like a sensible and useful idea! If you have any logs available showing the autoscaler giving up on the retry that would be great information to have also.

@eidam
Copy link
Author

eidam commented Aug 13, 2021

Hey @jrasell, sure thing! Here are the logs :)

2021-08-13T10:40:51.090Z [INFO]  policy_eval.worker: scaling target: id=05a354e1-741a-8a26-325c-138d5f639ada policy_id=5a352933-6197-5353-2c2a-c1a402afe1ad queue=cluster target=aws-asg from=5 to=4 reason="scaling down because metric is 4" meta=map[nomad_policy_id:5a352933-6197-5353-2c2a-c1a402afe1ad]
2021-08-13T10:40:51.210Z [INFO]  internal_plugin.aws-asg: triggering drain on node: node_id=05a01c2a-020b-888f-6499-2465cc2b1bc8 deadline=30m0s
2021-08-13T10:40:51.251Z [INFO]  internal_plugin.aws-asg: received node drain message: node_id=05a01c2a-020b-888f-6499-2465cc2b1bc8 msg="Drain complete for node 05a01c2a-020b-888f-6499-2465cc2b1bc8"
2021-08-13T10:41:02.076Z [INFO]  internal_plugin.aws-asg: received node drain message: node_id=05a01c2a-020b-888f-6499-2465cc2b1bc8 msg="All allocations on node "05a01c2a-020b-888f-6499-2465cc2b1bc8" have stopped"
2021-08-13T10:41:05.597Z [WARN]  internal_plugin.aws-asg: node pool status readiness check failed: error="node 05a01c2a-020b-888f-6499-2465cc2b1bc8 is ineligible"
2021-08-13T10:41:20.597Z [WARN]  internal_plugin.aws-asg: node pool status readiness check failed: error="node 05a01c2a-020b-888f-6499-2465cc2b1bc8 is ineligible"
2021-08-13T10:41:35.597Z [WARN]  internal_plugin.aws-asg: node pool status readiness check failed: error="node 05a01c2a-020b-888f-6499-2465cc2b1bc8 is ineligible"
2021-08-13T10:41:50.597Z [WARN]  internal_plugin.aws-asg: node pool status readiness check failed: error="node 05a01c2a-020b-888f-6499-2465cc2b1bc8 is ineligible"
2021-08-13T10:42:05.597Z [WARN]  internal_plugin.aws-asg: node pool status readiness check failed: error="node 05a01c2a-020b-888f-6499-2465cc2b1bc8 is ineligible"
2021-08-13T10:42:20.597Z [WARN]  internal_plugin.aws-asg: node pool status readiness check failed: error="node 05a01c2a-020b-888f-6499-2465cc2b1bc8 is ineligible"
2021-08-13T10:42:35.607Z [WARN]  internal_plugin.aws-asg: node pool status readiness check failed: error="node 05a01c2a-020b-888f-6499-2465cc2b1bc8 is ineligible"
2021-08-13T10:42:50.597Z [WARN]  internal_plugin.aws-asg: node pool status readiness check failed: error="node 05a01c2a-020b-888f-6499-2465cc2b1bc8 is ineligible"
2021-08-13T10:43:05.597Z [WARN]  internal_plugin.aws-asg: node pool status readiness check failed: error="node 05a01c2a-020b-888f-6499-2465cc2b1bc8 is ineligible"
2021-08-13T10:43:20.598Z [WARN]  internal_plugin.aws-asg: node pool status readiness check failed: error="node 05a01c2a-020b-888f-6499-2465cc2b1bc8 is ineligible"
2021-08-13T10:43:24.438Z [ERROR] internal_plugin.aws-asg: failed to ensure all activities completed: action=scale_in asg_name=nomad-client-arm-webapp error="reached retry limit"

@eidam
Copy link
Author

eidam commented Aug 13, 2021

Good to mention is that the downscale happen eventually, as ASG will remove it once the connections are drained and the nomad client goes to dead state afterwards. It's just Nomad not waiting long enough for it to happen.

@eidam
Copy link
Author

eidam commented Aug 17, 2021

I might try implementing this, gonna submit a PR once I will have something working. :)

@khaledabdelaziz
Copy link

Hello,
I'm having same issue with Autoscaler.. I kicked off a test last evening as I noticed this happening and autoscaler stopped like in 2nd round of testing..

Setup:
1- Autoscaler (v0.3.3) runs as nomad (v1.0.4) job
2- Autoscaler checks prometheus to add additional nodes when memory or cpu exceed 80%
3- Jobs with scaling policy, checks prometheus for certain metrics and scale in/out based on that

Test:
1- I run apache with 5 instances as initial count with setting to scale up to 20 if number of jobs in the clusters go over 3
2- I set a cron job to start a 4th job (webapp.nomad) at the beginning of the hour and stop it at minute 30
3- I expected Autoscaling to keep scaling the cluster and apache in and out when the job runs
4- that happened once then stopped

Full logs and jobs definition are attached
P.S.: there is a system job of fabio lb running across all nodes

autoscaler.log
autoscaler.txt
httpd.txt
webapp.txt

@lgfa29
Copy link
Contributor

lgfa29 commented Nov 25, 2021

Hi @khaledabdelaziz 👋

Your issue is actually different from what's being reported here. I think you just need to adjust your scaling policy to account for resources that are currently blocked. Looking at the Autoscaler logs your provided, it seems like your job currently being stuck in a deployment, which prevents further scaling.

You cluster policy is also only taking into account allocated resources, and not the allocations that are pending new resources. Take a look at this message and see if it helps you craft an improve cluster scaling policy.

@alopezsanchez
Copy link
Contributor

Hi there 👋🏻

I'm facing the same problem, and I would like to help. Are you open to new PRs, right?
I will try to help with this issue!

@lgfa29
Copy link
Contributor

lgfa29 commented Aug 5, 2022

Hi @alopezsanchez 👋

A PR would be great! Feel free to reach out if you have any questions 🙂

alopezsanchez added a commit to alopezsanchez/nomad-autoscaler that referenced this issue Aug 9, 2022
…ximum waiting time

The maximum waiting time for checking the status of the ASG is 2 minutes, which is not configurable.
When using the ASG target plugin and downscaling, the ELB connection draining could take more
than 2 minutes. If this happens, Nomad will purge the nodes without waiting for the ASG to remove
the instances.

Logs:
```
2022-07-28T15:48:51.939Z [ERROR] internal_plugin.aws-asg: failed to ensure all activities completed: action=scale_in asg_name=preproduction/nomad error="reached retry limit"
2022-07-28T15:48:52.123Z [INFO]  internal_plugin.aws-asg: successfully purged Nomad node: node_id=34862212-f89a-b107-903d-f4036a44543d nomad_evals=["88a3c562-31e6-f303-cae3-1289bd924958", "b0b08d76-0c2c-e5f3-1b8d-d6f032cd89aa"]
```

So, this commit adds a new config property called `retry_attempts`, which is the maximum number
of attempts the plugin will make to ensure that the ASG has completed all its actions.

Apart from that, if this PR is accepted, I will update the [documentation](https://www.nomadproject.io/tools/autoscaling/plugins/target/aws-asg#policy-configuration-options) accordingly.

Resolves hashicorp#518
@douglaje
Copy link

FYI, the PR #594 doesn't work correctly. It's reading the config set on the plugin struct, which is details about the nomad cluster and different from the config passed to the autoscaler during the Status and Scale calls which is what actually contains the retry_attempts value set in the config.

@lgfa29
Copy link
Contributor

lgfa29 commented Dec 21, 2023

Hi @douglaje 👋

Apologies for missing your comment. I think I didn't get the notification because this is a closed issue.

The retry_attempts is a plugin config, not a policy config. So it should be set in your autoscaler configuration:

target "aws-asg" {
  driver = "aws-asg"
  config = {
    aws_region     = "us-east-1"
    retry_attempts = "20"
  }
}

I noticed we're missing this in our docs, so I opened hashicorp/nomad#19549.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants