Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS up DescribeTags retry quota exceeded when creating multiple clusters in parallel after upgrading kops v1.29.x -> v1.30.x #16886

Closed
AndrewSirenko opened this issue Oct 9, 2024 · 2 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@AndrewSirenko
Copy link
Contributor

AndrewSirenko commented Oct 9, 2024

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

Any version after kops 1.29.6 (e.g. v1.30.1 and/or building from latest master branch commit)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

v1.30

3. What cloud provider are you using?

aws

4. What commands did you run? What is the simplest way to reproduce this issue?

A standard create cluster. Reproducible by creating multiple clusters in parallel on busy AWS account.

5. What happened after the commands executed?

If I upgrade past kops v1.29.6 to kops v1.30.x, I start running into retry attempt exceeded errors when provisioning multiple kops clusters on a busy AWS account. We get the following errors during cluster creation:

I1003 20:06:00.293612   27492 executor.go:171] Continuing to run 2 task(s)
I1003 20:06:10.297912   27492 executor.go:113] Tasks: 126 done / 134 total; 2 can run
W1003 20:06:42.931207   27492 executor.go:141] error running task "TargetGroup/tcp-e2e-18419240495386501-m1detk" (4s remaining to succeed): listing ELB TargetGroups: operation error Elastic Load Balancing v2: DescribeTargetGroups, exceeded maximum number of attempts, 3, https response error StatusCode: 400, RequestID: 6ca1f8bf-7dec-469a-aae0-0adcabb6dc81, api error Throttling: Rate exceeded
W1003 20:06:42.931249   27492 executor.go:141] error running task "TargetGroup/kops-controller-e2e-18419-gh2fha" (4s remaining to succeed): listing ELB TargetGroup tags: operation error Elastic Load Balancing v2: DescribeTags, failed to get rate limit token, retry quota exceeded, 0 available, 5 requested
I1003 20:06:42.931276   27492 executor.go:171] Continuing to run 2 task(s)
Error: error running tasks: deadline exceeded executing task TargetGroup/kops-controller-e2e-18419-gh2fha. Example error: listing ELB TargetGroup tags: operation error Elastic Load Balancing v2: DescribeTags, failed to get rate limit token, retry quota exceeded, 0 available, 5 requested

Perhaps this is due to a few unused retry_max_attempts configuration constants after kops v1.30's migration to AWS SDK V2?

6. What did you expect to happen?

I expect that kops would keep retrying these specific tasks a few more times until the rate-limit buckets refill over the next few seconds. And that the cluster would finish being created.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

N/A

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

TODO (need to wait for one more CI run, apologies)

9. Anything else do we need to know?

Looks like a couple of max retry constants are no longer used after migrating from AWS Go SDK V1 -> V2.

I have raised a draft PR #16887

A workaround would be to raise AWS account EC2 limits, but we'd have to go through the limit request process each time our CI account changes (every 6 months).

Thank you and have a great week!

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 9, 2024
@AndrewSirenko AndrewSirenko changed the title AWS up DescribeTags retry quota exceeded regression in v1.29.x -> v1.30.x AWS up DescribeTags retry quota exceeded when creating multiple clusters in parallel after upgrading kops v1.29.x -> v1.30.x Oct 9, 2024
@hakman
Copy link
Member

hakman commented Oct 9, 2024

Thank you @AndrewSirenko!
@rifelpet Could you please take a look?

@AndrewSirenko
Copy link
Contributor Author

Looks like #16887 cleared our problem, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants