core-services/prow/02_config: Drop AWS Boskos down to 155 leases #14832

wking · 2021-01-13T20:32:44Z

We bumped this from 150 total to 200 total in a9735b5 (#14262). But recently we have been hitting:

level=error msg=Error: Error creating VPC: VpcLimitExceeded: The maximum number of VPCs has been reached.
level=error msg=  status code: 400, request id: ...

in CI. The cause seems to be VPCs leaking out of CI jobs. And the cause of those leaks seems to be stuck teardowns, for example:

Deprovision failed on the following clusters:
ci-op-3kfz2j4c
ci-op-b1qptwsl
ci-op-btpryb6k
...

And the cause of those seems to be AWS throttling making it take a long time to list IAM roles:

time="2021-01-13T19:09:34Z" level=debug msg="search for IAM roles"
time="2021-01-13T19:16:35Z" level=debug msg="search for IAM users"

and thereafter not having enough time to actually clean up the cluster resources before we time out our teardown attempts. By reducing the overall capacity to 155, near our previous 150, we will hopefully reduce AWS IAM API traffic sufficiently to get back under AWS's undocumented throttling cap.

I'm weighting us-east-1 more heavily, because the current VPC limits are 150 for us-east-1, and 55 for our other three AWS regions. I haven't looked into the other AWS limits vs. our expected consumption recently, so still no attempt at rational limits. And if the limits are really "undocumented AWS throttling", maybe rational limits for AWS are not possible.

We bumped this from 150 total to 200 total in a9735b5 (Revert "Revert "core-services/prow/02_config/_boskos: Shard AWS, Azure, and GCP by region"", 2020-12-10, openshift#14262). But recently we have been hitting: level=error msg=Error: Error creating VPC: VpcLimitExceeded: The maximum number of VPCs has been reached. level=error msg= status code: 400, request id: ... in CI. The cause seems to be VPCs leaking out of CI jobs. And the cause of those leaks seems to be stuck teardowns, for example [1]: Deprovision failed on the following clusters: ci-op-3kfz2j4c ci-op-b1qptwsl ci-op-btpryb6k ... And the cause of those seems to be AWS throttling making it take a long time to list IAM roles [2]: time="2021-01-13T19:09:34Z" level=debug msg="search for IAM roles" time="2021-01-13T19:16:35Z" level=debug msg="search for IAM users" and thereafter not having enough time to actually clean up the cluster resources before we time out our teardown attempts. By reducing the overall capacity to 155, near our previous 150, we will hopefully reduce AWS IAM API traffic sufficiently to get back under AWS's undocumented throttling cap. I'm weighting us-east-1 more heavily, because the current VPC limits are 150 for us-east-1, and 55 for our other three AWS regions. I haven't looked into the other AWS limits vs. our expected consumption recently, so still no attempt at rational limits. And if the limits are really "undocumented AWS throttling", maybe rational limits for AWS are not possible. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ipi-deprovision/1349426249335836672#1:build-log.txt%3A1385 [2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ipi-deprovision/1349426249335836672/artifacts/deprovision/ci-op-btpryb6k/.openshift_install.log

dobbymoodge

LGTM

openshift-ci-robot · 2021-01-13T20:35:07Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dobbymoodge, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~core-services/prow/02_config/OWNERS~~ [wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2021-01-13T21:22:50Z

@wking: Updated the following 2 configmaps:

resources configmap in namespace ci at cluster api.ci using the following files:
- key boskos.yaml using file core-services/prow/02_config/_boskos.yaml
resources configmap in namespace ci at cluster app.ci using the following files:
- key boskos.yaml using file core-services/prow/02_config/_boskos.yaml

In response to this:

We bumped this from 150 total to 200 total in a9735b5 (#14262). But recently we have been hitting:
level=error msg=Error: Error creating VPC: VpcLimitExceeded: The maximum number of VPCs has been reached.
level=error msg=  status code: 400, request id: ...
in CI. The cause seems to be VPCs leaking out of CI jobs. And the cause of those leaks seems to be stuck teardowns, for example:
Deprovision failed on the following clusters:
ci-op-3kfz2j4c
ci-op-b1qptwsl
ci-op-btpryb6k
...
And the cause of those seems to be AWS throttling making it take a long time to list IAM roles:
time="2021-01-13T19:09:34Z" level=debug msg="search for IAM roles"
time="2021-01-13T19:16:35Z" level=debug msg="search for IAM users"
and thereafter not having enough time to actually clean up the cluster resources before we time out our teardown attempts. By reducing the overall capacity to 155, near our previous 150, we will hopefully reduce AWS IAM API traffic sufficiently to get back under AWS's undocumented throttling cap.

I'm weighting us-east-1 more heavily, because the current VPC limits are 150 for us-east-1, and 55 for our other three AWS regions. I haven't looked into the other AWS limits vs. our expected consumption recently, so still no attempt at rational limits. And if the limits are really "undocumented AWS throttling", maybe rational limits for AWS are not possible.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested review from bbguimaraes and stevekuznetsov January 13, 2021 20:33

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 13, 2021

dobbymoodge reviewed Jan 13, 2021

View reviewed changes

dobbymoodge self-requested a review January 13, 2021 20:34

dobbymoodge approved these changes Jan 13, 2021

View reviewed changes

openshift-ci-robot assigned dobbymoodge Jan 13, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 13, 2021

openshift-merge-robot merged commit f352ab7 into openshift:master Jan 13, 2021

wking deleted the reduce-aws-capacity-for-throttling branch January 13, 2021 21:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core-services/prow/02_config: Drop AWS Boskos down to 155 leases #14832

core-services/prow/02_config: Drop AWS Boskos down to 155 leases #14832

wking commented Jan 13, 2021

dobbymoodge left a comment

openshift-ci-robot commented Jan 13, 2021

openshift-ci-robot commented Jan 13, 2021

core-services/prow/02_config: Drop AWS Boskos down to 155 leases #14832

core-services/prow/02_config: Drop AWS Boskos down to 155 leases #14832

Conversation

wking commented Jan 13, 2021

dobbymoodge left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented Jan 13, 2021

openshift-ci-robot commented Jan 13, 2021