Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Revert "core-services/prow/02_config/_boskos: Shard AWS, Azure, and GCP by region"" #14262

Conversation

wking
Copy link
Member

@wking wking commented Dec 10, 2020

This reverts commit 8a22fc4, #12842, restoring(ish) #12589.

Boskos' config reloading on dynamic -> static pivots has been fixed by kubernetes-sigs/boskos#54, so we can take another run at static leases for these platforms. Not a clean re-revert, because 4705f26 (#14032) landed in the meantime, but it was easy to update from 120 to 80 here.

…e, and GCP by region""

This reverts commit 8a22fc4, openshift#12842.

Boskos' config reloading on dynamic -> static pivots has been fixed by
kubernetes-sigs/boskos@3834f37d8a (Config sync: Avoid deadlock when
static -> dynamic -> static, 2020-12-03, kubernetes-sigs/boskos#54),
so we can take another run at static leases for these platforms.  Not
a clean re-revert, because 4705f26 (core-services/prow/02_config:
Drop GCP Boskos leases to 80, 2020-12-02, openshift#14032) landed in the
meantime, but it was easy to update from 120 to 80 here.
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 10, 2020
@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 10, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

@wking: Updated the following 2 configmaps:

  • resources configmap in namespace ci at cluster app.ci using the following files:
    • key boskos.yaml using file core-services/prow/02_config/_boskos.yaml
  • resources configmap in namespace ci at cluster api.ci using the following files:
    • key boskos.yaml using file core-services/prow/02_config/_boskos.yaml

In response to this:

This reverts commit 8a22fc4, #12842, restoring(ish) #12589.

Boskos' config reloading on dynamic -> static pivots has been fixed by kubernetes-sigs/boskos#54, so we can take another run at static leases for these platforms. Not a clean re-revert, because 4705f26 (#14032) landed in the meantime, but it was easy to update from 120 to 80 here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the restore-by-region-aws-azure-and-gcp-leases branch December 11, 2020 16:55
wking added a commit to wking/openshift-release that referenced this pull request Jan 13, 2021
We bumped this from 150 total to 200 total in a9735b5 (Revert
"Revert "core-services/prow/02_config/_boskos: Shard AWS, Azure, and
GCP by region"", 2020-12-10, openshift#14262).  But recently we have been hitting:

  level=error msg=Error: Error creating VPC: VpcLimitExceeded: The maximum number of VPCs has been reached.
  level=error msg=  status code: 400, request id: ...

in CI.  The cause seems to be VPCs leaking out of CI jobs.  And the
cause of those leaks seems to be stuck teardowns, for example [1]:

  Deprovision failed on the following clusters:
  ci-op-3kfz2j4c
  ci-op-b1qptwsl
  ci-op-btpryb6k
  ...

And the cause of those seems to be AWS throttling making it take a
long time to list IAM roles [2]:

  time="2021-01-13T19:09:34Z" level=debug msg="search for IAM roles"
  time="2021-01-13T19:16:35Z" level=debug msg="search for IAM users"

and thereafter not having enough time to actually clean up the cluster
resources before we time out our teardown attempts.  By reducing the
overall capacity to 155, near our previous 150, we will hopefully
reduce AWS IAM API traffic sufficiently to get back under AWS's
undocumented throttling cap.

I'm weighting us-east-1 more heavily, because the current VPC limits
are 150 for us-east-1, and 55 for our other three AWS regions.  I
haven't looked into the other AWS limits vs. our expected consumption
recently, so still no attempt at rational limits.  And if the limits
are really "undocumented AWS throttling", maybe rational limits for
AWS are not possible.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ipi-deprovision/1349426249335836672#1:build-log.txt%3A1385
[2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ipi-deprovision/1349426249335836672/artifacts/deprovision/ci-op-btpryb6k/.openshift_install.log
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants