-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci-operator/templates/openshift/installer/cluster-launch-installer-*: Random AWS regions for IPI #6833
ci-operator/templates/openshift/installer/cluster-launch-installer-*: Random AWS regions for IPI #6833
Conversation
4bb7e30
to
2ebbcad
Compare
/hold We should confirm this is working in rehearsals. |
… Random AWS regions for IPI I'm leaving the UPI templates and such alone for now, because they mention regions in more places and I don't care as much about them ;). The approach mirrors what we did for Azure with 4d08a9d (ci-operator/templates/openshift/installer/cluster-launch-installer-*: Random Azure regions, 2019-09-18, openshift#5081). This should reduce resource contention (both for limits and for API throttling) in CI. We could theoretically raise our Boskos caps as well, but I'm leaving that off for now because we're getting hammered by AMI-copy issues [1]. I've also added ZONE_* variables in case we need to dodge broken zones like we have in the past with e8921c3 (ci-operator/templates/openshift: Get e2e-aws out of us-east-1b, 2019-03-22, openshift#3204). I have nothing against the other us-east-1 zones at the moment, the current ZONE_* overrides there are just to demonstrate the idea in case folks need to pivot later. Without explicit ZONE_* choices, we'll fall back to using zones a and b within the chosen region. The DevProductivity Platform (DPP) folks bumped our quotas in the new regions back in 2019-11 [2]. We should be good for up to 25% of our current us-east-1 capacity in those regions. Increases beyond that, or sharding into additional regions, will require additional DPP <-> AWS work to get limits raised. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1752370#c3 [2]: https://issues.redhat.com/browse/DPP-3108
2ebbcad
to
b717933
Compare
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@@ -442,6 +440,16 @@ objects: | |||
elif has_variant "large"; then | |||
master_type=m5.4xlarge | |||
fi | |||
case $((RANDOM % 4)) in | |||
0) AWS_REGION=us-east-1 | |||
ZONE_1=us-east-1b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems premature, but ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems premature, but ok.
It is premature, but I wanted folks to be able to see the pattern without having to dig through commit/PR logs. See "I have nothing against the other us-east-1 zones at the moment..." in the PR topic and commit message ;).
and a successful cleanup, e.g.:
/hold cancel |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: smarterclayton, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@wking: Updated the following 6 configmaps:
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
…ions Catching up with b717933 (ci-operator/templates/openshift/installer/cluster-launch-installer-*: Random AWS regions for IPI, 2020-01-23, openshift#6833), add shares subnets for the new regions. This is basically as described in 424a04a (Add shared vpc CI job for installer repo and informing periodics, 2019-08-22, openshift#5550), although I've dropped the --stack-policy-body mentioned in that commit message (I dunno what it was about, since the commit message didn't give it the argument that --stack-policy-body expects). Generated using: $ export AWS_PROFILE=ci # or whatever you call it locally $ git fetch origin $ date --iso=m --utc 2020-01-23T20:41+0000 $ git checkout origin/release-4.3 $ git --no-pager log --oneline -1 2055609f9 (HEAD, origin/release-4.3) Merge pull request openshift#2928 from ashcrow/4.3-signed-rhcos-bump with: for REGION in us-east-1 us-west-1 us-west-2 do COUNT=3 if test us-west-1 = "${REGION}" then COUNT=2 fi for INDEX in 1 2 3 4 do NAME="do-not-delete-shared-vpc-${INDEX}" aws --region "${REGION}" cloudformation create-stack --stack-name "${NAME}" --template-body "$(cat upi/aws/cloudformation/01_vpc.yaml)" --parameters "ParameterKey=AvailabilityZoneCount,ParameterValue=${COUNT}" >/dev/null aws --region "${REGION}" cloudformation wait stack-create-complete --stack-name "${NAME}" SUBNETS="$(aws --region "${REGION}" cloudformation describe-stacks --stack-name "${NAME}" | jq -c '[.Stacks[].Outputs[] | select(.OutputKey | endswith("SubnetIds")).OutputValue | split(",")[]]' | sed "s/\"/'/g")" echo "${REGION}_$((INDEX - 1))) subnets=\"${SUBNETS}\";;" done done which spits out: us-east-2_0) subnets="['subnet-0faf6d16c378ee7a7','subnet-0e104572db1b7d092','subnet-014ca96c04f36adec','subnet-0ea06057dceadfe8e','subnet-0689efe5e1f9f4212','subnet-0d36bb8edbcb3d916']";; us-east-2_1) subnets="['subnet-085787cc4b80b84b2','subnet-09dfbf66e8f6e5b50','subnet-0db5d90ff3087444e','subnet-047f15f2a0210fbe0','subnet-0bf13f041c4233849','subnet-0e2a5320549e289d8']";; ... The COUNT bit for us-west-1 is because our CI account only has access to two zones there: $ aws --region us-west-1 ec2 describe-availability-zones --output text AVAILABILITYZONES us-west-1 available usw1-az3 us-west-1a AVAILABILITYZONES us-west-1 available usw1-az1 us-west-1b
Bringing over a number of changes which have landed in ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml as of openshift/release@c64d5d7d8f (Merge pull request #6845 from wking/shared-subnets-for-other-regions, 2020-01-23). One series was AWS region sharding: * openshift/release@b7179335a3 (ci-operator/templates/openshift/installer/cluster-launch-installer-*: Random AWS regions for IPI, 2020-01-23, openshift/release#6833). * openshift/release@7e38260d25 (ci-operator/templates/openshift/installer: Shared subnets for new regions, 2020-01-23, openshift/release#6845). Another series was the password removal followed by a bunch of pipefail fumbling ;) * openshift/release@4847cb5477 (clean up install log output, 2020-01-13, openshift/release#6692). * openshift/release@5c6ca8a506 (templates: Use 'pipefail' so that grep doesn't mask install failures, 2020-01-15, openshift/release#6718). * openshift/release@bee15b9fa8 (add -eou pipefail to remaining job templates, 2020-01-16, openshift/release#6738) * openshift/release@07bd61d677 (Revert "add -eou pipefail to remaining job templates", 2020-01-17, openshift/release#6748). * openshift/release@ca655477ca (Revert "Revert "add -eou pipefail to remaining job templates"", 2020-01-17, openshift/release#6750). * openshift/release@9d7453156b (tolerate undefined env vars in teardown, 2020-01-17, openshift/release#6750). And there was also: * openshift/release@c6c2efb3fc (templates: Add ipv6 variant that triggers azure singlestack, 2020-01-22, openshift/release#6809). * openshift/release@752455a47f (templates: fix check for unset variable, 2020-01-16, openshift/release#6723).
This was originally part of avoiding broken zones, see e8921c3 (ci-operator/templates/openshift: Get e2e-aws out of us-east-1b, 2019-03-22, openshift#3204) and b717933 (ci-operator/templates/openshift/installer/cluster-launch-installer-*: Random AWS regions for IPI, 2020-01-23, openshift#6833). But the installer has had broken-zone avoidence since way back in openshift/installer@71aef620b6 (pkg/asset/machines/aws: Only return available zones, 2019-02-07, openshift/installer#1210) I dunno how reliably AWS sets 'state: impaired' and similar; it didn't seem to protect us from e8921c3. But we're getting ready to pivot to using multiple AWS accounts, which creates two issues with hard-coding region names in the step: 1. References by name are not stable between accounts. From the AWS docs [1]: To ensure that resources are distributed across the Availability Zones for a Region, we independently map Availability Zones to names for each AWS account. For example, the Availability Zone us-east-1a for your AWS account might not be the same location as us-east-1a for another AWS account. So "aah, us-east-1a is broken, let's use b and c instead" might apply to one account but not the other. And the installer does not currently accept zone IDs. 2. References by name may not exist in other accounts. From the AWS docs [1]: As Availability Zones grow over time, our ability to expand them can become constrained. If this happens, we might restrict you from launching an instance in a constrained Availability Zone unless you already have an instance in that Availability Zone. Eventually, we might also remove the constrained Availability Zone from the list of Availability Zones for new accounts. Therefore, your account might have a different number of available Availability Zones in a Region than another account. And it turns out that for some reason they sometimes don't name sequentially, e.g. our new account lacks us-west-1a: $ AWS_PROFILE=ci aws --region us-west-1 ec2 describe-availability-zones | jq -r '.AvailabilityZones[] | .ZoneName + " " + .ZoneId + " " + .State' | sort us-west-1a usw1-az3 available us-west-1b usw1-az1 available $ AWS_PROFILE=ci-2 aws --region us-west-1 ec2 describe-availability-zones | jq -r '.AvailabilityZones[] | .ZoneName + " " + .ZoneId + " " + .State' | sort us-west-1b usw1-az3 available us-west-1c usw1-az1 available I have no idea why they decided to do that, but we have to work with the world as it is ;). Removing the us-east-1 overrides helps reduce our exposure, although we are still vulnerable to (2) with the a/b default line. We'll do something about that in follow-up work. Leaving the "which zones?" decision up to the installer would cause it to try to set up each available zone, and that causes more API contention and resource consumption than we want. Background on that in 51c4a37 (ci-operator/templates/openshift: Explicitly set AWS availability zones, 2019-03-28, openshift#3285) and d87fffb (ci-operator/templates/openshift: Drop us-east-1c, 2019-04-26, openshift#3615), as well as the rejected/rotted-out [2]. [1]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html [2]: openshift/installer#1487
I'm leaving the UPI templates and such alone for now, because they mention regions in more places and I don't care as much about them ;). The approach mirrors what we did for Azure with 4d08a9d (#5081). This should reduce resource contention (both for limits and for API throttling) in CI. We could theoretically raise our Boskos caps as well, but I'm leaving that off for now because we're getting hammered by AMI-copy issues.
I've also added
ZONE_*
variables in case we need to dodge broken zones like we have in the past with e8921c3 (#3204). I have nothing against the other us-east-1 zones at the moment, the currentZONE_*
overrides there are just to demonstrate the idea in case folks need to pivot later. Without explicitZONE_*
choices, we'll fall back to using zonesa
andb
within the chosen region.