Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci-operator/templates/openshift/installer/cluster-launch-installer-*: Random AWS regions for IPI #6833

Merged
merged 1 commit into from
Jan 23, 2020

Conversation

wking
Copy link
Member

@wking wking commented Jan 23, 2020

I'm leaving the UPI templates and such alone for now, because they mention regions in more places and I don't care as much about them ;). The approach mirrors what we did for Azure with 4d08a9d (#5081). This should reduce resource contention (both for limits and for API throttling) in CI. We could theoretically raise our Boskos caps as well, but I'm leaving that off for now because we're getting hammered by AMI-copy issues.

I've also added ZONE_* variables in case we need to dodge broken zones like we have in the past with e8921c3 (#3204). I have nothing against the other us-east-1 zones at the moment, the current ZONE_* overrides there are just to demonstrate the idea in case folks need to pivot later. Without explicit ZONE_* choices, we'll fall back to using zones a and b within the chosen region.

@openshift-ci-robot openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 23, 2020
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 23, 2020
@wking
Copy link
Member Author

wking commented Jan 23, 2020

/hold

We should confirm this is working in rehearsals.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 23, 2020
… Random AWS regions for IPI

I'm leaving the UPI templates and such alone for now, because they
mention regions in more places and I don't care as much about them ;).
The approach mirrors what we did for Azure with 4d08a9d
(ci-operator/templates/openshift/installer/cluster-launch-installer-*:
Random Azure regions, 2019-09-18, openshift#5081).  This should reduce resource
contention (both for limits and for API throttling) in CI.  We could
theoretically raise our Boskos caps as well, but I'm leaving that off
for now because we're getting hammered by AMI-copy issues [1].

I've also added ZONE_* variables in case we need to dodge broken zones
like we have in the past with e8921c3
(ci-operator/templates/openshift: Get e2e-aws out of us-east-1b,
2019-03-22, openshift#3204).  I have nothing against the other us-east-1 zones
at the moment, the current ZONE_* overrides there are just to
demonstrate the idea in case folks need to pivot later.  Without
explicit ZONE_* choices, we'll fall back to using zones a and b within
the chosen region.

The DevProductivity Platform (DPP) folks bumped our quotas in the new
regions back in 2019-11 [2].  We should be good for up to 25% of our
current us-east-1 capacity in those regions.  Increases beyond that,
or sharding into additional regions, will require additional DPP <->
AWS work to get limits raised.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1752370#c3
[2]: https://issues.redhat.com/browse/DPP-3108
@openshift-ci-robot
Copy link
Contributor

@wking: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/rehearse/openshift/cloud-credential-operator/master/e2e-gcp b717933 link /test pj-rehearse
ci/rehearse/openshift/cluster-api-actuator-pkg/master/e2e-azure-operator b717933 link /test pj-rehearse
ci/rehearse/openshift/cloud-credential-operator/master/e2e-azure b717933 link /test pj-rehearse
ci/prow/pj-rehearse b717933 link /test pj-rehearse

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@@ -442,6 +440,16 @@ objects:
elif has_variant "large"; then
master_type=m5.4xlarge
fi
case $((RANDOM % 4)) in
0) AWS_REGION=us-east-1
ZONE_1=us-east-1b
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems premature, but ok.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems premature, but ok.

It is premature, but I wanted folks to be able to see the pattern without having to dig through commit/PR logs. See "I have nothing against the other us-east-1 zones at the moment..." in the PR topic and commit message ;).

@wking
Copy link
Member Author

wking commented Jan 23, 2020

CRC rehearsal had:

AWS region: us-east-2 (zones: us-east-2a us-east-2b)

and a successful cleanup, e.g.:

time="2020-01-23T15:35:31Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-2:460538899914:vpc/vpc-0c5b38c6d27eba1a8" id=vpc-0c5b38c6d27eba1a8

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 23, 2020
@smarterclayton
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 23, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 7ce78ce into openshift:master Jan 23, 2020
@openshift-ci-robot
Copy link
Contributor

@wking: Updated the following 6 configmaps:

  • prow-job-cluster-launch-installer-src configmap in namespace ci-stg at cluster default using the following files:
    • key cluster-launch-installer-src.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-src.yaml
  • prow-job-cluster-launch-installer-e2e configmap in namespace ci at cluster ci/api-build01-ci-devcluster-openshift-com:6443 using the following files:
    • key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
  • prow-job-cluster-launch-installer-e2e configmap in namespace ci at cluster default using the following files:
    • key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
  • prow-job-cluster-launch-installer-e2e configmap in namespace ci-stg at cluster default using the following files:
    • key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
  • prow-job-cluster-launch-installer-src configmap in namespace ci at cluster ci/api-build01-ci-devcluster-openshift-com:6443 using the following files:
    • key cluster-launch-installer-src.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-src.yaml
  • prow-job-cluster-launch-installer-src configmap in namespace ci at cluster default using the following files:
    • key cluster-launch-installer-src.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-src.yaml

In response to this:

I'm leaving the UPI templates and such alone for now, because they mention regions in more places and I don't care as much about them ;). The approach mirrors what we did for Azure with 4d08a9d (#5081). This should reduce resource contention (both for limits and for API throttling) in CI. We could theoretically raise our Boskos caps as well, but I'm leaving that off for now because we're getting hammered by AMI-copy issues.

I've also added ZONE_* variables in case we need to dodge broken zones like we have in the past with e8921c3 (#3204). I have nothing against the other us-east-1 zones at the moment, the current ZONE_* overrides there are just to demonstrate the idea in case folks need to pivot later. Without explicit ZONE_* choices, we'll fall back to using zones a and b within the chosen region.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking added a commit to wking/openshift-release that referenced this pull request Jan 23, 2020
…ions

Catching up with b717933
(ci-operator/templates/openshift/installer/cluster-launch-installer-*:
Random AWS regions for IPI, 2020-01-23, openshift#6833), add shares subnets for
the new regions.  This is basically as described in 424a04a (Add
shared vpc CI job for installer repo and informing periodics,
2019-08-22, openshift#5550), although I've dropped the --stack-policy-body
mentioned in that commit message (I dunno what it was about, since the
commit message didn't give it the argument that --stack-policy-body
expects).  Generated using:

  $ export AWS_PROFILE=ci  # or whatever you call it locally
  $ git fetch origin
  $ date --iso=m --utc
  2020-01-23T20:41+0000
  $ git checkout origin/release-4.3
  $ git --no-pager log --oneline -1
  2055609f9 (HEAD, origin/release-4.3) Merge pull request openshift#2928 from ashcrow/4.3-signed-rhcos-bump

with:

  for REGION in us-east-1 us-west-1 us-west-2
  do
    COUNT=3
    if test us-west-1 = "${REGION}"
    then
      COUNT=2
    fi
    for INDEX in 1 2 3 4
    do
      NAME="do-not-delete-shared-vpc-${INDEX}"
      aws --region "${REGION}" cloudformation create-stack --stack-name "${NAME}" --template-body "$(cat upi/aws/cloudformation/01_vpc.yaml)" --parameters "ParameterKey=AvailabilityZoneCount,ParameterValue=${COUNT}" >/dev/null
      aws --region "${REGION}" cloudformation wait stack-create-complete --stack-name "${NAME}"
      SUBNETS="$(aws --region "${REGION}" cloudformation describe-stacks --stack-name "${NAME}" | jq -c '[.Stacks[].Outputs[] | select(.OutputKey | endswith("SubnetIds")).OutputValue | split(",")[]]' | sed "s/\"/'/g")"
      echo "${REGION}_$((INDEX - 1))) subnets=\"${SUBNETS}\";;"
    done
  done

which spits out:

  us-east-2_0) subnets="['subnet-0faf6d16c378ee7a7','subnet-0e104572db1b7d092','subnet-014ca96c04f36adec','subnet-0ea06057dceadfe8e','subnet-0689efe5e1f9f4212','subnet-0d36bb8edbcb3d916']";;
  us-east-2_1) subnets="['subnet-085787cc4b80b84b2','subnet-09dfbf66e8f6e5b50','subnet-0db5d90ff3087444e','subnet-047f15f2a0210fbe0','subnet-0bf13f041c4233849','subnet-0e2a5320549e289d8']";;
  ...

The COUNT bit for us-west-1 is because our CI account only has access
to two zones there:

  $ aws --region us-west-1 ec2 describe-availability-zones --output text
  AVAILABILITYZONES		 us-west-1					available												usw1-az3 us-west-1a
  AVAILABILITYZONES		 us-west-1					available												usw1-az1 us-west-1b
wking added a commit to wking/ci-tools that referenced this pull request Jan 24, 2020
Bringing over a number of changes which have landed in
ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
as of openshift/release@c64d5d7d8f (Merge pull request #6845 from
wking/shared-subnets-for-other-regions, 2020-01-23).

One series was AWS region sharding:

* openshift/release@b7179335a3
  (ci-operator/templates/openshift/installer/cluster-launch-installer-*:
  Random AWS regions for IPI, 2020-01-23, openshift/release#6833).
* openshift/release@7e38260d25
  (ci-operator/templates/openshift/installer: Shared subnets for new
  regions, 2020-01-23, openshift/release#6845).

Another series was the password removal followed by a bunch of
pipefail fumbling ;)

* openshift/release@4847cb5477 (clean up install log output,
  2020-01-13, openshift/release#6692).
* openshift/release@5c6ca8a506 (templates: Use 'pipefail' so that grep
  doesn't mask install failures, 2020-01-15, openshift/release#6718).
* openshift/release@bee15b9fa8 (add -eou pipefail to remaining job
  templates, 2020-01-16, openshift/release#6738)
* openshift/release@07bd61d677 (Revert "add -eou pipefail to remaining
  job templates", 2020-01-17, openshift/release#6748).
* openshift/release@ca655477ca (Revert "Revert "add -eou pipefail to
  remaining job templates"", 2020-01-17, openshift/release#6750).
* openshift/release@9d7453156b (tolerate undefined env vars in
  teardown, 2020-01-17, openshift/release#6750).

And there was also:

* openshift/release@c6c2efb3fc (templates: Add ipv6 variant that
  triggers azure singlestack, 2020-01-22, openshift/release#6809).
* openshift/release@752455a47f (templates: fix check for unset
  variable, 2020-01-16, openshift/release#6723).
@wking wking deleted the aws-region-sharding branch January 25, 2020 21:08
wking added a commit to wking/openshift-release that referenced this pull request Apr 8, 2021
This was originally part of avoiding broken zones, see e8921c3
(ci-operator/templates/openshift: Get e2e-aws out of us-east-1b,
2019-03-22, openshift#3204) and b717933
(ci-operator/templates/openshift/installer/cluster-launch-installer-*:
Random AWS regions for IPI, 2020-01-23, openshift#6833).  But the installer has
had broken-zone avoidence since way back in
openshift/installer@71aef620b6 (pkg/asset/machines/aws: Only return
available zones, 2019-02-07, openshift/installer#1210) I dunno how
reliably AWS sets 'state: impaired' and similar; it didn't seem to
protect us from e8921c3.  But we're getting ready to pivot to using
multiple AWS accounts, which creates two issues with hard-coding
region names in the step:

1. References by name are not stable between accounts.  From the AWS
   docs [1]:

     To ensure that resources are distributed across the Availability
     Zones for a Region, we independently map Availability Zones to
     names for each AWS account. For example, the Availability Zone
     us-east-1a for your AWS account might not be the same location as
     us-east-1a for another AWS account.

   So "aah, us-east-1a is broken, let's use b and c instead" might
   apply to one account but not the other.  And the installer does not
   currently accept zone IDs.

2. References by name may not exist in other accounts.  From the AWS
   docs [1]:

     As Availability Zones grow over time, our ability to expand them
     can become constrained. If this happens, we might restrict you
     from launching an instance in a constrained Availability Zone
     unless you already have an instance in that Availability
     Zone. Eventually, we might also remove the constrained
     Availability Zone from the list of Availability Zones for new
     accounts. Therefore, your account might have a different number
     of available Availability Zones in a Region than another account.

   And it turns out that for some reason they sometimes don't name
   sequentially, e.g. our new account lacks us-west-1a:

     $ AWS_PROFILE=ci aws --region us-west-1 ec2 describe-availability-zones | jq -r '.AvailabilityZones[] | .ZoneName + " " + .ZoneId + " " + .State' | sort
     us-west-1a usw1-az3 available
     us-west-1b usw1-az1 available
     $ AWS_PROFILE=ci-2 aws --region us-west-1 ec2 describe-availability-zones | jq -r '.AvailabilityZones[] | .ZoneName + " " + .ZoneId + " " + .State' | sort
     us-west-1b usw1-az3 available
     us-west-1c usw1-az1 available

   I have no idea why they decided to do that, but we have to work
   with the world as it is ;).

Removing the us-east-1 overrides helps reduce our exposure, although
we are still vulnerable to (2) with the a/b default line.  We'll do
something about that in follow-up work.

Leaving the "which zones?" decision up to the installer would cause it
to try to set up each available zone, and that causes more API
contention and resource consumption than we want.  Background on that
in 51c4a37 (ci-operator/templates/openshift: Explicitly set AWS
availability zones, 2019-03-28, openshift#3285) and d87fffb
(ci-operator/templates/openshift: Drop us-east-1c, 2019-04-26, openshift#3615),
as well as the rejected/rotted-out [2].

[1]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html
[2]: openshift/installer#1487
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants