Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci-operator/templates/openshift: Drop us-east-1c #3615

Merged

Conversation

wking
Copy link
Member

@wking wking commented Apr 26, 2019

We're currently hitting a lot of these. Over the past 24 hours (using this local server):

$ curl -s 'http://localhost:8000/search?name=-e2e-aws&.&q=level%3Derror.*timeout+while+waiting+for+state' | jq -r '. | to_entries[].value[] | to_entries[].value[]' | sed 's/(i-[^)]*/(i-.../;s/(igw-[^)]*/(igw-.../;s/\(master\|nat_gw\|private_routing\|route_net\)\.[0-9]/\1.../' | sort | uniq -c | sort -n
     2 level=error msg="\t* aws_instance.master...: Error waiting for instance (i-...) to become ready: timeout while waiting for state
    10 level=error msg="\t* aws_security_group.bootstrap: timeout while waiting for state
    38 level=error msg="\t* aws_route.igw_route: Error creating route: timeout while waiting for state
    58 level=error msg="\t* aws_internet_gateway.igw: error attaching EC2 Internet Gateway (igw-...): timeout while waiting for state
    76 level=error msg="\t* aws_route_table_association.private_routing...: timeout while waiting for state
    90 level=error msg="\t* aws_route_table_association.route_net...: timeout while waiting for state
   164 level=error msg="\t* aws_route.to_nat_gw...: Error creating route: timeout while waiting for state

Dropping to two zones will reduce our API load for the per-subnet to_nat_gw routes and route-table associations, which are our leading breakages.

We're currently hitting a lot of these.  Over the past 24 hours [1]:

  $ curl -s 'http://localhost:8000/search?name=-e2e-aws&.&q=level%3Derror.*timeout+while+waiting+for+state' | jq -r '. | to_entries[].value[] | to_entries[].value[]' | sed 's/(i-[^)]*/(i-.../;s/(igw-[^)]*/(igw-.../;s/\(master\|nat_gw\|private_routing\|route_net\)\.[0-9]/\1.../' | sort | uniq -c | sort -n
       2 level=error msg="\t* aws_instance.master...: Error waiting for instance (i-...) to become ready: timeout while waiting for state
      10 level=error msg="\t* aws_security_group.bootstrap: timeout while waiting for state
      38 level=error msg="\t* aws_route.igw_route: Error creating route: timeout while waiting for state
      58 level=error msg="\t* aws_internet_gateway.igw: error attaching EC2 Internet Gateway (igw-...): timeout while waiting for state
      76 level=error msg="\t* aws_route_table_association.private_routing...: timeout while waiting for state
      90 level=error msg="\t* aws_route_table_association.route_net...: timeout while waiting for state
     164 level=error msg="\t* aws_route.to_nat_gw...: Error creating route: timeout while waiting for state

Dropping to two zones will reduce our API load for the per-subnet
to_nat_gw routes and route-table associations, which are our leading
breakages.

Generated with:

  $ sed -i '/us-east-1c/d' $(git grep -l us-east-1c)

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/d3
@openshift-ci-robot openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Apr 26, 2019
@smarterclayton
Copy link
Contributor

Hrm...

@smarterclayton
Copy link
Contributor

@derekwaynecarr can you think of any implications about us only testing in two AZ that will impact us long run? I can't think of many.

@droslean
Copy link
Member

/test pj-rehearse

1 similar comment
@droslean
Copy link
Member

/test pj-rehearse

@wking
Copy link
Member Author

wking commented May 1, 2019

e2e-aws:

2019/04/29 14:07:12 Container setup in pod e2e-aws completed successfully
failed to open log file "/var/log/pods/2940f26b-6a84-11e9-b219-42010a8e0002/test/0.log": open /var/log/pods/2940f26b-6a84-11e9-b219-42010a8e0002/test/0.log: no such file or directory2019/04/29 15:04:06 Container test in pod e2e-aws failed, exit code 1, reason Error
2019/04/29 15:04:14 Container artifacts in pod e2e-aws completed successfully
2019/04/29 15:04:14 Container teardown in pod e2e-aws completed successfully
2019/04/29 15:04:14 error: unable to gather container logs: [error: Unable to retrieve logs from pod container artifacts: pods "e2e-aws" not found, error: Unable to retrieve logs from pod container setup: pods "e2e-aws" not found, error: Unable to retrieve logs from pod container teardown: pods "e2e-aws" not found, error: Unable to retrieve logs from pod container test: pods "e2e-aws" not found]
2019/04/29 15:04:14 error: unable to signal to artifacts container to terminate in pod e2e-aws, triggering deletion: could not run remote command: pods "e2e-aws" is forbidden: pods "e2e-aws" not found
2019/04/29 15:04:14 error: unable to retrieve artifacts from pod e2e-aws: could not read gzipped artifacts: pods "e2e-aws" is forbidden: pods "e2e-aws" not found
2019/04/29 15:17:39 Ran for 1h43m41s
error: could not run steps: step e2e-aws failed: template pod "e2e-aws" failed: pod e2e-aws was already deleted

Dunno what that was about, but looks like a CI-cluster issue.

/retest

@openshift-ci-robot
Copy link
Contributor

@wking: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/rehearse/operator-framework/operator-lifecycle-manager/master/e2e-aws-console-olm d87fffb link /test pj-rehearse
ci/rehearse/openshift/installer/master/e2e-rhel-scaleup d87fffb link /test pj-rehearse
ci/prow/pj-rehearse d87fffb link /test pj-rehearse

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@smarterclayton
Copy link
Contributor

/lgtm

Merging this as a short term mitigation until rate limits are bumped.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 1, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 1, 2019
@openshift-merge-robot openshift-merge-robot merged commit 163fc1d into openshift:master May 1, 2019
@openshift-ci-robot
Copy link
Contributor

@wking: Updated the following 8 configmaps:

  • prow-job-cluster-launch-installer-src configmap in namespace ci using the following files:
    • key cluster-launch-installer-src.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-src.yaml
  • prow-job-cluster-launch-installer-src configmap in namespace ci-stg using the following files:
    • key cluster-launch-installer-src.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-src.yaml
  • prow-job-cluster-scaleup-e2e-40 configmap in namespace ci using the following files:
    • key cluster-scaleup-e2e-40.yaml using file ci-operator/templates/openshift/openshift-ansible/cluster-scaleup-e2e-40.yaml
  • prow-job-cluster-scaleup-e2e-40 configmap in namespace ci-stg using the following files:
    • key cluster-scaleup-e2e-40.yaml using file ci-operator/templates/openshift/openshift-ansible/cluster-scaleup-e2e-40.yaml
  • prow-job-cluster-launch-installer-console configmap in namespace ci using the following files:
    • key cluster-launch-installer-console.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-console.yaml
  • prow-job-cluster-launch-installer-console configmap in namespace ci-stg using the following files:
    • key cluster-launch-installer-console.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-console.yaml
  • prow-job-cluster-launch-installer-e2e configmap in namespace ci using the following files:
    • key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
  • prow-job-cluster-launch-installer-e2e configmap in namespace ci-stg using the following files:
    • key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml

In response to this:

We're currently hitting a lot of these. Over the past 24 hours (using this local server):

$ curl -s 'http://localhost:8000/search?name=-e2e-aws&.&q=level%3Derror.*timeout+while+waiting+for+state' | jq -r '. | to_entries[].value[] | to_entries[].value[]' | sed 's/(i-[^)]*/(i-.../;s/(igw-[^)]*/(igw-.../;s/\(master\|nat_gw\|private_routing\|route_net\)\.[0-9]/\1.../' | sort | uniq -c | sort -n
    2 level=error msg="\t* aws_instance.master...: Error waiting for instance (i-...) to become ready: timeout while waiting for state
   10 level=error msg="\t* aws_security_group.bootstrap: timeout while waiting for state
   38 level=error msg="\t* aws_route.igw_route: Error creating route: timeout while waiting for state
   58 level=error msg="\t* aws_internet_gateway.igw: error attaching EC2 Internet Gateway (igw-...): timeout while waiting for state
   76 level=error msg="\t* aws_route_table_association.private_routing...: timeout while waiting for state
   90 level=error msg="\t* aws_route_table_association.route_net...: timeout while waiting for state
  164 level=error msg="\t* aws_route.to_nat_gw...: Error creating route: timeout while waiting for state

Dropping to two zones will reduce our API load for the per-subnet to_nat_gw routes and route-table associations, which are our leading breakages.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the two-availablity-zones-on-aws branch May 1, 2019 21:08
wking added a commit to wking/ci-operator that referenced this pull request May 2, 2019
…e changes

Catching up with the state as of openshift/release@f5fe3f5 (2019-05-01).

I've mostly used the release repo's cluster-launch-installer-e2e.yaml,
but I mixed in some code from
cluster-launch-installer-openstack-e2e.yaml because the template here
seemed to have OpenStack support already.

I've also bumped the AWS zones to include us-east-1c, which will mean
three-zone upgrade jobs launched from this config while the
release-templates will result in two-zone jobs since
openshift/release@d87fffb3aa (ci-operator/templates/openshift: Drop
us-east-1c, 2019-04-26, openshift/release#3615).
wking added a commit to wking/ci-operator that referenced this pull request May 2, 2019
…e changes

Catching up with the state as of openshift/release@f5fe3f5 (2019-05-01).

I've mostly used the release repo's cluster-launch-installer-e2e.yaml,
but I mixed in some code from
cluster-launch-installer-openstack-e2e.yaml because the template here
seemed to have OpenStack support already.

I've also bumped the AWS zones to include us-east-1c, which will mean
three-zone upgrade jobs launched from this config while the
release-templates will result in two-zone jobs since
openshift/release@d87fffb3aa (ci-operator/templates/openshift: Drop
us-east-1c, 2019-04-26, openshift/release#3615).
enxebre added a commit to enxebre/cluster-api-actuator-pkg that referenced this pull request May 2, 2019
Relax the expected number of machineSets and replicas based on openshift/installer#1698 (comment) and openshift/release#3615 to temporary reduce CI cloud burden
@raghavendra-talur
Copy link

The comments say that dropping the 3rd zone was a temporary measure. Are there any plans to bring back the 3rd zone in near future?

I ask because ocs-operator is working on a feature that requires 3 failure domains and wants them to be zones preferably. It is certainly not a blocker and we can use a different failure domain for our tests but it would be nice to have a setup that is as close as possible to a real scenario.

wking added a commit to wking/openshift-release that referenced this pull request Apr 8, 2021
This was originally part of avoiding broken zones, see e8921c3
(ci-operator/templates/openshift: Get e2e-aws out of us-east-1b,
2019-03-22, openshift#3204) and b717933
(ci-operator/templates/openshift/installer/cluster-launch-installer-*:
Random AWS regions for IPI, 2020-01-23, openshift#6833).  But the installer has
had broken-zone avoidence since way back in
openshift/installer@71aef620b6 (pkg/asset/machines/aws: Only return
available zones, 2019-02-07, openshift/installer#1210) I dunno how
reliably AWS sets 'state: impaired' and similar; it didn't seem to
protect us from e8921c3.  But we're getting ready to pivot to using
multiple AWS accounts, which creates two issues with hard-coding
region names in the step:

1. References by name are not stable between accounts.  From the AWS
   docs [1]:

     To ensure that resources are distributed across the Availability
     Zones for a Region, we independently map Availability Zones to
     names for each AWS account. For example, the Availability Zone
     us-east-1a for your AWS account might not be the same location as
     us-east-1a for another AWS account.

   So "aah, us-east-1a is broken, let's use b and c instead" might
   apply to one account but not the other.  And the installer does not
   currently accept zone IDs.

2. References by name may not exist in other accounts.  From the AWS
   docs [1]:

     As Availability Zones grow over time, our ability to expand them
     can become constrained. If this happens, we might restrict you
     from launching an instance in a constrained Availability Zone
     unless you already have an instance in that Availability
     Zone. Eventually, we might also remove the constrained
     Availability Zone from the list of Availability Zones for new
     accounts. Therefore, your account might have a different number
     of available Availability Zones in a Region than another account.

   And it turns out that for some reason they sometimes don't name
   sequentially, e.g. our new account lacks us-west-1a:

     $ AWS_PROFILE=ci aws --region us-west-1 ec2 describe-availability-zones | jq -r '.AvailabilityZones[] | .ZoneName + " " + .ZoneId + " " + .State' | sort
     us-west-1a usw1-az3 available
     us-west-1b usw1-az1 available
     $ AWS_PROFILE=ci-2 aws --region us-west-1 ec2 describe-availability-zones | jq -r '.AvailabilityZones[] | .ZoneName + " " + .ZoneId + " " + .State' | sort
     us-west-1b usw1-az3 available
     us-west-1c usw1-az1 available

   I have no idea why they decided to do that, but we have to work
   with the world as it is ;).

Removing the us-east-1 overrides helps reduce our exposure, although
we are still vulnerable to (2) with the a/b default line.  We'll do
something about that in follow-up work.

Leaving the "which zones?" decision up to the installer would cause it
to try to set up each available zone, and that causes more API
contention and resource consumption than we want.  Background on that
in 51c4a37 (ci-operator/templates/openshift: Explicitly set AWS
availability zones, 2019-03-28, openshift#3285) and d87fffb
(ci-operator/templates/openshift: Drop us-east-1c, 2019-04-26, openshift#3615),
as well as the rejected/rotted-out [2].

[1]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html
[2]: openshift/installer#1487
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants