Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci-operator/templates/openshift: Get e2e-aws out of us-east-1b #3204

Merged
merged 1 commit into from
Mar 22, 2019

Conversation

wking
Copy link
Member

@wking wking commented Mar 22, 2019

We're currently getting a lot of 500s there like:

time="2019-03-22T18:02:51Z" level=debug msg="2019-03-22T18:02:51.255Z [DEBUG] plugin.terraform-provider-aws: 2019/03/22 18:02:51 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/RunInstances Details:"
time="2019-03-22T18:02:51Z" level=debug msg="2019-03-22T18:02:51.255Z [DEBUG] plugin.terraform-provider-aws: ---[ RESPONSE ]--------------------------------------"
time="2019-03-22T18:02:51Z" level=debug msg="2019-03-22T18:02:51.255Z [DEBUG] plugin.terraform-provider-aws: HTTP/1.1 500 Internal Server Error"

Looking at the results of this Athena query:

SELECT *
FROM "default"."cloudtrail_logs_cloud_trail_test_clayton"
WHERE from_iso8601_timestamp(eventtime) > date_add('hour', -4, now())
  AND eventname = 'RunInstances'
  AND errorcode = 'Server.InternalError'
ORDER BY eventtime;

They're all in 1b:

$ for ZONE in a b c d e f g; do echo "${ZONE} $(grep -c "us-east-1${ZONE}" c128e363-4791-4049-8936-aafc6e2e36dd.csv)"; done
a 0
b 181
c 0
d 0
e 0
f 0
g 0

This commit moves us to 1d instead.

@staebler, when I try this locally I get:

ERROR  * module.masters.aws_network_interface.master[1]: key "us-east-1c" does not exist in map var.az_to_subnet_id in:
ERROR
ERROR ${var.az_to_subnet_id[var.availability_zones[count.index]]}
ERROR  * module.masters.aws_network_interface.master[2]: key "us-east-1d" does not exist in map var.az_to_subnet_id in:
ERROR
ERROR ${var.az_to_subnet_id[var.availability_zones[count.index]]}
ERROR  * module.masters.aws_network_interface.master[0]: key "us-east-1a" does not exist in map var.az_to_subnet_id in:
ERROR
ERROR ${var.az_to_subnet_id[var.availability_zones[count.index]]}
ERROR  * module.bootstrap.var.subnet_id: key "us-east-1a" does not exist in map module.vpc.az_to_public_subnet_id in:
ERROR
ERROR ${module.vpc.az_to_public_subnet_id[var.aws_master_availability_zones[0]]}

Guesses?

We're currently getting a lot of 500s there like [1]:

  time="2019-03-22T18:02:51Z" level=debug msg="2019-03-22T18:02:51.255Z [DEBUG] plugin.terraform-provider-aws: 2019/03/22 18:02:51 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/RunInstances Details:"
  time="2019-03-22T18:02:51Z" level=debug msg="2019-03-22T18:02:51.255Z [DEBUG] plugin.terraform-provider-aws: ---[ RESPONSE ]--------------------------------------"
  time="2019-03-22T18:02:51Z" level=debug msg="2019-03-22T18:02:51.255Z [DEBUG] plugin.terraform-provider-aws: HTTP/1.1 500 Internal Server Error"

Looking at the results of this Athena query:

  SELECT *
  FROM "default"."cloudtrail_logs_cloud_trail_test_clayton"
  WHERE from_iso8601_timestamp(eventtime) > date_add('hour', -4, now())
    AND eventname = 'RunInstances'
    AND errorcode = 'Server.InternalError'
  ORDER BY eventtime;

They're all in 1b:

  $ for ZONE in a b c d e f g; do echo "${ZONE} $(grep -c "us-east-1${ZONE}" c128e363-4791-4049-8936-aafc6e2e36dd.csv)"; done
  a 0
  b 181
  c 0
  d 0
  e 0
  f 0
  g 0

This commit moves us to 1d instead.

[1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/operator-framework_operator-lifecycle-manager/777/pull-ci-operator-framework-operator-lifecycle-manager-master-e2e-aws-olm/1365/artifacts/e2e-aws-olm/installer/.openshift_install.log
@wking
Copy link
Member Author

wking commented Mar 22, 2019

/hold

@openshift-ci-robot openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 22, 2019
@wking
Copy link
Member Author

wking commented Mar 22, 2019

CC @vrutkovs, @sdodson too, since I'm touching the Ansible stuff.

@wking
Copy link
Member Author

wking commented Mar 22, 2019

/hold cancel

Working for @abhinavdahiya. My error was forgetting to change my usual platform.aws.region from us-west-2 to us-east-1 🤦‍♂️ . After this PR lands I'll file a new validation check ;).

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 22, 2019
@sdodson
Copy link
Member

sdodson commented Mar 22, 2019

Over what time period has this been an issue?

@smarterclayton
Copy link
Contributor

/lgtm

Also, we need to start moving these out of the template and into a profile - as it is right now you'd have to also change ci-operator in order to fix upgrade jobs but we can let those fail for a bit.

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 22, 2019
@sdodson
Copy link
Member

sdodson commented Mar 22, 2019

/approve

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sdodson, smarterclayton, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 55fc3ec into openshift:master Mar 22, 2019
@openshift-ci-robot
Copy link
Contributor

@wking: Updated the following 8 configmaps:

  • prow-job-cluster-launch-e2e-40 configmap in namespace ci using the following files:
    • key cluster-launch-e2e-40.yaml using file ci-operator/templates/openshift/openshift-ansible/cluster-launch-e2e-40.yaml
  • prow-job-cluster-launch-e2e-40 configmap in namespace ci-stg using the following files:
    • key cluster-launch-e2e-40.yaml using file ci-operator/templates/openshift/openshift-ansible/cluster-launch-e2e-40.yaml
  • prow-job-cluster-scaleup-e2e-40 configmap in namespace ci using the following files:
    • key cluster-scaleup-e2e-40.yaml using file ci-operator/templates/openshift/openshift-ansible/cluster-scaleup-e2e-40.yaml
  • prow-job-cluster-scaleup-e2e-40 configmap in namespace ci-stg using the following files:
    • key cluster-scaleup-e2e-40.yaml using file ci-operator/templates/openshift/openshift-ansible/cluster-scaleup-e2e-40.yaml
  • prow-job-cluster-launch-installer-e2e configmap in namespace ci using the following files:
    • key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
  • prow-job-cluster-launch-installer-e2e configmap in namespace ci-stg using the following files:
    • key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
  • prow-job-cluster-launch-installer-src configmap in namespace ci using the following files:
    • key cluster-launch-installer-src.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-src.yaml
  • prow-job-cluster-launch-installer-src configmap in namespace ci-stg using the following files:
    • key cluster-launch-installer-src.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-src.yaml

In response to this:

We're currently getting a lot of 500s there like:

time="2019-03-22T18:02:51Z" level=debug msg="2019-03-22T18:02:51.255Z [DEBUG] plugin.terraform-provider-aws: 2019/03/22 18:02:51 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/RunInstances Details:"
time="2019-03-22T18:02:51Z" level=debug msg="2019-03-22T18:02:51.255Z [DEBUG] plugin.terraform-provider-aws: ---[ RESPONSE ]--------------------------------------"
time="2019-03-22T18:02:51Z" level=debug msg="2019-03-22T18:02:51.255Z [DEBUG] plugin.terraform-provider-aws: HTTP/1.1 500 Internal Server Error"

Looking at the results of this Athena query:

SELECT *
FROM "default"."cloudtrail_logs_cloud_trail_test_clayton"
WHERE from_iso8601_timestamp(eventtime) > date_add('hour', -4, now())
 AND eventname = 'RunInstances'
 AND errorcode = 'Server.InternalError'
ORDER BY eventtime;

They're all in 1b:

$ for ZONE in a b c d e f g; do echo "${ZONE} $(grep -c "us-east-1${ZONE}" c128e363-4791-4049-8936-aafc6e2e36dd.csv)"; done
a 0
b 181
c 0
d 0
e 0
f 0
g 0

This commit moves us to 1d instead.

@staebler, when I try this locally I get:

ERROR  * module.masters.aws_network_interface.master[1]: key "us-east-1c" does not exist in map var.az_to_subnet_id in:
ERROR
ERROR ${var.az_to_subnet_id[var.availability_zones[count.index]]}
ERROR  * module.masters.aws_network_interface.master[2]: key "us-east-1d" does not exist in map var.az_to_subnet_id in:
ERROR
ERROR ${var.az_to_subnet_id[var.availability_zones[count.index]]}
ERROR  * module.masters.aws_network_interface.master[0]: key "us-east-1a" does not exist in map var.az_to_subnet_id in:
ERROR
ERROR ${var.az_to_subnet_id[var.availability_zones[count.index]]}
ERROR  * module.bootstrap.var.subnet_id: key "us-east-1a" does not exist in map module.vpc.az_to_public_subnet_id in:
ERROR
ERROR ${module.vpc.az_to_public_subnet_id[var.aws_master_availability_zones[0]]}

Guesses?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the get-out-of-us-east-1b branch March 22, 2019 19:42
wking added a commit to wking/openshift-release that referenced this pull request Mar 22, 2019
To make it easier to distinguish between "everything else is good" or
"this is everything, and we're completely broken" when you don't see a
lot of non-matching failures in a given time range.  To support this,
I'm saving the JSON with all the jobs regardless of state (but still
filtered by job name) into $CACHE/jobs.json.  I've also transitioned
to using a legend to explain the colors, although (at least in my RHEL
7.5's ancient, stock v1.2) that requires a bit of hoop-jumping.

To reflect the changes, I've also rebuild the examples to show the
us-east-1b RunInstances outage AWS just surprised us with [1] ;).

[1]: openshift#3204
@wking
Copy link
Member Author

wking commented Mar 23, 2019

Over what time period has this been an issue?

Sorry, missed this in the excitement.

run-instances

wking added a commit to wking/openshift-installer that referenced this pull request Mar 26, 2019
In the excitement of a zone outage, this one bit me [1], leading to
errors like [2]:

  ERROR  * module.masters.aws_network_interface.master[1]: key "us-east-1c" does not exist in map var.az_to_subnet_id in:
  ERROR
  ERROR ${var.az_to_subnet_id[var.availability_zones[count.index]]}
  ERROR  * module.masters.aws_network_interface.master[2]: key "us-east-1d" does not exist in map var.az_to_subnet_id in:
  ERROR
  ERROR ${var.az_to_subnet_id[var.availability_zones[count.index]]}
  ERROR  * module.masters.aws_network_interface.master[0]: key "us-east-1a" does not exist in map var.az_to_subnet_id in:
  ERROR
  ERROR ${var.az_to_subnet_id[var.availability_zones[count.index]]}
  ERROR  * module.bootstrap.var.subnet_id: key "us-east-1a" does not exist in map module.vpc.az_to_public_subnet_id in:
  ERROR
  ERROR ${module.vpc.az_to_public_subnet_id[var.aws_master_availability_zones[0]]}

With this commit, we require zones in the configured region before
getting into Terraform, and we describe the mismatch with words that
are hopefully more obvious to jumpy users ;).

[1]: openshift/release#3204 (comment)
[2]: openshift/release#3204 (comment)
wking added a commit to wking/openshift-installer that referenced this pull request Mar 26, 2019
In the excitement of a zone outage, this one bit me [1], leading to
errors like [2]:

  ERROR  * module.masters.aws_network_interface.master[1]: key "us-east-1c" does not exist in map var.az_to_subnet_id in:
  ERROR
  ERROR ${var.az_to_subnet_id[var.availability_zones[count.index]]}
  ERROR  * module.masters.aws_network_interface.master[2]: key "us-east-1d" does not exist in map var.az_to_subnet_id in:
  ERROR
  ERROR ${var.az_to_subnet_id[var.availability_zones[count.index]]}
  ERROR  * module.masters.aws_network_interface.master[0]: key "us-east-1a" does not exist in map var.az_to_subnet_id in:
  ERROR
  ERROR ${var.az_to_subnet_id[var.availability_zones[count.index]]}
  ERROR  * module.bootstrap.var.subnet_id: key "us-east-1a" does not exist in map module.vpc.az_to_public_subnet_id in:
  ERROR
  ERROR ${module.vpc.az_to_public_subnet_id[var.aws_master_availability_zones[0]]}

With this commit, we require zones in the configured region before
getting into Terraform, and we describe the mismatch with words that
are hopefully more obvious to jumpy users ;).

[1]: openshift/release#3204 (comment)
[2]: openshift/release#3204 (comment)
wking added a commit to wking/openshift-release that referenced this pull request Mar 28, 2019
This is very similar to the earlier e8921c3
(ci-operator/templates/openshift: Get e2e-aws out of us-east-1b,
2019-03-22, openshift#3204).  This time, however, I'm not changing the zones
where the machines will run.  By default, the installer will
provisioning zone infrastructure in all available zones, but since
openshift/installer@644f705286 (data/aws/vpc: Only create subnet
infrastucture for zones with Machine(Set)s, 2019-03-27,
openshift/installer#1481) users who explicitly set zones in their
install-config will no longer have unused zones provisioned with
subnets, NAT gateways, EIPs, and other related infrastructure.  This
infrastructure reduction has two benefits in CI:

1. We don't have to pay for resources that we won't use, and we will
   have more room under our EIP limits (although we haven't bumped
   into that one in a while, because we're VPC-constained).

2. We should see reduced rates in clusters failing install because of
   AWS rate limiting, with results like [1]:

     aws_route.to_nat_gw.3: Error creating route: timeout while waiting for state to become 'success' (timeout: 2m0s)

   The reduction is because:

   i. We'll be making fewer requests for these resources, because we
      won't need to create (and subsequently tear down) as many of
      them.  This will reduce our overall AWS-API load somewhat,
      although the reduction will be incremental because we have so
      many other resources which are not associated with zones.

   ii. Throttling for these per-zone resources are the ones that tend
       to break Terraform [2].  So even if the rate of timeouts
       per-API request remains unchanged, a given cluster will only
       have half as many (three vs. the old six) per-zone chances of
       hitting one of the timeouts.  This should give us something
       close to a 50% reduction in clusters hitting throttling
       timeouts.

The drawback is that we're diverging further from the stock "I just
called 'openshift-install create cluster' without providing an
install-config.yaml" experience.

[1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_console-operator/187/pull-ci-openshift-console-operator-master-e2e-aws-operator/575/artifacts/e2e-aws-operator/installer/.openshift_install.log
[2]: With a cache of build-log.txt from the past ~48 hours:

     $ grep -hr 'timeout while waiting for state' ~/.cache/openshift-deck-build-logs >timeouts
     $ wc -l timeouts
     362 timeouts
     $ grep aws_route_table_association timeouts | wc -l
     214
     $ grep 'aws_route\.to_nat_gw' timeouts | wc -l
     102

     So (102+214)/362 is 87% of our timeouts, with the remainder being
     almost entirely related to the internet gateway (which is not
     per-zone).
vrutkovs pushed a commit to vrutkovs/installer that referenced this pull request Apr 1, 2019
In the excitement of a zone outage, this one bit me [1], leading to
errors like [2]:

  ERROR  * module.masters.aws_network_interface.master[1]: key "us-east-1c" does not exist in map var.az_to_subnet_id in:
  ERROR
  ERROR ${var.az_to_subnet_id[var.availability_zones[count.index]]}
  ERROR  * module.masters.aws_network_interface.master[2]: key "us-east-1d" does not exist in map var.az_to_subnet_id in:
  ERROR
  ERROR ${var.az_to_subnet_id[var.availability_zones[count.index]]}
  ERROR  * module.masters.aws_network_interface.master[0]: key "us-east-1a" does not exist in map var.az_to_subnet_id in:
  ERROR
  ERROR ${var.az_to_subnet_id[var.availability_zones[count.index]]}
  ERROR  * module.bootstrap.var.subnet_id: key "us-east-1a" does not exist in map module.vpc.az_to_public_subnet_id in:
  ERROR
  ERROR ${module.vpc.az_to_public_subnet_id[var.aws_master_availability_zones[0]]}

With this commit, we require zones in the configured region before
getting into Terraform, and we describe the mismatch with words that
are hopefully more obvious to jumpy users ;).

[1]: openshift/release#3204 (comment)
[2]: openshift/release#3204 (comment)
wking added a commit to wking/openshift-release that referenced this pull request Jan 23, 2020
… Random AWS regions for IPI

I'm leaving the UPI templates and such alone for now, because they
mention regions in more places and I don't care as much about them ;).
The approach mirrors what we did for Azure with 4d08a9d
(ci-operator/templates/openshift/installer/cluster-launch-installer-*:
Random Azure regions, 2019-09-18, openshift#5081).  This should reduce resource
contention (both for limits and for API throttling) in CI.  We could
theoretically raise our Boskos caps as well, but I'm leaving that off
for now because we're getting hammered by AMI-copy issues [1].

I've also added ZONE_* variables in case we need to dodge broken zones
like we have in the past with e8921c3
(ci-operator/templates/openshift: Get e2e-aws out of us-east-1b,
2019-03-22, openshift#3204).  I have nothing against the other us-east-1 zones
at the moment, the current ZONE_* overrides there are just to
demonstrate the idea in case folks need to pivot later.  Without
explicit ZONE_* choices, we'll fall back to using zones a and b within
the chosen region.

The DevProductivity Platform (DPP) folks bumped our quotas in the new
regions back in 2019-11 [2].  We should be good for up to 25% of our
current us-east-1 capacity in those regions.  Increases beyond that,
or sharding into additional regions, will require additional DPP <->
AWS work to get limits raised.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1752370#c3
[2]: https://issues.redhat.com/browse/DPP-3108
wking added a commit to wking/openshift-release that referenced this pull request Apr 8, 2021
This was originally part of avoiding broken zones, see e8921c3
(ci-operator/templates/openshift: Get e2e-aws out of us-east-1b,
2019-03-22, openshift#3204) and b717933
(ci-operator/templates/openshift/installer/cluster-launch-installer-*:
Random AWS regions for IPI, 2020-01-23, openshift#6833).  But the installer has
had broken-zone avoidence since way back in
openshift/installer@71aef620b6 (pkg/asset/machines/aws: Only return
available zones, 2019-02-07, openshift/installer#1210) I dunno how
reliably AWS sets 'state: impaired' and similar; it didn't seem to
protect us from e8921c3.  But we're getting ready to pivot to using
multiple AWS accounts, which creates two issues with hard-coding
region names in the step:

1. References by name are not stable between accounts.  From the AWS
   docs [1]:

     To ensure that resources are distributed across the Availability
     Zones for a Region, we independently map Availability Zones to
     names for each AWS account. For example, the Availability Zone
     us-east-1a for your AWS account might not be the same location as
     us-east-1a for another AWS account.

   So "aah, us-east-1a is broken, let's use b and c instead" might
   apply to one account but not the other.  And the installer does not
   currently accept zone IDs.

2. References by name may not exist in other accounts.  From the AWS
   docs [1]:

     As Availability Zones grow over time, our ability to expand them
     can become constrained. If this happens, we might restrict you
     from launching an instance in a constrained Availability Zone
     unless you already have an instance in that Availability
     Zone. Eventually, we might also remove the constrained
     Availability Zone from the list of Availability Zones for new
     accounts. Therefore, your account might have a different number
     of available Availability Zones in a Region than another account.

   And it turns out that for some reason they sometimes don't name
   sequentially, e.g. our new account lacks us-west-1a:

     $ AWS_PROFILE=ci aws --region us-west-1 ec2 describe-availability-zones | jq -r '.AvailabilityZones[] | .ZoneName + " " + .ZoneId + " " + .State' | sort
     us-west-1a usw1-az3 available
     us-west-1b usw1-az1 available
     $ AWS_PROFILE=ci-2 aws --region us-west-1 ec2 describe-availability-zones | jq -r '.AvailabilityZones[] | .ZoneName + " " + .ZoneId + " " + .State' | sort
     us-west-1b usw1-az3 available
     us-west-1c usw1-az1 available

   I have no idea why they decided to do that, but we have to work
   with the world as it is ;).

Removing the us-east-1 overrides helps reduce our exposure, although
we are still vulnerable to (2) with the a/b default line.  We'll do
something about that in follow-up work.

Leaving the "which zones?" decision up to the installer would cause it
to try to set up each available zone, and that causes more API
contention and resource consumption than we want.  Background on that
in 51c4a37 (ci-operator/templates/openshift: Explicitly set AWS
availability zones, 2019-03-28, openshift#3285) and d87fffb
(ci-operator/templates/openshift: Drop us-east-1c, 2019-04-26, openshift#3615),
as well as the rejected/rotted-out [2].

[1]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html
[2]: openshift/installer#1487
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants