ci-operator/templates/openshift: Explicitly set AWS availability zones #3285

This is very similar to the earlier e8921c3 (ci-operator/templates/openshift: Get e2e-aws out of us-east-1b, 2019-03-22, openshift#3204). This time, however, I'm not changing the zones where the machines will run. By default, the installer will provisioning zone infrastructure in all available zones, but since openshift/installer@644f705286 (data/aws/vpc: Only create subnet infrastucture for zones with Machine(Set)s, 2019-03-27, openshift/installer#1481) users who explicitly set zones in their install-config will no longer have unused zones provisioned with subnets, NAT gateways, EIPs, and other related infrastructure. This infrastructure reduction has two benefits in CI: 1. We don't have to pay for resources that we won't use, and we will have more room under our EIP limits (although we haven't bumped into that one in a while, because we're VPC-constained). 2. We should see reduced rates in clusters failing install because of AWS rate limiting, with results like [1]: aws_route.to_nat_gw.3: Error creating route: timeout while waiting for state to become 'success' (timeout: 2m0s) The reduction is because: i. We'll be making fewer requests for these resources, because we won't need to create (and subsequently tear down) as many of them. This will reduce our overall AWS-API load somewhat, although the reduction will be incremental because we have so many other resources which are not associated with zones. ii. Throttling for these per-zone resources are the ones that tend to break Terraform [2]. So even if the rate of timeouts per-API request remains unchanged, a given cluster will only have half as many (three vs. the old six) per-zone chances of hitting one of the timeouts. This should give us something close to a 50% reduction in clusters hitting throttling timeouts. The drawback is that we're diverging further from the stock "I just called 'openshift-install create cluster' without providing an install-config.yaml" experience. [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_console-operator/187/pull-ci-openshift-console-operator-master-e2e-aws-operator/575/artifacts/e2e-aws-operator/installer/.openshift_install.log [2]: With a cache of build-log.txt from the past ~48 hours: $ grep -hr 'timeout while waiting for state' ~/.cache/openshift-deck-build-logs >timeouts $ wc -l timeouts 362 timeouts $ grep aws_route_table_association timeouts | wc -l 214 $ grep 'aws_route\.to_nat_gw' timeouts | wc -l 102 So (102+214)/362 is 87% of our timeouts, with the remainder being almost entirely related to the internet gateway (which is not per-zone).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci-operator/templates/openshift: Explicitly set AWS availability zones #3285

ci-operator/templates/openshift: Explicitly set AWS availability zones #3285

Commits on Mar 28, 2019