This is very similar to the earlier e8921c3
(ci-operator/templates/openshift: Get e2e-aws out of us-east-1b,
2019-03-22, openshift#3204). This time, however, I'm not changing the zones
where the machines will run. By default, the installer will
provisioning zone infrastructure in all available zones, but since
openshift/installer@644f705286 (data/aws/vpc: Only create subnet
infrastucture for zones with Machine(Set)s, 2019-03-27,
openshift/installer#1481) users who explicitly set zones in their
install-config will no longer have unused zones provisioned with
subnets, NAT gateways, EIPs, and other related infrastructure. This
infrastructure reduction has two benefits in CI:
1. We don't have to pay for resources that we won't use, and we will
have more room under our EIP limits (although we haven't bumped
into that one in a while, because we're VPC-constained).
2. We should see reduced rates in clusters failing install because of
AWS rate limiting, with results like [1]:
aws_route.to_nat_gw.3: Error creating route: timeout while waiting for state to become 'success' (timeout: 2m0s)
The reduction is because:
i. We'll be making fewer requests for these resources, because we
won't need to create (and subsequently tear down) as many of
them. This will reduce our overall AWS-API load somewhat,
although the reduction will be incremental because we have so
many other resources which are not associated with zones.
ii. Throttling for these per-zone resources are the ones that tend
to break Terraform [2]. So even if the rate of timeouts
per-API request remains unchanged, a given cluster will only
have half as many (three vs. the old six) per-zone chances of
hitting one of the timeouts. This should give us something
close to a 50% reduction in clusters hitting throttling
timeouts.
The drawback is that we're diverging further from the stock "I just
called 'openshift-install create cluster' without providing an
install-config.yaml" experience.
[1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_console-operator/187/pull-ci-openshift-console-operator-master-e2e-aws-operator/575/artifacts/e2e-aws-operator/installer/.openshift_install.log
[2]: With a cache of build-log.txt from the past ~48 hours:
$ grep -hr 'timeout while waiting for state' ~/.cache/openshift-deck-build-logs >timeouts
$ wc -l timeouts
362 timeouts
$ grep aws_route_table_association timeouts | wc -l
214
$ grep 'aws_route\.to_nat_gw' timeouts | wc -l
102
So (102+214)/362 is 87% of our timeouts, with the remainder being
almost entirely related to the internet gateway (which is not
per-zone).