Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci-operator/templates/openshift: Explicitly set AWS availability zones #3285

Merged

Commits on Mar 28, 2019

  1. ci-operator/templates/openshift: Explicitly set AWS availability zones

    This is very similar to the earlier e8921c3
    (ci-operator/templates/openshift: Get e2e-aws out of us-east-1b,
    2019-03-22, openshift#3204).  This time, however, I'm not changing the zones
    where the machines will run.  By default, the installer will
    provisioning zone infrastructure in all available zones, but since
    openshift/installer@644f705286 (data/aws/vpc: Only create subnet
    infrastucture for zones with Machine(Set)s, 2019-03-27,
    openshift/installer#1481) users who explicitly set zones in their
    install-config will no longer have unused zones provisioned with
    subnets, NAT gateways, EIPs, and other related infrastructure.  This
    infrastructure reduction has two benefits in CI:
    
    1. We don't have to pay for resources that we won't use, and we will
       have more room under our EIP limits (although we haven't bumped
       into that one in a while, because we're VPC-constained).
    
    2. We should see reduced rates in clusters failing install because of
       AWS rate limiting, with results like [1]:
    
         aws_route.to_nat_gw.3: Error creating route: timeout while waiting for state to become 'success' (timeout: 2m0s)
    
       The reduction is because:
    
       i. We'll be making fewer requests for these resources, because we
          won't need to create (and subsequently tear down) as many of
          them.  This will reduce our overall AWS-API load somewhat,
          although the reduction will be incremental because we have so
          many other resources which are not associated with zones.
    
       ii. Throttling for these per-zone resources are the ones that tend
           to break Terraform [2].  So even if the rate of timeouts
           per-API request remains unchanged, a given cluster will only
           have half as many (three vs. the old six) per-zone chances of
           hitting one of the timeouts.  This should give us something
           close to a 50% reduction in clusters hitting throttling
           timeouts.
    
    The drawback is that we're diverging further from the stock "I just
    called 'openshift-install create cluster' without providing an
    install-config.yaml" experience.
    
    [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_console-operator/187/pull-ci-openshift-console-operator-master-e2e-aws-operator/575/artifacts/e2e-aws-operator/installer/.openshift_install.log
    [2]: With a cache of build-log.txt from the past ~48 hours:
    
         $ grep -hr 'timeout while waiting for state' ~/.cache/openshift-deck-build-logs >timeouts
         $ wc -l timeouts
         362 timeouts
         $ grep aws_route_table_association timeouts | wc -l
         214
         $ grep 'aws_route\.to_nat_gw' timeouts | wc -l
         102
    
         So (102+214)/362 is 87% of our timeouts, with the remainder being
         almost entirely related to the internet gateway (which is not
         per-zone).
    wking committed Mar 28, 2019
    Configuration menu
    Copy the full SHA
    51c4a37 View commit details
    Browse the repository at this point in the history