pkg/asset/machines/worker: Structured workers #1481

wking · 2019-03-27T21:38:23Z

Building on #1150 (which will hopefully squeak through CI any moment now), this pull request is structuring our Worker (as 7a396d9, #1211, did for Master). This is groundwork for extracting configured availability zones when generating AWS Terraform variables, which will in turn allow us to avoid provisioning availability zones unless we are going to populate them with machines.

pkg/asset/machines/master.go

pkg/asset/machines/worker.go

wking · 2019-03-27T23:13:19Z

e2e-aws passed for c0bc4dce1, and I've just pushed c0bc4dce1 -> 009963e27, adding additional commits pushing this into Terraform and removing infrastructure from unused (at install-time) subnets.

docs/user/aws/limits.md

pkg/tfvars/aws/aws.go

wking · 2019-03-28T04:12:31Z

I've pushed 009963e -> 8516a53, rebasing onto master and fixing a segfault. Turns out []*File -> Machine conversion doesn't work on MachineSets ;). I've dropped the old 97c2d401d and gone back to structuring methods now that I realize I won't be able to share the implementation.

pkg/asset/machines/worker.go

abhinavdahiya · 2019-03-28T05:25:11Z

/lgtm
/hold

@wking feel free to hold cancel if #1481 (comment) doesn't resonate with you. :)

wking · 2019-03-28T06:38:09Z

/hold

wking · 2019-03-28T06:45:54Z

/hold cancel

Ok, I've pushed 6723bbd15 -> c32300383, inserting 549a704b15 to limit default MachineSets to the number needed to satisfy the configured replicas. Now our default cluster will top out at three zones.

wking · 2019-03-28T09:00:31Z

There we go:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1481/pull-ci-openshift-installer-master-e2e-aws/4737/artifacts/e2e-aws/installer/.openshift_install.log | grep 'aws_nat.*Creation'
time="2019-03-28T06:50:04Z" level=debug msg="module.vpc.aws_nat_gateway.nat_gw[1]: Creation complete after 1m36s (ID: nat-0d67c2bffb6621025)"
time="2019-03-28T06:50:14Z" level=debug msg="module.vpc.aws_nat_gateway.nat_gw[2]: Creation complete after 1m46s (ID: nat-058c9c84f360675ca)2019-03-28T06:50:14.803Z [DEBUG] plugin.terraform-provider-aws:                     <key>expirationDate</key>"
time="2019-03-28T06:50:35Z" level=debug msg="module.vpc.aws_nat_gateway.nat_gw[0]: Creation complete after 2m6s (ID: nat-0e2918fa6c686c519)"

And we're all green.

sdodson · 2019-03-28T15:52:22Z

/test e2e-aws

sdodson · 2019-03-28T15:54:49Z

Looks like of the tests that made it to the install we're around 50/50

abhinavdahiya · 2019-03-28T15:56:25Z

/hold cancel

Ok, I've pushed 6723bbd -> c323003, inserting 549a704 to limit default MachineSets to the number needed to satisfy the configured replicas. Now our default cluster will top out at three zones.

/hold

We should not be decreasing the default deployment, we have a way for customer's or CI to pick subset of azs.

The final consumer was removed in 361c082 (govcloud: remove terraform configs, 2018-03-23, coreos/tectonic-installer#3134).

…hines Since f2eacf3 (asset/machines/master: allow adding MachineConfigs for control-plane machinepool, 2019-03-25, openshift#1150), we no longer need to filter a File slice to get a Master object's Machine files. Drop the obsoleted Machines() implemenation and rename the previous StructuredMachines implmentation to take its place.

Like the earlier 7a396d9 (pkg/asset/machines: Convert Master to a WriteableAsset, 2019-02-07, openshift#1211), but for Worker. This is groundwork for extracting configured availability zones when generating AWS Terraform variables, which will in turn allow us to avoid provisioning availability zones unless we are going to populate them with machines.

…(Set)s This commit updates our Terraform variables to include the worker subnets, and then switches on that (and the master zones) in Terraform to avoid creating subnet infrastructure (NAT gateways, routes, etc.) in zones that have no Machine(Set)s. This helps address limit issues in high-zone regions like us-east-1, as seen in the limits.md change. Note that without a reduction in our default MachineSet creation, the installer defaults will still not work on us-east-1 without a limit bump. The drawback is that users are now on the hook to provision their own subnets in other zones if they decide that they want to grow into a new zone as a day-2 Machine(Set) operation. For now, they'll have to provide their own infrastructure for that, and our user-provided-infrastructure docs should give them sufficient grounding to do so. It's possible that in the future the machine-API or other infrastructure operator could dynamically provision subnets in zones that were not populated at install-time, but I can't hazard a guess as to how likely that will be. The HCL functions for combining the zone lists are documented in [1,2]. [1]: https://www.terraform.io/docs/configuration-0-11/interpolation.html#concat-list1-list2- [2]: https://www.terraform.io/docs/configuration-0-11/interpolation.html#distinct-list-

wking · 2019-03-28T16:23:36Z

We should not be decreasing the default deployment, we have a way for customer's or CI to pick subset of azs.

I'm not convinced yet ;). But we seem to be on the same page for the rest of this PR, so I've pushed c32300383 -> 644f705, rebasing onto master, dropping 549a704b1, and adjusting the docs accordingly. I'll circle back with 549a704b1 in a follow-up PR, and we can hash this out there.

abhinavdahiya · 2019-03-28T16:44:53Z

/lgtm

abhinavdahiya · 2019-03-28T16:45:03Z

/retest

openshift-ci-robot · 2019-03-28T16:45:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

abhinavdahiya · 2019-03-28T17:41:55Z

forgot to lift hold

/hold cancel

wking · 2019-03-28T17:58:41Z

images:

release "release-latest" failed: pod release-latest was already deleted

/retest

This is very similar to the earlier e8921c3 (ci-operator/templates/openshift: Get e2e-aws out of us-east-1b, 2019-03-22, openshift#3204). This time, however, I'm not changing the zones where the machines will run. By default, the installer will provisioning zone infrastructure in all available zones, but since openshift/installer@644f705286 (data/aws/vpc: Only create subnet infrastucture for zones with Machine(Set)s, 2019-03-27, openshift/installer#1481) users who explicitly set zones in their install-config will no longer have unused zones provisioned with subnets, NAT gateways, EIPs, and other related infrastructure. This infrastructure reduction has two benefits in CI: 1. We don't have to pay for resources that we won't use, and we will have more room under our EIP limits (although we haven't bumped into that one in a while, because we're VPC-constained). 2. We should see reduced rates in clusters failing install because of AWS rate limiting, with results like [1]: aws_route.to_nat_gw.3: Error creating route: timeout while waiting for state to become 'success' (timeout: 2m0s) The reduction is because: i. We'll be making fewer requests for these resources, because we won't need to create (and subsequently tear down) as many of them. This will reduce our overall AWS-API load somewhat, although the reduction will be incremental because we have so many other resources which are not associated with zones. ii. Throttling for these per-zone resources are the ones that tend to break Terraform [2]. So even if the rate of timeouts per-API request remains unchanged, a given cluster will only have half as many (three vs. the old six) per-zone chances of hitting one of the timeouts. This should give us something close to a 50% reduction in clusters hitting throttling timeouts. The drawback is that we're diverging further from the stock "I just called 'openshift-install create cluster' without providing an install-config.yaml" experience. [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_console-operator/187/pull-ci-openshift-console-operator-master-e2e-aws-operator/575/artifacts/e2e-aws-operator/installer/.openshift_install.log [2]: With a cache of build-log.txt from the past ~48 hours: $ grep -hr 'timeout while waiting for state' ~/.cache/openshift-deck-build-logs >timeouts $ wc -l timeouts 362 timeouts $ grep aws_route_table_association timeouts | wc -l 214 $ grep 'aws_route\.to_nat_gw' timeouts | wc -l 102 So (102+214)/362 is 87% of our timeouts, with the remainder being almost entirely related to the internet gateway (which is not per-zone).

openshift-ci-robot requested review from crawford and russellb March 27, 2019 21:38

openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 27, 2019