Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/asset/machines/worker: Structured workers #1481

Merged
merged 4 commits into from
Mar 28, 2019

Conversation

wking
Copy link
Member

@wking wking commented Mar 27, 2019

Building on #1150 (which will hopefully squeak through CI any moment now), this pull request is structuring our Worker (as 7a396d9, #1211, did for Master). This is groundwork for extracting configured availability zones when generating AWS Terraform variables, which will in turn allow us to avoid provisioning availability zones unless we are going to populate them with machines.

@openshift-ci-robot openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 27, 2019
@wking
Copy link
Member Author

wking commented Mar 27, 2019

e2e-aws passed for c0bc4dc, and I've just pushed c0bc4dc -> 009963e, adding additional commits pushing this into Terraform and removing infrastructure from unused (at install-time) subnets.

@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 28, 2019
@wking wking force-pushed the structured-workers branch 3 times, most recently from ba76eea to 8516a53 Compare March 28, 2019 04:10
@wking
Copy link
Member Author

wking commented Mar 28, 2019

I've pushed 009963e -> 8516a53, rebasing onto master and fixing a segfault. Turns out []*File -> Machine conversion doesn't work on MachineSets ;). I've dropped the old 97c2d40 and gone back to structuring methods now that I realize I won't be able to share the implementation.

@abhinavdahiya
Copy link
Contributor

/lgtm
/hold

@wking feel free to hold cancel if #1481 (comment) doesn't resonate with you. :)

@openshift-ci-robot openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Mar 28, 2019
@wking
Copy link
Member Author

wking commented Mar 28, 2019

/hold

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Mar 28, 2019
@wking
Copy link
Member Author

wking commented Mar 28, 2019

/hold cancel

Ok, I've pushed 6723bbd -> c323003, inserting 549a704 to limit default MachineSets to the number needed to satisfy the configured replicas. Now our default cluster will top out at three zones.

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 28, 2019
@wking
Copy link
Member Author

wking commented Mar 28, 2019

There we go:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1481/pull-ci-openshift-installer-master-e2e-aws/4737/artifacts/e2e-aws/installer/.openshift_install.log | grep 'aws_nat.*Creation'
time="2019-03-28T06:50:04Z" level=debug msg="module.vpc.aws_nat_gateway.nat_gw[1]: Creation complete after 1m36s (ID: nat-0d67c2bffb6621025)"
time="2019-03-28T06:50:14Z" level=debug msg="module.vpc.aws_nat_gateway.nat_gw[2]: Creation complete after 1m46s (ID: nat-058c9c84f360675ca)2019-03-28T06:50:14.803Z [DEBUG] plugin.terraform-provider-aws:                     <key>expirationDate</key>"
time="2019-03-28T06:50:35Z" level=debug msg="module.vpc.aws_nat_gateway.nat_gw[0]: Creation complete after 2m6s (ID: nat-0e2918fa6c686c519)"

And we're all green.

@sdodson
Copy link
Member

sdodson commented Mar 28, 2019

/test e2e-aws

@sdodson
Copy link
Member

sdodson commented Mar 28, 2019

Looks like of the tests that made it to the install we're around 50/50

@abhinavdahiya
Copy link
Contributor

/hold cancel

Ok, I've pushed 6723bbd -> c323003, inserting 549a704 to limit default MachineSets to the number needed to satisfy the configured replicas. Now our default cluster will top out at three zones.

/hold

We should not be decreasing the default deployment, we have a way for customer's or CI to pick subset of azs.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 28, 2019
The final consumer was removed in 361c082 (govcloud: remove
terraform configs, 2018-03-23, coreos/tectonic-installer#3134).
…hines

Since f2eacf3 (asset/machines/master: allow adding MachineConfigs
for control-plane machinepool, 2019-03-25, openshift#1150), we no longer need
to filter a File slice to get a Master object's Machine files.  Drop
the obsoleted Machines() implemenation and rename the previous
StructuredMachines implmentation to take its place.
Like the earlier 7a396d9 (pkg/asset/machines: Convert Master to a
WriteableAsset, 2019-02-07, openshift#1211), but for Worker.  This is
groundwork for extracting configured availability zones when
generating AWS Terraform variables, which will in turn allow us to
avoid provisioning availability zones unless we are going to populate
them with machines.
…(Set)s

This commit updates our Terraform variables to include the worker
subnets, and then switches on that (and the master zones) in Terraform
to avoid creating subnet infrastructure (NAT gateways, routes, etc.)
in zones that have no Machine(Set)s.  This helps address limit issues
in high-zone regions like us-east-1, as seen in the limits.md change.
Note that without a reduction in our default MachineSet creation, the
installer defaults will still not work on us-east-1 without a limit
bump.

The drawback is that users are now on the hook to provision their own
subnets in other zones if they decide that they want to grow into a
new zone as a day-2 Machine(Set) operation.  For now, they'll have to
provide their own infrastructure for that, and our
user-provided-infrastructure docs should give them sufficient
grounding to do so.  It's possible that in the future the machine-API
or other infrastructure operator could dynamically provision subnets
in zones that were not populated at install-time, but I can't hazard a
guess as to how likely that will be.

The HCL functions for combining the zone lists are documented in [1,2].

[1]: https://www.terraform.io/docs/configuration-0-11/interpolation.html#concat-list1-list2-
[2]: https://www.terraform.io/docs/configuration-0-11/interpolation.html#distinct-list-
@wking
Copy link
Member Author

wking commented Mar 28, 2019

We should not be decreasing the default deployment, we have a way for customer's or CI to pick subset of azs.

I'm not convinced yet ;). But we seem to be on the same page for the rest of this PR, so I've pushed c323003 -> 644f705, rebasing onto master, dropping 549a704, and adjusting the docs accordingly. I'll circle back with 549a704 in a follow-up PR, and we can hash this out there.

@abhinavdahiya
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 28, 2019
@abhinavdahiya
Copy link
Contributor

/retest

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@abhinavdahiya
Copy link
Contributor

forgot to lift hold

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 28, 2019
@wking
Copy link
Member Author

wking commented Mar 28, 2019

images:

release "release-latest" failed: pod release-latest was already deleted

/retest

@openshift-merge-robot openshift-merge-robot merged commit 655fdc7 into openshift:master Mar 28, 2019
wking added a commit to wking/openshift-release that referenced this pull request Mar 28, 2019
This is very similar to the earlier e8921c3
(ci-operator/templates/openshift: Get e2e-aws out of us-east-1b,
2019-03-22, openshift#3204).  This time, however, I'm not changing the zones
where the machines will run.  By default, the installer will
provisioning zone infrastructure in all available zones, but since
openshift/installer@644f705286 (data/aws/vpc: Only create subnet
infrastucture for zones with Machine(Set)s, 2019-03-27,
openshift/installer#1481) users who explicitly set zones in their
install-config will no longer have unused zones provisioned with
subnets, NAT gateways, EIPs, and other related infrastructure.  This
infrastructure reduction has two benefits in CI:

1. We don't have to pay for resources that we won't use, and we will
   have more room under our EIP limits (although we haven't bumped
   into that one in a while, because we're VPC-constained).

2. We should see reduced rates in clusters failing install because of
   AWS rate limiting, with results like [1]:

     aws_route.to_nat_gw.3: Error creating route: timeout while waiting for state to become 'success' (timeout: 2m0s)

   The reduction is because:

   i. We'll be making fewer requests for these resources, because we
      won't need to create (and subsequently tear down) as many of
      them.  This will reduce our overall AWS-API load somewhat,
      although the reduction will be incremental because we have so
      many other resources which are not associated with zones.

   ii. Throttling for these per-zone resources are the ones that tend
       to break Terraform [2].  So even if the rate of timeouts
       per-API request remains unchanged, a given cluster will only
       have half as many (three vs. the old six) per-zone chances of
       hitting one of the timeouts.  This should give us something
       close to a 50% reduction in clusters hitting throttling
       timeouts.

The drawback is that we're diverging further from the stock "I just
called 'openshift-install create cluster' without providing an
install-config.yaml" experience.

[1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_console-operator/187/pull-ci-openshift-console-operator-master-e2e-aws-operator/575/artifacts/e2e-aws-operator/installer/.openshift_install.log
[2]: With a cache of build-log.txt from the past ~48 hours:

     $ grep -hr 'timeout while waiting for state' ~/.cache/openshift-deck-build-logs >timeouts
     $ wc -l timeouts
     362 timeouts
     $ grep aws_route_table_association timeouts | wc -l
     214
     $ grep 'aws_route\.to_nat_gw' timeouts | wc -l
     102

     So (102+214)/362 is 87% of our timeouts, with the remainder being
     almost entirely related to the internet gateway (which is not
     per-zone).
@wking wking deleted the structured-workers branch April 1, 2019 06:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants