Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data/bootstrap/files/usr/local/bin/bootkube.sh.template: Set --tear-down-event #1147

Closed

Conversation

wking
Copy link
Member

@wking wking commented Jan 29, 2019

In master there's a bit of a window during the bootstrap-teardown dance:

  1. cluster-bootstrap sees the requested pods.
  2. cluster-bootstrap shuts itself down.
  3. openshift.sh pushes the OpenShift-specific manifests.
  4. report-progress.sh pushes bootstrap-complete.
  5. The installer sees bootstrap-complete and removes the bootstrap resources, including the bootstrap load-balancer targets.
  6. Subsequent Kubernetes API traffic hits the production control plane.

That leaves a fairly large window from 3 through 5 where Kubernetes API requests could be routed to the bootstrap machine and dropped because it no longer has anything listening on 6443.

With this commit, we take advantage of openshift/cluster-bootstrap@d07548e3 (openshift/cluster-bootstrap#9) to drop step 2 (waiting for an event we never send). That leaves the bootstrap control-plane running until we destroy that machine. We take advantage of openshift/cluster-bootstrap@180599bc (openshift/cluster-bootstrap#13) to replace our previous openshift.sh (with a minor change to the manifest directory). And we take advantage of openshift/cluster-bootstrap@e5095848 (openshift/cluster-bootstrap#9) to replace our previous report-progress.sh (with a minor change to the event name).

Also set --strict, because we want to fail-fast for these resources. The user is unlikely to scrape them out of the installer state and push them by hand if we fail to push them from the bootstrap node.

With these changes, the new transition is:

  1. cluster-bootstrap sees the requested pods.
  2. cluster-bootstrap pushes the OpenShift-specific manifests.
  3. cluster-bootstrap pushes bootstrap-success.
  4. The installer sees bootstrap-success and removes the bootstrap resources, including the bootstrap load-balancer targets.
  5. subsequent Kubernetes API traffic hits the production control plane.

There's still a small window for lost Kubernetes API traffic:

  • The Terraform tear-down could remove the bootstrap machine before it removes the bootstrap load-balancer target, leaving the target pointing into empty space.
  • Bootstrap teardown does not allow existing client connections to drain after removing the load balancer target before removing the bootstrap machine.

Both of these could be addressed by:

  1. Remove the bootstrap load-balancer targets.
  2. Wait for the 30 seconds (healthy_threshold * interval for our aws_lb_target_group) for the load-balancer to notice the production control-plane targets are live. This assumes the post-pod manifests are all pushed in zero seconds, so it's overly conservative, but waiting an extra 30 seconds isn't a large cost.
  3. Remove the remaining bootstrap resources, including the bootstrap machine.

But even without that delay, this commit reduces the window compared to what we have in master. I'll land the delay in follow-up work.

CC @sttts

We'll need openshift/cluster-bootstrap#13 for this, so...

/hold

…own-event

In master there's a bit of a window during the bootstrap-teardown
dance:

1. cluster-bootstrap sees the requested pods.
2. cluster-bootstrap shuts itself down.
3. openshift.sh pushes the OpenShift-specific manifests.
4. report-progress.sh pushes bootstrap-complete.
5. The installer sees bootstrap-complete and removes the bootstrap
   resources, including the bootstrap load-balancer targets.
6. subsequent Kubernetes API traffic hits the production control
   plane.

That leaves a fairly large window from 3 through 5 where Kubernetes
API requests could be routed to the bootstrap machine and dropped
because it no longer has anything listening on 6443.

With this commit, we take advantage of
openshift/cluster-bootstrap@d07548e3 (Add --tear-down-event flag to
delay tear down, 2019-01-24, openshift/cluster-bootstrap#9) to drop
step 2 (waiting for an event we never send).  That leaves the
bootstrap control-plane running until we destroy that machine.  We
take advantage of openshift/cluster-bootstrap@180599bc
(pkg/start/asset: Add support for post-pod-manifests, 2019-01-29,
openshift/cluster-bootstrap#13) to replace our previous openshift.sh
(with a minor change to the manifest directory).  And we take
advantage of openshift/cluster-bootstrap@e5095848 (Create
bootstrap-success event before tear down, 2019-01-24,
openshift/cluster-bootstrap#9) to replace our previous
report-progress.sh (with a minor change to the event name).

Also set --strict, because we want to fail-fast for these resources.
The user is unlikely to scrape them out of the installer state and
push them by hand if we fail to push them from the bootstrap node.

With these changes, the new transition is:

1. cluster-bootstrap sees the requested pods.
2. cluster-bootstrap pushes the OpenShift-specific manifests.
3. cluster-bootstrap pushes bootstrap-success.
4. The installer sees bootstrap-success and removes the bootstrap
   resources, including the bootstrap load-balancer targets.
5. subsequent Kubernetes API traffic hits the production control
   plane.

There's still a small window for lost Kubernetes API traffic:

* The Terraform tear-down could remove the bootstrap machine before it
  removes the bootstrap load-balancer target, leaving the target
  pointing into empty space.
* Bootstrap teardown does not allow existing client connections to
  drain after removing the load balancer target before removing the
  bootstrap machine.

Both of these could be addressed by:

1. Remove the bootstrap load-balancer targets.
2. Wait for the 30 seconds (healthy_threshold * interval for our
   aws_lb_target_group [1]) for the load-balancer to notice the
   production control-plane targets are live.  This assumes the
   post-pod manifests are all pushed in zero seconds, so it's overly
   conservative, but waiting an extra 30 seconds isn't a large cost.
3. Remove the remaining bootstrap resources, including the bootstrap
   machine.

But even without that delay, this commit reduces the window compared
to what we have in master.  I'll land the delay in follow-up work.

[1]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html
@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 29, 2019
@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 29, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 29, 2019
@wking wking added the kind/bug Categorizes issue or PR as related to a bug. label Jan 29, 2019
@abhinavdahiya
Copy link
Contributor

We take advantage of openshift/cluster-bootstrap@180599b (openshift/cluster-bootstrap#13) to replace our previous openshift.sh (with a minor change to the manifest directory).

this is probably not going to work openshift/cluster-bootstrap#13 (comment)

start \
--asset-dir=/assets \
--strict \
--tear-down-event run-forever \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced this is a good idea. We do not solve the load balancer problem, we only push it from ELB related to be kube-proxy related. All clients using in-cluster connection using the service IP will fall over (in the best case when the bootstrap machine is terminated). In the worst case, they will have pending watch TCP connection to a machine which is not there anymore.

@squeed can you comment about that? How do TCP connections to service IPs behave if the target machine disappears?

@sttts
Copy link
Contributor

sttts commented Jan 30, 2019

I am against this PR. It takes the ELB out of the failure equation, but at the same time introduces a time period of uncertainty with service IPs. This is bad. Let's solve the problem by proper synchronization instead of fixing some issues and creating new ones for other teams.

A correct way would be:

  1. cluster-bootstrap sees the requested pods.
  2. cluster-bootstrap pushes the OpenShift-specific manifests.
  3. cluster-bootstrap pushes bootstrap-success.
  4. The installer sees bootstrap-success and removes the bootstrap resources, including the bootstrap load-balancer targets.
  5. installer sends the tear down event to cluster-bootstrap
  6. cluster-bootstrap shuts down and removes itself from the in-cluster service IP endpoints
  7. subsequent Kubernetes API ELB traffic hits the production control plane, subsequent Kubernetes API in-cluster traffic hits the production control plane
  8. terraform takes down the bootstrap node.

@mfojtik
Copy link
Contributor

mfojtik commented Jan 30, 2019

Agree with @sttts. This change can potentially make other components unstable and introduce even more flakes. I think the proper solution would be what Stefan described. The cluster-bootstrap should tear down the bootstrap control plane when it receives an event from the installer saying i acknowledged bootstrap-success, removed whatever needed to be removed (ELB) and asking you to tear down the control plane. Then bootstrap can send another event saying "bootstrap-ready-to-shutdown" and then the installer can delete the bootstrap node.

@wking
Copy link
Member Author

wking commented Jan 30, 2019

This change can potentially make other components unstable and introduce even more flakes.

I still don't see how a hard-shutdown of the bootstrap machine would be more disruptive than a production-master reboot (e.g. for a RHCOS update). But once some form of openshift/cluster-bootstrap#13 (which we all agree we want?) has landed, we can use CI testing for this PR to explore these different proposals.

@sttts
Copy link
Contributor

sttts commented Jan 30, 2019

I still don't see how a hard-shutdown of the bootstrap machine would be more disruptive than a production-master reboot (e.g. for a RHCOS update).

What is a hard shutdown? Switching off the machine like "power off"? Or is it a controlled shutdown where every process gets a TERM signal, some time to terminate and only then a kill? If it's the later, we should be fine. If it is the pure removal of the machine, it's not good.

We were fighting with those cases in other contexts like etcd where connections got stuck (timeouts like 30 min), only very recent etcd client got that under control. Killing the bootstrap machine will bring us into the same situation again (not necessarily with etcd of course, but client connection to the apiserver) and we want to avoid that.

In short: kube-apiserver needs the TERM signal. Everything else is bad.

@sttts
Copy link
Contributor

sttts commented Jan 30, 2019

Also note that a RHCOS update will preceded by a draining of the node. That way every service can do whatever it has to do for graceful termination.

@crawford
Copy link
Contributor

In short: kube-apiserver needs the TERM signal. Everything else is bad.

Is this just the current state of affairs or is there some technical reason this will always be the case? We must be able to handle hard reboots and networking outtages.

@deads2k
Copy link
Contributor

deads2k commented Jan 31, 2019

In short: kube-apiserver needs the TERM signal. Everything else is bad.

Is this just the current state of affairs or is there some technical reason this will always be the case? We must be able to handle hard reboots and networking outtages.

Handling of an unexpected condition like a machine shutdown is different than handling controlled shut downs. In the example Stefan provided, a hard shutdown is tolerated in the sense that traffic is routed, dropped, and well written clients can retry. However, given an appropriate signal, we can avoid unnecessary blips by driving traffic away and quiecing over a few seconds to allow short lived calls to complete, controlled removal of leases and endpoints, and load balancers to see rejecting traffic before disappearing entirely.

We aren't saying we can't handle being shutdown. We are saying that is objectively more disruptive in a way that appears to be unnecessary.

@openshift-ci-robot
Copy link
Contributor

@wking: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 18, 2019
@openshift-ci-robot
Copy link
Contributor

@wking: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-aws-upgrade 4f68502 link /test e2e-aws-upgrade
ci/prow/verify-vendor 4f68502 link /test verify-vendor
ci/prow/e2e-aws 4f68502 link /test e2e-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@abhinavdahiya
Copy link
Contributor

The installer needs to continue to be READER for the api. Any dependency on bootstrapping the cluster on installer having access to the API to push things is a no-go.

/close

@openshift-ci-robot
Copy link
Contributor

@abhinavdahiya: Closed this PR.

In response to this:

The installer needs to continue to be READER for the api. Any dependency on bootstrapping the cluster on installer having access to the API to push things is a no-go.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants