-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data/bootstrap/files/usr/local/bin/bootkube.sh.template: Set --tear-down-event #1147
Conversation
…own-event In master there's a bit of a window during the bootstrap-teardown dance: 1. cluster-bootstrap sees the requested pods. 2. cluster-bootstrap shuts itself down. 3. openshift.sh pushes the OpenShift-specific manifests. 4. report-progress.sh pushes bootstrap-complete. 5. The installer sees bootstrap-complete and removes the bootstrap resources, including the bootstrap load-balancer targets. 6. subsequent Kubernetes API traffic hits the production control plane. That leaves a fairly large window from 3 through 5 where Kubernetes API requests could be routed to the bootstrap machine and dropped because it no longer has anything listening on 6443. With this commit, we take advantage of openshift/cluster-bootstrap@d07548e3 (Add --tear-down-event flag to delay tear down, 2019-01-24, openshift/cluster-bootstrap#9) to drop step 2 (waiting for an event we never send). That leaves the bootstrap control-plane running until we destroy that machine. We take advantage of openshift/cluster-bootstrap@180599bc (pkg/start/asset: Add support for post-pod-manifests, 2019-01-29, openshift/cluster-bootstrap#13) to replace our previous openshift.sh (with a minor change to the manifest directory). And we take advantage of openshift/cluster-bootstrap@e5095848 (Create bootstrap-success event before tear down, 2019-01-24, openshift/cluster-bootstrap#9) to replace our previous report-progress.sh (with a minor change to the event name). Also set --strict, because we want to fail-fast for these resources. The user is unlikely to scrape them out of the installer state and push them by hand if we fail to push them from the bootstrap node. With these changes, the new transition is: 1. cluster-bootstrap sees the requested pods. 2. cluster-bootstrap pushes the OpenShift-specific manifests. 3. cluster-bootstrap pushes bootstrap-success. 4. The installer sees bootstrap-success and removes the bootstrap resources, including the bootstrap load-balancer targets. 5. subsequent Kubernetes API traffic hits the production control plane. There's still a small window for lost Kubernetes API traffic: * The Terraform tear-down could remove the bootstrap machine before it removes the bootstrap load-balancer target, leaving the target pointing into empty space. * Bootstrap teardown does not allow existing client connections to drain after removing the load balancer target before removing the bootstrap machine. Both of these could be addressed by: 1. Remove the bootstrap load-balancer targets. 2. Wait for the 30 seconds (healthy_threshold * interval for our aws_lb_target_group [1]) for the load-balancer to notice the production control-plane targets are live. This assumes the post-pod manifests are all pushed in zero seconds, so it's overly conservative, but waiting an extra 30 seconds isn't a large cost. 3. Remove the remaining bootstrap resources, including the bootstrap machine. But even without that delay, this commit reduces the window compared to what we have in master. I'll land the delay in follow-up work. [1]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
this is probably not going to work openshift/cluster-bootstrap#13 (comment) |
start \ | ||
--asset-dir=/assets \ | ||
--strict \ | ||
--tear-down-event run-forever \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not convinced this is a good idea. We do not solve the load balancer problem, we only push it from ELB related to be kube-proxy related. All clients using in-cluster connection using the service IP will fall over (in the best case when the bootstrap machine is terminated). In the worst case, they will have pending watch TCP connection to a machine which is not there anymore.
@squeed can you comment about that? How do TCP connections to service IPs behave if the target machine disappears?
I am against this PR. It takes the ELB out of the failure equation, but at the same time introduces a time period of uncertainty with service IPs. This is bad. Let's solve the problem by proper synchronization instead of fixing some issues and creating new ones for other teams. A correct way would be:
|
Agree with @sttts. This change can potentially make other components unstable and introduce even more flakes. I think the proper solution would be what Stefan described. The cluster-bootstrap should tear down the bootstrap control plane when it receives an event from the installer saying i acknowledged bootstrap-success, removed whatever needed to be removed (ELB) and asking you to tear down the control plane. Then bootstrap can send another event saying "bootstrap-ready-to-shutdown" and then the installer can delete the bootstrap node. |
I still don't see how a hard-shutdown of the bootstrap machine would be more disruptive than a production-master reboot (e.g. for a RHCOS update). But once some form of openshift/cluster-bootstrap#13 (which we all agree we want?) has landed, we can use CI testing for this PR to explore these different proposals. |
What is a hard shutdown? Switching off the machine like "power off"? Or is it a controlled shutdown where every process gets a TERM signal, some time to terminate and only then a kill? If it's the later, we should be fine. If it is the pure removal of the machine, it's not good. We were fighting with those cases in other contexts like etcd where connections got stuck (timeouts like 30 min), only very recent etcd client got that under control. Killing the bootstrap machine will bring us into the same situation again (not necessarily with etcd of course, but client connection to the apiserver) and we want to avoid that. In short: kube-apiserver needs the TERM signal. Everything else is bad. |
Also note that a RHCOS update will preceded by a draining of the node. That way every service can do whatever it has to do for graceful termination. |
Is this just the current state of affairs or is there some technical reason this will always be the case? We must be able to handle hard reboots and networking outtages. |
Handling of an unexpected condition like a machine shutdown is different than handling controlled shut downs. In the example Stefan provided, a hard shutdown is tolerated in the sense that traffic is routed, dropped, and well written clients can retry. However, given an appropriate signal, we can avoid unnecessary blips by driving traffic away and quiecing over a few seconds to allow short lived calls to complete, controlled removal of leases and endpoints, and load balancers to see rejecting traffic before disappearing entirely. We aren't saying we can't handle being shutdown. We are saying that is objectively more disruptive in a way that appears to be unnecessary. |
@wking: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
The installer needs to continue to be READER for the api. Any dependency on bootstrapping the cluster on installer having access to the API to push things is a no-go. /close |
@abhinavdahiya: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
In master there's a bit of a window during the bootstrap-teardown dance:
openshift.sh
pushes the OpenShift-specific manifests.report-progress.sh
pushesbootstrap-complete
.That leaves a fairly large window from 3 through 5 where Kubernetes API requests could be routed to the bootstrap machine and dropped because it no longer has anything listening on 6443.
With this commit, we take advantage of openshift/cluster-bootstrap@d07548e3 (openshift/cluster-bootstrap#9) to drop step 2 (waiting for an event we never send). That leaves the bootstrap control-plane running until we destroy that machine. We take advantage of openshift/cluster-bootstrap@180599bc (openshift/cluster-bootstrap#13) to replace our previous
openshift.sh
(with a minor change to the manifest directory). And we take advantage of openshift/cluster-bootstrap@e5095848 (openshift/cluster-bootstrap#9) to replace our previousreport-progress.sh
(with a minor change to the event name).Also set
--strict
, because we want to fail-fast for these resources. The user is unlikely to scrape them out of the installer state and push them by hand if we fail to push them from the bootstrap node.With these changes, the new transition is:
bootstrap-success
.bootstrap-success
and removes the bootstrap resources, including the bootstrap load-balancer targets.There's still a small window for lost Kubernetes API traffic:
Both of these could be addressed by:
healthy_threshold
*interval
for our aws_lb_target_group) for the load-balancer to notice the production control-plane targets are live. This assumes the post-pod manifests are all pushed in zero seconds, so it's overly conservative, but waiting an extra 30 seconds isn't a large cost.But even without that delay, this commit reduces the window compared to what we have in master. I'll land the delay in follow-up work.
CC @sttts
We'll need openshift/cluster-bootstrap#13 for this, so...
/hold