…own-event
In master there's a bit of a window during the bootstrap-teardown
dance:
1. cluster-bootstrap sees the requested pods.
2. cluster-bootstrap shuts itself down.
3. openshift.sh pushes the OpenShift-specific manifests.
4. report-progress.sh pushes bootstrap-complete.
5. The installer sees bootstrap-complete and removes the bootstrap
resources, including the bootstrap load-balancer targets.
6. subsequent Kubernetes API traffic hits the production control
plane.
That leaves a fairly large window from 3 through 5 where Kubernetes
API requests could be routed to the bootstrap machine and dropped
because it no longer has anything listening on 6443.
With this commit, we take advantage of
openshift/cluster-bootstrap@d07548e3 (Add --tear-down-event flag to
delay tear down, 2019-01-24, openshift/cluster-bootstrap#9) to drop
step 2 (waiting for an event we never send). That leaves the
bootstrap control-plane running until we destroy that machine. We
take advantage of openshift/cluster-bootstrap@180599bc
(pkg/start/asset: Add support for post-pod-manifests, 2019-01-29,
openshift/cluster-bootstrap#13) to replace our previous openshift.sh
(with a minor change to the manifest directory). And we take
advantage of openshift/cluster-bootstrap@e5095848 (Create
bootstrap-success event before tear down, 2019-01-24,
openshift/cluster-bootstrap#9) to replace our previous
report-progress.sh (with a minor change to the event name).
Also set --strict, because we want to fail-fast for these resources.
The user is unlikely to scrape them out of the installer state and
push them by hand if we fail to push them from the bootstrap node.
With these changes, the new transition is:
1. cluster-bootstrap sees the requested pods.
2. cluster-bootstrap pushes the OpenShift-specific manifests.
3. cluster-bootstrap pushes bootstrap-success.
4. The installer sees bootstrap-success and removes the bootstrap
resources, including the bootstrap load-balancer targets.
5. subsequent Kubernetes API traffic hits the production control
plane.
There's still a small window for lost Kubernetes API traffic:
* The Terraform tear-down could remove the bootstrap machine before it
removes the bootstrap load-balancer target, leaving the target
pointing into empty space.
* Bootstrap teardown does not allow existing client connections to
drain after removing the load balancer target before removing the
bootstrap machine.
Both of these could be addressed by:
1. Remove the bootstrap load-balancer targets.
2. Wait for the 30 seconds (healthy_threshold * interval for our
aws_lb_target_group [1]) for the load-balancer to notice the
production control-plane targets are live. This assumes the
post-pod manifests are all pushed in zero seconds, so it's overly
conservative, but waiting an extra 30 seconds isn't a large cost.
3. Remove the remaining bootstrap resources, including the bootstrap
machine.
But even without that delay, this commit reduces the window compared
to what we have in master. I'll land the delay in follow-up work.
[1]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html