Skip to content

Commit

Permalink
pkg/destroy/bootstrap: Separate load-balancer target teardown
Browse files Browse the repository at this point in the history
Close a small window for lost Kubernetes API traffic:

* The Terraform tear-down could remove the bootstrap machine before it
  removes the bootstrap load-balancer target, leaving the target
  pointing into empty space.
* Bootstrap teardown does not allow existing client connections to
  drain after removing the load balancer target before removing the
  bootstrap machine.

With this commit, we:

1. Wait 30 seconds for the production control plane to come up.
2. Remove the bootstrap load-balancer targets.
3. Wait 10 seconds for requests to the bootstrap machine to drain out.
4. Remove the remaining bootstrap resources, including the bootstrap
   machine.

The 30-second calculation is provider specific. On AWS, it is
30-seconds for AWS to notice out production control-plane targets are
live (healthy_threshold * interval for our aws_lb_target_group on
AWS). This assumes the post-pod manifests are all pushed in zero
seconds, so it's overly conservative, but waiting an extra 30 seconds
isn't a large cost.

The 30-second delay doesn't really matter for libvirt, because clients
will have been banging away at the production control plane the whole
time, with those requests failing until the control plane came up to
listen.  But an extra 30 second delay is not a big deal either.

The 10-second delay for draining works around a Terraform plugin
limitation on AWS.  From the AWS network load-balancer docs [2]:

> Connection draining ensures that in-flight requests complete before
> existing connections are closed.  The initial state of a
> deregistering target is draining.  By default, the state of a
> deregistering target changes to unused after 300 seconds.  To change
> the amount of time that Elastic Load Balancing waits before changing
> the state to unused, update the deregistration delay value.  We
> recommend that you specify a value of at least 120 seconds to ensure
> that requests are completed.

And from [3]:

> Deregistering a target removes it from your target group, but does
> not affect the target otherwise.  The load balancer stops routing
> requests to a target as soon as it is deregistered.  The target
> enters the draining state until in-flight requests have completed.

The Terraform attachment-deletion logic is in [4], and while it fires
a deregister request, it does not wait around for draining to
complete.  I don't see any issues in the provider repository about
waiting for the unused state, but we could push something like that
[6] if we wanted more finesse here than a 10-second cross-platform
sleep.  For the moment, I'm just saying "we know who our consumers are
at this point, and none of them will keep an open request going for
more than 10 seconds".

The 10-second drain delay also seems sufficient for libvirt's
round-robin DNS, since clients should be able to fall-back to
alternative IPs on their own.  We may be able set shorter TTLs on
libvirt DNS entries to firm that up, but clean transitions are less
important for dev-only libvirt clusters anyway.  And, as for the
30-second delay for the production control plane to come up, clients
have been banging away on all of these IPs throughout the whole
bootstrap process.

I'm not sure how OpenStack handles this teardown; naively grepping
through data/data/openstack didn't turn up anything that looked much
like a bootstrap load-balancer target resource.

[1]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html
[2]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html
[3]: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-target-groups.html#registered-targets
[4]: pkg/terraform/exec/plugins/vendor/github.com/terraform-providers/terraform-provider-aws/aws/resource_aws_lb_target_group_attachment.go#L80-L106
[5]: https://docs.aws.amazon.com/sdk-for-go/api/service/elbv2/#ELBV2.WaitUntilTargetDeregistered
  • Loading branch information
wking committed Jan 30, 2019
1 parent 4f68502 commit 1c772ac
Show file tree
Hide file tree
Showing 6 changed files with 26 additions and 23 deletions.
3 changes: 3 additions & 0 deletions cmd/openshift-install/create.go
Original file line number Diff line number Diff line change
Expand Up @@ -275,6 +275,9 @@ func destroyBootstrap(ctx context.Context, config *rest.Config, directory string
return errors.Wrap(err, "waiting for bootstrap-success")
}

logrus.Info("Waiting 30 seconds for the production control-plane to enter the load balancer")
time.Sleep(30*time.Second)

logrus.Info("Destroying the bootstrap resources...")
return destroybootstrap.Destroy(rootOpts.dir)
}
Expand Down
4 changes: 2 additions & 2 deletions data/data/aws/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@ module "bootstrap" {
iam_role = "${var.aws_master_iam_role_name}"
ignition = "${var.ignition_bootstrap}"
subnet_id = "${module.vpc.master_subnet_ids[0]}"
target_group_arns = "${module.vpc.aws_lb_target_group_arns}"
target_group_arns_length = "${module.vpc.aws_lb_target_group_arns_length}"
target_group_arns = "${var.bootstrap_load_balancer_targets ? module.vpc.aws_lb_target_group_arns : []}"
target_group_arns_length = "${var.bootstrap_load_balancer_targets ? module.vpc.aws_lb_target_group_arns_length : 0}"
vpc_id = "${module.vpc.vpc_id}"
vpc_security_group_ids = "${list(module.vpc.master_sg_id)}"

Expand Down
5 changes: 5 additions & 0 deletions data/data/config.tf
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,11 @@ terraform {
required_version = ">= 0.10.7"
}

variable "bootstrap_load_balancer_targets" {
default = true
description = "Whether to include load-balancer targets for the bootstrap machine."
}

variable "machine_cidr" {
type = "string"

Expand Down
2 changes: 1 addition & 1 deletion data/data/libvirt/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ resource "libvirt_domain" "master" {
}

data "libvirt_network_dns_host_template" "bootstrap" {
count = "${var.bootstrap_dns ? 1 : 0}"
count = "${var.bootstrap_load_balancer_targets ? 1 : 0}"
ip = "${var.libvirt_bootstrap_ip}"
hostname = "${var.cluster_name}-api"
}
Expand Down
5 changes: 0 additions & 5 deletions data/data/libvirt/variables-libvirt.tf
Original file line number Diff line number Diff line change
@@ -1,8 +1,3 @@
variable "bootstrap_dns" {
default = true
description = "Whether to include DNS entries for the bootstrap node or not."
}

variable "libvirt_uri" {
type = "string"
description = "libvirt connection URI"
Expand Down
30 changes: 15 additions & 15 deletions pkg/destroy/bootstrap/bootstrap.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ import (
"io/ioutil"
"os"
"path/filepath"
"time"

"github.com/openshift/installer/pkg/asset/cluster"
"github.com/openshift/installer/pkg/terraform"
Expand All @@ -24,17 +25,12 @@ func Destroy(dir string) (err error) {
return errors.New("no platform configured in metadata")
}

copyNames := []string{terraform.StateFileName, cluster.TfVarsFileName}

if platform == "libvirt" {
err = ioutil.WriteFile(filepath.Join(dir, "disable-bootstrap.tfvars"), []byte(`{
"bootstrap_dns": false
err = ioutil.WriteFile(filepath.Join(dir, "disable-bootstrap-load-balancer-targets.tfvars"), []byte(`{
"bootstrap_load_balancer_targets": false
}
`), 0666)
if err != nil {
return err
}
copyNames = append(copyNames, "disable-bootstrap.tfvars")
if err != nil {
return err
}

tempDir, err := ioutil.TempDir("", "openshift-install-")
Expand All @@ -43,20 +39,24 @@ func Destroy(dir string) (err error) {
}
defer os.RemoveAll(tempDir)

for _, filename := range copyNames {
for _, filename := range []string{
terraform.StateFileName,
cluster.TfVarsFileName,
"disable-bootstrap-load-balancer-targets.tfvars",
} {
err = copy(filepath.Join(dir, filename), filepath.Join(tempDir, filename))
if err != nil {
return errors.Wrapf(err, "failed to copy %s to the temporary directory", filename)
}
}

if platform == "libvirt" {
_, err = terraform.Apply(tempDir, platform, fmt.Sprintf("-var-file=%s", filepath.Join(tempDir, "disable-bootstrap.tfvars")))
if err != nil {
return errors.Wrap(err, "Terraform apply")
}
_, err = terraform.Apply(tempDir, platform, fmt.Sprintf("-var-file=%s", filepath.Join(tempDir, "disable-bootstrap-load-balancer-targets.tfvars")))
if err != nil {
return errors.Wrap(err, "Terraform apply")
}

time.Sleep(10 * time.Second)

err = terraform.Destroy(tempDir, platform, "-target=module.bootstrap")
if err != nil {
return errors.Wrap(err, "Terraform destroy")
Expand Down

0 comments on commit 1c772ac

Please sign in to comment.