Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/destroy/bootstrap: Remove bootstrap from DNS for libvirt #614

Merged

Conversation

wking
Copy link
Member

@wking wking commented Nov 5, 2018

Without this, round-robin clients will fail when they hit the bootstrap DNS entry (after the bootstrap node stops serving its control plane).

The implementation is a bit awkward; I'd have preferred the AWS approach, with:

resource "aws_lb_target_group_attachment" "bootstrap" {
  count = "${var.target_group_arns_length}"

  target_group_arn = "${var.target_group_arns[count.index]}"
  target_id        = "${aws_instance.bootstrap.private_ip}"
}

in the bootstrap module. But the libvirt host entries are only available as a subsection of a libvirt_network resource (because the whole network is defined in a single XML object, including the DNS entries). So instead I've added an additional variable which we can tweak to disable the bootstrap entry. The default value for the new variable includes the bootstrap entry for the initial cluster apply call; on destry I override it via an *.auto.tfvars file (which Terraform loads automatically) to remove the bootstrap entry.

CC @sjenning

Without this, round-robin clients will fail when they hit the
bootstrap DNS entry (after the bootstrap node stops serving its
control plane).

The implementation is a bit awkward; I'd have preferred the AWS
approach, with:

  resource "aws_lb_target_group_attachment" "bootstrap" {
    count = "${var.target_group_arns_length}"

    target_group_arn = "${var.target_group_arns[count.index]}"
    target_id        = "${aws_instance.bootstrap.private_ip}"
  }

in the bootstrap module.  But the libvirt host entries are only
available as a subsection of a libvirt_network resource (because the
whole network is defined in a single XML object, including the DNS
entries [1]).  So instead I've added an additional variable which we
can tweak to disable the bootstrap entry.  The default value for the
new variable includes the bootstrap entry for the initial cluster
'apply' call; on destry I override it via an *.auto.tfvars file (which
Terraform loads automatically [2]) to remove the bootstrap entry.

[1]: https://libvirt.org/formatnetwork.html#elementsAddress
[2]: https://www.terraform.io/docs/configuration/variables.html
@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 5, 2018
@wking
Copy link
Member Author

wking commented Nov 5, 2018

Logs from a run of this look like:

$ openshift-install --dir=wking --log-level=debug create cluster
...
libvirt_network.tectonic_net: Destroying... (ID: d5b38c5b-6105-4b57-b0b6-448bb1f4e87c)
libvirt_network.tectonic_net: Destruction complete after 5s
libvirt_network.tectonic_net: Creating...
  addresses.#:            "" => "1"
  addresses.0:            "" => "192.168.126.0/24"
  autostart:              "" => "true"
  bridge:                 "" => "tt0"
  dns.#:                  "" => "1"
  dns.0.hosts.#:          "" => "3"
  dns.0.hosts.0.hostname: "" => "wking-api"
  dns.0.hosts.0.ip:       "" => "192.168.126.11"
  dns.0.hosts.1.hostname: "" => "wking-etcd-0"
  dns.0.hosts.1.ip:       "" => "192.168.126.11"
  dns.0.hosts.2.hostname: "" => "wking"
  dns.0.hosts.2.ip:       "" => "192.168.126.50"
  dns.0.local_only:       "" => "true"
  dns.0.srvs.#:           "" => "1"
  dns.0.srvs.0.domain:    "" => "wking.installer.testing"
  dns.0.srvs.0.port:      "" => "2380"
  dns.0.srvs.0.protocol:  "" => "tcp"
  dns.0.srvs.0.service:   "" => "etcd-server-ssl"
  dns.0.srvs.0.target:    "" => "wking-etcd-0.installer.testing"
  dns.0.srvs.0.weight:    "" => "10"
  domain:                 "" => "installer.testing"
  mode:                   "" => "nat"
  name:                   "" => "wking"
libvirt_network.tectonic_net: Creation complete after 6s (ID: 787999bb-5eed-44fa-aa0e-4c440cecce4a)
...
module.bootstrap.libvirt_domain.bootstrap: Destroying... (ID: e6849cb6-f146-444d-aa63-edf9c70f4a86)
module.bootstrap.libvirt_domain.bootstrap: Destruction complete after 1s
module.bootstrap.libvirt_volume.bootstrap: Destroying... (ID: /home/trking/VirtualMachines/bootstrap)
module.bootstrap.libvirt_ignition.bootstrap: Destroying... (ID: /home/trking/VirtualMachines/bootstrap.ign;5be087d6-cfd0-ed41-e644-dedb336a089d)
module.bootstrap.libvirt_volume.bootstrap: Destruction complete after 0s
module.bootstrap.libvirt_ignition.bootstrap: Destruction complete after 0s

Destroy complete! Resources: 3 destroyed.

And you can confirm the cleaned-up DNS entries with:

$ dig wking-api.installer.testing +short
192.168.126.11

@wking
Copy link
Member Author

wking commented Nov 5, 2018

ok, while this gives us the correct DNS entries, it seems to clear the master IP address. After a successful auto-bootstrap-removal:

$ dig wking-api.installer.testing +short
192.168.126.11
$ ping -c1 192.168.126.11
PING 192.168.126.11 (192.168.126.11) 56(84) bytes of data.
From 192.168.126.1 icmp_seq=1 Destination Host Unreachable

--- 192.168.126.11 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

The Terraform state still claims IP addresses for the master:

$ terraform state show libvirt_domain.master | grep address
network_interface.0.addresses.#    = 1
network_interface.0.addresses.0    = 192.168.126.11

But virsh shows no such address:

$ virsh -c "${OPENSHIFT_INSTALL_LIBVIRT_URI}" domifaddr master0
 Name       MAC address          Protocol     Address
-------------------------------------------------------------------------------

I'm not sure what's going on there yet.

@wking
Copy link
Member Author

wking commented Nov 5, 2018

Ah, perhaps this is what's going on:

Sometimes, one needs to edit the network definition and apply the changes on the fly. The most common scenario for this is adding new static MAC+IP mappings for the network's DHCP server. If you edit the network with "virsh net-edit", any changes you make won't take effect until the network is destroyed and re-started, which unfortunately will cause a all guests to lose network connectivity with the host until their network interfaces are explicitly re-attached.

We may be able to update the logic from @eparis' dmacvicar/terraform-provider-libvirt#382 to allow for hosts manipulation without re-creating the whole network.

@wking
Copy link
Member Author

wking commented Nov 5, 2018

We may be able to update the logic from @eparis' dmacvicar/terraform-provider-libvirt#382 to allow for hosts manipulation without re-creating the whole network.

I've filed dmacvicar/terraform-provider-libvirt#468 for this and will see if I can work up a fix.

@wking
Copy link
Member Author

wking commented Nov 6, 2018

We may be able to update the logic from @eparis' dmacvicar/terraform-provider-libvirt#382 to allow for hosts manipulation without re-creating the whole network.

I've filed dmacvicar/terraform-provider-libvirt#468 for this and will see if I can work up a fix.

Fix in flight with dmacvicar/terraform-provider-libvirt#469, after which, the post-destroy bootstrap environment looks like:

$ virsh -c "${OPENSHIFT_INSTALL_LIBVIRT_URI}" domifaddr master0
 Name       MAC address          Protocol     Address
-------------------------------------------------------------------------------
 vnet1      f2:0f:d9:25:11:aa    ipv4         192.168.126.11/24

$ virsh -c "${OPENSHIFT_INSTALL_LIBVIRT_URI}" net-dumpxml wking | grep -A9 '<dns>'
  <dns>
    <srv service='etcd-server-ssl' protocol='tcp' domain='wking.installer.testing' target='wking-etcd-0.installer.testing' port='2380' weight='10'/>
    <host ip='192.168.126.11'>
      <hostname>wking-api</hostname>
      <hostname>wking-etcd-0</hostname>
    </host>
    <host ip='192.168.126.50'>
      <hostname>wking</hostname>
    </host>
  </dns>
$ dig wking-api.installer.testing +short
192.168.126.11

@sjenning, can you try again with a fresh plugin build from that upstream PR?

$ ~/.terraform.d/plugins/terraform-provider-libvirt -version
/home/trking/.terraform.d/plugins/terraform-provider-libvirt ef44231327555f33fb4311de442c287bc0d7fbf6
Compiled against library: libvirt 3.9.0
Using library: libvirt 3.9.0
Running hypervisor: QEMU 2.9.0
Running against daemon: 3.9.0

@abhinavdahiya
Copy link
Contributor

/hold
we need dmacvicar/terraform-provider-libvirt#469
so we have 2 options before we merge, we use the forked repo for time being or wait for upstream to merge.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 6, 2018
@wking
Copy link
Member Author

wking commented Nov 6, 2018

so we have 2 options before we merge, we use the forked repo for time being or wait for upstream to merge.

I'm fine waiting on upstream. If they take more than a few days, we can revisit temporarily pointing at a fork.

@abhinavdahiya
Copy link
Contributor

abhinavdahiya commented Nov 6, 2018

so we have 2 options before we merge, we use the forked repo for time being or wait for upstream to merge.

I'm fine waiting on upstream. If they take more than a few days, we can revisit temporarily pointing at a fork.

👍 sounds good.

unless @crawford recommends going with fork quicker.

@wking
Copy link
Member Author

wking commented Nov 9, 2018

/hold cancel

dmacvicar/terraform-provider-libvirt#469 landed :).

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 9, 2018
@abhinavdahiya
Copy link
Contributor

abhinavdahiya commented Nov 9, 2018

testing locally. automatic bootstrap deletion was stuck and had to manually invoke ./bin/openshift-install --dir dev destroy bootstrap
but,

$ dig adahiya-0-api.tt.testing +noall +answer

; <<>> DiG 9.11.4-P1-RedHat-9.11.4-5.P1.fc28 <<>> adahiya-0-api.tt.testing +noall +answer
;; global options: +cmd
adahiya-0-api.tt.testing. 0     IN      A       192.168.126.11
$ oc get nodes
NAME                       STATUS    ROLES     AGE       VERSION
adahiya-0-master-0         Ready     master    18m       v1.11.0+d4cacc0
adahiya-0-worker-0-tb78z   Ready     worker    9m        v1.11.0+d4cacc0

/approve

@abhinavdahiya
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 9, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 5b3964d into openshift:master Nov 9, 2018
@wking wking deleted the libvirt-bootstrap-dns-removal branch November 10, 2018 17:42
@cgwalters
Copy link
Member

Just did an install this morning that seemed to be successful but AFAICS my master and worker didn't have IP addresses; looking in the logs:

Nov 13 13:18:03 vanadium libvirtd[1102]: 2018-11-13 18:18:03.498+0000: 1137: error : virFileReadAll:1420 : Failed to open file '/var/lib/libvirt/dnsmasq/tt0.status': No such file or directory

I did have an ocp4 network (the name of my cluster). Could that be fallout from this? Hmm, although I hadn't updated my terraform plugin in a while, trying with that updated too...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants