Libvirt installation failures: master node trying to access Ignition config from... itself #889

jlebon · 2018-12-12T21:49:35Z

Version

$ bin/openshift-install version
bin/openshift-install v0.5.0-master-95-gcbc7c5c89c7fe2e612932c9d3b4134289dfa2e19
Terraform v0.11.10
$ ~/.terraform.d/plugins/terraform-provider-libvirt -version
/home/jlebon/.terraform.d/plugins/terraform-provider-libvirt was not built correctly
Compiled against library: libvirt 4.1.0
Using library: libvirt 4.1.0
Running hypervisor: QEMU 2.11.2
Running against daemon: 4.1.0

Platform (aws|libvirt|openstack):

libvirt

What happened?

The master VM is stuck in the initrd because it's trying to fetch its Ignition config from itself:

$ virsh console test1-master-0
Connected to domain test1-master-0
Escape character is ^]
[*     ] A start job is running for Ignition (disks) (23min 38s / no limit)[ 1420.792998] ignition[463]: GET https://test1-api.mco.testing:49500/config/master: attempt #288
[ 1420.801267] ignition[463]: GET error: Get https://test1-api.mco.testing:49500/config/master: dial tcp 192.168.126.11:49500: getsockopt: connection refused
[  *** ] A start job is running for Ignition (disks) (23min 43s / no limit)[ 1425.801591] ignition[463]: GET https://test1-api.mco.testing:49500/config/master: attempt #289
[ 1425.809157] ignition[463]: GET error: Get https://test1-api.mco.testing:49500/config/master: dial tcp 192.168.126.11:49500: getsockopt: connection refused

$ virsh domifaddr test1-master-0
 Name       MAC address          Protocol     Address
-------------------------------------------------------------------------------
 vnet0      aa:0b:ae:0e:a0:35    ipv4         192.168.126.11/24

Shouldn't it be fetching it from the bootstrap VM?

How to reproduce it (as minimally and precisely as possible)?

$ OPENSHIFT_INSTALL_BASE_DOMAIN=mco.testing OPENSHIFT_INSTALL_CLUSTER_NAME=test1 OPENSHIFT_INSTALL_EMAIL_ADDRESS=jlebon@redhat.com OPENSHIFT_INSTALL_PASSWORD=admin OPENSHIFT_INSTALL_PLATFORM=libvirt OPENSHIFT_INSTALL_PULL_SECRET_PATH=$PWD/tectonic.pull-secret OPENSHIFT_INSTALL_SSH_PUB_KEY_PATH=~/.ssh/id_rsa.lux.pub OPENSHIFT_INSTALL_LIBVIRT_URI=qemu+tcp://192.168.122.1/system OPENSHIFT_INSTALL_LIBVIRT_IMAGE=file:///var/srv/imgs/redhat-coreos-maipo-47.199-qemu.qcow2 bin/openshift-install create cluster
INFO Creating cluster...
INFO Waiting 30m0s for the Kubernetes API...

The text was updated successfully, but these errors were encountered:

crawford · 2018-12-12T22:03:20Z

We use RRDNS to put both the masters and the bootstrap node behind test1-api.mco.testing. The master will eventually pull the config from the bootstrap node.

jlebon · 2018-12-12T22:07:54Z

OK, right I do see some tries to the bootstrap node as well now. But:

[root@test1-bootstrap ~]# ss -lntp | grep 49500
[root@test1-bootstrap ~]# uptime
 22:04:45 up 45 min,  1 user,  load average: 1.11, 1.11, 0.96

I'm not sure what I should be looking for on the bootstrap node to debug why whatever is supposed to listen there isn't up yet. bootkube.service is just stuck printing this over and over:

Dec 12 22:01:38 test1-bootstrap bootkube.sh[5715]: https://test1-etcd-0.mco.testing:2379 is unhealthy: failed to connect: dial tcp 192.168.126.11:2379: getsockopt: connection refused
Dec 12 22:01:38 test1-bootstrap bootkube.sh[5715]: Error:  unhealthy cluster
Dec 12 22:01:38 test1-bootstrap bootkube.sh[5715]: etcdctl failed. Retrying in 5 seconds...

jlebon · 2018-12-12T22:09:18Z

Ahh, I do see this:

Dec 12 21:20:44 test1-bootstrap bootkube.sh[3102]: cp: cannot stat ‘mco-bootstrap/bootstrap/manifests/*’: No such file or directory

Regression from #879 perhaps?

crawford · 2018-12-12T22:10:38Z

/cc @abhinavdahiya

abhinavdahiya · 2018-12-12T22:19:54Z

The release image is not getting promoted https://origin-release.svc.ci.openshift.org/
so registry.svc.ci.openshift.org/openshift/origin-release:v4.0 is pointing to a very old machine-config-operator

The PRs openshift/machine-config-operator#226 #879 were merged with green CI as they are always using the latest images from imagestream origin-v4.0

cgwalters · 2018-12-12T22:58:47Z

Is there a /retest equivalent for the payload gating?

cgwalters · 2018-12-13T15:35:57Z

I can confirm that e.g. env OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=registry.svc.ci.openshift.org/openshift/origin-release:4.0.0-0.alpha-2018-12-13-150530 openshift-installer ... gets me much farther.

cgwalters · 2018-12-13T15:46:14Z

Well...except since that one failed e2e-gcp it got GC'd right after that and now my cluster is half-up in ImagePullBackOff 😢

ironcladlou · 2018-12-13T16:15:05Z

Same symptoms on AWS. All non-bootstrap nodes stuck in Ignition:

[  166.591687] ignition[446]: GET error: Get https://dmace-api.devcluster.openshift.com:49500/config/master: dial tcp: lookup dmace-api.devcluster.openshift.com on 10.0.0.2:53: no such host
^M[   ***] A start job is running for Ignition (disks) (2min 44s / no limit)^M[    **] A start job is running for Ignition (disks) (2min 45s / no limit)^M[     *] A start job is running for Ignition (disks) (2
min 45s / no limit)^M[    **] A start job is running for Ignition (disks) (2min 46s / no limit)^M[   ***] A start job is running for Ignition (disks) (2min 46s / no limit)^M[  *** ] A start job is running for Ignition (disks) (2min 47s / no limit)^M[ ***  ] A start job is running for Ignition (disks) (2min 47s / no limit)^M[***   ] A start job is running for Ignition (disks) (2min 48s / no limit)        2018-12-13T14:49:59.000Z

According to the internal LB target pool, all the nodes are registered but only the bootstrap node is healthy. DNS looks okay so far.

cgwalters · 2018-12-13T20:24:55Z

We use RRDNS

It looks like the glibc resolver doesn't do that: https://daniel.haxx.se/blog/2012/01/03/getaddrinfo-with-round-robin-dns-and-happy-eyeballs/

wking · 2018-12-13T23:50:48Z

The release image is not getting promoted...

4.0.0-0.alpha-2018-12-13-221300 was just accepted, which should address this particular issue.

/close

openshift-ci-robot · 2018-12-13T23:50:49Z

@wking: Closing this issue.

In response to this:

The release image is not getting promoted...

4.0.0-0.alpha-2018-12-13-221300 was just accepted, which should address this particular issue.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

crawford · 2018-12-15T21:04:00Z

@WalterS it's up to the application to implement the address randomization since the order in which addresses are returned is deterministic and well defined. Ignition has a specific workaround for Go (since the HTTP package doesn't allow address resolution to be intercepted) which achieves this.

jlebon mentioned this issue Dec 12, 2018

hack: Add cluster-push-*.sh scripts openshift/machine-config-operator#231

Merged

This was referenced Dec 13, 2018

Cannot build the cluster on libvirt due to the image pull fail #880

Closed

release-controller: Stop gating on e2e-gcp openshift/release#2384

Closed

openshift-ci-robot closed this as completed Dec 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Libvirt installation failures: master node trying to access Ignition config from... itself #889

Libvirt installation failures: master node trying to access Ignition config from... itself #889

jlebon commented Dec 12, 2018

crawford commented Dec 12, 2018

jlebon commented Dec 12, 2018

jlebon commented Dec 12, 2018

crawford commented Dec 12, 2018

abhinavdahiya commented Dec 12, 2018

cgwalters commented Dec 12, 2018

cgwalters commented Dec 13, 2018

cgwalters commented Dec 13, 2018

ironcladlou commented Dec 13, 2018

cgwalters commented Dec 13, 2018

wking commented Dec 13, 2018

openshift-ci-robot commented Dec 13, 2018

crawford commented Dec 15, 2018

Libvirt installation failures: master node trying to access Ignition config from... itself #889

Libvirt installation failures: master node trying to access Ignition config from... itself #889

Comments

jlebon commented Dec 12, 2018

Version

Platform (aws|libvirt|openstack):

What happened?

How to reproduce it (as minimally and precisely as possible)?

crawford commented Dec 12, 2018

jlebon commented Dec 12, 2018

jlebon commented Dec 12, 2018

crawford commented Dec 12, 2018

abhinavdahiya commented Dec 12, 2018

cgwalters commented Dec 12, 2018

cgwalters commented Dec 13, 2018

cgwalters commented Dec 13, 2018

ironcladlou commented Dec 13, 2018

cgwalters commented Dec 13, 2018

wking commented Dec 13, 2018

openshift-ci-robot commented Dec 13, 2018

crawford commented Dec 15, 2018