Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Libvirt installation failures: master node trying to access Ignition config from... itself #889

Closed
jlebon opened this issue Dec 12, 2018 · 13 comments

Comments

@jlebon
Copy link
Member

jlebon commented Dec 12, 2018

Version

$ bin/openshift-install version
bin/openshift-install v0.5.0-master-95-gcbc7c5c89c7fe2e612932c9d3b4134289dfa2e19
Terraform v0.11.10
$ ~/.terraform.d/plugins/terraform-provider-libvirt -version
/home/jlebon/.terraform.d/plugins/terraform-provider-libvirt was not built correctly
Compiled against library: libvirt 4.1.0
Using library: libvirt 4.1.0
Running hypervisor: QEMU 2.11.2
Running against daemon: 4.1.0

Platform (aws|libvirt|openstack):

libvirt

What happened?

The master VM is stuck in the initrd because it's trying to fetch its Ignition config from itself:

$ virsh console test1-master-0
Connected to domain test1-master-0
Escape character is ^]
[*     ] A start job is running for Ignition (disks) (23min 38s / no limit)[ 1420.792998] ignition[463]: GET https://test1-api.mco.testing:49500/config/master: attempt #288
[ 1420.801267] ignition[463]: GET error: Get https://test1-api.mco.testing:49500/config/master: dial tcp 192.168.126.11:49500: getsockopt: connection refused
[  *** ] A start job is running for Ignition (disks) (23min 43s / no limit)[ 1425.801591] ignition[463]: GET https://test1-api.mco.testing:49500/config/master: attempt #289
[ 1425.809157] ignition[463]: GET error: Get https://test1-api.mco.testing:49500/config/master: dial tcp 192.168.126.11:49500: getsockopt: connection refused

$ virsh domifaddr test1-master-0
 Name       MAC address          Protocol     Address
-------------------------------------------------------------------------------
 vnet0      aa:0b:ae:0e:a0:35    ipv4         192.168.126.11/24

Shouldn't it be fetching it from the bootstrap VM?

How to reproduce it (as minimally and precisely as possible)?

$ OPENSHIFT_INSTALL_BASE_DOMAIN=mco.testing OPENSHIFT_INSTALL_CLUSTER_NAME=test1 OPENSHIFT_INSTALL_EMAIL_ADDRESS=jlebon@redhat.com OPENSHIFT_INSTALL_PASSWORD=admin OPENSHIFT_INSTALL_PLATFORM=libvirt OPENSHIFT_INSTALL_PULL_SECRET_PATH=$PWD/tectonic.pull-secret OPENSHIFT_INSTALL_SSH_PUB_KEY_PATH=~/.ssh/id_rsa.lux.pub OPENSHIFT_INSTALL_LIBVIRT_URI=qemu+tcp://192.168.122.1/system OPENSHIFT_INSTALL_LIBVIRT_IMAGE=file:///var/srv/imgs/redhat-coreos-maipo-47.199-qemu.qcow2 bin/openshift-install create cluster
INFO Creating cluster...
INFO Waiting 30m0s for the Kubernetes API...
@crawford
Copy link
Contributor

We use RRDNS to put both the masters and the bootstrap node behind test1-api.mco.testing. The master will eventually pull the config from the bootstrap node.

@jlebon
Copy link
Member Author

jlebon commented Dec 12, 2018

OK, right I do see some tries to the bootstrap node as well now. But:

[root@test1-bootstrap ~]# ss -lntp | grep 49500
[root@test1-bootstrap ~]# uptime
 22:04:45 up 45 min,  1 user,  load average: 1.11, 1.11, 0.96

I'm not sure what I should be looking for on the bootstrap node to debug why whatever is supposed to listen there isn't up yet. bootkube.service is just stuck printing this over and over:

Dec 12 22:01:38 test1-bootstrap bootkube.sh[5715]: https://test1-etcd-0.mco.testing:2379 is unhealthy: failed to connect: dial tcp 192.168.126.11:2379: getsockopt: connection refused
Dec 12 22:01:38 test1-bootstrap bootkube.sh[5715]: Error:  unhealthy cluster
Dec 12 22:01:38 test1-bootstrap bootkube.sh[5715]: etcdctl failed. Retrying in 5 seconds...

@jlebon
Copy link
Member Author

jlebon commented Dec 12, 2018

Ahh, I do see this:

Dec 12 21:20:44 test1-bootstrap bootkube.sh[3102]: cp: cannot stat ‘mco-bootstrap/bootstrap/manifests/*’: No such file or directory

Regression from #879 perhaps?

@crawford
Copy link
Contributor

/cc @abhinavdahiya

@abhinavdahiya
Copy link
Contributor

The release image is not getting promoted https://origin-release.svc.ci.openshift.org/
so registry.svc.ci.openshift.org/openshift/origin-release:v4.0 is pointing to a very old machine-config-operator

The PRs openshift/machine-config-operator#226 #879 were merged with green CI as they are always using the latest images from imagestream origin-v4.0

@cgwalters
Copy link
Member

Is there a /retest equivalent for the payload gating?

@cgwalters
Copy link
Member

I can confirm that e.g. env OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=registry.svc.ci.openshift.org/openshift/origin-release:4.0.0-0.alpha-2018-12-13-150530 openshift-installer ... gets me much farther.

@cgwalters
Copy link
Member

Well...except since that one failed e2e-gcp it got GC'd right after that and now my cluster is half-up in ImagePullBackOff 😢

@ironcladlou
Copy link
Contributor

Same symptoms on AWS. All non-bootstrap nodes stuck in Ignition:

[  166.591687] ignition[446]: GET error: Get https://dmace-api.devcluster.openshift.com:49500/config/master: dial tcp: lookup dmace-api.devcluster.openshift.com on 10.0.0.2:53: no such host
^M[   ***] A start job is running for Ignition (disks) (2min 44s / no limit)^M[    **] A start job is running for Ignition (disks) (2min 45s / no limit)^M[     *] A start job is running for Ignition (disks) (2
min 45s / no limit)^M[    **] A start job is running for Ignition (disks) (2min 46s / no limit)^M[   ***] A start job is running for Ignition (disks) (2min 46s / no limit)^M[  *** ] A start job is running for Ignition (disks) (2min 47s / no limit)^M[ ***  ] A start job is running for Ignition (disks) (2min 47s / no limit)^M[***   ] A start job is running for Ignition (disks) (2min 48s / no limit)        2018-12-13T14:49:59.000Z

According to the internal LB target pool, all the nodes are registered but only the bootstrap node is healthy. DNS looks okay so far.

@cgwalters
Copy link
Member

We use RRDNS

It looks like the glibc resolver doesn't do that: https://daniel.haxx.se/blog/2012/01/03/getaddrinfo-with-round-robin-dns-and-happy-eyeballs/

@wking
Copy link
Member

wking commented Dec 13, 2018

The release image is not getting promoted...

4.0.0-0.alpha-2018-12-13-221300 was just accepted, which should address this particular issue.

/close

@openshift-ci-robot
Copy link
Contributor

@wking: Closing this issue.

In response to this:

The release image is not getting promoted...

4.0.0-0.alpha-2018-12-13-221300 was just accepted, which should address this particular issue.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@crawford
Copy link
Contributor

@WalterS it's up to the application to implement the address randomization since the order in which addresses are returned is deterministic and well defined. Ignition has a specific workaround for Go (since the HTTP package doesn't allow address resolution to be intercepted) which achieves this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants