-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
master machineconfig pool reports degraded on new cluster #367
Comments
Token rotation issues seen again (as seen in #358):
|
Nothing is removed right now, see #354 I don't think this is token rotation. The problem here is likely related to #338 where if the MC generated at bootstrap time isn't identical to the one the cluster generates at start, the nodes will fail to find their MC. I saw this with osimageurl, but maybe it's possible e.g. we don't get the kubelet config in the bootstrap MC sometimes? |
I provided quite a data dump there but the tl;dr is this in the daemon logs
Seems like something is jumping the gun on deleting the old master |
To repeat - nothing is deleting MCs today. The problem is much more complex than that. A "rendered" MachineConfig object is a function of a variety of inputs, from the base templates we ship with the operator, to SSH keys, kubelet config, soon osImageURL etc. The generated MC name includes a hash of its contents. So if there's a difference they'll have different names. The bootstrap path generates MCs via a different codepath than after the cluster comes up. You can see this in how in openshift/installer#1149 I need to change things to pass the osImageURL there too. So if the bootstrap differs from what the renderer outputs in the main cluster, the booted nodes which talk to the MCS on the bootstrap node will be looking for a MC that doesn't exist. I think maintaining both of these paths is going to be a long term struggle particularly as we go and work on e.g. adding the crio CRD etc. The failure case here of booted nodes simply not being able to find their MC is pretty bad. I think there are two options: First, we could change the bootstrap to pass the MC object to the first booted master, and have the operator inject it into the real cluster. The advantage of this is any "drift" between the bootstrap gets reconciled. But the downside of course is that we're rebooting on node bringup to change config. The other path I think is to try to de-duplicate the bootstrap codepath more. I need to study the code more to understand how hard this would be. |
@abhinavdahiya does ⬆️ sound right? Any other ideas? |
OK, been reading installer code this morning. Today on the bootstrap node you'll see this:
These are generated by the MCC in bootstrap mode, which is a static pod launched by There's a separate (strangely named?) |
machine-config-operator/pkg/server/bootstrap_server.go Lines 42 to 58 in e621b7c
|
@abhinavdahiya can you add a couple of words to that? Remember here most of us are learning a codebase we didn't create; getting up to speed on all of it is going to take some time. Are you thinking that the bootstrap MCS would also serve the bootstrap MC directly embedded in the Ignition and e.g. the MCD would find it on disk and create it if not found? |
Another idea I had is that given that a MachineConfig object and Ignition are almost the same thing - I could imagine that we embed the necessary data inside the Ignition JSON, and turn it into a MC. Then the MCD wouldn't need to hit the cluster to find its current config; could theoretically make GC easier too. |
Would we need to expand the MC spec a bit (which is doable) or are you thinking of adding a section to |
Currently testing:
|
The node gets Ignition JSON which; not sure yet actually if Ignition saves it around somewhere for other processes to read. But let's say Ignition saved it for us in |
I don't believe so, but @ajeddeloh could answer.
OK, I follow. So it would be utilizing an unused section within
|
The current Igniton 2.2 spec: https://coreos.com/ignition/docs/latest/configuration-v2_2.html |
This is an ugly fix for openshift/machine-config-operator#367
PR in openshift/installer#1189 |
One thing that's a pain about this is that it's hard to debug why the MCs are different; the bootstrap node will often already have been torn down, so you can't just go and diff it versus the current one. I haven't done that yet for the cases I've hit; it may work to patch the installer to disable bootstrap destruction. Or, with my installer patch they should be reliably in the target cluster, and we'll be able to see if e.g. the MCO does node updates on an initial install due to drift. |
PRs welcome that help collect that debug information in CI. but openshift/installer#1189 doesn't seem like a solution. |
Re. sneaking in third-party keys in the Ignition config, this is discussed here: coreos/ignition#696. |
Mmm...I guess we could scrape off the bootstrap MCs in the e2e-aws Prow job and put them in artifacts? Hm but not sure how to do that since the bootstrap will be torn down by the installer usually.
OK but...given that we've never had CI gating on degraded, we simply don't know when this started (I think it was relatively recently but I can't be sure), and IMO we need to as quickly as possible fix it and start gating on not regressing here, even if it's not an ideal fix. |
OK this is fallout from #343 Here's a diff from my local libvirt's bootstrap MC versus cluster:
|
It looks to me like 6f0f3ff Edit: Bigger picture it feels to me like there's too tight a coupling between the installer and MCO; the dance necessary to shuffle data from the release payload into the MCO isn't worth it. Rather we could just pass the whole imagestream mapping into the MCO bootstrap or so. Particularly since the MCO is carrying multiple external images (etcd, machine-os-content), and necessarily is heavily involved in cluster setup. |
This was a missed corresponding change after openshift/machine-config-operator@6f0f3ff landed. Ref: openshift/machine-config-operator#367
Took a stab at this in openshift/installer#1194 - trying to test locally but the libvirt download is being slow for some reason. |
This was a missed corresponding change after openshift/machine-config-operator@6f0f3ff landed. Ref: openshift/machine-config-operator#367
referenced PRs are merged |
This should be fixed now. |
Since having a mismatch here will result in a broken cluster, change things so that the command line arguments are required and drop the `docker.io` image references. We never want to pull those into a real cluster. Ref: openshift#367
This is an ugly fix for openshift/machine-config-operator#367
This is an ugly fix for openshift/machine-config-operator#367
This is an ugly fix for openshift/machine-config-operator#367
Just installed the cluster and MCO reports failure
In fact, all masters show
state: Degraded
. However 2 of the 3, thecurrentConfig
anddesiredConfig
is the same, which conflicts with thestate
.machine-config-daemon
is running on all nodes.logs from the daemon on the degraded master
Indeed, the old
MachineConfig
master-b41804f2dd413f9ac7a730e8a241d716
has been removedThe text was updated successfully, but these errors were encountered: