This repository has been archived by the owner on Oct 24, 2023. It is now read-only.
Cloud-init and etcd.server mount requires cause a race in systemd and failure to start #3807
Labels
bug
Something isn't working
Describe the bug
Short story: etcd on masters will fail to start on our clusters if the master is "az vm deallocate" then "az vm start" to restart it.
Longer story: during a hard reboot (such as the vm deallocate/start - note this is not a re-imaging), the systemd dependency system gets confused due to some race with cloud-init and the mounting of the /var/lib/etcddisk volume. This causes systemd to never even try to start etcd and thus the master never comes on line again as the local etcd is what needs to be talked to by kubernetes. A soft reboot (or just a manual start of etcd.service) will recover as cloud-init does not do as much during those reboots.
Steps To Reproduce
This is with a standard aks-engine 0.54.0 kubernetes 1.18.6 cluster (skyman v7)
From the azure CLI, I restart one of the masters like this:
Without a fix, if you ssh to that master after it is up and you do:
You will see that etcd has not even attempted to be started after the reboot.
Expected behavior
The etcd service to start after a reboot
AKS Engine version
0.54.0
Kubernetes version
1.18.6
Proposed Fix
There is a fix to address this problem at the systemd configuration level. The deeper fix is maybe technically a systemd issue but it is triggered by an interaction with cloud-init so there could be some question as to if cloud-init's behavior is triggering a known limitation/issue in systemd or if systemd should support that issue. It seems to be that cloud-init is effectively causing a change in systemd's graph (collection) while systemd is using (enumerating) it and that is causing the lost requires triggering.
However, we have a clean fix that, in our testing, is 100% clean in that it removes the problem and yet retains the needed ordering constraints for etcd.
The current etcd.service file looks like this:
The following is the working etcd.service file: (plus a safety check)
Explanations of the changes: (1 and 2 are required, 3 and 4 are for completeness)
Remove the RequiresMountsFor - that is the mechanism/arc in the graph that is messed up by cloud-init
Add to
After
thevar-lib-etcddisk.mount
- This makes sure that etcd service is after the etcddisk mount ranAdd to
Wants
thevar-lib-etcddisk.mount
- This makes sure that manual starting of etcd service will want to start the mount if it has not already happened. (Should be optional but I can imagine edge cases where it could be needed)Add the
ExecStartPre=/bin/mountpoint -q /var/lib/etcddisk
- This is purely a safety check - since we are not requires, we could run even if the mount failed (after just means after, it does not mean after-success - that is requires which fails due to the above). This will only cause a failure if the mount failed. Thus is effectively blocks etcd from starting without the mount.Additional context
More details as to what is happening: During the boot process, systemd start cloud-init which, under certain reboot scenarios, does some extra work, including manually doing the mounting of the /var/lib/etcddisk. This triggers the
systemd-fsck@dev-disk-by\x2dlabel-etcd_disk.service
to be restarted, which means the first node instance is now failed which, unfortunately causes the required mounts graph arc to etcd service to be broken. The disk mounts just fine but the graph arc is broken and thus systemd does not think it can start etcd.With the change, the arc we have is "After" and not "Requires" so it just needs the mount to have been completed. This solves the problem. However, After just means after but not Wants (Requires == After + Wants + Success), so we then add the Wants to make sure that if, for some reason, the mount was not triggered elsewhere, it will be triggered by etcd being started and since we are after, it will be run before etcd is started.
The final change is to just cause etcd to start and loop in ExecStartPre until the mount point is successfully there. If the mount is in permanent failure, the ExecStartPre will fail and loop a few times and then just fail etcd (not start) with a good log of the fact that the mount point is not there (always nice to point at the right failure).
Now, in all of our testing, only changes 1 and 2 are required and we have not noticed any problem with just those two. However, the other two changes are the "belt and suspenders" mechanism by which we make the successful start of etcd be similar to requires without having the problem of requires.
The text was updated successfully, but these errors were encountered: