Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

Cloud-init and etcd.server mount requires cause a race in systemd and failure to start #3807

Closed
Michael-Sinz opened this issue Sep 9, 2020 · 1 comment · Fixed by #3809
Closed
Labels
bug Something isn't working

Comments

@Michael-Sinz
Copy link
Collaborator

Describe the bug
Short story: etcd on masters will fail to start on our clusters if the master is "az vm deallocate" then "az vm start" to restart it.

Longer story: during a hard reboot (such as the vm deallocate/start - note this is not a re-imaging), the systemd dependency system gets confused due to some race with cloud-init and the mounting of the /var/lib/etcddisk volume. This causes systemd to never even try to start etcd and thus the master never comes on line again as the local etcd is what needs to be talked to by kubernetes. A soft reboot (or just a manual start of etcd.service) will recover as cloud-init does not do as much during those reboots.

Steps To Reproduce
This is with a standard aks-engine 0.54.0 kubernetes 1.18.6 cluster (skyman v7)

From the azure CLI, I restart one of the masters like this:

az vm deallocate --resource-group SCM__MSINZ1 --name k8s-master-18861755-0
az vm start --resource-group SCM__MSINZ1 --name k8s-master-18861755-0 --no-wait

Without a fix, if you ssh to that master after it is up and you do:

systemctl status etcd

You will see that etcd has not even attempted to be started after the reboot.

Expected behavior
The etcd service to start after a reboot

AKS Engine version
0.54.0

Kubernetes version
1.18.6

Proposed Fix
There is a fix to address this problem at the systemd configuration level. The deeper fix is maybe technically a systemd issue but it is triggered by an interaction with cloud-init so there could be some question as to if cloud-init's behavior is triggering a known limitation/issue in systemd or if systemd should support that issue. It seems to be that cloud-init is effectively causing a change in systemd's graph (collection) while systemd is using (enumerating) it and that is causing the lost requires triggering.

However, we have a clean fix that, in our testing, is 100% clean in that it removes the problem and yet retains the needed ordering constraints for etcd.

The current etcd.service file looks like this:

[Unit]
Description=etcd - highly-available key value store
Documentation=https://github.com/coreos/etcd
Documentation=man:etcd
After=network.target
Wants=network-online.target
RequiresMountsFor=/var/lib/etcddisk
[Service]
Environment=DAEMON_ARGS=
Environment=ETCD_NAME=%H
Environment=ETCD_DATA_DIR=
EnvironmentFile=-/etc/default/%p
Type=notify
User=etcd
PermissionsStartOnly=true
ExecStart=/usr/bin/etcd $DAEMON_ARGS
Restart=always
[Install]
WantedBy=multi-user.target

The following is the working etcd.service file: (plus a safety check)

[Unit]
Description=etcd - highly-available key value store
Documentation=https://github.com/coreos/etcd
Documentation=man:etcd
After=network.target var-lib-etcddisk.mount
Wants=network-online.target var-lib-etcddisk.mount
[Service]
Environment=DAEMON_ARGS=
Environment=ETCD_NAME=%H
Environment=ETCD_DATA_DIR=
EnvironmentFile=-/etc/default/%p
Type=notify
User=etcd
PermissionsStartOnly=true
ExecStartPre=/bin/mountpoint -q /var/lib/etcddisk
ExecStart=/usr/bin/etcd $DAEMON_ARGS
Restart=always
[Install]
WantedBy=multi-user.target

Explanations of the changes: (1 and 2 are required, 3 and 4 are for completeness)

  1. Remove the RequiresMountsFor - that is the mechanism/arc in the graph that is messed up by cloud-init

  2. Add to After the var-lib-etcddisk.mount - This makes sure that etcd service is after the etcddisk mount ran

  3. Add to Wants the var-lib-etcddisk.mount - This makes sure that manual starting of etcd service will want to start the mount if it has not already happened. (Should be optional but I can imagine edge cases where it could be needed)

  4. Add the ExecStartPre=/bin/mountpoint -q /var/lib/etcddisk - This is purely a safety check - since we are not requires, we could run even if the mount failed (after just means after, it does not mean after-success - that is requires which fails due to the above). This will only cause a failure if the mount failed. Thus is effectively blocks etcd from starting without the mount.

Additional context
More details as to what is happening: During the boot process, systemd start cloud-init which, under certain reboot scenarios, does some extra work, including manually doing the mounting of the /var/lib/etcddisk. This triggers the systemd-fsck@dev-disk-by\x2dlabel-etcd_disk.service to be restarted, which means the first node instance is now failed which, unfortunately causes the required mounts graph arc to etcd service to be broken. The disk mounts just fine but the graph arc is broken and thus systemd does not think it can start etcd.

With the change, the arc we have is "After" and not "Requires" so it just needs the mount to have been completed. This solves the problem. However, After just means after but not Wants (Requires == After + Wants + Success), so we then add the Wants to make sure that if, for some reason, the mount was not triggered elsewhere, it will be triggered by etcd being started and since we are after, it will be run before etcd is started.

The final change is to just cause etcd to start and loop in ExecStartPre until the mount point is successfully there. If the mount is in permanent failure, the ExecStartPre will fail and loop a few times and then just fail etcd (not start) with a good log of the fact that the mount point is not there (always nice to point at the right failure).

Now, in all of our testing, only changes 1 and 2 are required and we have not noticed any problem with just those two. However, the other two changes are the "belt and suspenders" mechanism by which we make the successful start of etcd be similar to requires without having the problem of requires.

@Michael-Sinz Michael-Sinz added the bug Something isn't working label Sep 9, 2020
@Michael-Sinz
Copy link
Collaborator Author

Note that only #1 and #2 changes are absolutely needed - in fact, our tests show that these are all that was needed for reliable operation but it is "scary" in that it is not semantically the same, just operationally sufficient. With changes #3 and #4 we get to be not just operationally sufficient but semantically equivalent. I could say the same but it is not quite the same - there are some subtle differences but those differences are what make it work (fix the problem).

I am going to be putting in a patch to our current clusters that makes these changes as the lost of masters does happen often enough.

I still don't know all of the different boot conditions that cause the cloud-init/systemd interaction to happen. I do know the ones we have used to reproduce the problem but since this is somewhat outside of my control, having the system not be susceptible to that interaction/race is critical for running our clusters at this scale.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant