feat: support etcd volume #305

okozachenko1203 · 2024-02-02T18:51:22Z

this doc is wrong https://cluster-api-openstack.sigs.k8s.io/topics/crd-changes/v1alpha6-to-v1alpha7#creation-of-additionalblockdevices storage.type should be `Volume` in uppercase

mnaser · 2024-02-14T14:55:24Z

@okozachenko1203: The only thing I'm 'worried' about is the behaviour of CAPO when a new VM is created -- does it conserve the volume or create a new one?

If it creates a new one, that's what we want since all the etcd data will get sync'd automatically. If it doesn't, we might fail to kubeadm join the control plane node.

okozachenko1203 · 2024-02-15T17:08:46Z

@okozachenko1203: The only thing I'm 'worried' about is the behaviour of CAPO when a new VM is created -- does it conserve the volume or create a new one?

If it creates a new one, that's what we want since all the etcd data will get sync'd automatically. If it doesn't, we might fail to kubeadm join the control plane node.

It does recreate volumes.

fnpanic · 2024-02-27T15:31:16Z

Can we set the volume type with this patch also like having an encrypted volume?
According to the code it is not possible.

robincron · 2024-03-19T12:49:32Z

i don't exactly know why or how, but it seems that the etcd data dir is not empty when a control plane node is built using etcd_volume_size=10 and etcd_volume_type=encrypted-volumes in our deployment.
For example, here is a control plane node with the lost+found folder inside /var/lib/etcd:

It looks (to me) that there is some sort of timing problem, where Cloud-Init fails due to the folder not being empty before the prekubeadmcommands even run?

Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 audit: BPF prog-id=20 op=UNLOAD Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:23] Cloud-init v. 23.1.2-0ubuntu0~22.04.1 running 'modules:final' at Tue, 19 Mar 2024 11:22:22 +0000. Up 21.20 sec> Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] [init] Using Kubernetes version: v1.27.3 Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] [preflight] Running pre-flight checks Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] error execution phase preflight: [preflight] Some fatal errors occurred: Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] [ERROR DirAvailable--var-lib-etcd]: /var/lib/etcd is not empty Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] [preflight] If you know what you are doing, you can make a check non-fatal with --ignore-preflight-errors=... Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] To see the stack trace of this error execute with --v=5 or higher Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] 2024-03-19 11:22:24,453 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/> Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] 2024-03-19 11:22:24,454 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_> Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] Cloud-init v. 23.1.2-0ubuntu0~22.04.1 finished at Tue, 19 Mar 2024 11:22:24 +0000. Datasource DataSourceOpenSt> Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 systemd[1]: dmesg.service: Deactivated successfully. Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='unit=dmesg comm="systemd" exe="/usr/lib/systemd/systemd" hostn> Mar 19 11:22:27 kube-bxbqe-xx6sh-cndp2 chronyd[818]: Selected source 158.101.188.125 (2.ubuntu.pool.ntp.org) Mar 19 11:22:27 kube-bxbqe-xx6sh-cndp2 chronyd[818]: System clock wrong by -278.022585 seconds Mar 19 11:22:00 kube-bxbqe-xx6sh-cndp2 kubelet[1144]: Flag --pod-infra-container-image has been deprecated, will be removed in a future release. Image garbage collector will get sandbox im> Mar 19 11:22:00 kube-bxbqe-xx6sh-cndp2 kubelet[1144]: I0319 11:22:00.003195 1144 server.go:199] "--pod-infra-container-image will not be pruned by the image garbage collector in kubelet> Mar 19 11:22:00 kube-bxbqe-xx6sh-cndp2 kubelet[1144]: E0319 11:22:00.003494 1144 run.go:74] "command failed" err="failed to load kubelet config file, error: failed to load Kubelet confi> Mar 19 11:22:00 kube-bxbqe-xx6sh-cndp2 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE Mar 19 11:22:00 kube-bxbqe-xx6sh-cndp2 systemd[1]: kubelet.service: Failed with result 'exit-code'.

We are running:

okozachenko1203 · 2024-03-19T12:55:02Z

i don't exactly know why or how, but it seems that the etcd data dir is not empty when a control plane node is built using etcd_volume_size=10 and etcd_volume_type=encrypted-volumes in our deployment. For example, here is a control plane node with the lost+found folder inside /var/lib/etcd:

It looks (to me) that there is some sort of timing problem, where Cloud-Init fails due to the folder not being empty before the prekubeadmcommands even run?

Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 audit: BPF prog-id=20 op=UNLOAD Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:23] Cloud-init v. 23.1.2-0ubuntu0~22.04.1 running 'modules:final' at Tue, 19 Mar 2024 11:22:22 +0000. Up 21.20 sec> Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] [init] Using Kubernetes version: v1.27.3 Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] [preflight] Running pre-flight checks Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] error execution phase preflight: [preflight] Some fatal errors occurred: Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] [ERROR DirAvailable--var-lib-etcd]: /var/lib/etcd is not empty Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] [preflight] If you know what you are doing, you can make a check non-fatal with --ignore-preflight-errors=... Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] To see the stack trace of this error execute with --v=5 or higher Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] 2024-03-19 11:22:24,453 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/> Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] 2024-03-19 11:22:24,454 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_> Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] Cloud-init v. 23.1.2-0ubuntu0~22.04.1 finished at Tue, 19 Mar 2024 11:22:24 +0000. Datasource DataSourceOpenSt> Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 systemd[1]: dmesg.service: Deactivated successfully. Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='unit=dmesg comm="systemd" exe="/usr/lib/systemd/systemd" hostn> Mar 19 11:22:27 kube-bxbqe-xx6sh-cndp2 chronyd[818]: Selected source 158.101.188.125 (2.ubuntu.pool.ntp.org) Mar 19 11:22:27 kube-bxbqe-xx6sh-cndp2 chronyd[818]: System clock wrong by -278.022585 seconds Mar 19 11:22:00 kube-bxbqe-xx6sh-cndp2 kubelet[1144]: Flag --pod-infra-container-image has been deprecated, will be removed in a future release. Image garbage collector will get sandbox im> Mar 19 11:22:00 kube-bxbqe-xx6sh-cndp2 kubelet[1144]: I0319 11:22:00.003195 1144 server.go:199] "--pod-infra-container-image will not be pruned by the image garbage collector in kubelet> Mar 19 11:22:00 kube-bxbqe-xx6sh-cndp2 kubelet[1144]: E0319 11:22:00.003494 1144 run.go:74] "command failed" err="failed to load kubelet config file, error: failed to load Kubelet confi> Mar 19 11:22:00 kube-bxbqe-xx6sh-cndp2 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE Mar 19 11:22:00 kube-bxbqe-xx6sh-cndp2 systemd[1]: kubelet.service: Failed with result 'exit-code'.

We are running:

Yeah, that is why we have "rm /var/lib/etcd/lost+found -rf" as the first cmd in prekubeadmcommand.
https://github.com/vexxhost/magnum-cluster-api/pull/305/files#diff-43c5da84e3410f51531d7b4f6bdfe0b0f83fe9a19018cf1dde7ee2ef352bb019R603-R605

robincron · 2024-03-19T13:21:08Z

Yeah, that is why we have "rm /var/lib/etcd/lost+found -rf" as the first cmd in prekubeadmcommand.
https://github.com/vexxhost/magnum-cluster-api/pull/305/files#diff-43c5da84e3410f51531d7b4f6bdfe0b0f83fe9a19018cf1dde7ee2ef352bb019R603-R605

I see that, but this does not seem to work 100% of the time. Perhaps it is a timing problem, where the rm command runs before cloud-init mounts the etcd volume sometimes?
I was thinking the command could be changed to something like
while [ ! -d /var/lib/etcd ] ; do sleep 1 ; done ; rm /var/lib/etcd/lost+found -rf to wait until the directory actually exists? If timing is even the problem

feat: support etcd volume

b2581c4

okozachenko1203 marked this pull request as draft February 2, 2024 18:51

okozachenko1203 added 10 commits February 6, 2024 13:47

fix device name

ccc7e76

Add a condition for cinder

e920215

revert cinder enabled check condition

fa11815

fix typo

65d7df4

fix az omit

4458fd8

fix az omit

a5bfc41

fix block device type supported by capo

a8bcdc7

this doc is wrong https://cluster-api-openstack.sigs.k8s.io/topics/crd-changes/v1alpha6-to-v1alpha7#creation-of-additionalblockdevices storage.type should be `Volume` in uppercase

fix device name for etcd volume

8a8b3ce

make sure /var/lib/etcd path is empty before kubeadm init

e89f303

fix lint error

69cc32e

okozachenko1203 marked this pull request as ready for review February 7, 2024 09:10

okozachenko1203 requested a review from mnaser February 7, 2024 09:10

mnaser and others added 2 commits February 9, 2024 14:58

Merge branch 'main' into support-etcd-volume

7e0c1ba

fix etcdVolumeType

9b0f467

mnaser merged commit 544cb77 into main Feb 15, 2024
24 checks passed

mnaser deleted the support-etcd-volume branch February 15, 2024 17:41

github-actions bot mentioned this pull request Feb 15, 2024

chore(main): release 0.14.0 #304

Merged

This was referenced Mar 8, 2024

chore(main): release 0.14.1 #319

Merged

chore(main): release 0.14.2 #323

Merged

github-actions bot mentioned this pull request Mar 18, 2024

chore(main): release 0.15.0 #328

Merged

robincron mentioned this pull request Mar 19, 2024

etcd data dir not empty #330

Closed

github-actions bot mentioned this pull request Mar 19, 2024

chore(main): release 0.15.1 #333

Closed

github-actions bot mentioned this pull request Mar 22, 2024

chore(main): release 0.15.1 #338

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support etcd volume #305

feat: support etcd volume #305

okozachenko1203 commented Feb 2, 2024

mnaser commented Feb 14, 2024

okozachenko1203 commented Feb 15, 2024

fnpanic commented Feb 27, 2024

robincron commented Mar 19, 2024

okozachenko1203 commented Mar 19, 2024

robincron commented Mar 19, 2024

feat: support etcd volume #305

feat: support etcd volume #305

Conversation

okozachenko1203 commented Feb 2, 2024

mnaser commented Feb 14, 2024

okozachenko1203 commented Feb 15, 2024

fnpanic commented Feb 27, 2024

robincron commented Mar 19, 2024

okozachenko1203 commented Mar 19, 2024

robincron commented Mar 19, 2024