Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support etcd volume #305

Merged
merged 13 commits into from
Feb 15, 2024
Merged

feat: support etcd volume #305

merged 13 commits into from
Feb 15, 2024

Conversation

okozachenko1203
Copy link
Member

fix #291

@okozachenko1203 okozachenko1203 marked this pull request as draft February 2, 2024 18:51
@okozachenko1203 okozachenko1203 marked this pull request as ready for review February 7, 2024 09:10
@mnaser
Copy link
Member

mnaser commented Feb 14, 2024

@okozachenko1203: The only thing I'm 'worried' about is the behaviour of CAPO when a new VM is created -- does it conserve the volume or create a new one?

If it creates a new one, that's what we want since all the etcd data will get sync'd automatically. If it doesn't, we might fail to kubeadm join the control plane node.

@okozachenko1203
Copy link
Member Author

@okozachenko1203: The only thing I'm 'worried' about is the behaviour of CAPO when a new VM is created -- does it conserve the volume or create a new one?

If it creates a new one, that's what we want since all the etcd data will get sync'd automatically. If it doesn't, we might fail to kubeadm join the control plane node.

It does recreate volumes.

@mnaser mnaser merged commit 544cb77 into main Feb 15, 2024
24 checks passed
@mnaser mnaser deleted the support-etcd-volume branch February 15, 2024 17:41
@fnpanic
Copy link

fnpanic commented Feb 27, 2024

Can we set the volume type with this patch also like having an encrypted volume?
According to the code it is not possible.

@robincron
Copy link
Contributor

i don't exactly know why or how, but it seems that the etcd data dir is not empty when a control plane node is built using etcd_volume_size=10 and etcd_volume_type=encrypted-volumes in our deployment.
For example, here is a control plane node with the lost+found folder inside /var/lib/etcd:
image

It looks (to me) that there is some sort of timing problem, where Cloud-Init fails due to the folder not being empty before the prekubeadmcommands even run?

Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 audit: BPF prog-id=20 op=UNLOAD Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:23] Cloud-init v. 23.1.2-0ubuntu0~22.04.1 running 'modules:final' at Tue, 19 Mar 2024 11:22:22 +0000. Up 21.20 sec> Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] [init] Using Kubernetes version: v1.27.3 Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] [preflight] Running pre-flight checks Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] error execution phase preflight: [preflight] Some fatal errors occurred: Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] [ERROR DirAvailable--var-lib-etcd]: /var/lib/etcd is not empty Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] [preflight] If you know what you are doing, you can make a check non-fatal with --ignore-preflight-errors=... Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] To see the stack trace of this error execute with --v=5 or higher Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] 2024-03-19 11:22:24,453 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/> Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] 2024-03-19 11:22:24,454 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_> Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] Cloud-init v. 23.1.2-0ubuntu0~22.04.1 finished at Tue, 19 Mar 2024 11:22:24 +0000. Datasource DataSourceOpenSt> Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 systemd[1]: dmesg.service: Deactivated successfully. Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='unit=dmesg comm="systemd" exe="/usr/lib/systemd/systemd" hostn> Mar 19 11:22:27 kube-bxbqe-xx6sh-cndp2 chronyd[818]: Selected source 158.101.188.125 (2.ubuntu.pool.ntp.org) Mar 19 11:22:27 kube-bxbqe-xx6sh-cndp2 chronyd[818]: System clock wrong by -278.022585 seconds Mar 19 11:22:00 kube-bxbqe-xx6sh-cndp2 kubelet[1144]: Flag --pod-infra-container-image has been deprecated, will be removed in a future release. Image garbage collector will get sandbox im> Mar 19 11:22:00 kube-bxbqe-xx6sh-cndp2 kubelet[1144]: I0319 11:22:00.003195 1144 server.go:199] "--pod-infra-container-image will not be pruned by the image garbage collector in kubelet> Mar 19 11:22:00 kube-bxbqe-xx6sh-cndp2 kubelet[1144]: E0319 11:22:00.003494 1144 run.go:74] "command failed" err="failed to load kubelet config file, error: failed to load Kubelet confi> Mar 19 11:22:00 kube-bxbqe-xx6sh-cndp2 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE Mar 19 11:22:00 kube-bxbqe-xx6sh-cndp2 systemd[1]: kubelet.service: Failed with result 'exit-code'.

We are running:
image

@okozachenko1203
Copy link
Member Author

i don't exactly know why or how, but it seems that the etcd data dir is not empty when a control plane node is built using etcd_volume_size=10 and etcd_volume_type=encrypted-volumes in our deployment. For example, here is a control plane node with the lost+found folder inside /var/lib/etcd: image

It looks (to me) that there is some sort of timing problem, where Cloud-Init fails due to the folder not being empty before the prekubeadmcommands even run?

Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 audit: BPF prog-id=20 op=UNLOAD Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:23] Cloud-init v. 23.1.2-0ubuntu0~22.04.1 running 'modules:final' at Tue, 19 Mar 2024 11:22:22 +0000. Up 21.20 sec> Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] [init] Using Kubernetes version: v1.27.3 Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] [preflight] Running pre-flight checks Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] error execution phase preflight: [preflight] Some fatal errors occurred: Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] [ERROR DirAvailable--var-lib-etcd]: /var/lib/etcd is not empty Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] [preflight] If you know what you are doing, you can make a check non-fatal with --ignore-preflight-errors=... Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] To see the stack trace of this error execute with --v=5 or higher Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] 2024-03-19 11:22:24,453 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/> Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] 2024-03-19 11:22:24,454 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_> Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 cloud-init[901]: [2024-03-19 11:22:24] Cloud-init v. 23.1.2-0ubuntu0~22.04.1 finished at Tue, 19 Mar 2024 11:22:24 +0000. Datasource DataSourceOpenSt> Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 systemd[1]: dmesg.service: Deactivated successfully. Mar 19 11:22:24 kube-bxbqe-xx6sh-cndp2 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='unit=dmesg comm="systemd" exe="/usr/lib/systemd/systemd" hostn> Mar 19 11:22:27 kube-bxbqe-xx6sh-cndp2 chronyd[818]: Selected source 158.101.188.125 (2.ubuntu.pool.ntp.org) Mar 19 11:22:27 kube-bxbqe-xx6sh-cndp2 chronyd[818]: System clock wrong by -278.022585 seconds Mar 19 11:22:00 kube-bxbqe-xx6sh-cndp2 kubelet[1144]: Flag --pod-infra-container-image has been deprecated, will be removed in a future release. Image garbage collector will get sandbox im> Mar 19 11:22:00 kube-bxbqe-xx6sh-cndp2 kubelet[1144]: I0319 11:22:00.003195 1144 server.go:199] "--pod-infra-container-image will not be pruned by the image garbage collector in kubelet> Mar 19 11:22:00 kube-bxbqe-xx6sh-cndp2 kubelet[1144]: E0319 11:22:00.003494 1144 run.go:74] "command failed" err="failed to load kubelet config file, error: failed to load Kubelet confi> Mar 19 11:22:00 kube-bxbqe-xx6sh-cndp2 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE Mar 19 11:22:00 kube-bxbqe-xx6sh-cndp2 systemd[1]: kubelet.service: Failed with result 'exit-code'.

We are running: image

Yeah, that is why we have "rm /var/lib/etcd/lost+found -rf" as the first cmd in prekubeadmcommand.
https://github.com/vexxhost/magnum-cluster-api/pull/305/files#diff-43c5da84e3410f51531d7b4f6bdfe0b0f83fe9a19018cf1dde7ee2ef352bb019R603-R605

@robincron
Copy link
Contributor

Yeah, that is why we have "rm /var/lib/etcd/lost+found -rf" as the first cmd in prekubeadmcommand.
https://github.com/vexxhost/magnum-cluster-api/pull/305/files#diff-43c5da84e3410f51531d7b4f6bdfe0b0f83fe9a19018cf1dde7ee2ef352bb019R603-R605

I see that, but this does not seem to work 100% of the time. Perhaps it is a timing problem, where the rm command runs before cloud-init mounts the etcd volume sometimes?
I was thinking the command could be changed to something like
while [ ! -d /var/lib/etcd ] ; do sleep 1 ; done ; rm /var/lib/etcd/lost+found -rf to wait until the directory actually exists? If timing is even the problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fr: Implement the extra volumes feature
4 participants