Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k3s startup fails with "starting kubernetes: preparing server: start cluster and https: raft_start(): io: load closed segment 0000000024946269-0000000024946590: found 321 entries (expected 322)" #1403

Closed
bokysan opened this issue Feb 9, 2020 · 3 comments

Comments

@bokysan
Copy link

bokysan commented Feb 9, 2020

Version: k3s version v1.17.2+k3s1 (cdab19b0)

Description:

k3s master fails to start with in the log "starting kubernetes: preparing server: start cluster and https: raft_start(): io: load closed segment 0000000024946269-0000000024946590: found 321 entries (expected 322)"

This has happened after the machines were forcefully shut down (power loss). There's no info on the web on how to resolve this error or what to do next.

To Reproduce:

  • install cluster using Ansible scripts on at least two nodes
  • unplug power (I guess?)

Expected behavior:

  • cluster survives power outages / gives a clear path how to restore it manually

Actual behavior:

  • cluster doesn't startup anymore

Additional context

  • k3s is (was) running on a cluster of TWO machines
  • k3s non-master node seems to start up successfully
  • k3s is installed on almost clean Armbian, on Pine64
  • cluster was working fine before the power loss
uname -a
Linux ariana 5.4.7-sunxi64 #19.11.6 SMP Sat Jan 4 19:40:10 CET 2020 aarch64 GNU/Linux


lsb_release -a
No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 10 (buster)
Release:	10
Codename:	buster


cat /etc/systemd/system/k3s.service
[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
After=network-online.target
[Service]
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s server --cluster-init --write-kubeconfig-mode 664
KillMode=process
Delegate=yes
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
Restart=always
RestartSec=5s

[Install]
WantedBy=multi-user.target

/var/log/syslog:

...
Feb  9 00:00:12 ariana systemd[1]: Starting Lightweight Kubernetes...
Feb  9 00:00:12 ariana systemd[1]: Started Lightweight Kubernetes.
Feb  9 00:00:13 ariana k3s[3961]: time="2020-02-09T00:00:13.429349422Z" level=info msg="Starting k3s v1.17.2+k3s1 (cdab19b0)"
Feb  9 00:00:16 ariana k3s[3961]: time="2020-02-09T00:00:16.592512841Z" level=fatal msg="starting kubernetes: preparing server: start cluster and https: raft_start(): io: load closed segment 0000000024946269-0000000024946590: found 321 entries (expected 322)"
Feb  9 00:00:16 ariana systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Feb  9 00:00:16 ariana systemd[1]: k3s.service: Failed with result 'exit-code'.
Feb  9 00:00:21 ariana systemd[1]: k3s.service: Service RestartSec=5s expired, scheduling restart.
Feb  9 00:00:21 ariana systemd[1]: k3s.service: Scheduled restart job, restart counter is at 5380.
Feb  9 00:00:21 ariana systemd[1]: Stopped Lightweight Kubernetes.
...
@Kampe
Copy link

Kampe commented Feb 27, 2020

seeing the same issues, I was purposefully deleting master nodes at various intervals and discovered this on reboot after a couple of times.

@brandond
Copy link
Member

brandond commented Feb 27, 2020

This appears to be the upstream dqlite issue: canonical/dqlite#190

dqlite is still experimental; there does not appear to be a way to recover from this at the moment. If you need more production-ready HA you should probably be using an external DB.

Also, a two-node dqlite cluster won't meet Raft consensus requirements (no quorum if one goes down) so this setup probably won't ever work as expected.

@stale
Copy link

stale bot commented Jul 30, 2021

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

@stale stale bot added the status/stale label Jul 30, 2021
@stale stale bot closed this as completed Aug 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants