Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to setup cluster with 1 server and 1 agent #4839

Closed
1 task
jamezrin opened this issue Dec 25, 2021 · 6 comments
Closed
1 task

Unable to setup cluster with 1 server and 1 agent #4839

jamezrin opened this issue Dec 25, 2021 · 6 comments

Comments

@jamezrin
Copy link

jamezrin commented Dec 25, 2021

Environmental Info:
K3s Version: v1.22.5+k3s1

Node(s) CPU architecture, OS, and Version:
Linux ocvm-a1-1 5.11.0-1022-oracle #23~20.04.1-Ubuntu SMP Fri Nov 12 15:45:47 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux

Cluster Configuration:
1 server, 1 agent. Both have the same hardware and software. They also have no limitations in connectivity between them (all ports open in the network). They have access to the internet, but only ports 80 and 443 are open to it.

Describe the bug:
I am trying to setup a cluster with 1 server / master node and 1 worker node. When I setup the agent in the worker VM it never successfully joins the cluster and both VMs log an error regarding this agent configuration process, please see the logs below.

Steps To Reproduce:

  • Configured two VMs with complete access between them (all ports are open in private network). They have different hostnames
  • Installed k3s on the master with curl -sfL https://get.k3s.io | sh -
  • Deployed some services, everything works perfectly
  • Tried installing the agent on the other VM with curl -sfL https://get.k3s.io | K3S_URL=https://<ip of master node>:6443 K3S_TOKEN="<content of node-token in the master>" sh -

Expected behavior:
I expected the worker node to join the cluster and be shown in kubectl get nodes.

Actual behavior:
Both nodes seem to exchange some information, but the worker node never joins the cluster, and both have log entries regarding this failure.

Additional context / logs:

Output of journalctl -u k3s -f on the master node:

Dec 25 18:11:38 ocvm-a1-1 k3s[8221]: time="2021-12-25T18:11:38Z" level=error msg="unable to verify hash for node 'ocvm-a1-2': hash does not match"

Output of the agent installation:

[INFO]  Finding release for channel stable
[INFO]  Using v1.22.5+k3s1 as release
[INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.22.5+k3s1/sha256sum-arm64.txt
[INFO]  Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.22.5+k3s1/k3s-arm64
[INFO]  Verifying binary download
[INFO]  Installing k3s to /usr/local/bin/k3s
[INFO]  Skipping installation of SELinux RPM
[INFO]  Creating /usr/local/bin/kubectl symlink to k3s
[INFO]  Creating /usr/local/bin/crictl symlink to k3s
[INFO]  Creating /usr/local/bin/ctr symlink to k3s
[INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[INFO]  Creating uninstall script /usr/local/bin/k3s-agent-uninstall.sh
[INFO]  env: Creating environment file /etc/systemd/system/k3s-agent.service.env
[INFO]  systemd: Creating service file /etc/systemd/system/k3s-agent.service
[INFO]  systemd: Enabling k3s-agent unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s-agent.service → /etc/systemd/system/k3s-agent.service.
[INFO]  systemd: Starting k3s-agent

Output of journalctl -u k3s-agent.service on the worker node (the last line is repeated infinitely):

Dec 22 15:57:58 ocvm-a1-2 systemd[1]: Started Lightweight Kubernetes.
Dec 22 15:57:58 ocvm-a1-2 k3s[1532]: time="2021-12-22T15:57:58Z" level=info msg="Acquiring lock file /var/lib/rancher/k3s/data/.lock"
Dec 22 15:57:58 ocvm-a1-2 k3s[1532]: time="2021-12-22T15:57:58Z" level=info msg="Preparing data dir /var/lib/rancher/k3s/data/e06dacf3811b44a394f7daff154613935a5bbab34238ccb4361e4e7ceeeb77fb"
Dec 22 15:58:00 ocvm-a1-2 k3s[1532]: time="2021-12-22T15:58:00Z" level=info msg="Starting k3s agent v1.22.5+k3s1 (405bf79d)"
Dec 22 15:58:00 ocvm-a1-2 k3s[1532]: time="2021-12-22T15:58:00Z" level=info msg="Running load balancer 127.0.0.1:6444 -> [10.0.0.172:6443]"
Dec 22 15:58:00 ocvm-a1-2 k3s[1532]: time="2021-12-22T15:58:00Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"

Backporting

  • Needs backporting to older releases
@brandond
Copy link
Member

brandond commented Dec 26, 2021

The message you're getting:
Dec 22 15:58:00 ocvm-a1-2 k3s[1532]: time="2021-12-22T15:58:00Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"
makes me suspect both VMs have the same hostname. Is that true?

@jamezrin
Copy link
Author

The message you're getting:
Dec 22 15:58:00 ocvm-a1-2 k3s[1532]: time="2021-12-22T15:58:00Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"
makes me suspect both VMs have the same hostname. Is that true?

No, they have different hostnames

@brandond
Copy link
Member

brandond commented Dec 27, 2021

Ok; in that case can you take a look at https://rancher.com/docs/k3s/latest/en/architecture/#how-agent-node-registration-works and see if you somehow have a node and secret for the agent node, perhaps left over from a previous join attempt?

@jamezrin
Copy link
Author

That was it, thank you! I don't know how but I never got to read that part of the documentation.
I just uninstalled the agent from the second VM and in the first VM I deleted the secret corresponding to the second node, and it started working!

ubuntu@ocvm-a1-1:~$ kubectl get secret -n kube-system | grep ocvm-a1-2
ocvm-a1-2.node-password.k3s                          Opaque                                1      64d
ubuntu@ocvm-a1-1:~$ kubectl delete secret -n kube-system ocvm-a1-2.node-password.k3s
secret "ocvm-a1-2.node-password.k3s" deleted
<VM 2 joins the cluster>
ubuntu@ocvm-a1-1:~$ kubectl get nodes
NAME        STATUS   ROLES                  AGE   VERSION
ocvm-a1-2   Ready    <none>                 18s   v1.22.5+k3s1
ocvm-a1-1   Ready    control-plane,master   65d   v1.22.5+k3s1

@KunLiu1210
Copy link

That was it, thank you! I don't know how but I never got to read that part of the documentation. I just uninstalled the agent from the second VM and in the first VM I deleted the secret corresponding to the second node, and it started working!

ubuntu@ocvm-a1-1:~$ kubectl get secret -n kube-system | grep ocvm-a1-2
ocvm-a1-2.node-password.k3s                          Opaque                                1      64d
ubuntu@ocvm-a1-1:~$ kubectl delete secret -n kube-system ocvm-a1-2.node-password.k3s
secret "ocvm-a1-2.node-password.k3s" deleted
<VM 2 joins the cluster>
ubuntu@ocvm-a1-1:~$ kubectl get nodes
NAME        STATUS   ROLES                  AGE   VERSION
ocvm-a1-2   Ready    <none>                 18s   v1.22.5+k3s1
ocvm-a1-1   Ready    control-plane,master   65d   v1.22.5+k3s1

That helps me, thank you!

@einsteinarbert
Copy link

thank you. Rke2 has many of unknown error :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants