Failed Verify that all nodes actually joined / No VIP on any master #280

CanisHelix · 2023-04-17T12:54:28Z

CanisHelix
Apr 17, 2023

Expected Behavior

I expected the Playbook to complete without failure, and I expected the VIP to be pingable.

Current Behavior

Failure to complete playbook.

FAILED - RETRYING: [10.23.2.37]: Verify that all nodes actually joined (check k3s-init.service if this fails) (1 retries left).
FAILED - RETRYING: [10.23.2.38]: Verify that all nodes actually joined (check k3s-init.service if this fails) (1 retries left).
FAILED - RETRYING: [10.23.2.36]: Verify that all nodes actually joined (check k3s-init.service if this fails) (1 retries left).
fatal: [10.23.2.37]: FAILED! => {"attempts": 20, "changed": false, "cmd": ["k3s", "kubectl", "get", "nodes", "-l", "node-role.kubernetes.io/master=true", "-o=jsonpath={.items[*].metadata.name}"], "delta": "0:00:00.083028", "end": "2023-04-17 12:41:07.069478", "msg": "non-zero return code", "rc": 1, "start": "2023-04-17 12:41:06.986450", "stderr": "The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?", "stderr_lines": ["The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?"], "stdout": "", "stdout_lines": []}
fatal: [10.23.2.38]: FAILED! => {"attempts": 20, "changed": false, "cmd": ["k3s", "kubectl", "get", "nodes", "-l", "node-role.kubernetes.io/master=true", "-o=jsonpath={.items[*].metadata.name}"], "delta": "0:00:00.104864", "end": "2023-04-17 12:41:09.726744", "msg": "", "rc": 0, "start": "2023-04-17 12:41:09.621880", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
fatal: [10.23.2.36]: FAILED! => {"attempts": 20, "changed": false, "cmd": ["k3s", "kubectl", "get", "nodes", "-l", "node-role.kubernetes.io/master=true", "-o=jsonpath={.items[*].metadata.name}"], "delta": "0:00:00.197526", "end": "2023-04-17 12:41:11.921748", "msg": "", "rc": 0, "start": "2023-04-17 12:41:11.724222", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

Steps to Reproduce

Deploy 8 Debian-11 GenericCloud Images with unique cloud-init
Configure hosts.ini and all.yml
Run playbook

Context (variables)

Operating system: OSX (Ansible Machine), Debian 11 GenericCloud Images (K3 VM's)

Hardware: Intel 10th Gen NUC, Proxmox 7.2-4

Variables Used

all.yml

---
k3s_version: v1.24.12+k3s1
# this is the user that has ssh access to these machines
ansible_user: nigel
ansible_ssh_private_key_file: ~/.ssh/nigel.george
systemd_dir: /etc/systemd/system

# Set your timezone
system_timezone: "Etc/UTC"

# interface which will be used for flannel
flannel_iface: "eth0"

# apiserver_endpoint is virtual ip-address which will be configured on each master
apiserver_endpoint: "10.23.2.35"

# k3s_token is required masters can talk together securely
# this token should be alpha numeric only
k3s_token: "mySuperSecretToken"

# The IP on which the node is reachable in the cluster.
# Here, a sensible default is provided, you can still override
# it for each of your hosts, though.
k3s_node_ip: '{{ ansible_facts[flannel_iface]["ipv4"]["address"] }}'

# Disable the taint manually by setting: k3s_master_taint = false
k3s_master_taint: "{{ true if groups['node'] | default([]) | length >= 1 else false }}"

# these arguments are recommended for servers as well as agents:
extra_args: >-
  --flannel-iface={{ flannel_iface }}
  --node-ip={{ k3s_node_ip }}

# change these to your liking, the only required are: --disable servicelb, --tls-san {{ apiserver_endpoint }}
extra_server_args: >-
  {{ extra_args }}
  {{ '--node-taint node-role.kubernetes.io/master=true:NoSchedule' if k3s_master_taint else '' }}
  --tls-san {{ apiserver_endpoint }}
  --disable servicelb
  --disable traefik
  --datastore-endpoint mysql://k3s:myPassword@tcp(10.23.2.20:3306)/kubernetes
extra_agent_args: >-
  {{ extra_args }}

# image tag for kube-vip
kube_vip_tag_version: "v0.5.11"

# metallb type frr or native
metal_lb_type: "native"

# metallb mode layer2 or bgp
metal_lb_mode: "layer2"

# bgp options
# metal_lb_bgp_my_asn: "64513"
# metal_lb_bgp_peer_asn: "64512"
# metal_lb_bgp_peer_address: "192.168.30.1"

# image tag for metal lb
metal_lb_frr_tag_version: "v7.5.1"
metal_lb_speaker_tag_version: "v0.13.9"
metal_lb_controller_tag_version: "v0.13.9"

# metallb ip range for load balancer
metal_lb_ip_range: "10.23.2.100-10.23.2.149"

# Only enable if your nodes are proxmox LXC nodes, make sure to configure your proxmox nodes
# in your hosts.ini file.
# Please read https://gist.github.com/triangletodd/02f595cd4c0dc9aac5f7763ca2264185 before using this.
# Most notably, your containers must be privileged, and must not have nesting set to true.
# Please note this script disables most of the security of lxc containers, with the trade off being that lxc
# containers are significantly more resource efficent compared to full VMs.
# Mixing and matching VMs and lxc containers is not supported, ymmv if you want to do this.
# I would only really recommend using this if you have partiularly low powered proxmox nodes where the overhead of
# VMs would use a significant portion of your available resources.
proxmox_lxc_configure: false
# the user that you would use to ssh into the host, for example if you run ssh some-user@my-proxmox-host,
# set this value to some-user
proxmox_lxc_ssh_user: root
# the unique proxmox ids for all of the containers in the cluster, both worker and master nodes
proxmox_lxc_ct_ids:
  - 200
  - 201
  - 202
  - 203
  - 204

Hosts

host.ini

[master]
10.23.2.36
10.23.2.37
10.23.2.38

[node]
10.23.2.41
10.23.2.42
10.23.2.50
10.23.2.51
10.23.2.52

# only required if proxmox_lxc_configure: true
# must contain all proxmox instances that have a master or worker node
# [proxmox]
# 192.168.30.43

[k3s_cluster:children]
master
node

Additional Information

Removing --datasource option to use etcd results in the same.
Reset Runbook is executed before each attempt
systemctl status k3s-init.service reports:

Apr 17 12:38:41 Iron-Priest-0 k3s[1291]: time="2023-04-17T12:38:41Z" level=info msg="Waiting for CRD addons.k3s.cattle.io to become available"
Apr 17 12:38:42 Iron-Priest-0 k3s[1291]: time="2023-04-17T12:38:42Z" level=info msg="Waiting for CRD addons.k3s.cattle.io to become available"
Apr 17 12:38:42 Iron-Priest-0 k3s[1291]: time="2023-04-17T12:38:42Z" level=info msg="Waiting for CRD addons.k3s.cattle.io to become available"
Apr 17 12:38:43 Iron-Priest-0 k3s[1291]: time="2023-04-17T12:38:43Z" level=info msg="Waiting for CRD addons.k3s.cattle.io to become available"```

Attempting to connect to the port fails due to self-signed certificate

❯ curl https://127.0.0.1:6443/v1-k3s/readyz
curl: (60) SSL certificate problem: self signed certificate in certificate chain
More details here: https://curl.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.```

10.23.2.35 is never reachable presumably as init it failing.

Possible Solution

I've checked the General Troubleshooting Guide

Answered by CanisHelix

Apr 18, 2023

@timothystewart6 Thanks for the quick response, I was able to get it working using etcd eventually, but I had to run the reset.yml playbook twice before re-attempting site.yml. A single reset.yml prior to a site.yml was producing the same results.

I restored them to their original debian-11-genericcloud-amd64-20230124-1270.qcow2 state, plus a udev (and reboot) change to remap them all to eth0, same problem when running site.yml on the clean images.

This morning I restored them back to debian-11-genericcloud-amd64-20230124-1270.qcow2 state with the udev (and reboot) changes once more, but executed reset.yml twice, then site.yml and now it's all working fine with the mysql datasource too.

P…

View full answer

timothystewart6 · 2023-04-18T02:36:25Z

timothystewart6
Apr 18, 2023
Maintainer

This scenario works in CI as well as on my machines. Are you use you have the correct interface name?

0 replies

CanisHelix · 2023-04-18T02:52:28Z

CanisHelix
Apr 18, 2023
Author

@timothystewart6 Thanks for the quick response, I was able to get it working using etcd eventually, but I had to run the reset.yml playbook twice before re-attempting site.yml. A single reset.yml prior to a site.yml was producing the same results.

I restored them to their original debian-11-genericcloud-amd64-20230124-1270.qcow2 state, plus a udev (and reboot) change to remap them all to eth0, same problem when running site.yml on the clean images.

This morning I restored them back to debian-11-genericcloud-amd64-20230124-1270.qcow2 state with the udev (and reboot) changes once more, but executed reset.yml twice, then site.yml and now it's all working fine with the mysql datasource too.

Perhaps this is unique to this cloud image, but it seems a reset.yml must be done even on a clean that uses this cloud image.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed Verify that all nodes actually joined / No VIP on any master #280

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Failed Verify that all nodes actually joined / No VIP on any master #280

CanisHelix Apr 17, 2023

Expected Behavior

Current Behavior

Steps to Reproduce

Context (variables)

Variables Used

Hosts

Additional Information

Possible Solution

Replies: 2 comments

timothystewart6 Apr 18, 2023 Maintainer

CanisHelix Apr 18, 2023 Author

CanisHelix
Apr 17, 2023

timothystewart6
Apr 18, 2023
Maintainer

CanisHelix
Apr 18, 2023
Author