Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

controller join not working with AWS NLB #4557

Closed
4 tasks done
ianb-mp opened this issue Jun 6, 2024 · 2 comments · Fixed by #4607
Closed
4 tasks done

controller join not working with AWS NLB #4557

ianb-mp opened this issue Jun 6, 2024 · 2 comments · Fixed by #4607
Assignees
Labels
bug Something isn't working

Comments

@ianb-mp
Copy link
Contributor

ianb-mp commented Jun 6, 2024

Before creating an issue, make sure you've checked the following:

  • You are running the latest released version of k0s
  • Make sure you've searched for existing issues, both open and closed
  • Make sure you've searched for PRs too, a fix might've been merged already
  • You're looking at docs for the released version, "main" branch docs are usually ahead of released versions.

Platform

Linux 6.1.0-13-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29) x86_64 GNU/Linux
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Version

v1.30.1+k0s.0

Sysinfo

`k0s sysinfo`
Total memory: 1.9 GiB (pass)
Disk space available for /var/lib/k0s: 17.9 GiB (pass)
Name resolution: localhost: [::1 127.0.0.1] (pass)
Operating system: Linux (pass)
  Linux kernel release: 6.1.0-13-cloud-amd64 (pass)
  Max. file descriptors per process: current: 1048576 / max: 1048576 (pass)
  AppArmor: active (pass)
  Executable in PATH: modprobe: /usr/sbin/modprobe (pass)
  Executable in PATH: mount: /usr/bin/mount (pass)
  Executable in PATH: umount: /usr/bin/umount (pass)
  /proc file system: mounted (0x9fa0) (pass)
  Control Groups: version 2 (pass)
    cgroup controller "cpu": available (is a listed root controller) (pass)
    cgroup controller "cpuacct": available (via cpu in version 2) (pass)
    cgroup controller "cpuset": available (is a listed root controller) (pass)
    cgroup controller "memory": available (is a listed root controller) (pass)
    cgroup controller "devices": available (device filters attachable) (pass)
    cgroup controller "freezer": available (cgroup.freeze exists) (pass)
    cgroup controller "pids": available (is a listed root controller) (pass)
    cgroup controller "hugetlb": available (is a listed root controller) (pass)
    cgroup controller "blkio": available (via io in version 2) (pass)
  CONFIG_CGROUPS: Control Group support: built-in (pass)
    CONFIG_CGROUP_FREEZER: Freezer cgroup subsystem: built-in (pass)
    CONFIG_CGROUP_PIDS: PIDs cgroup subsystem: built-in (pass)
    CONFIG_CGROUP_DEVICE: Device controller for cgroups: built-in (pass)
    CONFIG_CPUSETS: Cpuset support: built-in (pass)
    CONFIG_CGROUP_CPUACCT: Simple CPU accounting cgroup subsystem: built-in (pass)
    CONFIG_MEMCG: Memory Resource Controller for Control Groups: built-in (pass)
    CONFIG_CGROUP_HUGETLB: HugeTLB Resource Controller for Control Groups: built-in (pass)
    CONFIG_CGROUP_SCHED: Group CPU scheduler: built-in (pass)
      CONFIG_FAIR_GROUP_SCHED: Group scheduling for SCHED_OTHER: built-in (pass)
        CONFIG_CFS_BANDWIDTH: CPU bandwidth provisioning for FAIR_GROUP_SCHED: built-in (pass)
    CONFIG_BLK_CGROUP: Block IO controller: built-in (pass)
  CONFIG_NAMESPACES: Namespaces support: built-in (pass)
    CONFIG_UTS_NS: UTS namespace: built-in (pass)
    CONFIG_IPC_NS: IPC namespace: built-in (pass)
    CONFIG_PID_NS: PID namespace: built-in (pass)
    CONFIG_NET_NS: Network namespace: built-in (pass)
  CONFIG_NET: Networking support: built-in (pass)
    CONFIG_INET: TCP/IP networking: built-in (pass)
      CONFIG_IPV6: The IPv6 protocol: built-in (pass)
    CONFIG_NETFILTER: Network packet filtering framework (Netfilter): built-in (pass)
      CONFIG_NETFILTER_ADVANCED: Advanced netfilter configuration: built-in (pass)
      CONFIG_NF_CONNTRACK: Netfilter connection tracking support: module (pass)
      CONFIG_NETFILTER_XTABLES: Netfilter Xtables support: module (pass)
        CONFIG_NETFILTER_XT_TARGET_REDIRECT: REDIRECT target support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_COMMENT: "comment" match support: module (pass)
        CONFIG_NETFILTER_XT_MARK: nfmark target and match support: module (pass)
        CONFIG_NETFILTER_XT_SET: set target and match support: module (pass)
        CONFIG_NETFILTER_XT_TARGET_MASQUERADE: MASQUERADE target support: module (pass)
        CONFIG_NETFILTER_XT_NAT: "SNAT and DNAT" targets support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_ADDRTYPE: "addrtype" address type match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_CONNTRACK: "conntrack" connection tracking match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_MULTIPORT: "multiport" Multiple port match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_RECENT: "recent" match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_STATISTIC: "statistic" match support: module (pass)
      CONFIG_NETFILTER_NETLINK: module (pass)
      CONFIG_NF_NAT: module (pass)
      CONFIG_IP_SET: IP set support: module (pass)
        CONFIG_IP_SET_HASH_IP: hash:ip set support: module (pass)
        CONFIG_IP_SET_HASH_NET: hash:net set support: module (pass)
      CONFIG_IP_VS: IP virtual server support: module (pass)
        CONFIG_IP_VS_NFCT: Netfilter connection tracking: built-in (pass)
        CONFIG_IP_VS_SH: Source hashing scheduling: module (pass)
        CONFIG_IP_VS_RR: Round-robin scheduling: module (pass)
        CONFIG_IP_VS_WRR: Weighted round-robin scheduling: module (pass)
      CONFIG_NF_CONNTRACK_IPV4: IPv4 connetion tracking support (required for NAT): unknown (warning)
      CONFIG_NF_REJECT_IPV4: IPv4 packet rejection: module (pass)
      CONFIG_NF_NAT_IPV4: IPv4 NAT: unknown (warning)
      CONFIG_IP_NF_IPTABLES: IP tables support: module (pass)
        CONFIG_IP_NF_FILTER: Packet filtering: module (pass)
          CONFIG_IP_NF_TARGET_REJECT: REJECT target support: module (pass)
        CONFIG_IP_NF_NAT: iptables NAT support: module (pass)
        CONFIG_IP_NF_MANGLE: Packet mangling: module (pass)
      CONFIG_NF_DEFRAG_IPV4: module (pass)
      CONFIG_NF_CONNTRACK_IPV6: IPv6 connetion tracking support (required for NAT): unknown (warning)
      CONFIG_NF_NAT_IPV6: IPv6 NAT: unknown (warning)
      CONFIG_IP6_NF_IPTABLES: IP6 tables support: module (pass)
        CONFIG_IP6_NF_FILTER: Packet filtering: module (pass)
        CONFIG_IP6_NF_MANGLE: Packet mangling: module (pass)
        CONFIG_IP6_NF_NAT: ip6tables NAT support: module (pass)
      CONFIG_NF_DEFRAG_IPV6: module (pass)
    CONFIG_BRIDGE: 802.1d Ethernet Bridging: module (pass)
      CONFIG_LLC: module (pass)
      CONFIG_STP: module (pass)
  CONFIG_EXT4_FS: The Extended 4 (ext4) filesystem: built-in (pass)
  CONFIG_PROC_FS: /proc file system support: built-in (pass)

What happened?

I have three k0s controllers running on AWS ec2 instances behind an AWS NLB (network load balancer) and using k0sctl to deploy k0s to them. I'm finding that a secondary controller will sometimes fail to join the primary controller. k0sctl gives up waiting for the join and fails the job.

I'm finding it hard to diagnose the issue due to a lack of debug log telling me what's going on (see full log output below). For example, secondary controller's last log message is:

Jun 06 03:41:19 ip-172-31-37-26 k0s[91966]: time="2024-06-06 03:41:19" level=info msg="initialized join client successfully"

If I do k0sctl reset and try again it often will work. Failure vs success is perhaps 50/50. I can't see any pattern to it otherwise.

I did wonder if it might be related to a feature AWS NLB Client IP Preservation which is enabled by default. However turning this off doesn't help.

Steps to reproduce

Expected behavior

No response

Actual behavior

No response

Screenshots and logs

172.31.47.173 = primary controller
172.31.37.26 = secondary controller

primary k0scontroller log (full log) - https://gist.github.com/ianb-mp/326a61a02927fbf3d112c41e73d40a18#file-k0scontroller-primary-log

secondary k0scontroller log (full log)
Jun 06 03:41:19 ip-172-31-37-26 systemd[1]: Started k0scontroller.service - k0s - Zero Friction Kubernetes.
Jun 06 03:41:19 ip-172-31-37-26 k0s[91966]: time="2024-06-06 03:41:19" level=debug pre-flight-check=memory property="1.9 GiB"
Jun 06 03:41:19 ip-172-31-37-26 k0s[91966]: time="2024-06-06 03:41:19" level=debug pre-flight-check="disk:/var/lib/k0s" property="18.5 GiB"
Jun 06 03:41:19 ip-172-31-37-26 k0s[91966]: time="2024-06-06 03:41:19" level=debug pre-flight-check="nameResolution:localhost" property="[::1 127.0.0.1]"
Jun 06 03:41:19 ip-172-31-37-26 k0s[91966]: time="2024-06-06 03:41:19" level=debug pre-flight-check=os property=Linux
Jun 06 03:41:19 ip-172-31-37-26 k0s[91966]: time="2024-06-06 03:41:19" level=debug pre-flight-check=os/kernelRelease property=6.1.0-13-cloud-amd64
Jun 06 03:41:19 ip-172-31-37-26 k0s[91966]: time="2024-06-06 03:41:19" level=debug pre-flight-check=os/RLIMIT_NOFILE property="current: 999999 / max: 999999"
Jun 06 03:41:19 ip-172-31-37-26 k0s[91966]: time="2024-06-06 03:41:19" level=debug pre-flight-check=os/AppArmor property=active
Jun 06 03:41:19 ip-172-31-37-26 k0s[91966]: time="2024-06-06 03:41:19" level=debug msg="Starting debug server" debug_server=":6060"
Jun 06 03:41:19 ip-172-31-37-26 k0s[91966]: time="2024-06-06 03:41:19" level=debug pre-flight-check=os/NET property=built-in
Jun 06 03:41:19 ip-172-31-37-26 k0s[91966]: time="2024-06-06 03:41:19" level=debug pre-flight-check=os/NET/INET property=built-in
Jun 06 03:41:19 ip-172-31-37-26 k0s[91966]: time="2024-06-06 03:41:19" level=info msg="initialized join client successfully"
k0sctl output (partial log - secondary controller failing to join)
[...]
DEBU[0077] [ssh] 172.31.37.26:22: executing `curl -kso /dev/null --connect-timeout 20 -w "%{http_code}" "https://ianb-k0s-0a055dd957096fe6.elb.ap-southeast-2.amazonaws.com:6443/healthz"` 
DEBU[0078] [ssh] 172.31.37.26:22: 401                   
INFO[0078] [ssh] 172.31.47.173:22: generating token     
DEBU[0078] [ssh] 172.31.47.173:22: executing `sudo -s -- /usr/local/bin/k0s token create --role controller --expiry 10m0s --data-dir=/var/lib/k0s` 
DEBU[0078] [ssh] 172.31.47.173:22: join token ID: 8g98h8 
INFO[0078] [ssh] 172.31.37.26:22: writing join token    
DEBU[0078] [ssh] 172.31.37.26:22: executing `mktemp 2> /dev/null` 
DEBU[0078] [ssh] 172.31.37.26:22: /tmp/tmp.kmtKw0tHxU   
DEBU[0078] [ssh] 172.31.37.26:22: executing `cat > /tmp/tmp.kmtKw0tHxU` 
DEBU[0078] [ssh] 172.31.37.26:22: writing 1808 bytes to command stdin 
DEBU[0078] [ssh] 172.31.37.26:22: executing `sudo -s -- install -D -m 0640 -- /tmp/tmp.kmtKw0tHxU /etc/k0s/k0stoken` 
INFO[0078] [ssh] 172.31.37.26:22: installing k0s controller 
DEBU[0078] [ssh] 172.31.37.26:22: executing `sudo -s -- /usr/local/bin/k0s install controller --debug --enable-dynamic-config --data-dir=/var/lib/k0s --token-file "/etc/k0s/k0stoken" --config "/etc/k0s/k0s.yaml"` 
INFO[0080] [ssh] 172.31.37.26:22: starting service      
DEBU[0080] [ssh] 172.31.37.26:22: executing `sudo -s -- stat /run/systemd/system` 
DEBU[0080] [ssh] 172.31.37.26:22: File: /run/systemd/system 
DEBU[0080] [ssh] 172.31.37.26:22: Size: 80              Blocks: 0          IO Block: 4096   directory 
DEBU[0080] [ssh] 172.31.37.26:22: Device: 0,22  Inode: 209         Links: 3 
DEBU[0080] [ssh] 172.31.37.26:22: Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root) 
DEBU[0080] [ssh] 172.31.37.26:22: Access: 2024-06-06 03:41:18.700748701 +0000 
DEBU[0080] [ssh] 172.31.37.26:22: Modify: 2024-06-06 03:41:18.596750448 +0000 
DEBU[0080] [ssh] 172.31.37.26:22: Change: 2024-06-06 03:41:18.596750448 +0000 
DEBU[0080] [ssh] 172.31.37.26:22: Birth: 2024-05-16 02:59:59.940000000 +0000 
DEBU[0080] [ssh] 172.31.37.26:22: executing `sudo -s -- systemctl start k0scontroller 2> /dev/null` 
INFO[0080] [ssh] 172.31.37.26:22: waiting for the k0s service to start 
DEBU[0080] [ssh] 172.31.37.26:22: executing `sudo -s -- stat /run/systemd/system` 
DEBU[0080] [ssh] 172.31.37.26:22: File: /run/systemd/system 
DEBU[0080] [ssh] 172.31.37.26:22: Size: 80              Blocks: 0          IO Block: 4096   directory 
DEBU[0080] [ssh] 172.31.37.26:22: Device: 0,22  Inode: 209         Links: 3 
DEBU[0080] [ssh] 172.31.37.26:22: Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root) 
DEBU[0080] [ssh] 172.31.37.26:22: Access: 2024-06-06 03:41:18.700748701 +0000 
DEBU[0080] [ssh] 172.31.37.26:22: Modify: 2024-06-06 03:41:18.596750448 +0000 
DEBU[0080] [ssh] 172.31.37.26:22: Change: 2024-06-06 03:41:18.596750448 +0000 
DEBU[0080] [ssh] 172.31.37.26:22: Birth: 2024-05-16 02:59:59.940000000 +0000 
DEBU[0080] [ssh] 172.31.37.26:22: executing `sudo -s -- systemctl status k0scontroller 2> /dev/null | grep -q "(running)"` 
INFO[0081] [ssh] 172.31.37.26:22: waiting for kubernetes api to respond 
DEBU[0081] [ssh] 172.31.37.26:22: executing `curl -kso /dev/null --connect-timeout 20 -w "%{http_code}" "https://localhost:6443/version"` 
DEBU[0081] [ssh] 172.31.37.26:22: 000                   
DEBU[0084] [ssh] 10.2.5.21:22: executing command        
DEBU[0084] [ssh] 172.31.47.173:22: executing command    
DEBU[0084] [ssh] 172.31.37.26:22: executing command     
DEBU[0084] [ssh] 172.31.47.8:22: executing command      
DEBU[0084] [ssh] 10.2.5.20:22: executing command        
DEBU[0084] [ssh] 10.2.5.21:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:41:23.370372 +0000' -- /run/lock/k0sctl` 
DEBU[0085] [ssh] 10.2.5.20:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:41:23.454921 +0000' -- /run/lock/k0sctl` 
DEBU[0085] [ssh] 172.31.47.173:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:41:23.425792 +0000' -- /run/lock/k0sctl` 
DEBU[0085] [ssh] 172.31.37.26:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:41:23.425823 +0000' -- /run/lock/k0sctl` 
DEBU[0085] [ssh] 172.31.47.8:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:41:23.420148 +0000' -- /run/lock/k0sctl` 
DEBU[0086] retrying, attempt 1 - last error: command failed: client exec: ssh session wait: Process exited with status 7 
DEBU[0086] [ssh] 172.31.37.26:22: executing `curl -kso /dev/null --connect-timeout 20 -w "%{http_code}" "https://localhost:6443/version"` 
DEBU[0086] [ssh] 172.31.37.26:22: 000                   
DEBU[0091] retrying, attempt 2 - last error: command failed: client exec: ssh session wait: Process exited with status 7 
DEBU[0091] [ssh] 172.31.37.26:22: executing `curl -kso /dev/null --connect-timeout 20 -w "%{http_code}" "https://localhost:6443/version"` 
DEBU[0091] [ssh] 172.31.37.26:22: 000                   
DEBU[0094] [ssh] 10.2.5.21:22: executing command        
DEBU[0094] [ssh] 172.31.47.8:22: executing command      
DEBU[0094] [ssh] 172.31.47.173:22: executing command    
DEBU[0094] [ssh] 172.31.37.26:22: executing command     
DEBU[0094] [ssh] 10.2.5.20:22: executing command        
DEBU[0094] [ssh] 10.2.5.21:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:41:33.370837 +0000' -- /run/lock/k0sctl` 
DEBU[0095] [ssh] 10.2.5.20:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:41:33.455287 +0000' -- /run/lock/k0sctl` 
DEBU[0095] [ssh] 172.31.47.8:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:41:33.420322 +0000' -- /run/lock/k0sctl` 
DEBU[0095] [ssh] 172.31.47.173:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:41:33.426019 +0000' -- /run/lock/k0sctl` 
DEBU[0095] [ssh] 172.31.37.26:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:41:33.42606 +0000' -- /run/lock/k0sctl` 
DEBU[0096] retrying, attempt 3 - last error: command failed: client exec: ssh session wait: Process exited with status 7 
DEBU[0096] [ssh] 172.31.37.26:22: executing `curl -kso /dev/null --connect-timeout 20 -w "%{http_code}" "https://localhost:6443/version"` 
DEBU[0096] [ssh] 172.31.37.26:22: 000                   
DEBU[0101] retrying, attempt 4 - last error: command failed: client exec: ssh session wait: Process exited with status 7 
DEBU[0101] [ssh] 172.31.37.26:22: executing `curl -kso /dev/null --connect-timeout 20 -w "%{http_code}" "https://localhost:6443/version"` 
DEBU[0101] [ssh] 172.31.37.26:22: 000                   
DEBU[0104] [ssh] 10.2.5.21:22: executing command        
DEBU[0104] [ssh] 172.31.47.8:22: executing command      
DEBU[0104] [ssh] 172.31.47.173:22: executing command    
DEBU[0104] [ssh] 172.31.37.26:22: executing command     
DEBU[0104] [ssh] 10.2.5.20:22: executing command        
DEBU[0104] [ssh] 10.2.5.21:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:41:43.370941 +0000' -- /run/lock/k0sctl` 
DEBU[0105] [ssh] 10.2.5.20:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:41:43.455717 +0000' -- /run/lock/k0sctl` 
DEBU[0105] [ssh] 172.31.47.8:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:41:43.420212 +0000' -- /run/lock/k0sctl` 
DEBU[0105] [ssh] 172.31.47.173:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:41:43.42632 +0000' -- /run/lock/k0sctl` 
DEBU[0105] [ssh] 172.31.37.26:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:41:43.426401 +0000' -- /run/lock/k0sctl` 
DEBU[0106] retrying, attempt 5 - last error: command failed: client exec: ssh session wait: Process exited with status 7 
DEBU[0106] [ssh] 172.31.37.26:22: executing `curl -kso /dev/null --connect-timeout 20 -w "%{http_code}" "https://localhost:6443/version"` 
DEBU[0106] [ssh] 172.31.37.26:22: 000                   
DEBU[0111] retrying, attempt 6 - last error: command failed: client exec: ssh session wait: Process exited with status 7 
DEBU[0111] [ssh] 172.31.37.26:22: executing `curl -kso /dev/null --connect-timeout 20 -w "%{http_code}" "https://localhost:6443/version"` 
DEBU[0111] [ssh] 172.31.37.26:22: 000                   
DEBU[0114] [ssh] 10.2.5.21:22: executing command        
DEBU[0114] [ssh] 172.31.47.8:22: executing command      
DEBU[0114] [ssh] 172.31.47.173:22: executing command    
DEBU[0114] [ssh] 172.31.37.26:22: executing command     
DEBU[0114] [ssh] 10.2.5.20:22: executing command        
DEBU[0114] [ssh] 10.2.5.21:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:41:53.371415 +0000' -- /run/lock/k0sctl` 
DEBU[0115] [ssh] 10.2.5.20:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:41:53.455739 +0000' -- /run/lock/k0sctl` 
DEBU[0115] [ssh] 172.31.47.173:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:41:53.426703 +0000' -- /run/lock/k0sctl` 
DEBU[0115] [ssh] 172.31.47.8:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:41:53.420443 +0000' -- /run/lock/k0sctl` 
DEBU[0115] [ssh] 172.31.37.26:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:41:53.426744 +0000' -- /run/lock/k0sctl` 
DEBU[0116] retrying, attempt 7 - last error: command failed: client exec: ssh session wait: Process exited with status 7 
DEBU[0116] [ssh] 172.31.37.26:22: executing `curl -kso /dev/null --connect-timeout 20 -w "%{http_code}" "https://localhost:6443/version"` 
DEBU[0116] [ssh] 172.31.37.26:22: 000                   
DEBU[0121] retrying, attempt 8 - last error: command failed: client exec: ssh session wait: Process exited with status 7 
DEBU[0121] [ssh] 172.31.37.26:22: executing `curl -kso /dev/null --connect-timeout 20 -w "%{http_code}" "https://localhost:6443/version"` 
DEBU[0121] [ssh] 172.31.37.26:22: 000                   
DEBU[0124] [ssh] 10.2.5.21:22: executing command        
DEBU[0124] [ssh] 172.31.47.8:22: executing command      
DEBU[0124] [ssh] 172.31.37.26:22: executing command     
DEBU[0124] [ssh] 172.31.47.173:22: executing command    
DEBU[0124] [ssh] 10.2.5.20:22: executing command        
DEBU[0124] [ssh] 10.2.5.21:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:03.371658 +0000' -- /run/lock/k0sctl` 
DEBU[0125] [ssh] 10.2.5.20:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:03.456994 +0000' -- /run/lock/k0sctl` 
DEBU[0125] [ssh] 172.31.47.8:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:03.420599 +0000' -- /run/lock/k0sctl` 
DEBU[0125] [ssh] 172.31.47.173:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:03.427086 +0000' -- /run/lock/k0sctl` 
DEBU[0125] [ssh] 172.31.37.26:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:03.426979 +0000' -- /run/lock/k0sctl` 
DEBU[0126] retrying, attempt 9 - last error: command failed: client exec: ssh session wait: Process exited with status 7 
DEBU[0126] [ssh] 172.31.37.26:22: executing `curl -kso /dev/null --connect-timeout 20 -w "%{http_code}" "https://localhost:6443/version"` 
DEBU[0126] [ssh] 172.31.37.26:22: 000                   
DEBU[0131] retrying, attempt 10 - last error: command failed: client exec: ssh session wait: Process exited with status 7 
DEBU[0131] [ssh] 172.31.37.26:22: executing `curl -kso /dev/null --connect-timeout 20 -w "%{http_code}" "https://localhost:6443/version"` 
DEBU[0131] [ssh] 172.31.37.26:22: 000                   
DEBU[0134] [ssh] 10.2.5.21:22: executing command        
DEBU[0134] [ssh] 172.31.47.8:22: executing command      
DEBU[0134] [ssh] 172.31.47.173:22: executing command    
DEBU[0134] [ssh] 172.31.37.26:22: executing command     
DEBU[0134] [ssh] 10.2.5.20:22: executing command        
DEBU[0134] [ssh] 10.2.5.21:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:13.372017 +0000' -- /run/lock/k0sctl` 
DEBU[0135] [ssh] 10.2.5.20:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:13.457087 +0000' -- /run/lock/k0sctl` 
DEBU[0135] [ssh] 172.31.47.8:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:13.420869 +0000' -- /run/lock/k0sctl` 
DEBU[0135] [ssh] 172.31.47.173:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:13.426664 +0000' -- /run/lock/k0sctl` 
DEBU[0135] [ssh] 172.31.37.26:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:13.42671 +0000' -- /run/lock/k0sctl` 
DEBU[0136] retrying, attempt 11 - last error: command failed: client exec: ssh session wait: Process exited with status 7 
DEBU[0136] [ssh] 172.31.37.26:22: executing `curl -kso /dev/null --connect-timeout 20 -w "%{http_code}" "https://localhost:6443/version"` 
DEBU[0136] [ssh] 172.31.37.26:22: 000                   
DEBU[0141] retrying, attempt 12 - last error: command failed: client exec: ssh session wait: Process exited with status 7 
DEBU[0141] [ssh] 172.31.37.26:22: executing `curl -kso /dev/null --connect-timeout 20 -w "%{http_code}" "https://localhost:6443/version"` 
DEBU[0141] [ssh] 172.31.37.26:22: 000                   
DEBU[0144] [ssh] 10.2.5.21:22: executing command        
DEBU[0144] [ssh] 172.31.47.8:22: executing command      
DEBU[0144] [ssh] 172.31.47.173:22: executing command    
DEBU[0144] [ssh] 172.31.37.26:22: executing command     
DEBU[0144] [ssh] 10.2.5.20:22: executing command        
DEBU[0144] [ssh] 10.2.5.21:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:23.372848 +0000' -- /run/lock/k0sctl` 
DEBU[0145] [ssh] 10.2.5.20:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:23.4574 +0000' -- /run/lock/k0sctl` 
DEBU[0145] [ssh] 172.31.37.26:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:23.427317 +0000' -- /run/lock/k0sctl` 
DEBU[0145] [ssh] 172.31.47.173:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:23.427288 +0000' -- /run/lock/k0sctl` 
DEBU[0145] [ssh] 172.31.47.8:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:23.421421 +0000' -- /run/lock/k0sctl` 
DEBU[0146] retrying, attempt 13 - last error: command failed: client exec: ssh session wait: Process exited with status 7 
DEBU[0146] [ssh] 172.31.37.26:22: executing `curl -kso /dev/null --connect-timeout 20 -w "%{http_code}" "https://localhost:6443/version"` 
DEBU[0146] [ssh] 172.31.37.26:22: 000                   
DEBU[0151] retrying, attempt 14 - last error: command failed: client exec: ssh session wait: Process exited with status 7 
DEBU[0151] [ssh] 172.31.37.26:22: executing `curl -kso /dev/null --connect-timeout 20 -w "%{http_code}" "https://localhost:6443/version"` 
DEBU[0151] [ssh] 172.31.37.26:22: 000                   
DEBU[0154] [ssh] 10.2.5.21:22: executing command        
DEBU[0154] [ssh] 172.31.47.8:22: executing command      
DEBU[0154] [ssh] 172.31.37.26:22: executing command     
DEBU[0154] [ssh] 172.31.47.173:22: executing command    
DEBU[0154] [ssh] 10.2.5.20:22: executing command        
DEBU[0154] [ssh] 10.2.5.21:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:33.372946 +0000' -- /run/lock/k0sctl` 
DEBU[0155] [ssh] 10.2.5.20:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:33.457676 +0000' -- /run/lock/k0sctl` 
DEBU[0155] [ssh] 172.31.47.8:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:33.421915 +0000' -- /run/lock/k0sctl` 
DEBU[0155] [ssh] 172.31.37.26:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:33.427369 +0000' -- /run/lock/k0sctl` 
DEBU[0155] [ssh] 172.31.47.173:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:33.427345 +0000' -- /run/lock/k0sctl` 
DEBU[0156] retrying, attempt 15 - last error: command failed: client exec: ssh session wait: Process exited with status 7 
DEBU[0156] [ssh] 172.31.37.26:22: executing `curl -kso /dev/null --connect-timeout 20 -w "%{http_code}" "https://localhost:6443/version"` 
DEBU[0156] [ssh] 172.31.37.26:22: 000                   
DEBU[0161] retrying, attempt 16 - last error: command failed: client exec: ssh session wait: Process exited with status 7 
DEBU[0161] [ssh] 172.31.37.26:22: executing `curl -kso /dev/null --connect-timeout 20 -w "%{http_code}" "https://localhost:6443/version"` 
DEBU[0161] [ssh] 172.31.37.26:22: 000                   
DEBU[0164] [ssh] 10.2.5.21:22: executing command        
DEBU[0164] [ssh] 172.31.47.8:22: executing command      
DEBU[0164] [ssh] 172.31.37.26:22: executing command     
DEBU[0164] [ssh] 172.31.47.173:22: executing command    
DEBU[0164] [ssh] 10.2.5.20:22: executing command        
DEBU[0164] [ssh] 10.2.5.21:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:43.372425 +0000' -- /run/lock/k0sctl` 
DEBU[0165] [ssh] 10.2.5.20:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:43.457441 +0000' -- /run/lock/k0sctl` 
DEBU[0165] [ssh] 172.31.47.8:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:43.422034 +0000' -- /run/lock/k0sctl` 
DEBU[0165] [ssh] 172.31.37.26:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:43.427762 +0000' -- /run/lock/k0sctl` 
DEBU[0165] [ssh] 172.31.47.173:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:43.427721 +0000' -- /run/lock/k0sctl` 
DEBU[0166] retrying, attempt 17 - last error: command failed: client exec: ssh session wait: Process exited with status 7 
DEBU[0166] [ssh] 172.31.37.26:22: executing `curl -kso /dev/null --connect-timeout 20 -w "%{http_code}" "https://localhost:6443/version"` 
DEBU[0166] [ssh] 172.31.37.26:22: 000                   
DEBU[0171] retrying, attempt 18 - last error: command failed: client exec: ssh session wait: Process exited with status 7 
DEBU[0171] [ssh] 172.31.37.26:22: executing `curl -kso /dev/null --connect-timeout 20 -w "%{http_code}" "https://localhost:6443/version"` 
DEBU[0171] [ssh] 172.31.37.26:22: 000                   
DEBU[0174] [ssh] 10.2.5.21:22: executing command        
DEBU[0174] [ssh] 172.31.47.8:22: executing command      
DEBU[0174] [ssh] 172.31.37.26:22: executing command     
DEBU[0174] [ssh] 172.31.47.173:22: executing command    
DEBU[0174] [ssh] 10.2.5.20:22: executing command        
DEBU[0174] [ssh] 10.2.5.21:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:53.372984 +0000' -- /run/lock/k0sctl` 
DEBU[0175] [ssh] 10.2.5.20:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:53.457387 +0000' -- /run/lock/k0sctl` 
DEBU[0175] [ssh] 172.31.47.8:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:53.421887 +0000' -- /run/lock/k0sctl` 
DEBU[0175] [ssh] 172.31.37.26:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:53.427627 +0000' -- /run/lock/k0sctl` 
DEBU[0175] [ssh] 172.31.47.173:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:42:53.427681 +0000' -- /run/lock/k0sctl` 
DEBU[0176] retrying, attempt 19 - last error: command failed: client exec: ssh session wait: Process exited with status 7 
DEBU[0176] [ssh] 172.31.37.26:22: executing `curl -kso /dev/null --connect-timeout 20 -w "%{http_code}" "https://localhost:6443/version"` 
DEBU[0176] [ssh] 172.31.37.26:22: 000                   
DEBU[0181] retrying, attempt 20 - last error: command failed: client exec: ssh session wait: Process exited with status 7 
DEBU[0181] [ssh] 172.31.37.26:22: executing `curl -kso /dev/null --connect-timeout 20 -w "%{http_code}" "https://localhost:6443/version"` 
DEBU[0181] [ssh] 172.31.37.26:22: 000                   
DEBU[0184] [ssh] 10.2.5.21:22: executing command        
DEBU[0184] [ssh] 172.31.47.8:22: executing command      
DEBU[0184] [ssh] 172.31.37.26:22: executing command     
DEBU[0184] [ssh] 172.31.47.173:22: executing command    
DEBU[0184] [ssh] 10.2.5.20:22: executing command        
DEBU[0184] [ssh] 10.2.5.21:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:43:03.372862 +0000' -- /run/lock/k0sctl` 
DEBU[0185] [ssh] 10.2.5.20:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:43:03.45748 +0000' -- /run/lock/k0sctl` 
DEBU[0185] [ssh] 172.31.47.8:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:43:03.422653 +0000' -- /run/lock/k0sctl` 
DEBU[0185] [ssh] 172.31.47.173:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:43:03.428458 +0000' -- /run/lock/k0sctl` 
DEBU[0185] [ssh] 172.31.37.26:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:43:03.428515 +0000' -- /run/lock/k0sctl` 
DEBU[0186] retrying, attempt 21 - last error: command failed: client exec: ssh session wait: Process exited with status 7 
DEBU[0186] [ssh] 172.31.37.26:22: executing `curl -kso /dev/null --connect-timeout 20 -w "%{http_code}" "https://localhost:6443/version"` 
DEBU[0186] [ssh] 172.31.37.26:22: 000                   
DEBU[0191] retrying, attempt 22 - last error: command failed: client exec: ssh session wait: Process exited with status 7 
DEBU[0191] [ssh] 172.31.37.26:22: executing `curl -kso /dev/null --connect-timeout 20 -w "%{http_code}" "https://localhost:6443/version"` 
DEBU[0191] [ssh] 172.31.37.26:22: 000                   
DEBU[0194] [ssh] 10.2.5.21:22: executing command        
DEBU[0194] [ssh] 172.31.47.8:22: executing command      
DEBU[0194] [ssh] 172.31.37.26:22: executing command     
DEBU[0194] [ssh] 172.31.47.173:22: executing command    
DEBU[0194] [ssh] 10.2.5.20:22: executing command        
DEBU[0194] [ssh] 10.2.5.21:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:43:13.374054 +0000' -- /run/lock/k0sctl` 
DEBU[0195] [ssh] 10.2.5.20:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:43:13.458679 +0000' -- /run/lock/k0sctl` 
DEBU[0195] [ssh] 172.31.47.8:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:43:13.422569 +0000' -- /run/lock/k0sctl` 
DEBU[0195] [ssh] 172.31.37.26:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:43:13.428077 +0000' -- /run/lock/k0sctl` 
DEBU[0195] [ssh] 172.31.47.173:22: executing `sudo -s -- env -i LC_ALL=C TZ=UTC touch -m -d '2024-06-06 03:43:13.428145 +0000' -- /run/lock/k0sctl` 
DEBU[0196] retrying, attempt 23 - last error: command failed: client exec: ssh session wait: Process exited with status 7 
DEBU[0196] [ssh] 172.31.37.26:22: executing `curl -kso /dev/null --connect-timeout 20 -w "%{http_code}" "https://localhost:6443/version"` 
DEBU[0196] [ssh] 172.31.37.26:22: 000                   
INFO[0201] * Running clean-up for phase: Acquire exclusive host lock 
DEBU[0201] [ssh] 10.2.5.20:22: stopped lock cycle, removing file 
DEBU[0201] [ssh] 172.31.37.26:22: stopped lock cycle, removing file 
DEBU[0201] [ssh] 172.31.47.8:22: stopped lock cycle, removing file 
DEBU[0201] [ssh] 10.2.5.21:22: stopped lock cycle, removing file 
DEBU[0201] [ssh] 172.31.47.173:22: stopped lock cycle, removing file 
DEBU[0201] [ssh] 10.2.5.20:22: executing `sudo -s -- rm -f -- /run/lock/k0sctl 2> /dev/null` 
DEBU[0201] [ssh] 10.2.5.21:22: executing `sudo -s -- rm -f -- /run/lock/k0sctl 2> /dev/null` 
DEBU[0201] [ssh] 172.31.37.26:22: executing `sudo -s -- rm -f -- /run/lock/k0sctl 2> /dev/null` 
DEBU[0201] [ssh] 172.31.47.8:22: executing `sudo -s -- rm -f -- /run/lock/k0sctl 2> /dev/null` 
DEBU[0201] [ssh] 172.31.47.173:22: executing `sudo -s -- rm -f -- /run/lock/k0sctl 2> /dev/null` 
INFO[0201] * Running clean-up for phase: Initialize the k0s cluster 
INFO[0201] [ssh] 172.31.47.173:22: cleaning up          
DEBU[0201] [ssh] 172.31.47.173:22: executing `sudo -s -- /usr/local/bin/k0s reset --data-dir=/var/lib/k0s` 
WARN[0201] [ssh] 172.31.47.173:22: k0s reset failed     
INFO[0201] * Running clean-up for phase: Install controllers 
DEBU[0201] [ssh] 172.31.47.173:22: invalidating join token for controller 1 
DEBU[0201] [ssh] 172.31.47.173:22: executing `sudo -s -- /usr/local/bin/k0s token invalidate --data-dir=/var/lib/k0s 8g98h8` 
DEBU[0201] [ssh] 172.31.47.173:22: token 8g98h8 deleted successfully 
DEBU[0201] [ssh] 172.31.37.26:22: executing `mktemp 2> /dev/null` 
DEBU[0202] [ssh] 172.31.37.26:22: /tmp/tmp.7h72WKiOVs   
DEBU[0202] [ssh] 172.31.37.26:22: executing `cat > /tmp/tmp.7h72WKiOVs` 
DEBU[0202] [ssh] 172.31.37.26:22: writing 35 bytes to command stdin: [REDACTED] 
DEBU[0202] [ssh] 172.31.37.26:22: executing `sudo -s -- install -D -m 0600 -- /tmp/tmp.7h72WKiOVs /etc/k0s/k0stoken` 
INFO[0202] [ssh] 172.31.47.8:22: cleaning up            
INFO[0202] [ssh] 172.31.37.26:22: cleaning up           
DEBU[0202] [ssh] 172.31.37.26:22: executing `sudo -s -- /usr/local/bin/k0s reset --data-dir=/var/lib/k0s` 
INFO[0203] ==> Apply failed                    
FATA[0204] apply failed - log file saved to /Users/ian.bishop/Library/Caches/k0sctl/k0sctl.log: context deadline exceeded
command failed: client exec: ssh session wait: Process exited with status 7 

Additional context

No response

@ianb-mp ianb-mp added the bug Something isn't working label Jun 6, 2024
@twz123 twz123 self-assigned this Jun 10, 2024
@twz123
Copy link
Member

twz123 commented Jun 11, 2024

After the log line "initialized join client successfully", the second controller attempts to make an HTTP request to obtain certificates and key material from the first controller. My educated guess is that this request is blocking on network I/O. The load balancer is probably not responding about 50% of the time because it is balancing the load between the two controllers. (I assume the load balancer is using a round-robin or random load balancing strategy). When the load balancer tries to connect to the secondary controller, it fails because the secondary controller cannot respond to a join request from itself. It's not even listening on the join port yet. So I think the LB is trying to reach the secondary controller, keeping the HTTP request open and not responding yet. K0s retries join requests up to ten times. However, it doesn't define any timeouts, so it probably gets stuck on the first request.

Can you maybe check the configuration of your load balancer? Does it have some backend health checks defined? What are the frontend and backend timeouts?

@ianb-mp
Copy link
Contributor Author

ianb-mp commented Jun 12, 2024

Thankyou for taking the time to respond. My hunch is that the issue lies with the load balancer. Yesterday I tore down the LB and backend hosts and recreated them. Since then, I have done many reset/apply actions and I haven't seen the issue occur. This time the 'client IP preservation' setting is disabled. I had previously disabled it, but still saw the issue occur (or so I thought), so I'm confused why it seems to be consistently working now...

However, it doesn't define any timeouts, so it probably gets stuck on the first request.

Would you be willing to add timeouts for read/write? I think this would be good in terms of best practice and to aid in debugging these kinds of unusual scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants