Debian Testing worker nodes cannot reach out to the network #2157

cro · 2022-09-15T02:34:17Z

Before creating an issue, make sure you've checked the following:

You are running the latest released version of k0s
Make sure you've searched for existing issues, both open and closed
Make sure you've searched for PRs too, a fix might've been merged already
You're looking at docs for the released version, "main" branch docs are usually ahead of released versions.

Platform

Linux 5.19.0-1-amd64 #1 SMP PREEMPT_DYNAMIC Debian 5.19.6-1 (2022-09-01) x86_64 GNU/Linux
PRETTY_NAME="Debian GNU/Linux bookworm/sid"
NAME="Debian GNU/Linux"
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Version

v1.24.4+k0s.0

Sysinfo

`k0s sysinfo`

Machine ID: "ff1fec33320066321228dd75911bbf8e4d6d5024de836102c49acae5580af0a3" (from machine) (pass)
Total memory: 975.0 MiB (warning: 1.0 GiB recommended)
Disk space available for /var/lib/k0s: 24.7 GiB (pass)
Operating system: Linux (pass)
  Linux kernel release: 5.19.0-1-amd64 (pass)
  Max. file descriptors per process: current: 1024 / max: 1048576 (warning: < 65536)
  Executable in path: modprobe: /usr/sbin/modprobe (pass)
  /proc file system: mounted (0x9fa0) (pass)
  Control Groups: version 2 (pass)
    cgroup controller "cpu": available (pass)
    cgroup controller "cpuacct": available (via cpu in version 2) (pass)
    cgroup controller "cpuset": available (pass)
    cgroup controller "memory": available (pass)
    cgroup controller "devices": available (assumed) (pass)
    cgroup controller "freezer": available (assumed) (pass)
    cgroup controller "pids": available (pass)
    cgroup controller "hugetlb": available (pass)
    cgroup controller "blkio": available (via io in version 2) (pass)
  CONFIG_CGROUPS: Control Group support: built-in (pass)
    CONFIG_CGROUP_FREEZER: Freezer cgroup subsystem: built-in (pass)
    CONFIG_CGROUP_PIDS: PIDs cgroup subsystem: built-in (pass)
    CONFIG_CGROUP_DEVICE: Device controller for cgroups: built-in (pass)
    CONFIG_CPUSETS: Cpuset support: built-in (pass)
    CONFIG_CGROUP_CPUACCT: Simple CPU accounting cgroup subsystem: built-in (pass)
    CONFIG_MEMCG: Memory Resource Controller for Control Groups: built-in (pass)
    CONFIG_CGROUP_HUGETLB: HugeTLB Resource Controller for Control Groups: built-in (pass)
    CONFIG_CGROUP_SCHED: Group CPU scheduler: built-in (pass)
      CONFIG_FAIR_GROUP_SCHED: Group scheduling for SCHED_OTHER: built-in (pass)
        CONFIG_CFS_BANDWIDTH: CPU bandwidth provisioning for FAIR_GROUP_SCHED: built-in (pass)
    CONFIG_BLK_CGROUP: Block IO controller: built-in (pass)
  CONFIG_NAMESPACES: Namespaces support: built-in (pass)
    CONFIG_UTS_NS: UTS namespace: built-in (pass)
    CONFIG_IPC_NS: IPC namespace: built-in (pass)
    CONFIG_PID_NS: PID namespace: built-in (pass)
    CONFIG_NET_NS: Network namespace: built-in (pass)
  CONFIG_NET: Networking support: built-in (pass)
    CONFIG_INET: TCP/IP networking: built-in (pass)
      CONFIG_IPV6: The IPv6 protocol: built-in (pass)
    CONFIG_NETFILTER: Network packet filtering framework (Netfilter): built-in (pass)
      CONFIG_NETFILTER_ADVANCED: Advanced netfilter configuration: built-in (pass)
      CONFIG_NETFILTER_XTABLES: Netfilter Xtables support: module (pass)
        CONFIG_NETFILTER_XT_TARGET_REDIRECT: REDIRECT target support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_COMMENT: "comment" match support: module (pass)
        CONFIG_NETFILTER_XT_MARK: nfmark target and match support: module (pass)
        CONFIG_NETFILTER_XT_SET: set target and match support: module (pass)
        CONFIG_NETFILTER_XT_TARGET_MASQUERADE: MASQUERADE target support: module (pass)
        CONFIG_NETFILTER_XT_NAT: "SNAT and DNAT" targets support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_ADDRTYPE: "addrtype" address type match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_CONNTRACK: "conntrack" connection tracking match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_MULTIPORT: "multiport" Multiple port match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_RECENT: "recent" match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_STATISTIC: "statistic" match support: module (pass)
      CONFIG_NETFILTER_NETLINK: module (pass)
      CONFIG_NF_CONNTRACK: Netfilter connection tracking support: module (pass)
      CONFIG_NF_NAT: module (pass)
      CONFIG_IP_SET: IP set support: module (pass)
        CONFIG_IP_SET_HASH_IP: hash:ip set support: module (pass)
        CONFIG_IP_SET_HASH_NET: hash:net set support: module (pass)
      CONFIG_IP_VS: IP virtual server support: module (pass)
        CONFIG_IP_VS_NFCT: Netfilter connection tracking: built-in (pass)
      CONFIG_NF_CONNTRACK_IPV4: IPv4 connetion tracking support (required for NAT): unknown (warning)
      CONFIG_NF_REJECT_IPV4: IPv4 packet rejection: module (pass)
      CONFIG_NF_NAT_IPV4: IPv4 NAT: unknown (warning)
      CONFIG_IP_NF_IPTABLES: IP tables support: module (pass)
        CONFIG_IP_NF_FILTER: Packet filtering: module (pass)
          CONFIG_IP_NF_TARGET_REJECT: REJECT target support: module (pass)
        CONFIG_IP_NF_NAT: iptables NAT support: module (pass)
        CONFIG_IP_NF_MANGLE: Packet mangling: module (pass)
      CONFIG_NF_DEFRAG_IPV4: module (pass)
      CONFIG_NF_CONNTRACK_IPV6: IPv6 connetion tracking support (required for NAT): unknown (warning)
      CONFIG_NF_NAT_IPV6: IPv6 NAT: unknown (warning)
      CONFIG_IP6_NF_IPTABLES: IP6 tables support: module (pass)
        CONFIG_IP6_NF_FILTER: Packet filtering: module (pass)
        CONFIG_IP6_NF_MANGLE: Packet mangling: module (pass)
        CONFIG_IP6_NF_NAT: ip6tables NAT support: module (pass)
      CONFIG_NF_DEFRAG_IPV6: module (pass)
    CONFIG_BRIDGE: 802.1d Ethernet Bridging: module (pass)
      CONFIG_LLC: module (pass)
      CONFIG_STP: module (pass)
  CONFIG_EXT4_FS: The Extended 4 (ext4) filesystem: module (pass)
  CONFIG_PROC_FS: /proc file system support: built-in (pass)

What happened?

I deployed k0s to 3 control nodes and 2 worker nodes via k0sctl. The worker nodes lost network connectivity at deployment time.

Steps to reproduce

Using two machines with Debian Testing installed, create a k0sctl.yaml file deploying one control node and one worker node.
Note that the worker node install will not complete
Attempt to ssh to the worker node. This will fail.

Expected behavior

k0sctl should deploy functioning worker nodes.

Actual behavior

After deployment, k0s kubectl get nodes shows two worker nodes in NotReady status. I cannot ssh to the worker nodes anymore. If I try iptables --flush on the worker nodes and reboot them, get nodes will show them for a second or two, but they are unable to pull any container images.

Screenshots and logs

No response

Additional context

I note there is an open PR dealing with iptables-nft. I'm not sure if this is the actual problem. I can restore network access by going to the console of the worker nodes and running iptables --flush, but of course that's not a workable solution.

The text was updated successfully, but these errors were encountered:

makhov · 2022-09-15T12:01:27Z

Hello, @cro!

Thank you for creating the issue. Could you post the output of the iptables -V and /var/lib/k0s/bin/iptables -V commands from the worker node?
And also output of /var/lib/k0s/bin/iptables-save from the broken node, if it's possible

cro · 2022-09-15T16:18:44Z

root@k0w01:~# iptables -V
iptables v1.8.8 (nf_tables)
root@k0w01:~# /var/lib/k0s/bin/iptables -V
bash: /var/lib/k0s/bin/iptables: No such file or directory
root@k0w01:~# ls /var/lib/k0s/bin
containerd               containerd-shim-runc-v2  ip6tables-save    kubelet
containerd-shim          ip6tables                iptables-restore  runc
containerd-shim-runc-v1  ip6tables-restore        iptables-save     xtables-legacy-multi
root@k0w01:~# find / -xdev -name iptables
/usr/share/doc/iptables
/usr/share/iptables
/usr/share/bash-completion/completions/iptables
/usr/sbin/iptables
/etc/alternatives/iptables
/var/lib/k0s/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/73/fs/etc/alternatives/iptables
/var/lib/k0s/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/73/fs/var/lib/dpkg/alternatives/iptables
/var/lib/k0s/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/629/fs/sbin/iptables
/var/lib/k0s/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/22/fs/sbin/iptables
/var/lib/k0s/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/4/fs/usr/share/doc/iptables
/var/lib/k0s/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/4/fs/usr/share/iptables
/var/lib/k0s/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/4/fs/usr/sbin/iptables
/var/lib/k0s/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/4/fs/etc/alternatives/iptables
/var/lib/k0s/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/4/fs/var/lib/dpkg/alternatives/iptables
/var/lib/k0s/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/605/fs/etc/alternatives/iptables
/var/lib/k0s/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/605/fs/var/lib/dpkg/alternatives/iptables
/var/lib/k0s/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/15/fs/etc/iptables
/var/lib/k0s/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/15/fs/sbin/iptables
/var/lib/k0s/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/15/fs/var/lib/iptables
/var/lib/dpkg/alternatives/iptables
root@k0w01:~#

cro · 2022-09-15T16:21:24Z

Just to be clear, k0sctl could not determine my distribution automatically. I added os: debian to the .yaml config.

makhov · 2022-09-15T20:48:30Z

Thanks for providing the information.

iptables v1.8.8 (nf_tables) has some discrepancies with the previous version and doesn't work correctly with Kubernetes. You can find more info here:
kubernetes/kubernetes#112477
https://bugzilla.netfilter.org/show_bug.cgi?id=1632

We are working on a release that brings our own iptables binary for kubelet and will try to ship it asap.

Now you can workaround the issue by downgrading the iptables version to v1.8.7.

jnummelin · 2022-09-20T11:31:30Z

@cro if possible, could you test out on the same hosts with 1.24.5-rc.1+k0s.0 release from yesterday. That contains a fix for this iptables incompatibility issue

cro · 2022-09-20T17:46:20Z

Deployment was successful for the 3 control and 2 worker nodes.

I have a different issue now:

kc --kubeconfig=k0s.kubeconfig run -ti --image=debian:latest -- bash
If you don't see a command prompt, try pressing enter.
Error attaching, falling back to logs: error dialing backend: No agent available
Error from server: Get "https://172.23.23.215:10250/containerLogs/default/bash/bash": No agent available
kc --kubeconfig=k0s.kubeconfig get pods
NAME   READY   STATUS    RESTARTS   AGE
bash   1/1     Running   0          24s
kc --kubeconfig=k0s.kubeconfig exec -ti -- bash
Error from server: error dialing backend: No agent available

Some possibly relevant details:

kc --kubeconfig=k0s.kubeconfig get all -A
NAMESPACE     NAME                                  READY   STATUS    RESTARTS   AGE
default       pod/bash                              1/1     Running   0          4m3s
kube-system   pod/coredns-ddddfbd5c-75z9r           1/1     Running   0          5m4s
kube-system   pod/coredns-ddddfbd5c-9lzh2           1/1     Running   0          5m4s
kube-system   pod/konnectivity-agent-2rkrz          1/1     Running   0          4m59s
kube-system   pod/konnectivity-agent-n5g94          1/1     Running   0          4m59s
kube-system   pod/kube-proxy-28bzp                  1/1     Running   0          5m11s
kube-system   pod/kube-proxy-vd9pb                  1/1     Running   0          5m11s
kube-system   pod/kube-router-ct5nz                 1/1     Running   0          5m11s
kube-system   pod/kube-router-vdmfd                 1/1     Running   0          5m11s
kube-system   pod/metrics-server-7d7c4887f4-56ffv   1/1     Running   0          5m9s

NAMESPACE     NAME                     TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                  AGE
default       service/kubernetes       ClusterIP   10.96.0.1      <none>        443/TCP                  6m13s
kube-system   service/kube-dns         ClusterIP   10.96.0.10     <none>        53/UDP,53/TCP,9153/TCP   5m33s
kube-system   service/metrics-server   ClusterIP   10.100.98.30   <none>        443/TCP                  5m9s

NAMESPACE     NAME                                DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
kube-system   daemonset.apps/konnectivity-agent   2         2         2       2            2           kubernetes.io/os=linux   5m38s
kube-system   daemonset.apps/kube-proxy           2         2         2       2            2           kubernetes.io/os=linux   5m38s
kube-system   daemonset.apps/kube-router          2         2         2       2            2           <none>                   5m34s

NAMESPACE     NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/coredns          2/2     2            2           5m34s
kube-system   deployment.apps/metrics-server   1/1     1            1           5m9s

NAMESPACE     NAME                                        DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/coredns-ddddfbd5c           2         2         2       5m33s
kube-system   replicaset.apps/metrics-server-7d7c4887f4   1         1         1       5m9s
kc --kubeconfig=k0s.kubeconfig get nodes -A
NAME    STATUS   ROLES    AGE     VERSION
k0w00   Ready    <none>   5m31s   v1.24.5+k0s
k0w01   Ready    <none>   5m32s   v1.24.5+k0s

cro · 2022-09-20T17:56:28Z

Upgrading again to the other RC available, v1.25.1-rc.1+k0s.0, did not correct this issue.

cro · 2022-09-20T19:32:46Z

Furthermore, my other test cluster running Alpine 3.16 seems to deploy OK, but see this:

kc --kubeconfig=alp.kubeconfig get pods
Unable to connect to the server: stream error: stream ID 59; INTERNAL_ERROR; received from peer

jnummelin · 2022-09-21T07:22:37Z

exec and logs failing is usually symptom of other issues in the setup. So I believe the RCs did fix the initial iptables related issue.

Exec and logs failing like this is usually a symptom of broken connections in the konnectivity-agent services. In this case as you have multiple controllers (in pure controller mode) it seems the agents cannot establish connections with ALL the controllers. Like the docs say HA controlplane REQUIRES LB with a single address in front to allow konnectivity-agents to properly establish HA connections.

We know this is a PITA requirement to have, but it stems from the architectural decisions on upstream konnectivity how it establishes HA comms tunnels. k0s team is working on solution to lift this requirement for most deployment scenarios but unfortunately that did not get ready till 1.25 releases yet.

cro · 2022-09-21T16:12:37Z

I already had a load balancer in place from my previous experiments with k3s, so I repurposed it this morning. As you deduced, this fixed my remaining issues.

Like the docs say HA controlplane REQUIRES LB with a single address in front to allow konnectivity-agents to properly establish HA connections.

Guilty as charged! 😁 In my defense, I did read the docs, but missed the part about the LB being required.

Closing this ticket as the RCs fix the original issue. Thanks so much for your help and responsiveness.

cro added the bug Something isn't working label Sep 15, 2022

jnummelin added the area/network label Sep 16, 2022

cro closed this as completed Sep 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debian Testing worker nodes cannot reach out to the network #2157

Debian Testing worker nodes cannot reach out to the network #2157

cro commented Sep 15, 2022 •

edited by makhov

Loading

makhov commented Sep 15, 2022 •

edited

Loading

cro commented Sep 15, 2022

cro commented Sep 15, 2022

makhov commented Sep 15, 2022

jnummelin commented Sep 20, 2022

cro commented Sep 20, 2022

cro commented Sep 20, 2022

cro commented Sep 20, 2022

jnummelin commented Sep 21, 2022

cro commented Sep 21, 2022

Debian Testing worker nodes cannot reach out to the network #2157

Debian Testing worker nodes cannot reach out to the network #2157

Comments

cro commented Sep 15, 2022 • edited by makhov Loading

Before creating an issue, make sure you've checked the following:

Platform

Version

Sysinfo

What happened?

Steps to reproduce

Expected behavior

Actual behavior

Screenshots and logs

Additional context

makhov commented Sep 15, 2022 • edited Loading

cro commented Sep 15, 2022

cro commented Sep 15, 2022

makhov commented Sep 15, 2022

jnummelin commented Sep 20, 2022

cro commented Sep 20, 2022

cro commented Sep 20, 2022

cro commented Sep 20, 2022

jnummelin commented Sep 21, 2022

cro commented Sep 21, 2022

cro commented Sep 15, 2022 •

edited by makhov

Loading

makhov commented Sep 15, 2022 •

edited

Loading