Flannel fails to communicate between pods after node reboot #1474

Slyke · 2021-08-29T10:32:28Z

No interpod communication works after nodes are restarted. Requires docer running on each node to be manually stoped & started.

Expected Behavior

Pods should work fine

Current Behavior

DNS and all other connections timeout when trying to reach other pods

Possible Solution

Not sure, that's why I'm here!

Steps to Reproduce (for bugs)

Full steps from fresh Ubuntu install and details are here: kubernetes/kubernetes#104645 but TL;DR:

Install Flannel
Run kubectl exec -i -t dnsutils -- nslookup kubernetes.default. It works
Restart Node
Run kubectl exec -i -t dnsutils -- nslookup kubernetes.default in the pod on the Node that restarted. It fails with ;; connection timed out; no servers could be reached

Context

New to Kubernetes and this was really annoying to figure out. Went down so many wrong paths. Took ages to figure out what was going on. Learned a lot though. I have tried these solutions with no success:

Flannel logs (See line entry: I0828 09:00:22.327495):

# (Comment: This log is from the dnsutils pod running on the restarted node).
$ kubectl logs kube-flannel-ds-zv7nf -n kube-system
I0828 09:00:20.802275       1 main.go:520] Determining IP address of default interface
I0828 09:00:20.803003       1 main.go:533] Using interface with name enp3s0 and address 10.7.60.12
I0828 09:00:20.803045       1 main.go:550] Defaulting external address to interface address (10.7.60.12)
W0828 09:00:20.804272       1 client_config.go:608] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0828 09:00:21.236919       1 kube.go:116] Waiting 10m0s for node controller to sync
I0828 09:00:21.237044       1 kube.go:299] Starting kube subnet manager
I0828 09:00:22.237113       1 kube.go:123] Node controller sync successful
I0828 09:00:22.237173       1 main.go:254] Created subnet manager: Kubernetes Subnet Manager - k-w-002
I0828 09:00:22.237186       1 main.go:257] Installing signal handlers
I0828 09:00:22.237447       1 main.go:392] Found network config - Backend type: vxlan
I0828 09:00:22.237559       1 vxlan.go:123] VXLAN config: VNI=1 Port=0 GBP=false Learning=false DirectRouting=false
I0828 09:00:22.327495       1 main.go:357] Current network or subnet (10.244.0.0/16, 10.244.2.0/24) is not equal to previous one (0.0.0.0/0, 0.0.0.0/0), trying to recycle old iptables rules
I0828 09:00:22.804776       1 iptables.go:172] Deleting iptables rule: -s 0.0.0.0/0 -d 0.0.0.0/0 -j RETURN
I0828 09:00:22.806656       1 iptables.go:172] Deleting iptables rule: -s 0.0.0.0/0 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
I0828 09:00:23.131186       1 main.go:307] Setting up masking rules
I0828 09:00:23.133140       1 main.go:315] Changing default FORWARD chain policy to ACCEPT
I0828 09:00:23.133319       1 main.go:323] Wrote subnet file to /run/flannel/subnet.env
I0828 09:00:23.133340       1 main.go:327] Running backend.
I0828 09:00:23.133362       1 main.go:345] Waiting for all goroutines to exit
I0828 09:00:23.133393       1 vxlan_network.go:59] watching for new subnet leases
I0828 09:00:23.200217       1 iptables.go:148] Some iptables rules are missing; deleting and recreating rules
I0828 09:00:23.200246       1 iptables.go:172] Deleting iptables rule: -s 10.244.0.0/16 -j ACCEPT
I0828 09:00:23.200581       1 iptables.go:148] Some iptables rules are missing; deleting and recreating rules
I0828 09:00:23.200738       1 iptables.go:172] Deleting iptables rule: -s 10.244.0.0/16 -d 10.244.0.0/16 -j RETURN
I0828 09:00:23.202487       1 iptables.go:172] Deleting iptables rule: -d 10.244.0.0/16 -j ACCEPT
I0828 09:00:23.299811       1 iptables.go:172] Deleting iptables rule: -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
I0828 09:00:23.302004       1 iptables.go:160] Adding iptables rule: -s 10.244.0.0/16 -j ACCEPT
I0828 09:00:23.302092       1 iptables.go:172] Deleting iptables rule: ! -s 10.244.0.0/16 -d 10.244.2.0/24 -j RETURN
I0828 09:00:23.397339       1 iptables.go:160] Adding iptables rule: -d 10.244.0.0/16 -j ACCEPT
I0828 09:00:23.397598       1 iptables.go:172] Deleting iptables rule: ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE --random-fully
I0828 09:00:23.399463       1 iptables.go:160] Adding iptables rule: -s 10.244.0.0/16 -d 10.244.0.0/16 -j RETURN
I0828 09:00:23.499174       1 iptables.go:160] Adding iptables rule: -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
I0828 09:00:23.502667       1 iptables.go:160] Adding iptables rule: ! -s 10.244.0.0/16 -d 10.244.2.0/24 -j RETURN
I0828 09:00:23.599067       1 iptables.go:160] Adding iptables rule: ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE --random-fully


# (Comment: This is from the node that wasn't restarted and is working fine. Restarting dns queries it breaks though and the cluster needs to be reinstalled)
$ kubectl logs kube-flannel-ds-j5n42 -n kube-system
I0828 08:54:55.315700       1 main.go:520] Determining IP address of default interface
I0828 08:54:55.316066       1 main.go:533] Using interface with name eno1 and address 10.7.60.11
I0828 08:54:55.316084       1 main.go:550] Defaulting external address to interface address (10.7.60.11)
W0828 08:54:55.316103       1 client_config.go:608] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0828 08:54:55.416101       1 kube.go:116] Waiting 10m0s for node controller to sync
I0828 08:54:55.416149       1 kube.go:299] Starting kube subnet manager
I0828 08:54:56.416389       1 kube.go:123] Node controller sync successful
I0828 08:54:56.416436       1 main.go:254] Created subnet manager: Kubernetes Subnet Manager - k-w-001
I0828 08:54:56.416464       1 main.go:257] Installing signal handlers
I0828 08:54:56.416673       1 main.go:392] Found network config - Backend type: vxlan
I0828 08:54:56.416733       1 vxlan.go:123] VXLAN config: VNI=1 Port=0 GBP=false Learning=false DirectRouting=false
I0828 08:54:56.443901       1 main.go:307] Setting up masking rules
I0828 08:54:56.719917       1 main.go:315] Changing default FORWARD chain policy to ACCEPT
I0828 08:54:56.720021       1 main.go:323] Wrote subnet file to /run/flannel/subnet.env
I0828 08:54:56.720035       1 main.go:327] Running backend.
I0828 08:54:56.720047       1 main.go:345] Waiting for all goroutines to exit
I0828 08:54:56.720072       1 vxlan_network.go:59] watching for new subnet leases

Your Environment

$ kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1", GitCommit:"632ed300f2c34f6d6d15ca4cef3d3c7073412212", GitTreeState:"clean", BuildDate:"2021-08-19T15:44:22Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/amd64"}

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1", GitCommit:"632ed300f2c34f6d6d15ca4cef3d3c7073412212", GitTreeState:"clean", BuildDate:"2021-08-19T15:45:37Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1", GitCommit:"632ed300f2c34f6d6d15ca4cef3d3c7073412212", GitTreeState:"clean", BuildDate:"2021-08-19T15:39:34Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/amd64"}

$ kubelet --version
Kubernetes v1.22.1

Flannel version: https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
Backend used (e.g. vxlan or udp): vxlan is a word I see in the logs, so guessing that one.
Etcd version:
Operating System and version:

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="21.04 (Hirsute Hippo)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 21.04"
VERSION_ID="21.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=hirsute
UBUNTU_CODENAME=hirsute

$ uname -a
Linux k-m-001 5.11.0-31-generic #33-Ubuntu SMP Wed Aug 11 13:19:04 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

jkowalski · 2021-08-30T01:57:39Z

I'm seeing the same issue here.

Deleting flannel pods fixes the issue for me but is super annoying.

$ kubectl delete pod -n kube-system -l app=flannel

Using Ubuntu 20.04.3 nodes (VMs), Kubernetes 1.22.1 and flannel:v0.14.0

I was successfully running Flannel on old version of Kubernetes <=v1.19 for several quarters, never noticed this behavior, it all started after upgrading Kubernetes and Flannel, not sure which one is the culprit.

cst152 · 2021-08-31T07:49:25Z

Oh, thanks for reporting! I'm glad I'm not the only one :)

The problem appears on my Debian Bullseye machine as well:

Kubernetes: v1.21.3
Flannel versions tested: 0.12 and 0.14

PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

root@octo00:~# uname -a
Linux somehost 5.10.0-8-amd64 #1 SMP Debian 5.10.46-4 (2021-08-03) x86_64 GNU/Linux

I'm running v1.21.3 and Kubelet talks to /run/containerd/containerd.sock.

No issues on Debian Buster with Linux someotherhost 4.19.0-17-amd64 #1 SMP Debian 4.19.194-1 (2021-06-10) x86_64 GNU/Linux with Flannel 0.12 as well as Flannel 0.14.

mpartel · 2021-09-01T04:12:37Z

Same issue with Debian Buster + backport kernel (5.10.46-4~bpo10+1) and Kubernetes 1.19.4, using the 'extension' backend.

DrEngi · 2021-09-06T22:12:11Z

Also having this issue with Fedora CoreOS 34.20210808.3.0. Works fine if I restart all flannel pods, but very troublesome that I have to do this every time I need to take a node offline for maintenance.

bengtfredh · 2021-09-08T21:37:02Z

I had this issue and got tips about it be connected to MACAddressPolicy. Default for Fedora is MACAddressPolicy=persistent. By setting MACAddressPolicy=none for flannel interface, connection between nodes works fine after reboot.
Add file /etc/systemd/network/50-flannel.link

[Match]
OriginalName=flannel*
[Link]
MACAddressPolicy=none

https://www.freedesktop.org/software/systemd/man/systemd.link.html#MACAddressPolicy=

mpartel · 2021-09-09T04:46:38Z

That didn't fix it for me.
My awful workaround: ping other nodes and a few external IPs from a daemonset and restart the local flanneld if more than a few pings fail. Fortunately restarting flanneld doesn't seem to disrupt anything.

DrEngi · 2021-09-15T00:24:18Z

Hey @mpartel, would you happen to have specifics about you implemented this? Also having this problem and it is really annoying having to restart flannel all the time.

mpartel · 2021-09-15T03:30:04Z

Sorry, I can't share the code (it'd be tangled with stuff specific to our setup anyway). A bit more detail: I run a daemonset that loops doing ping -c 5 -i 0.3 -W 3 <IP> for each node IP, plus a few well-known external IPs. If more than two pings fail (quite an arbitrary number), then it does pkill -e flanneld and waits for one minute before resuming pinging. It needs to run with hostPID: true to be able to kill Flannel. I also set tolerations: [{ operator: 'Exists', effect: 'NoSchedule' }], so it also runs on drained/non-ready/master-only nodes.

Slyke · 2021-09-15T10:39:59Z

@DrEngi I put this together based off @mpartel 's description:

#!/bin/bash

# Configuration
declare -a ipList=("8.8.8.8" "8.8.4.4")
MAX_FAILURES=2

# Code
FAILED_COUNTER=0

for ipCheck in ${ipList[@]}; do
  ping -c 5 -i 0.3 -W 3 $ipCheck

  pingRes=$?
  if [[ $pingRes -eq 0 ]] ; then
    FAILED_COUNTER=0
  else
    ((FAILED_COUNTER++))
  fi

  if [[ $FAILED_COUNTER -gt $MAX_FAILURES ]]; then
    echo "Ping failed '$FAILED_COUNTER' times, with last failure ip: '$ipCheck'"
    FAILED_COUNTER=0
    echo "kubectl delete pod -n kube-system -l app=flannel"
    kubectl delete pod -n kube-system -l app=flannel
    sleep 60
  fi
done

I haven't tested it though.

ghost · 2021-10-01T11:05:54Z

I am experiencing the same issue after upgrading nodes in a kubernetes cluster from Debian buster to bullseye. The version of the flannell image is v0.13.0-rancher1 .

kskalski · 2021-10-01T18:43:36Z

Same issue with k3s cluster on 5.11.0-37-generic OS Image, Ubuntu 21.04, some observations:

I think this started (?) or became more frequent with 1.22 release of k3s, which switched to 1.14 flannel (from 1.12 or maybe some branch out of 1.13 - would need to track exactly their version history). There might be even a relation - the more nodes running newer release, the more likely some pod pair will lose connectivity
it is more prominent with vxlan, switching to ipsec seems to remedy the issue a bit, but in a treachery way, since it still breaks, but after few hours of correct operation

lossos · 2021-10-19T06:32:39Z

Problem appears on Debian Bullseye (Debian 11) with Kernel 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) and Kubernetes 1.22.2 with flannel 0.14.0 or 0.15.0-rc1 as well.

It seems to be a problem with a vxlan, see also #k3s-io/k3s#3863

Slyke · 2021-10-19T07:38:43Z

Looks like this PR may fix this issue: #1485

manuelbuil · 2021-10-20T12:29:49Z

Yes, that PR should fix it. I have just created a release v0.15.1. Let's close the issue

Slyke · 2021-10-25T07:21:40Z

@manuelbuil what's the best way to upgrade to the latest version on an existing cluster?

manuelbuil · 2021-10-25T09:21:56Z

I'd edit the daemonset and point to the new image

lossos · 2021-10-26T07:09:43Z

It seems the Images are not yet publicly available, Last version on Quay is flannel:v0.15.0
Error response from daemon: manifest for quay.io/coreos/flannel:v0.15.1 not found: manifest unknown: manifest unknown
https://quay.io/repository/coreos/flannel?tab=tags

manuelbuil · 2021-10-26T07:41:57Z

It seems the Images are not yet publicly available, Last version on Quay is flannel:v0.15.0 Error response from daemon: manifest for quay.io/coreos/flannel:v0.15.1 not found: manifest unknown: manifest unknown https://quay.io/repository/coreos/flannel?tab=tags

@rajatchopra could you please push the v0.15.1 images to the repo?

Slyke · 2021-11-08T06:27:55Z

Hello @manuelbuil , still unable to update Flannel to 0.15.1. Seems the image isn't pushed yet

manuelbuil · 2021-11-08T08:13:10Z

Hello @manuelbuil , still unable to update Flannel to 0.15.1. Seems the image isn't pushed yet

@rajatchopra is the person with the permissions to push the image

Slyke · 2021-11-15T09:12:34Z

I tested out 0.15.1 and can confirm that this is now fixed. Thank you!

yuanying mentioned this issue Sep 2, 2021

k8s packets dropped between flannel.1 and cni0, no connectivity between hosts #1473

Closed

aojea mentioned this issue Sep 5, 2021

Pod DNS failure kubernetes/kubernetes#104645

Closed

Slyke mentioned this issue Sep 13, 2021

Pod in one node cannot ping another node #1477

Closed

manuelbuil closed this as completed Oct 20, 2021

alexpearce mentioned this issue Jan 30, 2023

Bump Flannel to v0.20.2. charmed-kubernetes/charm-flannel#85

Closed

addyess mentioned this issue Jan 31, 2023

switch from flannel v0.11.0 to v0.20.2 charmed-kubernetes/charm-flannel#86

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flannel fails to communicate between pods after node reboot #1474

Flannel fails to communicate between pods after node reboot #1474

Slyke commented Aug 29, 2021 •

edited

Loading

jkowalski commented Aug 30, 2021 •

edited

Loading

cst152 commented Aug 31, 2021

mpartel commented Sep 1, 2021

DrEngi commented Sep 6, 2021 •

edited

Loading

bengtfredh commented Sep 8, 2021

mpartel commented Sep 9, 2021

DrEngi commented Sep 15, 2021

mpartel commented Sep 15, 2021

Slyke commented Sep 15, 2021 •

edited

Loading

ghost commented Oct 1, 2021

kskalski commented Oct 1, 2021

lossos commented Oct 19, 2021

Slyke commented Oct 19, 2021

manuelbuil commented Oct 20, 2021

Slyke commented Oct 25, 2021

manuelbuil commented Oct 25, 2021

lossos commented Oct 26, 2021

manuelbuil commented Oct 26, 2021

Slyke commented Nov 8, 2021

manuelbuil commented Nov 8, 2021

Slyke commented Nov 15, 2021

Flannel fails to communicate between pods after node reboot #1474

Flannel fails to communicate between pods after node reboot #1474

Comments

Slyke commented Aug 29, 2021 • edited Loading

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

jkowalski commented Aug 30, 2021 • edited Loading

cst152 commented Aug 31, 2021

mpartel commented Sep 1, 2021

DrEngi commented Sep 6, 2021 • edited Loading

bengtfredh commented Sep 8, 2021

mpartel commented Sep 9, 2021

DrEngi commented Sep 15, 2021

mpartel commented Sep 15, 2021

Slyke commented Sep 15, 2021 • edited Loading

ghost commented Oct 1, 2021

kskalski commented Oct 1, 2021

lossos commented Oct 19, 2021

Slyke commented Oct 19, 2021

manuelbuil commented Oct 20, 2021

Slyke commented Oct 25, 2021

manuelbuil commented Oct 25, 2021

lossos commented Oct 26, 2021

manuelbuil commented Oct 26, 2021

Slyke commented Nov 8, 2021

manuelbuil commented Nov 8, 2021

Slyke commented Nov 15, 2021

Slyke commented Aug 29, 2021 •

edited

Loading

jkowalski commented Aug 30, 2021 •

edited

Loading

DrEngi commented Sep 6, 2021 •

edited

Loading

Slyke commented Sep 15, 2021 •

edited

Loading