Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flannel fails to communicate between pods after node reboot #1474

Closed
Slyke opened this issue Aug 29, 2021 · 21 comments
Closed

Flannel fails to communicate between pods after node reboot #1474

Slyke opened this issue Aug 29, 2021 · 21 comments

Comments

@Slyke
Copy link

Slyke commented Aug 29, 2021

No interpod communication works after nodes are restarted. Requires docer running on each node to be manually stoped & started.

Expected Behavior

Pods should work fine

Current Behavior

DNS and all other connections timeout when trying to reach other pods

Possible Solution

Not sure, that's why I'm here!

Steps to Reproduce (for bugs)

Full steps from fresh Ubuntu install and details are here: kubernetes/kubernetes#104645 but TL;DR:

  1. Install Flannel
  2. Run kubectl exec -i -t dnsutils -- nslookup kubernetes.default. It works
  3. Restart Node
  4. Run kubectl exec -i -t dnsutils -- nslookup kubernetes.default in the pod on the Node that restarted. It fails with ;; connection timed out; no servers could be reached

Context

New to Kubernetes and this was really annoying to figure out. Went down so many wrong paths. Took ages to figure out what was going on. Learned a lot though. I have tried these solutions with no success:

Flannel logs (See line entry: I0828 09:00:22.327495):

# (Comment: This log is from the dnsutils pod running on the restarted node).
$ kubectl logs kube-flannel-ds-zv7nf -n kube-system
I0828 09:00:20.802275       1 main.go:520] Determining IP address of default interface
I0828 09:00:20.803003       1 main.go:533] Using interface with name enp3s0 and address 10.7.60.12
I0828 09:00:20.803045       1 main.go:550] Defaulting external address to interface address (10.7.60.12)
W0828 09:00:20.804272       1 client_config.go:608] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0828 09:00:21.236919       1 kube.go:116] Waiting 10m0s for node controller to sync
I0828 09:00:21.237044       1 kube.go:299] Starting kube subnet manager
I0828 09:00:22.237113       1 kube.go:123] Node controller sync successful
I0828 09:00:22.237173       1 main.go:254] Created subnet manager: Kubernetes Subnet Manager - k-w-002
I0828 09:00:22.237186       1 main.go:257] Installing signal handlers
I0828 09:00:22.237447       1 main.go:392] Found network config - Backend type: vxlan
I0828 09:00:22.237559       1 vxlan.go:123] VXLAN config: VNI=1 Port=0 GBP=false Learning=false DirectRouting=false
I0828 09:00:22.327495       1 main.go:357] Current network or subnet (10.244.0.0/16, 10.244.2.0/24) is not equal to previous one (0.0.0.0/0, 0.0.0.0/0), trying to recycle old iptables rules
I0828 09:00:22.804776       1 iptables.go:172] Deleting iptables rule: -s 0.0.0.0/0 -d 0.0.0.0/0 -j RETURN
I0828 09:00:22.806656       1 iptables.go:172] Deleting iptables rule: -s 0.0.0.0/0 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
I0828 09:00:23.131186       1 main.go:307] Setting up masking rules
I0828 09:00:23.133140       1 main.go:315] Changing default FORWARD chain policy to ACCEPT
I0828 09:00:23.133319       1 main.go:323] Wrote subnet file to /run/flannel/subnet.env
I0828 09:00:23.133340       1 main.go:327] Running backend.
I0828 09:00:23.133362       1 main.go:345] Waiting for all goroutines to exit
I0828 09:00:23.133393       1 vxlan_network.go:59] watching for new subnet leases
I0828 09:00:23.200217       1 iptables.go:148] Some iptables rules are missing; deleting and recreating rules
I0828 09:00:23.200246       1 iptables.go:172] Deleting iptables rule: -s 10.244.0.0/16 -j ACCEPT
I0828 09:00:23.200581       1 iptables.go:148] Some iptables rules are missing; deleting and recreating rules
I0828 09:00:23.200738       1 iptables.go:172] Deleting iptables rule: -s 10.244.0.0/16 -d 10.244.0.0/16 -j RETURN
I0828 09:00:23.202487       1 iptables.go:172] Deleting iptables rule: -d 10.244.0.0/16 -j ACCEPT
I0828 09:00:23.299811       1 iptables.go:172] Deleting iptables rule: -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
I0828 09:00:23.302004       1 iptables.go:160] Adding iptables rule: -s 10.244.0.0/16 -j ACCEPT
I0828 09:00:23.302092       1 iptables.go:172] Deleting iptables rule: ! -s 10.244.0.0/16 -d 10.244.2.0/24 -j RETURN
I0828 09:00:23.397339       1 iptables.go:160] Adding iptables rule: -d 10.244.0.0/16 -j ACCEPT
I0828 09:00:23.397598       1 iptables.go:172] Deleting iptables rule: ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE --random-fully
I0828 09:00:23.399463       1 iptables.go:160] Adding iptables rule: -s 10.244.0.0/16 -d 10.244.0.0/16 -j RETURN
I0828 09:00:23.499174       1 iptables.go:160] Adding iptables rule: -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
I0828 09:00:23.502667       1 iptables.go:160] Adding iptables rule: ! -s 10.244.0.0/16 -d 10.244.2.0/24 -j RETURN
I0828 09:00:23.599067       1 iptables.go:160] Adding iptables rule: ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE --random-fully


# (Comment: This is from the node that wasn't restarted and is working fine. Restarting dns queries it breaks though and the cluster needs to be reinstalled)
$ kubectl logs kube-flannel-ds-j5n42 -n kube-system
I0828 08:54:55.315700       1 main.go:520] Determining IP address of default interface
I0828 08:54:55.316066       1 main.go:533] Using interface with name eno1 and address 10.7.60.11
I0828 08:54:55.316084       1 main.go:550] Defaulting external address to interface address (10.7.60.11)
W0828 08:54:55.316103       1 client_config.go:608] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0828 08:54:55.416101       1 kube.go:116] Waiting 10m0s for node controller to sync
I0828 08:54:55.416149       1 kube.go:299] Starting kube subnet manager
I0828 08:54:56.416389       1 kube.go:123] Node controller sync successful
I0828 08:54:56.416436       1 main.go:254] Created subnet manager: Kubernetes Subnet Manager - k-w-001
I0828 08:54:56.416464       1 main.go:257] Installing signal handlers
I0828 08:54:56.416673       1 main.go:392] Found network config - Backend type: vxlan
I0828 08:54:56.416733       1 vxlan.go:123] VXLAN config: VNI=1 Port=0 GBP=false Learning=false DirectRouting=false
I0828 08:54:56.443901       1 main.go:307] Setting up masking rules
I0828 08:54:56.719917       1 main.go:315] Changing default FORWARD chain policy to ACCEPT
I0828 08:54:56.720021       1 main.go:323] Wrote subnet file to /run/flannel/subnet.env
I0828 08:54:56.720035       1 main.go:327] Running backend.
I0828 08:54:56.720047       1 main.go:345] Waiting for all goroutines to exit
I0828 08:54:56.720072       1 vxlan_network.go:59] watching for new subnet leases

Your Environment

$ kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1", GitCommit:"632ed300f2c34f6d6d15ca4cef3d3c7073412212", GitTreeState:"clean", BuildDate:"2021-08-19T15:44:22Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/amd64"}

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1", GitCommit:"632ed300f2c34f6d6d15ca4cef3d3c7073412212", GitTreeState:"clean", BuildDate:"2021-08-19T15:45:37Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1", GitCommit:"632ed300f2c34f6d6d15ca4cef3d3c7073412212", GitTreeState:"clean", BuildDate:"2021-08-19T15:39:34Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/amd64"}

$ kubelet --version
Kubernetes v1.22.1
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="21.04 (Hirsute Hippo)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 21.04"
VERSION_ID="21.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=hirsute
UBUNTU_CODENAME=hirsute

$ uname -a
Linux k-m-001 5.11.0-31-generic #33-Ubuntu SMP Wed Aug 11 13:19:04 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
@jkowalski
Copy link

jkowalski commented Aug 30, 2021

I'm seeing the same issue here.

Deleting flannel pods fixes the issue for me but is super annoying.

$ kubectl delete pod -n kube-system -l app=flannel

Using Ubuntu 20.04.3 nodes (VMs), Kubernetes 1.22.1 and flannel:v0.14.0

I was successfully running Flannel on old version of Kubernetes <=v1.19 for several quarters, never noticed this behavior, it all started after upgrading Kubernetes and Flannel, not sure which one is the culprit.

@cst152
Copy link

cst152 commented Aug 31, 2021

Oh, thanks for reporting! I'm glad I'm not the only one :)

The problem appears on my Debian Bullseye machine as well:

Kubernetes: v1.21.3
Flannel versions tested: 0.12 and 0.14

PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

root@octo00:~# uname -a
Linux somehost 5.10.0-8-amd64 #1 SMP Debian 5.10.46-4 (2021-08-03) x86_64 GNU/Linux

I'm running v1.21.3 and Kubelet talks to /run/containerd/containerd.sock.

No issues on Debian Buster with Linux someotherhost 4.19.0-17-amd64 #1 SMP Debian 4.19.194-1 (2021-06-10) x86_64 GNU/Linux with Flannel 0.12 as well as Flannel 0.14.

@mpartel
Copy link

mpartel commented Sep 1, 2021

Same issue with Debian Buster + backport kernel (5.10.46-4~bpo10+1) and Kubernetes 1.19.4, using the 'extension' backend.

@DrEngi
Copy link

DrEngi commented Sep 6, 2021

Also having this issue with Fedora CoreOS 34.20210808.3.0. Works fine if I restart all flannel pods, but very troublesome that I have to do this every time I need to take a node offline for maintenance.

@bengtfredh
Copy link

I had this issue and got tips about it be connected to MACAddressPolicy. Default for Fedora is MACAddressPolicy=persistent. By setting MACAddressPolicy=none for flannel interface, connection between nodes works fine after reboot.
Add file /etc/systemd/network/50-flannel.link

[Match]
OriginalName=flannel*
[Link]
MACAddressPolicy=none

https://www.freedesktop.org/software/systemd/man/systemd.link.html#MACAddressPolicy=

@mpartel
Copy link

mpartel commented Sep 9, 2021

That didn't fix it for me.
My awful workaround: ping other nodes and a few external IPs from a daemonset and restart the local flanneld if more than a few pings fail. Fortunately restarting flanneld doesn't seem to disrupt anything.

@DrEngi
Copy link

DrEngi commented Sep 15, 2021

Hey @mpartel, would you happen to have specifics about you implemented this? Also having this problem and it is really annoying having to restart flannel all the time.

@mpartel
Copy link

mpartel commented Sep 15, 2021

Sorry, I can't share the code (it'd be tangled with stuff specific to our setup anyway). A bit more detail: I run a daemonset that loops doing ping -c 5 -i 0.3 -W 3 <IP> for each node IP, plus a few well-known external IPs. If more than two pings fail (quite an arbitrary number), then it does pkill -e flanneld and waits for one minute before resuming pinging. It needs to run with hostPID: true to be able to kill Flannel. I also set tolerations: [{ operator: 'Exists', effect: 'NoSchedule' }], so it also runs on drained/non-ready/master-only nodes.

@Slyke
Copy link
Author

Slyke commented Sep 15, 2021

@DrEngi I put this together based off @mpartel 's description:

#!/bin/bash

# Configuration
declare -a ipList=("8.8.8.8" "8.8.4.4")
MAX_FAILURES=2

# Code
FAILED_COUNTER=0

for ipCheck in ${ipList[@]}; do
  ping -c 5 -i 0.3 -W 3 $ipCheck

  pingRes=$?
  if [[ $pingRes -eq 0 ]] ; then
    FAILED_COUNTER=0
  else
    ((FAILED_COUNTER++))
  fi

  if [[ $FAILED_COUNTER -gt $MAX_FAILURES ]]; then
    echo "Ping failed '$FAILED_COUNTER' times, with last failure ip: '$ipCheck'"
    FAILED_COUNTER=0
    echo "kubectl delete pod -n kube-system -l app=flannel"
    kubectl delete pod -n kube-system -l app=flannel
    sleep 60
  fi
done

I haven't tested it though.

@ghost
Copy link

ghost commented Oct 1, 2021

I am experiencing the same issue after upgrading nodes in a kubernetes cluster from Debian buster to bullseye. The version of the flannell image is v0.13.0-rancher1 .

@kskalski
Copy link

kskalski commented Oct 1, 2021

Same issue with k3s cluster on 5.11.0-37-generic OS Image, Ubuntu 21.04, some observations:

  • I think this started (?) or became more frequent with 1.22 release of k3s, which switched to 1.14 flannel (from 1.12 or maybe some branch out of 1.13 - would need to track exactly their version history). There might be even a relation - the more nodes running newer release, the more likely some pod pair will lose connectivity
  • it is more prominent with vxlan, switching to ipsec seems to remedy the issue a bit, but in a treachery way, since it still breaks, but after few hours of correct operation

@lossos
Copy link

lossos commented Oct 19, 2021

Problem appears on Debian Bullseye (Debian 11) with Kernel 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) and Kubernetes 1.22.2 with flannel 0.14.0 or 0.15.0-rc1 as well.

It seems to be a problem with a vxlan, see also #k3s-io/k3s#3863

@Slyke
Copy link
Author

Slyke commented Oct 19, 2021

Looks like this PR may fix this issue: #1485

@manuelbuil
Copy link
Collaborator

Yes, that PR should fix it. I have just created a release v0.15.1. Let's close the issue

@Slyke
Copy link
Author

Slyke commented Oct 25, 2021

@manuelbuil what's the best way to upgrade to the latest version on an existing cluster?

@manuelbuil
Copy link
Collaborator

I'd edit the daemonset and point to the new image

@lossos
Copy link

lossos commented Oct 26, 2021

It seems the Images are not yet publicly available, Last version on Quay is flannel:v0.15.0
Error response from daemon: manifest for quay.io/coreos/flannel:v0.15.1 not found: manifest unknown: manifest unknown
https://quay.io/repository/coreos/flannel?tab=tags

@manuelbuil
Copy link
Collaborator

It seems the Images are not yet publicly available, Last version on Quay is flannel:v0.15.0 Error response from daemon: manifest for quay.io/coreos/flannel:v0.15.1 not found: manifest unknown: manifest unknown https://quay.io/repository/coreos/flannel?tab=tags

@rajatchopra could you please push the v0.15.1 images to the repo?

@Slyke
Copy link
Author

Slyke commented Nov 8, 2021

Hello @manuelbuil , still unable to update Flannel to 0.15.1. Seems the image isn't pushed yet
image

@manuelbuil
Copy link
Collaborator

Hello @manuelbuil , still unable to update Flannel to 0.15.1. Seems the image isn't pushed yet image

@rajatchopra is the person with the permissions to push the image

@Slyke
Copy link
Author

Slyke commented Nov 15, 2021

I tested out 0.15.1 and can confirm that this is now fixed. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants