Calico node networking errors #1606

piwi91 · 2019-08-29T14:50:38Z

RKE version:

v0.2.8

Docker version: (docker version,docker info preferred)

Operating system and kernel: (cat /etc/os-release, uname -r preferred)

CentOS 7.6 Kernel 3.10.0-957.1.3.el7.x86_64
and
CentOS 7.6 Kernel 3.10.0-957.27.2.el7.x86_64

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)

OpenStack

cluster.yml file:

# Nodes: this is the only required configuration. Everything else is optional.
nodes:
  # Controlplane & Etcd nodes
  - address: 10.253.10.7
    user: ansible
    role:
      - controlplane
      - etcd
    hostname_override: xxxxxxx
  - address: 10.253.10.8
    user: ansible
    role:
      - controlplane
      - etcd
    hostname_override: xxxxxxx
  - address: 10.253.10.9
    user: ansible
    role:
      - controlplane
      - etcd
    hostname_override: xxxxxxx
  # Worker nodes
  - address: 10.253.10.6
    user: ansible
    role:
      - worker
    hostname_override: xxxxxxx
  - address: 10.253.10.4
    user: ansible
    role:
      - worker
    hostname_override: xxxxxxx
  - address: 10.253.10.5
    user: ansible
    role:
      - worker
    hostname_override: xxxxxxx

# Enable use of SSH agent to use SSH private keys with passphrase
# This requires the environment `SSH_AUTH_SOCK` configured pointing to your SSH agent which has the private key added
ssh_agent_auth: true

# Set the name of the Kubernetes cluster
cluster_name: xxxxxxxxxxxx

# Check out the kubernetes version support one the rancher/rke Github page: https://github.com/rancher/rke/releases/
kubernetes_version: v1.15.3-rancher1-1

services:
  etcd:
    backup_config:
      interval_hours: 12
      retention: 6
  kube-api:
    # IP range for any services created on Kubernetes
    # This must match the service_cluster_ip_range in kube-controller
    service_cluster_ip_range: 10.21.0.0/16
    # Expose a different port range for NodePort services
    service_node_port_range: 30000-32767
    pod_security_policy: false
    extra_args:
      oidc-client-id: "spn:xxxxxxxxxx"
      oidc-issuer-url: "https://sts.windows.net/xxxxxxxxxx/"
      oidc-username-claim: "upn"
      oidc-groups-claim: "groups"
      v: 2
  kube-controller:
    # CIDR pool used to assign IP addresses to pods in the cluster
    cluster_cidr: 10.20.0.0/16
    # IP range for any services created on Kubernetes
    # This must match the service_cluster_ip_range in kube-api
    service_cluster_ip_range: 10.21.0.0/16
    extra_args:
      v: 2
  kubelet:
    # Base domain for the cluster
    cluster_domain: xxxxxxxxxxx
    # IP address for the DNS service endpoint
    cluster_dns_server: 10.21.0.10
    # Fail if swap is on
    fail_swap_on: true
    extra_args:
      v: 2

# Currently, only authentication strategy supported is x509.
# You can optionally create additional SANs (hostnames or IPs) to add to
#  the API server PKI certificate.
# This is useful if you want to use a load balancer for the control plane servers.
authentication:
  strategy: x509 # Use x509 for cluster administrator credentials and keep them very safe after you've created them
  sans:
    - "xxx.xxx.xxx.xxx"

cloud_provider:
  name: openstack
  openstackCloudProvider:
    global:
      username: xxxxxxxx
      password: xxxxxxxx
      auth-url: xxxxxxx
      tenant-id: xxxxxxx
      domain-id: default
    load_balancer:
      subnet-id: 88a8968f-2d6d-494e-a67e-dab207d068f0
    block_storage:
      bs-version: v3
      trust-device-path: false
      ignore-volume-az: false

# There are several network plug-ins that work, but we default to canal
network:
  plugin: canal

# Specify DNS provider (coredns or kube-dns)
dns:
  provider: coredns

# We disable the ingress controller deployment because we are going to run multiple ingress controllers with our own configuration
ingress:
  provider: none

# All add-on manifests MUST specify a namespace
# addons: ''
# addons_include: []

Steps to Reproduce:

Deploy an empty cluster with RKE

Results:

2019-08-29 14:26:48.610 [INFO][9] startup.go 256: Early log level set to info
2019-08-29 14:26:48.610 [INFO][9] startup.go 272: Using NODENAME environment for node name
2019-08-29 14:26:48.610 [INFO][9] startup.go 284: Determined node name: nlsvpkubec01
2019-08-29 14:26:48.614 [INFO][9] k8s.go 228: Using Calico IPAM
2019-08-29 14:26:48.614 [INFO][9] startup.go 316: Checking datastore connection
2019-08-29 14:26:48.630 [INFO][9] startup.go 340: Datastore connection verified
2019-08-29 14:26:48.630 [INFO][9] startup.go 95: Datastore is ready
2019-08-29 14:26:48.655 [INFO][9] startup.go 530: FELIX_IPV6SUPPORT is false through environment variable
2019-08-29 14:26:48.661 [INFO][9] startup.go 181: Using node name: nlsvpkubec01
2019-08-29 14:26:48.693 [INFO][18] k8s.go 228: Using Calico IPAM
CALICO_NETWORKING_BACKEND is none - no BGP daemon running
Calico node started successfully
2019-08-29 14:26:49.845 [WARNING][38] int_dataplane.go 354: Failed to query VXLAN device error=Link not found
2019-08-29 14:26:49.881 [WARNING][38] int_dataplane.go 384: Failed to cleanup preexisting XDP state error=failed to load XDP program (/tmp/felix-xdp-942558251): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: failed to get EHDR from /tmp/felix-xdp-942558251
Error: failed to open object file
2019-08-29 14:27:03.250 [WARNING][38] health.go 190: Reporter failed readiness checks name="async_calc_graph" reporter-state=&health.reporterState{name:"async_calc_graph", reports:health.HealthReport{Live:true, Ready:true}, timeout:20000000000, latest:health.HealthReport{Live:true, Ready:false}, timestamp:time.Time{wall:0xbf52160db4d62435, ext:13105494327, loc:(*time.Location)(0x2b08080)}}
2019-08-29 14:28:26.819 [WARNING][38] health.go 190: Reporter failed readiness checks name="async_calc_graph" reporter-state=&health.reporterState{name:"async_calc_graph", reports:health.HealthReport{Live:true, Ready:true}, timeout:20000000000, latest:health.HealthReport{Live:true, Ready:false}, timestamp:time.Time{wall:0xbf521622a8ce8c8a, ext:96903670157, loc:(*time.Location)(0x2b08080)}}
2019-08-29 14:29:36.819 [WARNING][38] health.go 190: Reporter failed readiness checks name="async_calc_graph" reporter-state=&health.reporterState{name:"async_calc_graph", reports:health.HealthReport{Live:true, Ready:true}, timeout:20000000000, latest:health.HealthReport{Live:true, Ready:false}, timestamp:time.Time{wall:0xbf5216341ce3e9fd, ext:166703743746, loc:(*time.Location)(0x2b08080)}}
2019-08-29 14:31:06.819 [WARNING][38] health.go 190: Reporter failed readiness checks name="async_calc_graph" reporter-state=&health.reporterState{name:"async_calc_graph", reports:health.HealthReport{Live:true, Ready:true}, timeout:20000000000, latest:health.HealthReport{Live:true, Ready:false}, timestamp:time.Time{wall:0xbf52164aa8e3ca35, ext:256905062112, loc:(*time.Location)(0x2b08080)}}

The text was updated successfully, but these errors were encountered:

cespo · 2019-09-05T04:53:41Z

Hitting the same error here:

2019-09-05 04:36:18.167 [WARNING][23732] daemon.go 592: Felix is shutting down reason="config changed"
2019-09-05 04:36:19.288 [WARNING][23732] health.go 190: Reporter failed readiness checks name="int_dataplane" reporter-state=&health.reporterState{name:"int_dataplane", reports:health.HealthReport{Live:true, Ready:true}, timeout:20000000000, latest:health.HealthReport{Live:true, Ready:false}, timestamp:time.Time{wall:0xbf544210893fef4b, ext:105631748, loc:(*time.Location)(0x2b314c0)}}
2019-09-05 04:36:20.304 [WARNING][23772] int_dataplane.go 362: Failed to query VXLAN device error=Link not found
2019-09-05 04:36:20.333 [WARNING][23772] int_dataplane.go 392: Failed to cleanup preexisting XDP state error=failed to load BPF program (/tmp/felix-bpf-927537225): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: failed to get EHDR from /tmp/felix-bpf-927537225
Error: failed to open object file```

cespo · 2019-09-05T23:20:04Z

I think this is related with https://github.com/projectcalico/calico/issues/2191
Fixed it disabling IPv6 on the node

echo 1 > /proc/sys/net/ipv6/conf/default/disable_ipv6
echo 1 > /proc/sys/net/ipv6/conf/all/disable_ipv6

mateuszkwiatkowski · 2019-09-18T10:09:50Z

Hello,
We hit the same error when deploying 1.15.3 with canal. We haven't seen this error with older k8s versions and canal, neither with 1.15.3 and calico.

eroji · 2019-10-09T00:50:54Z

This problem seems to be present with Rancher 2.3.0 and 1.15.4.

combor · 2019-10-10T12:13:09Z

I can confirm that it exists in Rancher 2.3.0 and 1.15.4 on rancherOS

2019-10-10 09:47:02.482 [INFO][9] startup.go 256: Early log level set to info
2019-10-10 09:47:02.482 [INFO][9] startup.go 272: Using NODENAME environment for node name
2019-10-10 09:47:02.482 [INFO][9] startup.go 284: Determined node name: etcd2
2019-10-10 09:47:02.483 [INFO][9] k8s.go 228: Using Calico IPAM
2019-10-10 09:47:02.484 [INFO][9] startup.go 316: Checking datastore connection
2019-10-10 09:47:02.497 [INFO][9] startup.go 340: Datastore connection verified
2019-10-10 09:47:02.497 [INFO][9] startup.go 95: Datastore is ready
2019-10-10 09:47:02.520 [INFO][9] startup.go 530: FELIX_IPV6SUPPORT is false through environment variable
2019-10-10 09:47:02.526 [INFO][9] startup.go 181: Using node name: etcd2
2019-10-10 09:47:02.552 [INFO][17] k8s.go 228: Using Calico IPAM
CALICO_NETWORKING_BACKEND is none - no BGP daemon running
Calico node started successfully
2019-10-10 09:47:03.642 [WARNING][35] int_dataplane.go 354: Failed to query VXLAN device error=Link not found
2019-10-10 09:47:03.686 [WARNING][35] int_dataplane.go 384: Failed to cleanup preexisting XDP state error=failed to load XDP program (/tmp/felix-xdp-243267895): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: failed to get EHDR from /tmp/felix-xdp-243267895
Error: failed to open object file

2019-10-10 09:47:36.630 [WARNING][35] health.go 190: Reporter failed readiness checks name="async_calc_graph" reporter-state=&health.reporterState{name:"async_calc_graph", reports:health.HealthReport{Live:true, Ready:true}, timeout:20000000000, latest:health.HealthReport{Live:true, Ready:false}, timestamp:time.Time{wall:0xbf5fdd6e24d6f8d1, ext:33033435314, loc:(*time.Location)(0x2b08080)}}

Malpractis · 2019-10-10T22:30:09Z

Also seeing the same with Rancher 2.3.0 and kube 1.15.4 on Ubuntu 16.04 with ipv6 disabled. Fresh install of OS and cluster.

michaellqu · 2019-10-14T11:14:10Z

This problem exsits in Rancher 2.3.0 and kubernetes 1.15.4 on Ubuntu 19.04

2019-10-14 11:06:44.361 [INFO][9] startup.go 256: Early log level set to info
2019/10/14 下午7:06:44 2019-10-14 11:06:44.361 [INFO][9] startup.go 272: Using NODENAME environment for node name
2019/10/14 下午7:06:44 2019-10-14 11:06:44.361 [INFO][9] startup.go 284: Determined node name: k8s-master01
2019/10/14 下午7:06:44 2019-10-14 11:06:44.363 [INFO][9] k8s.go 228: Using Calico IPAM
2019/10/14 下午7:06:44 2019-10-14 11:06:44.363 [INFO][9] startup.go 316: Checking datastore connection
2019/10/14 下午7:06:44 2019-10-14 11:06:44.386 [INFO][9] startup.go 340: Datastore connection verified
2019/10/14 下午7:06:44 2019-10-14 11:06:44.386 [INFO][9] startup.go 95: Datastore is ready
2019/10/14 下午7:06:44 2019-10-14 11:06:44.403 [INFO][9] startup.go 530: FELIX_IPV6SUPPORT is false through environment variable
2019/10/14 下午7:06:44 2019-10-14 11:06:44.407 [INFO][9] startup.go 181: Using node name: k8s-master01
2019/10/14 下午7:06:44 2019-10-14 11:06:44.439 [INFO][17] k8s.go 228: Using Calico IPAM
2019/10/14 下午7:06:44 CALICO_NETWORKING_BACKEND is none - no BGP daemon running
2019/10/14 下午7:06:44 Calico node started successfully
2019/10/14 下午7:06:45 2019-10-14 11:06:45.552 [WARNING][35] int_dataplane.go 354: Failed to query VXLAN device error=Link not found
2019/10/14 下午7:06:45 2019-10-14 11:06:45.593 [WARNING][35] int_dataplane.go 384: Failed to cleanup preexisting XDP state error=failed to load XDP program (/tmp/felix-xdp-107591071): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
2019/10/14 下午7:06:45 libbpf: failed to get EHDR from /tmp/felix-xdp-107591071
2019/10/14 下午7:06:45 Error: failed to open object file
2019/10/14 下午7:06:45
2019/10/14 下午7:08:09 2019-10-14 11:08:09.791 [WARNING][35] health.go 190: Reporter failed readiness checks name="async_calc_graph" reporter-state=&health.reporterState{name:"async_calc_graph", reports:health.HealthReport{Live:true, Ready:true}, timeout:20000000000, latest:health.HealthReport{Live:true, Ready:false}, timestamp:time.Time{wall:0xbf6133a663a64dca, ext:84121564611, loc:(*time.Location)(0x2b08080)}}
2019/10/14 下午7:09:07 2019-10-14 11:09:07.037 [WARNING][35] health.go 190: Reporter failed readiness checks name="async_calc_graph" reporter-state=&health.reporterState{name:"async_calc_graph", reports:health.HealthReport{Live:true, Ready:true}, timeout:20000000000, latest:health.HealthReport{Live:true, Ready:false}, timestamp:time.Time{wall:0xbf6133b4afa5eddd, ext:141322866623, loc:(*time.Location)(0x2b08080)}}
2019/10/14 下午7:09:19 2019-10-14 11:09:19.790 [WARNING][35] health.go 190: Reporter failed readiness checks name="async_calc_graph" reporter-state=&health.reporterState{name:"async_calc_graph", reports:health.HealthReport{Live:true, Ready:true}, timeout:20000000000, latest:health.HealthReport{Live:true, Ready:false}, timestamp:time.Time{wall:0xbf6133b7e3af81db, ext:154122167734, loc:(*time.Location)(0x2b08080)}}

superseb · 2019-10-16T11:17:47Z

Please see rancher/rancher#23430 (comment) and let me know if it resolves the issue.

piwi91 · 2019-10-18T08:14:38Z

@superseb this resolved the health checks but int_dataplane errors are still present:

2019-10-18 08:13:22.651 [INFO][9] startup.go 256: Early log level set to info
2019-10-18 08:13:22.653 [INFO][9] startup.go 272: Using NODENAME environment for node name
2019-10-18 08:13:22.653 [INFO][9] startup.go 284: Determined node name: nlsvpkubec01
2019-10-18 08:13:22.655 [INFO][9] k8s.go 228: Using Calico IPAM
2019-10-18 08:13:22.655 [INFO][9] startup.go 316: Checking datastore connection
2019-10-18 08:13:22.667 [INFO][9] startup.go 340: Datastore connection verified
2019-10-18 08:13:22.667 [INFO][9] startup.go 95: Datastore is ready
2019-10-18 08:13:22.694 [INFO][9] startup.go 530: FELIX_IPV6SUPPORT is false through environment variable
2019-10-18 08:13:22.736 [INFO][9] startup.go 181: Using node name: nlsvpkubec01
2019-10-18 08:13:22.772 [INFO][18] k8s.go 228: Using Calico IPAM
CALICO_NETWORKING_BACKEND is none - no BGP daemon running
Calico node started successfully
2019-10-18 08:13:23.913 [WARNING][38] int_dataplane.go 354: Failed to query VXLAN device error=Link not found
2019-10-18 08:13:24.036 [WARNING][38] int_dataplane.go 384: Failed to cleanup preexisting XDP state error=failed to load XDP program (/tmp/felix-xdp-082326122): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: failed to get EHDR from /tmp/felix-xdp-082326122
Error: failed to open object file

vitobotta · 2019-10-22T17:51:46Z

Hi @superseb, I am seeing the same errors in the logs. Applying the CRDs in the other thread fixed some errors but I still see those pasted by @piwi91 above. I am having a problem with a node.kubernetes.io/network-unavailable:NoSchedule taint which I haven't managed to remove. Can it be caused by this problem with Calico? Thanks

rbq · 2020-01-21T16:02:36Z

Since upgrading to Rancher v2.3.4 and Kubernetes v1.17.0-rancher1-2 I'm getting Calico errors on some of my nodes—the ones that happen to be virtual machines (Hyper-V). Bare metal ones are fine.

Pod: canal-xyzabc, container calico-node (image rancher/calico-node:v3.10.2):

[…]
2020-01-21 15:57:40.097 [WARNING][38878] int_dataplane.go 776: failed to wipe the XDP state error=failed to load BPF program (/tmp/felix-bpf-457814611): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory 
libbpf: Error in bpf_object__probe_name():Operation not permitted(1). Couldn't load basic 'r0 = 0' BPF program. 
libbpf: failed to load object '/tmp/felix-bpf-457814611' 
Error: failed to load object file 
 try=8 
2020-01-21 15:57:40.137 [WARNING][38878] int_dataplane.go 776: failed to wipe the XDP state error=failed to load BPF program (/tmp/felix-bpf-090885526): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory 
libbpf: Error in bpf_object__probe_name():Operation not permitted(1). Couldn't load basic 'r0 = 0' BPF program. 
libbpf: failed to load object '/tmp/felix-bpf-090885526' 
Error: failed to load object file 
 try=9 
2020-01-21 15:57:40.137 [PANIC][38878] int_dataplane.go 779: Failed to wipe the XDP state after 10 tries 
panic: (*logrus.Entry) (0x1a8e900,0xc000186140) 
 
goroutine 1 [running]: 
github.com/sirupsen/logrus.Entry.log(0xc0000d2050, 0xc0001d0f30, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7f6700000000, ...) 
	/go/pkg/mod/github.com/projectcalico/logrus@v0.0.0-20180627202928-fc9bbf2f57995271c5cd6911ede7a2ebc5ea7c6f/entry.go:112 +0x2d2 
github.com/sirupsen/logrus.(*Entry).Panic(0xc0006603c0, 0xc0005d2250, 0x1, 0x1) 
	/go/pkg/mod/github.com/projectcalico/logrus@v0.0.0-20180627202928-fc9bbf2f57995271c5cd6911ede7a2ebc5ea7c6f/entry.go:182 +0x103 
github.com/sirupsen/logrus.(*Entry).Panicf(0xc0006603c0, 0x1b11e1b, 0x2b, 0xc0005d2300, 0x1, 0x1) 
	/go/pkg/mod/github.com/projectcalico/logrus@v0.0.0-20180627202928-fc9bbf2f57995271c5cd6911ede7a2ebc5ea7c6f/entry.go:230 +0xd4 
github.com/sirupsen/logrus.(*Logger).Panicf(0xc0000d2050, 0x1b11e1b, 0x2b, 0xc0005d2300, 0x1, 0x1) 
	/go/pkg/mod/github.com/projectcalico/logrus@v0.0.0-20180627202928-fc9bbf2f57995271c5cd6911ede7a2ebc5ea7c6f/logger.go:173 +0x86 
github.com/sirupsen/logrus.Panicf(...) 
	/go/pkg/mod/github.com/projectcalico/logrus@v0.0.0-20180627202928-fc9bbf2f57995271c5cd6911ede7a2ebc5ea7c6f/exported.go:145 
github.com/projectcalico/felix/dataplane/linux.(*InternalDataplane).shutdownXDPCompletely(0xc0000f6d80) 
	/go/pkg/mod/github.com/projectcalico/felix@v0.0.0-20191003065011-e01caf688c90/dataplane/linux/int_dataplane.go:779 +0x2cd 
github.com/projectcalico/felix/dataplane/linux.(*InternalDataplane).doStaticDataplaneConfig(0xc0000f6d80) 
	/go/pkg/mod/github.com/projectcalico/felix@v0.0.0-20191003065011-e01caf688c90/dataplane/linux/int_dataplane.go:724 +0xc22 
github.com/projectcalico/felix/dataplane/linux.(*InternalDataplane).Start(0xc0000f6d80) 
	/go/pkg/mod/github.com/projectcalico/felix@v0.0.0-20191003065011-e01caf688c90/dataplane/linux/int_dataplane.go:584 +0x2f 
github.com/projectcalico/felix/dataplane.StartDataplaneDriver(0xc0005f4000, 0xc000162390, 0xc000576d20, 0x1, 0xc0005d37c0, 0x0) 
	/go/pkg/mod/github.com/projectcalico/felix@v0.0.0-20191003065011-e01caf688c90/dataplane/driver.go:186 +0xf09 
github.com/projectcalico/felix/daemon.Run(0x1ae3b51, 0x15, 0x1db21b0, 0x7, 0x1e08600, 0x28, 0x1ddf1c0, 0x18) 
	/go/pkg/mod/github.com/projectcalico/felix@v0.0.0-20191003065011-e01caf688c90/daemon/daemon.go:304 +0x18d7 
main.main() 
	/go/src/github.com/projectcalico/node/cmd/calico-node/main.go:102 +0x423

theAkito · 2020-01-21T20:01:14Z

I can confirm @rbq 's error. I experience pretty much the same.

barankaynak · 2020-01-28T20:49:15Z

@superseb this resolved the health checks but int_dataplane errors are still present:

2019-10-18 08:13:22.651 [INFO][9] startup.go 256: Early log level set to info
2019-10-18 08:13:22.653 [INFO][9] startup.go 272: Using NODENAME environment for node name
2019-10-18 08:13:22.653 [INFO][9] startup.go 284: Determined node name: nlsvpkubec01
2019-10-18 08:13:22.655 [INFO][9] k8s.go 228: Using Calico IPAM
2019-10-18 08:13:22.655 [INFO][9] startup.go 316: Checking datastore connection
2019-10-18 08:13:22.667 [INFO][9] startup.go 340: Datastore connection verified
2019-10-18 08:13:22.667 [INFO][9] startup.go 95: Datastore is ready
2019-10-18 08:13:22.694 [INFO][9] startup.go 530: FELIX_IPV6SUPPORT is false through environment variable
2019-10-18 08:13:22.736 [INFO][9] startup.go 181: Using node name: nlsvpkubec01
2019-10-18 08:13:22.772 [INFO][18] k8s.go 228: Using Calico IPAM
CALICO_NETWORKING_BACKEND is none - no BGP daemon running
Calico node started successfully
2019-10-18 08:13:23.913 [WARNING][38] int_dataplane.go 354: Failed to query VXLAN device error=Link not found
2019-10-18 08:13:24.036 [WARNING][38] int_dataplane.go 384: Failed to cleanup preexisting XDP state error=failed to load XDP program (/tmp/felix-xdp-082326122): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: failed to get EHDR from /tmp/felix-xdp-082326122
Error: failed to open object file

Same errors on my cluster

skaven81 · 2020-01-28T21:10:57Z

Any resolution to this? I'm seeing this in one of our test clusters we just upgraded to 1.15.5 using Rancher 2.2.9

imle · 2020-02-03T20:21:31Z

I had this issue as well. I did an empty config gen and copied over the new container versions and that seems to have resolved everything for me.

rbq · 2020-02-12T11:13:26Z

I had this issue as well. I did an empty config gen and copied over the new container versions and that seems to have resolved everything for me.

@imle Could you please provide the exact steps you took?

thomashoell · 2020-02-28T09:01:32Z

I just upgraded my cluster from 1.15.5 to 1.15.10 which solved my immediate problems. Afterwards I upgraded Rancher to 2.3.5 and my cluster to 1.17.3. No issues so far.

mcmcghee · 2020-03-07T03:46:55Z

I was having this issue and it was due to a combination of Ubuntu, Linux kernel 5.3, and secure boot. The newer kernels have lockdown enabled and it breaks BPF. There is bug report here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1863234

If you're having this problem you'll see the below errors in dmesg.

Kernel is locked down from EFI secure boot; see man kernel_lockdown.7
Lockdown: systemd: BPF is restricted; see man kernel_lockdown.7
systemd[1]: File /lib/systemd/system/systemd-journald.service:36 configures an IP firewall (IPAddressDeny=any), but the local system does not support BPF/cgroup based firewalling.
systemd[1]: Proceeding WITHOUT firewalling in effect! (This warning is only shown for the first loaded unit using IP firewalling.)```

flauschi · 2020-03-13T15:25:30Z

My current workaround is to disable XDP until the problem @mcmcghee described is fixed:
kubectl -n kube-system patch daemonset/canal -p '{"spec": {"template": {"spec": {"containers": [{"name": "calico-node", "env": [{"name": "FELIX_XDPENABLED", "value": "false"}]}]}}}}'

miraculixx · 2020-06-13T14:08:19Z

On a sandbox cluster that had this problem I was able to recover by doing the following (just fishing as nothing else worked). I'm advising not to try this unless you are quite sure you can live with a failed cluster. But it worked for me.

# very losely following https://docs.projectcalico.org/getting-started/kubernetes/flannel/flannel
$ kubectl delete daemonset canal
$ kubectl delete clusterrolebinding  calico-node
$ kubectl delete clusterrolebinding  canal-calico
$ kubectl apply -f https://docs.projectcalico.org/manifests/canal.yaml
$ kubectl create clusterrolebinding canal -n kube-system --clusterrole=cluster-admin --serviceaccount=kube-system:canal

stale · 2020-10-08T18:35:00Z

This issue/PR has been automatically marked as stale because it has not had activity (commit/comment/label) for 60 days. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

TeroPihlaja · 2020-10-09T16:20:20Z

We are seeing similar issues with rancher 2.5.0, kubernetes 1.18.8. rancher/calico-node:v3.13.4

I tried disabling ipv6 with

echo 1 > /proc/sys/net/ipv6/conf/default/disable_ipv6
echo 1 > /proc/sys/net/ipv6/conf/all/disable_ipv6

but that doesn't seem to help.
Also tried the FELIX_XDPENABLED -> "false" but that also has not helped with the issue.

We are using Fedora CoreOS.

Only thing that seems to work after a node restart is to wait for everything to start on the node and then manually restart canal pod. That seems to restore network connectivity.

olivierlemasle · 2020-10-16T19:00:45Z

This seem to be this issue: flannel-io/flannel#1321

Adding a file /etc/systemd/network/50-flannel.link with the following content should fix the issue:

[Match]
OriginalName=flannel*
[Link]
MACAddressPolicy=none

E.g. with ignition:

    - path: /etc/systemd/network/50-flannel.link
      contents:
        inline: |
          [Match]
          OriginalName=flannel*
          [Link]
          MACAddressPolicy=none

For more context:

TeroPihlaja · 2020-10-20T16:21:38Z

@olivierlemasle Thank you! This appears to solve our issues!

stale · 2020-12-19T16:22:51Z

This issue/PR has been automatically marked as stale because it has not had activity (commit/comment/label) for 60 days. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

Nowaker mentioned this issue Sep 24, 2019

High system load/CPU utilization with trivial liveness and readiness exec probes kubernetes/kubernetes#82440

Open

remche mentioned this issue Dec 13, 2019

NetworkPolicy broken when pods on different nodes #1830

Closed

stale bot added the status/stale label Oct 8, 2020

stale bot removed the status/stale label Oct 9, 2020

stale bot added the status/stale label Dec 19, 2020

stale bot closed this as completed Jan 2, 2021

HectorB-2020 mentioned this issue Jul 9, 2023

[Question] Right way to deploy Calico in RKE #3290

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calico node networking errors #1606

Calico node networking errors #1606

piwi91 commented Aug 29, 2019

cespo commented Sep 5, 2019

cespo commented Sep 5, 2019

mateuszkwiatkowski commented Sep 18, 2019

eroji commented Oct 9, 2019

combor commented Oct 10, 2019

Malpractis commented Oct 10, 2019 •

edited

Loading

michaellqu commented Oct 14, 2019

superseb commented Oct 16, 2019

piwi91 commented Oct 18, 2019

vitobotta commented Oct 22, 2019

rbq commented Jan 21, 2020

theAkito commented Jan 21, 2020

barankaynak commented Jan 28, 2020

skaven81 commented Jan 28, 2020

imle commented Feb 3, 2020

rbq commented Feb 12, 2020

thomashoell commented Feb 28, 2020

mcmcghee commented Mar 7, 2020 •

edited

Loading

flauschi commented Mar 13, 2020

miraculixx commented Jun 13, 2020 •

edited

Loading

stale bot commented Oct 8, 2020

TeroPihlaja commented Oct 9, 2020

olivierlemasle commented Oct 16, 2020

TeroPihlaja commented Oct 20, 2020

stale bot commented Dec 19, 2020

Calico node networking errors #1606

Calico node networking errors #1606

Comments

piwi91 commented Aug 29, 2019

cespo commented Sep 5, 2019

cespo commented Sep 5, 2019

mateuszkwiatkowski commented Sep 18, 2019

eroji commented Oct 9, 2019

combor commented Oct 10, 2019

Malpractis commented Oct 10, 2019 • edited Loading

michaellqu commented Oct 14, 2019

superseb commented Oct 16, 2019

piwi91 commented Oct 18, 2019

vitobotta commented Oct 22, 2019

rbq commented Jan 21, 2020

theAkito commented Jan 21, 2020

barankaynak commented Jan 28, 2020

skaven81 commented Jan 28, 2020

imle commented Feb 3, 2020

rbq commented Feb 12, 2020

thomashoell commented Feb 28, 2020

mcmcghee commented Mar 7, 2020 • edited Loading

flauschi commented Mar 13, 2020

miraculixx commented Jun 13, 2020 • edited Loading

stale bot commented Oct 8, 2020

TeroPihlaja commented Oct 9, 2020

olivierlemasle commented Oct 16, 2020

TeroPihlaja commented Oct 20, 2020

stale bot commented Dec 19, 2020

Malpractis commented Oct 10, 2019 •

edited

Loading

mcmcghee commented Mar 7, 2020 •

edited

Loading

miraculixx commented Jun 13, 2020 •

edited

Loading