NetworkPolicy broken when pods on different nodes #1830

remche · 2019-12-13T15:08:17Z

RKE version:

INFO[0000] Running RKE version: v1.0.0                  
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.6", GitCommit:"7015f71e75f670eb9e7ebd4b5749639d42e20079", GitTreeState:"clean", BuildDate:"2019-11-13T11:11:50Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}

Docker version: (docker version,docker info preferred)

Docker version 18.09.4, build d14af54

Containers: 33 Running: 28 Paused: 0 Stopped: 5 Images: 61 Server Version: 18.09.4 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: bb71b10fd8f58240ca47fbb579b9d1028eea7c84 runc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30 init version: fec3683 Security Options: apparmor seccomp Profile: default Kernel Version: 4.4.0-141-generic Operating System: Ubuntu 16.04.5 LTS OSType: linux Architecture: x86_64 CPUs: 2 Total Memory: 3.859GiB Name: osug-test-rke-worker-02 ID: OC5X:4WUZ:MM7D:ORS4:KJIG:V6UB:BAEA:V2MI:G7DY:ZRYH:S2IA:3YC3 Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false Product License: Community Engine

WARNING: No swap limit support

Operating system and kernel: (cat /etc/os-release, uname -r preferred)
Ubuntu 16.04.05 LTS
Kernel 4.4.0-141-generic

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)

VSphere

cluster.yml file:

cluster.yml


nodes:
-   address: x.x.x.x
    role:
    - controlplane
    - etcd
    hostname_override: rke-test-master-01
    user: user
-   address: x.x.x.x
    role:
    - worker
    hostname_override: rke-test-worker-01
    user: user
-   address: x.x.x.x
    role:
    - worker
    hostname_override: rke-test-worker-02
    user: user
services:
    kubelet:
        fail_swap_on: true
ssh_agent_auth: true
ignore_docker_version: false
kubernetes_version: v1.15.6-rancher1-2
ingress:
    provider: none
cluster_name: test
addon_job_timeout: 0

Steps to Reproduce:

Deploy a cluster in 15.5.6 with default CNI
Add a network policy.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: test
spec:
  podSelector:
    matchLabels:
      app: dice-canary
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: default
      podSelector:
        matchLabels:
          role: test
    ports:
    - protocol: TCP
      port: 8081

Results:

Only pods on the same node are allowed to reach the selected pod on port 8081.

We face this issue with NetworkPolicy on our development and production since the upgrade, and I can reproduce it on our test cluster.

I'm not sure if it comes from Calico, flannel, kube-proxy...
calico-node shows

2019-12-10 10:46:51.252 [WARNING][34] int_dataplane.go 354: Failed to query VXLAN device error=Link not found
2019-12-10 10:46:51.328 [WARNING][34] int_dataplane.go 384: Failed to cleanup preexisting XDP state error=failed to load XDP program (/tmp/felix-xdp-708183231): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: failed to get EHDR from /tmp/felix-xdp-708183231
Error: failed to open object file

IPv6 is disabled on nodes, as suggested in #1606

The text was updated successfully, but these errors were encountered:

remche · 2019-12-13T19:44:14Z

I tried to put a GlobalNetworkPolicy to get iptables logs, but when applied, all traffic between pods is dropped !

policy :

apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
  name: iptables-logging
spec:
  types:
  - Ingress
  - Egress
  ingress:
  - action: Log
    protocol: TCP
  - action: Log
    protocol: UDP
  egress:
  - action: Log
    protocol: TCP
  - action: Log
    protocol: UDP

log :

Dec 13 20:43:11 rke-worker-02 kernel: [21812004.530427] calico-packet: IN=cali1037a54e65e OUT=calib22d46149c5 MAC=ee:ee:ee:ee:ee:ee:06:20:e2:d5:3b:94:08:00 SRC=10.42.2.104 DST=10.42.2.87 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=18958 DF PROTO=TCP SPT=53406 DPT=8081 WINDOW=29200 RES=0x00 SYN URGP=0 
Dec 13 20:43:19 rke-worker-02 kernel: [21812012.554216] calico-packet: IN=cali1037a54e65e OUT=calib22d46149c5 MAC=ee:ee:ee:ee:ee:ee:06:20:e2:d5:3b:94:08:00 SRC=10.42.2.104 DST=10.42.2.87 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=18959 DF PROTO=TCP SPT=53406 DPT=8081 WINDOW=29200 RES=0x00 SYN URGP=0

remche · 2019-12-17T14:43:19Z

I dig further on this issue, problem only happens on upgraded cluster.

I'm able to reproduce it :

Install cluster with rke 0.2.8 (v1.14.6-rancher1-1). All work fine.
Upgrade with rke 1.0.0 (v1.15.6-rancher1-2). NetworkPolicies are broken as described.
Rebooting the nodes fix the problem.

I choose this version because it fits our productions cluster, but v1.16.3-rancher1-1 seems affected too.

andrezaycev · 2020-04-14T11:10:40Z

The same issue after upgrade to 1.17.4 - Rancher 2.4.2. NetworkPolicy broken - work only between pods on same node

ErikLundJensen · 2020-04-20T21:12:18Z

We see the same error when running RKE 1.1.0 with Canal as network plugin.
Sniplet from RKE cluster.yml

network:
plugin: canal
options:
canal_iface: ens192
canal_flannel_backend_type: vxlan

Running Ubuntu 18.04 at nodes.

We get the same result no matter if we use the service IP or the pod IP.
@remche and @andrezaycev are you also using Canal and have you found a solution to the problem?

andrezaycev · 2020-04-21T02:14:43Z

Rebooting the nodes fix the problem.
@ErikLundJensen

andrezaycev · 2020-04-24T11:56:24Z

We get the same result no matter if we use the service IP or the pod IP.

@remche and @andrezaycev are you also using Canal and have you found a solution to the problem?

I had problem after upgrading cluster to 1.17 with exist NetworkPolicy. Yes i using Canal.
Reboot worker nodes help me

ErikLundJensen · 2020-04-30T15:20:48Z

I found the bug in my setup. The "from" pod was using hostNetwork: true
which changes the policy evaluation. See also
https://github.com/projectcalico/felix/issues/1361

lukegriffith · 2020-07-21T10:39:54Z

We're running RKE on K8s rev: v1.18.3 using Canal and we're seeing this behaviour out the box, no upgrade.
Update: This was an issue with my pod cidr range clashing with the docker network range.

stale · 2020-10-08T18:34:46Z

This issue/PR has been automatically marked as stale because it has not had activity (commit/comment/label) for 60 days. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

remche mentioned this issue Dec 17, 2019

externalTrafficPolicy does not preserve IP anymore (L2 mode) metallb/metallb#504

Closed

This comment has been minimized.

Sign in to view

stale bot added the status/stale label Oct 8, 2020

stale bot closed this as completed Oct 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NetworkPolicy broken when pods on different nodes #1830

NetworkPolicy broken when pods on different nodes #1830

remche commented Dec 13, 2019 •

edited by zube bot

Loading

remche commented Dec 13, 2019

remche commented Dec 17, 2019 •

edited

Loading

This comment has been minimized.

andrezaycev commented Apr 14, 2020

ErikLundJensen commented Apr 20, 2020 •

edited

Loading

andrezaycev commented Apr 21, 2020

andrezaycev commented Apr 24, 2020 •

edited

Loading

ErikLundJensen commented Apr 30, 2020

lukegriffith commented Jul 21, 2020 •

edited

Loading

stale bot commented Oct 8, 2020

NetworkPolicy broken when pods on different nodes #1830

NetworkPolicy broken when pods on different nodes #1830

Comments

remche commented Dec 13, 2019 • edited by zube bot Loading

remche commented Dec 13, 2019

remche commented Dec 17, 2019 • edited Loading

This comment has been minimized.

andrezaycev commented Apr 14, 2020

ErikLundJensen commented Apr 20, 2020 • edited Loading

andrezaycev commented Apr 21, 2020

andrezaycev commented Apr 24, 2020 • edited Loading

ErikLundJensen commented Apr 30, 2020

lukegriffith commented Jul 21, 2020 • edited Loading

stale bot commented Oct 8, 2020

remche commented Dec 13, 2019 •

edited by zube bot

Loading

remche commented Dec 17, 2019 •

edited

Loading

ErikLundJensen commented Apr 20, 2020 •

edited

Loading

andrezaycev commented Apr 24, 2020 •

edited

Loading

lukegriffith commented Jul 21, 2020 •

edited

Loading