Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NetworkPolicy broken when pods on different nodes #1830

Closed
remche opened this issue Dec 13, 2019 · 10 comments
Closed

NetworkPolicy broken when pods on different nodes #1830

remche opened this issue Dec 13, 2019 · 10 comments

Comments

@remche
Copy link

remche commented Dec 13, 2019

RKE version:

INFO[0000] Running RKE version: v1.0.0                  
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.6", GitCommit:"7015f71e75f670eb9e7ebd4b5749639d42e20079", GitTreeState:"clean", BuildDate:"2019-11-13T11:11:50Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}

Docker version: (docker version,docker info preferred)

Docker version 18.09.4, build d14af54

Containers: 33
 Running: 28
 Paused: 0
 Stopped: 5
Images: 61
Server Version: 18.09.4
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: bb71b10fd8f58240ca47fbb579b9d1028eea7c84
runc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-141-generic
Operating System: Ubuntu 16.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.859GiB
Name: osug-test-rke-worker-02
ID: OC5X:4WUZ:MM7D:ORS4:KJIG:V6UB:BAEA:V2MI:G7DY:ZRYH:S2IA:3YC3
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine

WARNING: No swap limit support

Operating system and kernel: (cat /etc/os-release, uname -r preferred)
Ubuntu 16.04.05 LTS
Kernel 4.4.0-141-generic

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)

VSphere

cluster.yml file:

cluster.yml

nodes:
-   address: x.x.x.x
    role:
    - controlplane
    - etcd
    hostname_override: rke-test-master-01
    user: user
-   address: x.x.x.x
    role:
    - worker
    hostname_override: rke-test-worker-01
    user: user
-   address: x.x.x.x
    role:
    - worker
    hostname_override: rke-test-worker-02
    user: user
services:
    kubelet:
        fail_swap_on: true
ssh_agent_auth: true
ignore_docker_version: false
kubernetes_version: v1.15.6-rancher1-2
ingress:
    provider: none
cluster_name: test
addon_job_timeout: 0

Steps to Reproduce:

Deploy a cluster in 15.5.6 with default CNI
Add a network policy.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: test
spec:
  podSelector:
    matchLabels:
      app: dice-canary
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: default
      podSelector:
        matchLabels:
          role: test
    ports:
    - protocol: TCP
      port: 8081

Results:

Only pods on the same node are allowed to reach the selected pod on port 8081.

We face this issue with NetworkPolicy on our development and production since the upgrade, and I can reproduce it on our test cluster.

I'm not sure if it comes from Calico, flannel, kube-proxy...
calico-node shows

2019-12-10 10:46:51.252 [WARNING][34] int_dataplane.go 354: Failed to query VXLAN device error=Link not found
2019-12-10 10:46:51.328 [WARNING][34] int_dataplane.go 384: Failed to cleanup preexisting XDP state error=failed to load XDP program (/tmp/felix-xdp-708183231): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: failed to get EHDR from /tmp/felix-xdp-708183231
Error: failed to open object file

IPv6 is disabled on nodes, as suggested in #1606

@remche
Copy link
Author

remche commented Dec 13, 2019

I tried to put a GlobalNetworkPolicy to get iptables logs, but when applied, all traffic between pods is dropped !

policy :

apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
  name: iptables-logging
spec:
  types:
  - Ingress
  - Egress
  ingress:
  - action: Log
    protocol: TCP
  - action: Log
    protocol: UDP
  egress:
  - action: Log
    protocol: TCP
  - action: Log
    protocol: UDP

log :

Dec 13 20:43:11 rke-worker-02 kernel: [21812004.530427] calico-packet: IN=cali1037a54e65e OUT=calib22d46149c5 MAC=ee:ee:ee:ee:ee:ee:06:20:e2:d5:3b:94:08:00 SRC=10.42.2.104 DST=10.42.2.87 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=18958 DF PROTO=TCP SPT=53406 DPT=8081 WINDOW=29200 RES=0x00 SYN URGP=0 
Dec 13 20:43:19 rke-worker-02 kernel: [21812012.554216] calico-packet: IN=cali1037a54e65e OUT=calib22d46149c5 MAC=ee:ee:ee:ee:ee:ee:06:20:e2:d5:3b:94:08:00 SRC=10.42.2.104 DST=10.42.2.87 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=18959 DF PROTO=TCP SPT=53406 DPT=8081 WINDOW=29200 RES=0x00 SYN URGP=0 

@remche
Copy link
Author

remche commented Dec 17, 2019

I dig further on this issue, problem only happens on upgraded cluster.

I'm able to reproduce it :

  1. Install cluster with rke 0.2.8 (v1.14.6-rancher1-1). All work fine.
  2. Upgrade with rke 1.0.0 (v1.15.6-rancher1-2). NetworkPolicies are broken as described.
  3. Rebooting the nodes fix the problem.

I choose this version because it fits our productions cluster, but v1.16.3-rancher1-1 seems affected too.

@ErikLundJensen

This comment has been minimized.

@andrezaycev
Copy link

The same issue after upgrade to 1.17.4 - Rancher 2.4.2. NetworkPolicy broken - work only between pods on same node

@ErikLundJensen
Copy link

ErikLundJensen commented Apr 20, 2020

We see the same error when running RKE 1.1.0 with Canal as network plugin.
Sniplet from RKE cluster.yml

network:
plugin: canal
options:
canal_iface: ens192
canal_flannel_backend_type: vxlan

Running Ubuntu 18.04 at nodes.

We get the same result no matter if we use the service IP or the pod IP.
@remche and @andrezaycev are you also using Canal and have you found a solution to the problem?

@andrezaycev
Copy link

  1. Rebooting the nodes fix the problem.
    @ErikLundJensen

@andrezaycev
Copy link

andrezaycev commented Apr 24, 2020

We get the same result no matter if we use the service IP or the pod IP.

@remche and @andrezaycev are you also using Canal and have you found a solution to the problem?

I had problem after upgrading cluster to 1.17 with exist NetworkPolicy. Yes i using Canal.
Reboot worker nodes help me

@ErikLundJensen
Copy link

I found the bug in my setup. The "from" pod was using hostNetwork: true
which changes the policy evaluation. See also
https://github.com/projectcalico/felix/issues/1361

@lukegriffith
Copy link

lukegriffith commented Jul 21, 2020

We're running RKE on K8s rev: v1.18.3 using Canal and we're seeing this behaviour out the box, no upgrade.
Update: This was an issue with my pod cidr range clashing with the docker network range.

@stale
Copy link

stale bot commented Oct 8, 2020

This issue/PR has been automatically marked as stale because it has not had activity (commit/comment/label) for 60 days. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the status/stale label Oct 8, 2020
@stale stale bot closed this as completed Oct 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants