Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IP address sharing + externalTrafficPolicy: Local; exact same selector mixes traffic between pods #271

Closed
uablrek opened this issue Jun 27, 2018 · 2 comments

Comments

@uablrek
Copy link
Contributor

uablrek commented Jun 27, 2018

Is this a bug report or a feature request?:

Bug

What happened:

The metallb documentation states that if address sharing and externalTrafficPolicy: Local is combined the services must "have the exact same selector".

But the implementing pods must then also have the same selector or Kubernetes complains. However Kubernetes seem to regard the pods exactly equal and distributes traffic to all sevices with the shared ip (and the same selector) to all pods, regardless of which ports they serve.

Hence my two pods serving ports 22 and 5001 will get a mix of traffic to the ports and half the traffic is lost.

What you expected to happen:

Traffic to a service should be distributed to the pods implementing the service only.

How to reproduce it (as minimally and precisely as possible):

I have 2 services and deployments;

apiVersion: v1
kind: Service
metadata:
  name: cgen
  annotations:
    metallb.universe.tf/allow-shared-ip: ekvm
spec:
  selector:
    app: ekvm
  ports:
  - port: 5001
  externalTrafficPolicy: Local
  loadBalancerIP: 10.0.0.2
  type: LoadBalancer
---
apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: cgen-deployment
spec:
  selector:
    matchLabels:
      app: ekvm
  replicas: 4
  template:
    metadata:
      labels:
        app: ekvm
    spec:
      containers:
      - name: cgen
        image: example.com/cgen:0.0.1
        ports:
        - containerPort: 5001

and

apiVersion: v1
kind: Service
metadata:
  name: ekvm-busybox
  annotations:
    metallb.universe.tf/allow-shared-ip: ekvm
spec:
  selector:
    app: ekvm
  ports:
  - port: 1022
    name: ssh
    targetPort: 22
  externalTrafficPolicy: Local
  loadBalancerIP: 10.0.0.2
  type: LoadBalancer
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ekvm-busybox-deployment
spec:
  selector:
    matchLabels:
      app: ekvm
  replicas: 4
  template:
    metadata:
      labels:
        app: ekvm
    spec:
      containers:
      - name: ekvm-busybox
        image: example.com/ekvm-busybox:0.0.1
        ports:
        - containerPort: 22
          name: ssh

When started every other attempt to e.g. port 1022 will fail;

vm-201 ~ # ssh -p 1022 10.0.0.2 netstat -putan
dbclient: Caution, skipping hostkey check for 10.0.0.2

Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      35/dropbear
tcp        0      0 11.0.2.2:22             192.168.0.201:56530     ESTABLISHED 56/dropbear
tcp        0      0 :::80                   :::*                    LISTEN      28/inetd
tcp        0      0 :::22                   :::*                    LISTEN      35/dropbear
tcp        0      0 :::23                   :::*                    LISTEN      28/inetd
vm-201 ~ # ssh -p 1022 10.0.0.2 netstat -putan

dbclient: Connection to root@10.0.0.2:1022 exited: Connect failed: Connection refused

The source is preserved (192.168.0.201) as seen in the first attempt, so far so good. But the second attempt fails because it is routed to the other serving only port 5001.

I run proxy-mode=ipvs so it is quite easy to visualize the problem;

vm-003 ~ # ipvsadm -Ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  192.168.0.3:31374 rr
  -> 11.0.3.2:5001                Masq    1      0          0
  -> 11.0.3.3:5001                Masq    1      0          0
TCP  10.0.0.2:1022 rr
  -> 11.0.3.2:22                  Masq    1      0          0
  -> 11.0.3.3:22                  Masq    1      0          0
TCP  10.0.0.2:5001 rr
  -> 11.0.3.2:5001                Masq    1      0          0
  -> 11.0.3.3:5001                Masq    1      0          0
TCP  11.0.3.1:30919 rr
...

The loadBalancerIP 10.0.0.2 is distributed locally (good) but for port 1022 it is distributed to both local pods which is a bug IMO. Maybe not in metallb though.

Anything else we need to know?:

Actually I can't understand the restriction of exact the same selector.

By removing the check i can make it work perfectly the way I want it.

Environment:

  • MetalLB version:
    Built from source. commit; d38ad1e
  • Kubernetes version:
    v1.10.5-beta
  • BGP router type/version:
    gobgp
  • OS (e.g. from /etc/os-release):
    Own busyBox based
  • Kernel (e.g. uname -a):
    Linux vm-003 4.16.2 #2 SMP Mon Jun 11 16:02:46 CEST 2018 x86_64 GNU/Linux
@danderson
Copy link
Contributor

Thanks for the report.

You're right, I tried to be generic in the documentation, but I actually need to be more specific: to use the externalTrafficPolicy: Local on a shared IP service, all services must be sending traffic to the same set of pods.

This is because MetalLB can only control traffic flow at L3, so it's all-or-nothing. MetalLB can't tell the outside world "I want traffic for IP 1.2.3.4, but only port 80", it can only say "I want all traffic for 1.2.3.4". This is fine with the Cluster traffic policy, because kube-proxy will forward traffic per-port to the right destinations, no matter where they are. But with the Local traffic policy, kube-proxy can only forward to pods on the same node.

So, we have 2 constraints:

  • MetalLB can only attract traffic for all ports on an IP
  • kube-proxy can only route traffic to pods on the same node

There are 2 ways to work with these constraints:

  • Only receive traffic if all shared services have >=1 healthy pod on the current node. This leads to confusing behavior, where one unhealthy service triggers traffic shifts in other services.
  • Force all services to use the exact same pods as backend. That way, the ready/unready endpoint data is identical for all services, and it's easy to decide if a node should receive traffic.

MetalLB picks the second option, because the first leads to a bunch of confusing behaviors that would mean a lot more bugs.

Unfortunately this means that IP sharing with externalTrafficPolicy: Local is only really useful as a workaround for kubernetes/kubernetes#23880 , and not as a generic IP sharing mechanism.

It's pretty clear that I need to make the docs more explicit about the limitations for your scenario. If you have suggestions on how to make this behavior better (e.g. how to remove this constraint without a flood of bug reports about confusing traffic routing because of what I explained above), I'd love to make IP sharing more generally useful.

@uablrek
Copy link
Contributor Author

uablrek commented Jul 1, 2018

Tanks for your elaborate answer and I see your point.

But IMHO a bare-metal k8s system will never be as "forgiving" as e.g GCE so one may assume a somewhat better knowledge from those users. The local traffic policy is often a very desired, if not required, feature for the source address preseravtion. I think a reasonable division of responsibility would be;

  • Metallb make sure that external traffic (L3) is routed to a set of nodes.
  • The application makes sure there are traffic-handling PODs on those nodes if local traffic policy is used

Most likely an application will be assigned one external IP but will be implemented as several services with different functions and ports for external traffic which is why I want both shared-ip and local traffic policy.

I also foresee a scaling problem that I hope will be possible to turn into a feature using local traffic policy;

In a large system all nodes can't be ECMP targets. In that case I would like to use "frontend" PODs (e.g. Ingress Controllers) forming some "load-balancing-tier" and using local traffic policy to avoid an unnecessary hop. External traffic flow would become very efficient I think, and the source would always be preserved.

I would suggest some "expert" option to metallb, like --unsafe-local-policy that would disable the check and also allow local traffic policy it in L2-mode. But from long experience I know that it would not prevent the flood of bug reports. But if you think the flow would be bearable, please consider it.

For now I can patch metallb as described. Another option is to correct externalIPs in k8s. It already does local traffic policy, and drop traffic if there is no local POD, but ... it still does the SNAT which is totally unnecessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants