Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MetalLB cannot peer with BGP routers that Calico is already peering with #114

Closed
WillieWookiee opened this issue Dec 19, 2017 · 80 comments
Closed

Comments

@WillieWookiee
Copy link

Is this a bug report or a feature request?:

Question actually.

What happened:
Can't get MetalLB to peer with my core router.

What you expected to happen:
Peering is expected to happen and for me to see the routes for each node in the routing table.

How to reproduce it (as minimally and precisely as possible):
Setup MetalLB and peer it with a Cisco L3 routing device.

Anything else we need to know?:
I am not sure if this is something related to the Cisco side or the MetalLB side. I also have calico peering with the same Cisco device with the same IP address and that could be the problem, but I wanted to verify. I am not sure that it is a bug.

Getting this in the log:
{"log":"E1213 22:09:35.960710 1 bgp.go:48] read OPEN from "10.1.105.1:179": message type is not OPEN, got 3, want 1\n","stream":"stderr","time":"2017-12-13T22:09:35.961076973Z"}

Makes me think the connection to calico is hijacking the metallb connection.

Environment:

  • MetalLB version: Not sure how to find this.
  • Kubernetes version: Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.5"
  • BGP router type/version: Cisco 4500 Version 03.09.00.E
  • OS (e.g. from /etc/os-release): "16.04.3 LTS (Xenial Xerus)"
  • Kernel (e.g. uname -a): Linux 4.4.0-103-generic
@danderson
Copy link
Contributor

Thank you very much for the report! As you'll see in this message, this is a rich bug report, containing at least 4 separate bugs/improvements :)

So, the error you're getting is that the peer router is sending MetalLB a BGP NOTIFICATION message ("oops, there's a problem, I'm closing the connection") instead of BGP OPEN ("Hi, I'm a router and here's my capabilities").

Ease of use/debugging bugs

This points to a first bug: MetalLB does not decode the full notification, it aborts parsing as soon as it sees that the message is not the one it expected. This means we throw away the debugging information that the peer is sending us. Filed #115 to fix that.

Second, you said that you're not sure how to find the MetalLB version you're using. Good point! If you're running a released version, you can look at the container image version (with e.g. kubectl get po -n metallb-system -l app=controller -oyaml). Filed #116 to have MetalLB log its version on startup, and to update the issue template.

The peering bug

Now, about your actual issue... I think there's probably 2 separate things going on here.

Conflict with Calico

As you said, Calico is probably going to be an issue. AFAIK, BGP only allows 1 session between a pair of IPs. So if Calico is already peering, MetalLB cannot also peer with the router.

I have a couple of potential answers to what we can do about that. Some are policy/documentation (peer calico with your "datacenter fabric" routers, and peer MetalLB with your "datacenter edge" routers), another might be a technical solution (make MetalLB peer with the calico BGP daemon on the node, and inject routes that way). I've filed #117 to investigate that more.

What error is the Cisco router sending?

The failure mode we would see with a Calico conflict doesn't seem to match what you're seeing. BGP has a resolution algorithm for conflicting BGP sessions (it has to, because the peering logic is not race-free, so it's common for routers to temporarily end up with >1 connection to a peer), but that algorithm iirc involves ungracefully closing the BGP session, not sending a NOTIFY. In fact, there doesn't seem to be any notification code that means "sorry, I have another BGP session for you already". So... The notification may be a separate interop issue with Cisco IOS specifically.

To confirm this, we could do one of two things:

  • Are you comfortable with grabbing and sharing packet captures? If so, could you share a capture of tcp port 179 traffic, so that I can examine the BGP traffic and see what the error is? On your k8s nodes, the command for that would be tcpdump -i any -w dump.pcap tcp port 179. This will capture both Calico and MetalLB BGP traffic. Leave the capture running for at least 1min, to make sure it captures some connection attempts. Attach the pcap to this issue, and I'll examine it.
  • Alternatively, later today I can prepare a debugging container image for MetalLB, with better logging for BGP notification messages. You can deploy that test image to your cluster, and grab the debugging data we need directly from MetalLB's logs.

@danderson danderson self-assigned this Dec 19, 2017
@danderson danderson changed the title Message Type error in Logs Peering with Cisco 4500 in a Calico-enabled cluster fails with BGP notification error messages Dec 19, 2017
@WillieWookiee
Copy link
Author

Dave, thank you very much for the thorough response, it really helps.

Instead of going as far as tcpdumps (which I will be happy to do if needed) I did the debug on the Cisco side and believe I have confirmed what we suspected that the peers were colliding.

I get the following message:

Dec 19 18:24:51.528: %BGP-3-NOTIFICATION: sent to neighbor 10.1.105.65 passive 6/7 (Connection Collision Resolution) 0 bytes
Dec 19 18:24:51.529: %BGP-5-NBR_RESET: Neighbor 10.1.105.65 passive reset (Peer closed the session)
Dec 19 18:24:51.529: %BGP-5-ADJCHANGE: neighbor 10.1.105.65 passive Down Error during connection collision

A quick google search showed that to be the case:

https://www.cisco.com/c/en/us/support/docs/ip/border-gateway-protocol-bgp/5816-bgpfaq-5816.html#anc40

@danderson
Copy link
Contributor

Aha, perfect! Yes, that debug log confirms our suspicion about session collisions. The reason I was confused is that this code/subcode is defined in RFC 4486, and not the base BGP spec RFC 4271, so I had never seen the BGP Cease errors before.

Okay, so that fully explains what you're seeing, and confirms that Calico and MetalLB are conflicting with each other. Unfortunately, there's no obvious easy fix for that :(. The closest I can offer is something that is very specific to your network architecture: is there another router that you could peer MetalLB with, that is not also a Calico peer?

For example, in my cluster layout, I have top of rack BGP routers, and they connect to a pair of core routers that connect to the rest of the world. In that setup, I would peer Calico with the ToRs (and have them propagate their routes to the core), and I would peer MetalLB with the core routers. The reasoning is that the Calico session are distributing routes internally to the cluster's network, so it should be peering with the first hop outside the machine. OTOH, MetalLB wants to attract traffic from outside the cluster, so it should peer with the "border" of the cluster, i.e. the core routers that connect to the rest of the world.

Another alternative, if your router supports it, might be to define some VRFs and do some hacky cross-VRF route propagation, so that there are 2 logical routers, and Calico and MetalLB are not peering with the "same" router.

But, both of these are just hacky workarounds :(. I will investigate options for Calico interop, and we'll see if we can come up with something better. Worst case, we need to at least document this incompatibility, since it's a pretty big deal... But I'm hoping that I can find a way to make Calico and MetalLB coexists instead.

I'm duping #117 to this issue, and I'll keep using this issue to track the investigation on calico compatibility. Pasting the bug text from #117 below...

MetalLB vs. Calico interop problems

Calico can be configured to peer with BGP routers, so that pod traffic routing between L3 network domains works. However, this puts Calico in conflict with MetalLB, because there can only be one node<>router BGP session, and Calico is consuming it.

We need at the very least some documentation about this:

  • State that MetalLB is not compatible with Calico when you're using Calico's external peering feature
  • Alternatively, provide some architectural guidance on how to peer both: Calico should peer with top-of-rack switches to distribute pod network routing data, and MetalLB should peer with the "datacenter edge" routers that are the interface between the cluster and the world. Requires a bunch of network diagrams and explanations to get across, probably a good candidate for an installation sub-article on the new website.

Separately, we should also investigate a technical solution. Is there a way to make MetalLB piggyback on Calico's BGP sessions? Can we somehow inject routes into the local Calico BGP speaker? If so, we could implement a new "calico" peer type, and teach MetalLB that for this peer type, it should talk to the local Calico daemon and inject its routes that way.

@danderson danderson changed the title Peering with Cisco 4500 in a Calico-enabled cluster fails with BGP notification error messages MetalLB cannot peer with BGP routers that Calico is already peering with Dec 19, 2017
@danderson
Copy link
Contributor

I will investigate in more detail later today after work, but after a very quick reading of calico's documentation, one potential hack suggests itself...

Calico supports per-node BGP peering configurations. Assuming we can get the configuration to be acceptable in terms of the BGP spec, we could make MetalLB listen for BGP on a static host port (not 179), and create per-node BGP peerings in Calico to peer with MetalLB. Basically, make calico's bgpd on each node peer with localhost:1234, so that MetalLB can inject its routes into Calico that way.

Open questions:

  • BGP peering with localhost is notoriously tricky, because the router can incorrectly believe that it's peering with itself. Calico uses the GoBGP codebase, which should not have this problem... But it needs to be tested.
  • Does the Calico node daemon run the full BGP convergence algorithm? IOW, if we have a peering chain of MetalLB<>Calico<>external router, will Calico propagate the routes from MetalLB to the external router? In theory it should, because Calico advertises itself as "we just make your k8s nodes look like a regular BGP router", so unlike MetalLB it should be implementing the full BGP convergence/redistribution algorithm.
  • Can we make this configuration automagic? Given appropriate RBAC rules, we could give MetalLB the permission to create new calico bgpPeer objects, so some component of MetalLB could automatically reconfigure the cluster to peer with MetalLB. This adds a bunch of complexity, and the "magic" may not be welcome by cluster/network admins who want explicit control over what happens to their network.
  • Alternatively, is it possible to define the MetalLB peering as a "global" Calico BGP peer, with a peer address of 127.0.0.1? Again, in theory, this should just work: global peer will apply to all nodes, and all nodes will just connect to a different MetalLB instance on each machine. That should be fine... But maybe Calico has some sanity checks that prevent this. This would be much nicer from an admin perspective, because we can just tell Calico cluster operators "here's one BGP peer object that you should add to your Calico config, and voila, MetalLB just works!"

@WillieWookiee
Copy link
Author

Thanks again for the detailed analysis. What about having MetalLB containers attain a different IP than the node and just let calico use the node's IP? This might take some configuration options on the cluster, but if the Cisco knew about MetalLB as a different IP, that would solve everything.

Unfortunately, I don't have another internal L3 device that I can peer with. I am already peering Calico with my core routers.

Another hacky method would be to introduce a BGP software router between MetalLB and the core router. Then it would be MetalLB<>Linux-basedBGPPeer<>Cisco core router.

Just some brainstorming here.

@danderson
Copy link
Contributor

What about having MetalLB containers attain a different IP than the node and just let calico use the node's IP?

Interesting thought! It would be possible, but would require a bunch of completely custom k8s node configuration (adding more IPs to the node). This makes MetalLB much harder to deploy, because suddenly choosing to use MetalLB has implications on how you provision your machines. It's definitely possible, but I'd like to keep it as a last resort, and try to have a solution that's more "zero config".

Unfortunately, I don't have another internal L3 device that I can peer with. I am already peering Calico with my core routers.

Yeah, that's what I figured, and I suspect many Calico users are in your situation. Out of curiosity, what does this peering buy you? Does your core router redistribute these routes to ToRs? I'm trying to get a picture of your network topology, so I understand why you're exposing Calico's pod network to your physical network.

Another hacky method would be to introduce a BGP software router between MetalLB and the core router. Then it would be MetalLB<>Linux-basedBGPPeer<>Cisco core router.

Interesting idea! I know at least one other MetalLB user who does something similar, they run a BIRD instance on each node, and peer it with both MetalLB and their upstream network. They're doing it for different reasons than Calico, but that model already works from MetalLB's perspective.

In a perfect world, I would still like to try and make MetalLB peer directly with Calico, so that we don't have to run yet-another-BGP router on the cluster (more CPU/memory overhead), but it's definitely an option. If direct Calico integration doesn't work out, that's probably the next best thing.

@WillieWookiee
Copy link
Author

WillieWookiee commented Dec 19, 2017

Out of curiosity, what does this peering buy you? Does your core router redistribute these routes to ToRs?

We are exposing pods so that when we use a database solution that replicates, something like Cockroach DB, all of the instances of Cockroach can communicate directly between each other.
Maybe there is a different way to do this with K8 and I would be open to suggestions, but that kind of gets outside the scope of this issue.

@nyaxt
Copy link
Contributor

nyaxt commented Jan 5, 2018

We've successfully deployed MetalLB on k8s+calico cluster w/ a hacky workaround.
Instead of letting MetalLB join the iBGP directly, we pointed MetalLB speakers to a (separate) BGP route reflector, so that they would avoid having same IP addrs pairs in the peering graph.

@WillieWookiee
Copy link
Author

We've successfully deployed MetalLB on k8s+calico cluster w/ a hacky workaround.
Instead of letting MetalLB join the iBGP directly, we pointed MetalLB speakers to a (separate) BGP route reflector, so that they would avoid having same IP addrs pairs in the peering graph.

Yes, that's what I mentioned above as a possible work around. Good to know that it does work.

@danderson danderson added this to the v0.3.0 milestone Jan 7, 2018
@danderson
Copy link
Contributor

Romana is (somewhat) in the same boat: one of the configurations it supports uses BIRD route publishers to peer with the datacenter network and announce the cluster network: https://github.com/romana/romana/wiki/Romana-route-publisher

The addon supports providing custom BIRD config snippets, and I'm told by Romana folks that the route agent is configured such that it should redistribute advertisements just fine if metallb peers and injects routes... So, assuming BIRD is okay with localhost<>localhost peering, we should be golden there too.

@danderson danderson changed the title MetalLB cannot peer with BGP routers that Calico is already peering with MetalLB cannot peer with BGP routers that Calico or Romana are already peering with Jan 12, 2018
@danderson
Copy link
Contributor

Finally getting around to this...

Configured a Calico peer of 127.0.0.1, and told MetalLB to peer with 127.0.0.1... And it almost works! So far the only objection Calico has is that it's using the same router ID as MetalLB, so it thinks it's talking to itself.

Adding a config option to set MetalLB's router ID, let's see how this goes...

@danderson
Copy link
Contributor

Making some progress. The peering with Calico is still pretty unstable right now because Calico is actively trying to connect, and therefore sometimes connects to itself and rejects the connection. This triggers error backoff, so it becomes increasingly difficult for metallb to successfully connect. Hopefully Calico has some way to specify custom target BGP ports, which would fix this.

Second problem: when the connection establishes, BIRD marks metallb-originated routes as unreachable, and so doesn't propagate them to other peers. It looks like it's marking the routes unreachable because the next-hop is an IP of a local interface (i.e. the node IP), and BIRD decides that this means the route is unreachable, for some reason...

@danderson
Copy link
Contributor

Bad news, I think there's no way to make MetalLB cooperate with Calico in the way I imagined.

  1. Calico does not support BGP peering options such as target port number or passive mode. This makes it impossible to avoid the race condition where calico connects to itself and enters error backoff, which means that the peering will be very unreliable. I can't fix this on MetalLB's side, Calico has to implement configuration options for BGP peering.
  2. BIRD marks routes received with a next-hop of a local interface IP as unreachable. This might not be a problem because Calico always sets next-hop-self when exporting, but it might mess up the local routing table. I would have investigated this more, but see the next point...
  3. Calico filters out all routes that it doesn't control when exporting to peers. projectcalico/calicoctl#1138 theoretically addresses this by allowing the specification of custom BGP filters, but then Document use of custom bird filters projectcalico/calico#292 ruins that by saying they do not want to document the feature, and don't want to support it.

The real killer is the 3rd problem, but the 1st also makes peering with Calico really unreliable, sadly. So, there is no technical solution for making MetalLB work well with Calico :(

We can still make things work by implementing what @WillieWookiee suggested, and documenting how cluster operators can create an additional IP for each node, and use that IP for MetalLB peering. We need a small change to support setting the source IP in internal/bgp, but that's feasible at least.

Next up, I'm going to install Romana and see if that works any better.

@danderson
Copy link
Contributor

Good news! Romana pretty much Just Works. You have to add the route publisher addon, and configure it just right, but when done right, MetalLB will peer with the local route publisher and redistribute routes to the upstream peers.

So, Romana support is just a question of cleaning the configs a bit and documenting how to set it up.

@danderson
Copy link
Contributor

New documentation for how to run MetalLB and Romana together is at https://master--metallb.netlify.com/configuration/romana/ . It'll got to the live website in the next release.

As far as Calico is concerned... Sadly, right now all I can do is document the mediocre workarounds listed in this bug, we can't do anything as clean as Romana in the current state of the world. I'll file upstream bugs with Calico to document what we need from them, so hopefully we can do something better in the future.

@ryholt
Copy link

ryholt commented Jul 29, 2020

Well I'll be damned... it works. :)

BGPConfiguration

apiVersion: v1
items:
- apiVersion: crd.projectcalico.org/v1
  kind: BGPConfiguration
  metadata:
    annotations:
      projectcalico.org/metadata: '{"uid":"985f849b-fc4e-4514-b9d6-4acfb0e6d6d3","creationTimestamp":"2020-07-29T14:00:36Z"}'
    creationTimestamp: "2020-07-29T14:00:37Z"
    generation: 1
    name: default
    resourceVersion: "2720013"
    selfLink: /apis/crd.projectcalico.org/v1/bgpconfigurations/default
    uid: 654e14bf-7520-4a94-8862-b2a61d36c76b
  spec:
    serviceClusterIPs:
    - cidr: 10.43.0.0/16
    serviceExternalIPs:
    - cidr: 10.45.0.0/16
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Service config:

apiVersion: v1
kind: Service
metadata:
  annotations:
    helm.fluxcd.io/antecedent: default:helmrelease/plex
    metallb.universe.tf/allow-shared-ip: plex
  creationTimestamp: "2020-07-24T11:10:01Z"
  labels:
    app.kubernetes.io/instance: plex
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: plex
    app.kubernetes.io/version: 1.19.1.2645-ccb6eb67e
    helm.sh/chart: plex-1.6.1
  name: plex-tcp
  namespace: default
  resourceVersion: "2749251"
  selfLink: /api/v1/namespaces/default/services/plex-tcp
  uid: 187f8b59-cab4-4151-a7d0-df4a92733126
spec:
  clusterIP: 10.43.129.214
  externalIPs:
  - 10.45.100.100
  externalTrafficPolicy: Local
  healthCheckNodePort: 31466
  ports:
  - name: pms
    nodePort: 30499
    port: 32400
    protocol: TCP
    targetPort: pms
  - name: http
    nodePort: 32708
    port: 80
    protocol: TCP
    targetPort: pms
  - name: https
    nodePort: 32287
    port: 443
    protocol: TCP
    targetPort: pms
  - name: plex-dlna
    nodePort: 31848
    port: 1900
    protocol: TCP
    targetPort: plex-dlna
  selector:
    app.kubernetes.io/instance: plex
    app.kubernetes.io/name: plex
  sessionAffinity: None
  type: LoadBalancer
status:
  loadBalancer:
    ingress:
    - ip: 10.20.50.155

After initial deployment, routes updated as expected on router:

ryan@vyos:~$ netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
0.0.0.0         PROVIDER    0.0.0.0         UG        0 0          0 bond0.4000
10.9.18.0       0.0.0.0         255.255.255.0   U         0 0          0 bond0.1
10.20.0.0       0.0.0.0         255.255.0.0     U         0 0          0 bond0.20
10.30.0.0       0.0.0.0         255.255.0.0     U         0 0          0 bond0.30
10.40.0.0       0.0.0.0         255.255.255.0   U         0 0          0 wg1
10.43.0.0       10.20.10.10     255.255.0.0     UG        0 0          0 bond0.20
10.43.21.128    10.20.10.10     255.255.255.192 UG        0 0          0 bond0.20
10.43.67.128    10.20.10.15     255.255.255.192 UG        0 0          0 bond0.20
10.43.74.128    10.20.10.17     255.255.255.192 UG        0 0          0 bond0.20
10.43.94.192    10.20.10.11     255.255.255.192 UG        0 0          0 bond0.20
10.43.106.163   10.20.10.10     255.255.255.255 UGH       0 0          0 bond0.20
10.43.117.90    10.20.10.12     255.255.255.255 UGH       0 0          0 bond0.20
10.43.129.214   10.20.10.10     255.255.255.255 UGH       0 0          0 bond0.20
10.43.144.128   10.20.10.12     255.255.255.192 UG        0 0          0 bond0.20
10.43.192.64    10.20.10.16     255.255.255.192 UG        0 0          0 bond0.20
10.45.0.0       10.20.10.10     255.255.0.0     UG        0 0          0 bond0.20
10.45.100.100   10.20.10.10     255.255.255.255 UGH       0 0          0 bond0.20

Cordoned the node, deleted the pod, plex starts up on other node, routes propagated as expected in router:

ryan@vyos:~$ netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
0.0.0.0         PROVIDER    0.0.0.0         UG        0 0          0 bond0.4000
10.9.18.0       0.0.0.0         255.255.255.0   U         0 0          0 bond0.1
10.20.0.0       0.0.0.0         255.255.0.0     U         0 0          0 bond0.20
10.30.0.0       0.0.0.0         255.255.0.0     U         0 0          0 bond0.30
10.40.0.0       0.0.0.0         255.255.255.0   U         0 0          0 wg1
10.43.0.0       10.20.10.10     255.255.0.0     UG        0 0          0 bond0.20
10.43.21.128    10.20.10.10     255.255.255.192 UG        0 0          0 bond0.20
10.43.67.128    10.20.10.15     255.255.255.192 UG        0 0          0 bond0.20
10.43.74.128    10.20.10.17     255.255.255.192 UG        0 0          0 bond0.20
10.43.94.192    10.20.10.11     255.255.255.192 UG        0 0          0 bond0.20
10.43.106.163   10.20.10.15     255.255.255.255 UGH       0 0          0 bond0.20
10.43.117.90    10.20.10.12     255.255.255.255 UGH       0 0          0 bond0.20
10.43.129.214   10.20.10.15     255.255.255.255 UGH       0 0          0 bond0.20
10.43.144.128   10.20.10.12     255.255.255.192 UG        0 0          0 bond0.20
10.43.192.64    10.20.10.16     255.255.255.192 UG        0 0          0 bond0.20
10.45.0.0       10.20.10.10     255.255.0.0     UG        0 0          0 bond0.20
10.45.100.100   10.20.10.15     255.255.255.255 UGH       0 0          0 bond0.20

It appears I'll need to flip from LoadBalancerIP to NodePort per requirements here:

Services must be configured with the correct service type (“Cluster” or “Local”) for your implementation. For externalTrafficPolicy: Local, the service must be type LoadBalancer or NodePort.

@gclawes
Copy link
Contributor

gclawes commented Jul 29, 2020

You had to manually specify the externalIP when you configured the service? Will it work if the metallb controller assigns an IP?

@ryholt
Copy link

ryholt commented Jul 29, 2020

Manually in the sense that I needed to specify the IP in the helm deployment and it didn’t come out of a pool.

Leaving metallb in the mix wouldn’t solve the issue with BGP, and it may have asynchronous routing problems.

@Elegant996
Copy link

Works well so long as you don't need externalTrafficPolicy: Local per your notes. Just wanted to re-iterate that. Thanks!

@carpenike
Copy link

Ran into an SNAT problem that requires a feature gate config for Kube proxy in 1.18. Opened a ticket for k3s -- k3s-io/k3s#2090

@Elegant996
Copy link

Also up for Cherry Pick in 1.17, see kubernetes/kubernetes#90536. This would seemingly remove the need for a bare metal LB or am I missing something?

Does TCP and UDP still function correctly if you assign the same IP to two services? MetalLB offers this through metallb.universe.tf/allow-shared-ip: "true" presently but this would seemingly remove LBs.

@champtar
Copy link
Contributor

champtar commented Aug 2, 2020

Also up for Cherry Pick in 1.17, see kubernetes/kubernetes#90536. This would seemingly remove the need for a bare metal LB or am I missing something?

ExternalIPs configure the K8S dataplane, but you still need something to tell the network to send the packets to the node (ARP, BGP, OSPF, ...)

@carpenike
Copy link

Seems to remove the need for a LB altogether. I haven't had any issue with tcp/udp port sharing.

@Elegant996
Copy link

Not quite, I found out that the LB should be used for service health checks as External IPs do not perform them. Instead it just throws you at one of the endpoints without checking.

That being said, when I read this over I realized something. The point of the MetalLB controller is to create the LoadBalancer IP. So if we can route to the service through an External IP, then the controller can route to it as well making it accessible through the LoadBalancer IP.

Give this a go, install MetalLB but completely remove the speaker daemonset. You should still be able to route your ingress controller on the LoadBalancer IP without issue.

Forget BGP and with MetalLB, let Calico handle it and just use MetalLB for your LoadBalancer IP :) This completely removes the need for peering and accomplishes the same goals. Perhaps this is sufficient to close the issue?

@ryholt
Copy link

ryholt commented Aug 5, 2020

I think per #114 (comment) above, the ExternalIP and LoadBalancerIP are not 100% the same and there's the concern that the BGP configuration won't actually pick up the LoadBalancerIP assigned by metallb.

@adamdunstan
Copy link

@Elegant996 I dont have calico running, this behavior is dependent upon kubeproxy, are you using kubeproxy from calico, iptables or ipvs? Its correct that metallb just 'attracts" traffic. The controller does nothing more that allocate ip addresses, if you run it without a speaker, the external IP address will still get allocated. In addition to the speaker looking for it, kubeproxy is also. In iptables mode a filter is added to pre-routing so traffic from that address can be forwarded, in ipvs mode its added to the ipvs interface. Not sure how calico is adding to its routing table, but I would guess that its reading the interfaces and importing the kernel routes, if so, they will be advertised by the calico router (which is bird I think). So the behavior is more of a side effect than intended. Should not work with kubeproxy in iptables mode, but can work in either ipvs or I guess calico's kubeproxy....

@Elegant996
Copy link

Elegant996 commented Aug 5, 2020

@adamdunstan Using kube-proxy in ipvs mode. Is there any downside to this method? Seems like a good solution otherwise.

@adamdunstan
Copy link

@Elegant996 As I guessed. With the caveat that I havent looked at how calico is configuring bird,... I assume that its importing the interfaces from ipvs0, and advertising those routes. If I am correct you will be getting all the routes to all of the addresses that Ipvs has attached. This may be what you want, but this will include endpoints and kubeapi. You may want to modify the bird configuration (I think its in a configmap) to filter only the external addresses, and it would then advertise thoses. Make sure that ipvs is configured with the strict_arp flag. Bit of a confusing name, just means that the IPVS interface should not answer arp requests, otherwise every node will answer for those addresses locally, which doesnt really matter for routed destinations but could cause you some confusion later on..... Hope I have been helpful.....

@Elegant996
Copy link

@adamdunstan Sounds good, FWIW the original thought actually came from this comment here regarding kube-router and MetalLB #160 (comment).

@fasaxc
Copy link

fasaxc commented Sep 9, 2020

As of Calico v3.16, Calico now has support for setting the BGP port that we use. That might help if you need to peer twice to the same ToR from one node. https://docs.projectcalico.org/reference/resources/bgpconfig#spec

@johananl
Copy link
Member

Thanks for the heads up @fasaxc! There are upcoming changes to MetalLB which would allow users to specify the source IP MetalLB uses for BGP sessions. The destination port for BGP is already configurable.

@salanki
Copy link

salanki commented Nov 3, 2020

@caseydavenport
Copy link

Another update here - in Calico v3.18, Calico will be capable of advertising LoadBalancer IPs allocated by the MetalLB controller without installing Speaker. projectcalico/confd#422

@ivan046
Copy link

ivan046 commented Jan 5, 2021

Provide some setup guide for this please!

@gautvenk
Copy link

gautvenk commented Mar 12, 2021

Another update here - in Calico v3.18, Calico will be capable of advertising LoadBalancer IPs allocated by the MetalLB controller without installing Speaker. projectcalico/confd#422

I was able to try Calico v3.18 (with BGP peering to a ToR switch) along with MetalLB v0.9.5 manifests. I deleted the speaker daemon set, and confirmed that external LB IPs were advertised via BIRD to the peer. Even though I don't use the speaker component, I had to specify "protocol" as layer2 or bgp in the config-map to ensure controller allocates IPs for LB. If the protocol requirement can be removed in config-map (provided we don't use speakers), it will look neat. Please let me know if I can file an issue for this.

Thanks @caseydavenport and @salanki for much awaited projectcalico/confd#422.

@salanki
Copy link

salanki commented Mar 12, 2021

Very happy to hear this @gautvenk. Thank you @caseydavenport for pushing this over the line.

@russellb
Copy link
Contributor

Another update here - in Calico v3.18, Calico will be capable of advertising LoadBalancer IPs allocated by the MetalLB controller without installing Speaker. projectcalico/confd#422

Thanks! Based on this, I think we can close out this old issue. It seems ideal to me that the existing BGP daemon would handle advertisements vs trying to colocate the current MetalLB bgp speaker.

I was able to try Calico v3.18 (with BGP peering to a ToR switch) along with MetalLB v0.9.5 manifests. I deleted the speaker daemon set, and confirmed that external LB IPs were advertised via BIRD to the peer. Even though I don't use the speaker component, I had to specify "protocol" as layer2 or bgp in the config-map to ensure controller allocates IPs for LB. If the protocol requirement can be removed in config-map (provided we don't use speakers), it will look neat. Please let me know if I can file an issue for this.

Thanks @caseydavenport and @salanki for much awaited projectcalico/confd#422.

Please file a feature request issue for this (or a PR is event better!). Maybe we could support protocol: none in the config to be explicit that no speaker will be running, and we just want the controller for IPAM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.