-
Notifications
You must be signed in to change notification settings - Fork 935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MetalLB cannot peer with BGP routers that Calico is already peering with #114
Comments
Thank you very much for the report! As you'll see in this message, this is a rich bug report, containing at least 4 separate bugs/improvements :) So, the error you're getting is that the peer router is sending MetalLB a BGP NOTIFICATION message ("oops, there's a problem, I'm closing the connection") instead of BGP OPEN ("Hi, I'm a router and here's my capabilities"). Ease of use/debugging bugsThis points to a first bug: MetalLB does not decode the full notification, it aborts parsing as soon as it sees that the message is not the one it expected. This means we throw away the debugging information that the peer is sending us. Filed #115 to fix that. Second, you said that you're not sure how to find the MetalLB version you're using. Good point! If you're running a released version, you can look at the container image version (with e.g. The peering bugNow, about your actual issue... I think there's probably 2 separate things going on here. Conflict with CalicoAs you said, Calico is probably going to be an issue. AFAIK, BGP only allows 1 session between a pair of IPs. So if Calico is already peering, MetalLB cannot also peer with the router. I have a couple of potential answers to what we can do about that. Some are policy/documentation (peer calico with your "datacenter fabric" routers, and peer MetalLB with your "datacenter edge" routers), another might be a technical solution (make MetalLB peer with the calico BGP daemon on the node, and inject routes that way). I've filed #117 to investigate that more. What error is the Cisco router sending?The failure mode we would see with a Calico conflict doesn't seem to match what you're seeing. BGP has a resolution algorithm for conflicting BGP sessions (it has to, because the peering logic is not race-free, so it's common for routers to temporarily end up with >1 connection to a peer), but that algorithm iirc involves ungracefully closing the BGP session, not sending a NOTIFY. In fact, there doesn't seem to be any notification code that means "sorry, I have another BGP session for you already". So... The notification may be a separate interop issue with Cisco IOS specifically. To confirm this, we could do one of two things:
|
Dave, thank you very much for the thorough response, it really helps. Instead of going as far as tcpdumps (which I will be happy to do if needed) I did the debug on the Cisco side and believe I have confirmed what we suspected that the peers were colliding. I get the following message: Dec 19 18:24:51.528: %BGP-3-NOTIFICATION: sent to neighbor 10.1.105.65 passive 6/7 (Connection Collision Resolution) 0 bytes A quick google search showed that to be the case: |
Aha, perfect! Yes, that debug log confirms our suspicion about session collisions. The reason I was confused is that this code/subcode is defined in RFC 4486, and not the base BGP spec RFC 4271, so I had never seen the BGP Cease errors before. Okay, so that fully explains what you're seeing, and confirms that Calico and MetalLB are conflicting with each other. Unfortunately, there's no obvious easy fix for that :(. The closest I can offer is something that is very specific to your network architecture: is there another router that you could peer MetalLB with, that is not also a Calico peer? For example, in my cluster layout, I have top of rack BGP routers, and they connect to a pair of core routers that connect to the rest of the world. In that setup, I would peer Calico with the ToRs (and have them propagate their routes to the core), and I would peer MetalLB with the core routers. The reasoning is that the Calico session are distributing routes internally to the cluster's network, so it should be peering with the first hop outside the machine. OTOH, MetalLB wants to attract traffic from outside the cluster, so it should peer with the "border" of the cluster, i.e. the core routers that connect to the rest of the world. Another alternative, if your router supports it, might be to define some VRFs and do some hacky cross-VRF route propagation, so that there are 2 logical routers, and Calico and MetalLB are not peering with the "same" router. But, both of these are just hacky workarounds :(. I will investigate options for Calico interop, and we'll see if we can come up with something better. Worst case, we need to at least document this incompatibility, since it's a pretty big deal... But I'm hoping that I can find a way to make Calico and MetalLB coexists instead. I'm duping #117 to this issue, and I'll keep using this issue to track the investigation on calico compatibility. Pasting the bug text from #117 below... MetalLB vs. Calico interop problemsCalico can be configured to peer with BGP routers, so that pod traffic routing between L3 network domains works. However, this puts Calico in conflict with MetalLB, because there can only be one node<>router BGP session, and Calico is consuming it. We need at the very least some documentation about this:
Separately, we should also investigate a technical solution. Is there a way to make MetalLB piggyback on Calico's BGP sessions? Can we somehow inject routes into the local Calico BGP speaker? If so, we could implement a new "calico" peer type, and teach MetalLB that for this peer type, it should talk to the local Calico daemon and inject its routes that way. |
I will investigate in more detail later today after work, but after a very quick reading of calico's documentation, one potential hack suggests itself... Calico supports per-node BGP peering configurations. Assuming we can get the configuration to be acceptable in terms of the BGP spec, we could make MetalLB listen for BGP on a static host port (not 179), and create per-node BGP peerings in Calico to peer with MetalLB. Basically, make calico's bgpd on each node peer with localhost:1234, so that MetalLB can inject its routes into Calico that way. Open questions:
|
Thanks again for the detailed analysis. What about having MetalLB containers attain a different IP than the node and just let calico use the node's IP? This might take some configuration options on the cluster, but if the Cisco knew about MetalLB as a different IP, that would solve everything. Unfortunately, I don't have another internal L3 device that I can peer with. I am already peering Calico with my core routers. Another hacky method would be to introduce a BGP software router between MetalLB and the core router. Then it would be MetalLB<>Linux-basedBGPPeer<>Cisco core router. Just some brainstorming here. |
Interesting thought! It would be possible, but would require a bunch of completely custom k8s node configuration (adding more IPs to the node). This makes MetalLB much harder to deploy, because suddenly choosing to use MetalLB has implications on how you provision your machines. It's definitely possible, but I'd like to keep it as a last resort, and try to have a solution that's more "zero config".
Yeah, that's what I figured, and I suspect many Calico users are in your situation. Out of curiosity, what does this peering buy you? Does your core router redistribute these routes to ToRs? I'm trying to get a picture of your network topology, so I understand why you're exposing Calico's pod network to your physical network.
Interesting idea! I know at least one other MetalLB user who does something similar, they run a BIRD instance on each node, and peer it with both MetalLB and their upstream network. They're doing it for different reasons than Calico, but that model already works from MetalLB's perspective. In a perfect world, I would still like to try and make MetalLB peer directly with Calico, so that we don't have to run yet-another-BGP router on the cluster (more CPU/memory overhead), but it's definitely an option. If direct Calico integration doesn't work out, that's probably the next best thing. |
We are exposing pods so that when we use a database solution that replicates, something like Cockroach DB, all of the instances of Cockroach can communicate directly between each other. |
We've successfully deployed MetalLB on k8s+calico cluster w/ a hacky workaround. |
Yes, that's what I mentioned above as a possible work around. Good to know that it does work. |
Romana is (somewhat) in the same boat: one of the configurations it supports uses BIRD route publishers to peer with the datacenter network and announce the cluster network: https://github.com/romana/romana/wiki/Romana-route-publisher The addon supports providing custom BIRD config snippets, and I'm told by Romana folks that the route agent is configured such that it should redistribute advertisements just fine if metallb peers and injects routes... So, assuming BIRD is okay with localhost<>localhost peering, we should be golden there too. |
Finally getting around to this... Configured a Calico peer of 127.0.0.1, and told MetalLB to peer with 127.0.0.1... And it almost works! So far the only objection Calico has is that it's using the same router ID as MetalLB, so it thinks it's talking to itself. Adding a config option to set MetalLB's router ID, let's see how this goes... |
Making some progress. The peering with Calico is still pretty unstable right now because Calico is actively trying to connect, and therefore sometimes connects to itself and rejects the connection. This triggers error backoff, so it becomes increasingly difficult for metallb to successfully connect. Hopefully Calico has some way to specify custom target BGP ports, which would fix this. Second problem: when the connection establishes, BIRD marks metallb-originated routes as unreachable, and so doesn't propagate them to other peers. It looks like it's marking the routes unreachable because the next-hop is an IP of a local interface (i.e. the node IP), and BIRD decides that this means the route is unreachable, for some reason... |
Bad news, I think there's no way to make MetalLB cooperate with Calico in the way I imagined.
The real killer is the 3rd problem, but the 1st also makes peering with Calico really unreliable, sadly. So, there is no technical solution for making MetalLB work well with Calico :( We can still make things work by implementing what @WillieWookiee suggested, and documenting how cluster operators can create an additional IP for each node, and use that IP for MetalLB peering. We need a small change to support setting the source IP in internal/bgp, but that's feasible at least. Next up, I'm going to install Romana and see if that works any better. |
Good news! Romana pretty much Just Works. You have to add the route publisher addon, and configure it just right, but when done right, MetalLB will peer with the local route publisher and redistribute routes to the upstream peers. So, Romana support is just a question of cleaning the configs a bit and documenting how to set it up. |
New documentation for how to run MetalLB and Romana together is at https://master--metallb.netlify.com/configuration/romana/ . It'll got to the live website in the next release. As far as Calico is concerned... Sadly, right now all I can do is document the mediocre workarounds listed in this bug, we can't do anything as clean as Romana in the current state of the world. I'll file upstream bugs with Calico to document what we need from them, so hopefully we can do something better in the future. |
Well I'll be damned... it works. :) BGPConfiguration
Service config:
After initial deployment, routes updated as expected on router:
Cordoned the node, deleted the pod, plex starts up on other node, routes propagated as expected in router:
It appears I'll need to flip from LoadBalancerIP to NodePort per requirements here:
|
You had to manually specify the externalIP when you configured the service? Will it work if the metallb controller assigns an IP? |
Manually in the sense that I needed to specify the IP in the helm deployment and it didn’t come out of a pool. Leaving metallb in the mix wouldn’t solve the issue with BGP, and it may have asynchronous routing problems. |
Works well so long as you don't need |
Ran into an SNAT problem that requires a feature gate config for Kube proxy in 1.18. Opened a ticket for k3s -- k3s-io/k3s#2090 |
Also up for Cherry Pick in 1.17, see kubernetes/kubernetes#90536. This would seemingly remove the need for a bare metal LB or am I missing something? Does TCP and UDP still function correctly if you assign the same IP to two services? MetalLB offers this through |
ExternalIPs configure the K8S dataplane, but you still need something to tell the network to send the packets to the node (ARP, BGP, OSPF, ...) |
Seems to remove the need for a LB altogether. I haven't had any issue with tcp/udp port sharing. |
Not quite, I found out that the LB should be used for service health checks as External IPs do not perform them. Instead it just throws you at one of the endpoints without checking. That being said, when I read this over I realized something. The point of the MetalLB controller is to create the LoadBalancer IP. So if we can route to the service through an External IP, then the controller can route to it as well making it accessible through the LoadBalancer IP. Give this a go, install MetalLB but completely remove the speaker daemonset. You should still be able to route your ingress controller on the LoadBalancer IP without issue. Forget BGP and with MetalLB, let Calico handle it and just use MetalLB for your LoadBalancer IP :) This completely removes the need for peering and accomplishes the same goals. Perhaps this is sufficient to close the issue? |
I think per #114 (comment) above, the ExternalIP and LoadBalancerIP are not 100% the same and there's the concern that the BGP configuration won't actually pick up the LoadBalancerIP assigned by metallb. |
@Elegant996 I dont have calico running, this behavior is dependent upon kubeproxy, are you using kubeproxy from calico, iptables or ipvs? Its correct that metallb just 'attracts" traffic. The controller does nothing more that allocate ip addresses, if you run it without a speaker, the external IP address will still get allocated. In addition to the speaker looking for it, kubeproxy is also. In iptables mode a filter is added to pre-routing so traffic from that address can be forwarded, in ipvs mode its added to the ipvs interface. Not sure how calico is adding to its routing table, but I would guess that its reading the interfaces and importing the kernel routes, if so, they will be advertised by the calico router (which is bird I think). So the behavior is more of a side effect than intended. Should not work with kubeproxy in iptables mode, but can work in either ipvs or I guess calico's kubeproxy.... |
@adamdunstan Using kube-proxy in ipvs mode. Is there any downside to this method? Seems like a good solution otherwise. |
@Elegant996 As I guessed. With the caveat that I havent looked at how calico is configuring bird,... I assume that its importing the interfaces from ipvs0, and advertising those routes. If I am correct you will be getting all the routes to all of the addresses that Ipvs has attached. This may be what you want, but this will include endpoints and kubeapi. You may want to modify the bird configuration (I think its in a configmap) to filter only the external addresses, and it would then advertise thoses. Make sure that ipvs is configured with the strict_arp flag. Bit of a confusing name, just means that the IPVS interface should not answer arp requests, otherwise every node will answer for those addresses locally, which doesnt really matter for routed destinations but could cause you some confusion later on..... Hope I have been helpful..... |
@adamdunstan Sounds good, FWIW the original thought actually came from this comment here regarding kube-router and MetalLB #160 (comment). |
As of Calico v3.16, Calico now has support for setting the BGP port that we use. That might help if you need to peer twice to the same ToR from one node. https://docs.projectcalico.org/reference/resources/bgpconfig#spec |
Thanks for the heads up @fasaxc! There are upcoming changes to MetalLB which would allow users to specify the source IP MetalLB uses for BGP sessions. The destination port for BGP is already configurable. |
Another update here - in Calico v3.18, Calico will be capable of advertising LoadBalancer IPs allocated by the MetalLB controller without installing Speaker. projectcalico/confd#422 |
Provide some setup guide for this please! |
I was able to try Calico v3.18 (with BGP peering to a ToR switch) along with MetalLB v0.9.5 manifests. I deleted the speaker daemon set, and confirmed that external LB IPs were advertised via BIRD to the peer. Even though I don't use the speaker component, I had to specify "protocol" as layer2 or bgp in the config-map to ensure controller allocates IPs for LB. If the protocol requirement can be removed in config-map (provided we don't use speakers), it will look neat. Please let me know if I can file an issue for this. Thanks @caseydavenport and @salanki for much awaited projectcalico/confd#422. |
Very happy to hear this @gautvenk. Thank you @caseydavenport for pushing this over the line. |
Thanks! Based on this, I think we can close out this old issue. It seems ideal to me that the existing BGP daemon would handle advertisements vs trying to colocate the current MetalLB bgp speaker.
Please file a feature request issue for this (or a PR is event better!). Maybe we could support |
…4.14-ose-metallb Updating ose-metallb images to be consistent with ART
Is this a bug report or a feature request?:
Question actually.
What happened:
Can't get MetalLB to peer with my core router.
What you expected to happen:
Peering is expected to happen and for me to see the routes for each node in the routing table.
How to reproduce it (as minimally and precisely as possible):
Setup MetalLB and peer it with a Cisco L3 routing device.
Anything else we need to know?:
I am not sure if this is something related to the Cisco side or the MetalLB side. I also have calico peering with the same Cisco device with the same IP address and that could be the problem, but I wanted to verify. I am not sure that it is a bug.
Getting this in the log:
{"log":"E1213 22:09:35.960710 1 bgp.go:48] read OPEN from "10.1.105.1:179": message type is not OPEN, got 3, want 1\n","stream":"stderr","time":"2017-12-13T22:09:35.961076973Z"}
Makes me think the connection to calico is hijacking the metallb connection.
Environment:
uname -a
): Linux 4.4.0-103-genericThe text was updated successfully, but these errors were encountered: