Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP ILB support / support scope local routes to be configured #4109

Open
Tracked by #9249 ...
nberlee opened this issue Aug 22, 2021 · 2 comments
Open
Tracked by #9249 ...

GCP ILB support / support scope local routes to be configured #4109

nberlee opened this issue Aug 22, 2021 · 2 comments
Milestone

Comments

@nberlee
Copy link
Contributor

nberlee commented Aug 22, 2021

Feature Request

Talos should be mimicking the google-agent (watch metadata for forwardedIPs and add them to the routing table with the scope local) in order to support ILB for at least the kubernetes api endpoint.

The Problem

The only way in GCP to create a load-balancer without a public internet address is to use a Internal Load Balancer (ILB). An ILB does not use tcp-proxy for LB, but forwards the packet directly to the backend, without modification, and expects the backend to accept this packet and leverage direct server return (DSR)

This creates a few problems for Talos or any other VM which has no google-agent running.

  • It does not know that its needs to accept packets with the source ip loadbalancer. And therefor drops it
  • Return traffic should always have the loadbalancer's source ip. Otherwise this would create problems with statefull firewall like the one in GCP. If the Talos node responds with his normal interface ip, the statefull firewall cannot match the session and tries to create a new one. (which probably will be blocked as these source/destination ip/port tuple is normally not in the firewall rules)

The Solution

Source IPs that can be delivered to the Talos node by the ILB are stated in the GCE metadata. Talos could watch for this field and add/remove routes on the interface with scope local.

Conditions

This solution should only be active when:

  • its running on a worker node (not needed for Cilium but maybe other CNIs need it to support ILB). This solution is not hindering CNIs as google-agent is also running on GKE nodes.
  • A control-plane node after the local kube-api server has been successfully bootstrapped. Otherwise it will never bootstrap because it cannot reach a working k8s api server on the loadbalancing ip). @smira suggested to be started after kube-apiserver is started, satisfies this condition.

Tried workarounds

  • After discussing this with @rsmitty and @smira in the #support slack channel we came up with a workaround using:

    - op: add
      path: /machine/network/interfaces
      value:
        - interface: eth0
          dhcp: true
        - interface: eth0
          cidr: <ILBIP>/32
    

    This works with non-strict firewall rules. However this creates async traffic as the return traffic has the source ip of the DHCP address in stead of the Load Balancer. (This currently blocks my whole project from going forward)

  • Using google cloud controller manager to create a loadbalance service in K8S to create an ILB for the kubernetes api only addresses the source ip problem. As now the CNI (cilium in my case) will take care of the source ip, and this gets now accepted by the node. However, as kube-apiserver is running on HostNetwork, return traffic will not be flowing back to the CNI. And the source port + source ip are not rewritten. Thereby creating async traffic because the source ip & the source port (6443 in stead of the nodeport used by the GCE Cloud Controller Manager) are not matchable by the statefull firewall.

  • Another solution would be to create a yaml patch with the route just like the google-agent would create dynamically. However, there is currently no way to create a route with scope local in Talos. (if it would it would help me really out here)

Example code

To prove my dire need for this feature asap, I started to create some code. Although I can create some simple go code, I do not feel at all comfortable to create some hooks where I need them.

Also it needs scope local route support in talos route functions. But it does provide all the metadata and logic parts. See nberlee@d5f6cad

However, feel free to totally rewrite it.

@sergelogvinov
Copy link
Contributor

sergelogvinov commented May 23, 2022

Azure/Oracle internal load balancer uses DSR too.

as an option to solve it - is assign VIP address on all control plane nodes manually (ip alias).
But it can be done after etcd join event. Because Talos cannot find the other neighbors thought VIP address. (discovery service is off in this case)

To make it automatically by Talos:
at boot time wait the etcd join event and assign to VIP address to the lo interface.
And probably run this (not in the cloud case) to disable arp announce:

ifconfig lo:0 inet VIP netmask 255.255.255.255 up

echo 1 > /proc/sys/net/ipv4/conf/lo/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/lo/arp_announce

in machine config add parameter like

network:
  interfaces:
    - interface: eth0
      vip: 
        dsr: true
        ip: <IP>

All control plane nodes will have VIP address at the same time. And external LB will work fine (after health checks of cause).

@smira smira added this to the v1.4 milestone Jan 18, 2023
@smira
Copy link
Member

smira commented Feb 1, 2023

Priority of the issue is contingent on customer negotiations.

@smira smira modified the milestones: v1.4, v1.5 Apr 27, 2023
@smira smira modified the milestones: v1.5, v1.6 Aug 2, 2023
@smira smira modified the milestones: v1.6, v1.7 Dec 15, 2023
@smira smira modified the milestones: v1.7, v1.8 Apr 4, 2024
@smira smira modified the milestones: v1.8, v1.9 Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants