Egress Policy #667

jianjuns · 2020-04-29T22:46:26Z

Describe what you are trying to solve
Egress and SNAT policies implementation of Antrea - being able to control egress Nodes and SNAT IPs of Pod egress traffic (from Pods to external network).

Describe the solution you have in mind
Just put some high level ideas here.

Egress policy definition

EgressPolicy CRD
We might introduce an EgressPolicy CRD that:

selects Pods, Namespaces, Services to apply the policy
We might prioritize selecting a single Namespace or Service.
defines SNAT strategy, e.g.:
- using a specified IP
- allocating a dedicate IP from an IP pool
potentially supports other egress policies, e.g.:
- egressing from a specified Node

IPPool CRD
We could add a CRD to define an IP pool. Besides SNAT, it could be used for Pod IPAM too.

NodePool CRD
We could add a CRD to define the set of Nodes that can act as egress Nodes. For simplicity, we might start from a single egress Node pool.
There could be multiple interfaces on a Node. We might need to support configuring which interface to use for egress.

Egress IP management

Discovery of Node IPs
Antrea Agent auto-discovers all interfaces and their IPs, and probably saves the information to a CRD like NodeInfo.
Then user can use any of the discovered IP to define EgressPolicy.

Auto IP assignment by Controller
Antrea Controller can automatically assign a SNAT IP to a Node from a configured NodePool.

HA and failover
When the SNAT IP is assigned to Nodes by Controller, we might further support failover of the SNAT IP - moving the SNAT IP to a new Node when the current Node fails. There could be two possible approaches:

Decision by Controller
When the Node fails, Controller should move the IP to another available Node.

To avoid conflicts of the old and new egress Nodes (e.g. the old Node loses connection to K8s API or Controller, but is still active and can serve egress traffic in datapath), we might introduce some conflict detection mechanism. For example, the new Node tries to ping the old Node (SNAT IP) to see if it is still active and reachable.

Limitations:

If the IP is assigned by Controller, when Controller or K8s API is down, SNAT IP can not fail over to another Node.
In SNAT IP failover, existing connections will be broken, as we do not replicate connection state.

Active/standby Nodes
Controller selects a pair of active/standby Nodes for each SNAT IP. Active/standby Nodes use a distributed protocol to decide the active Node and even replicate connection state.
One possible solution in Linux is to leverage conntrackd and keepalived.

If we can assume all SNAT IPs can be reachable from every Node, the implementation could be simpler, as the source Nodes need not to know the SNAT IP is available on which Node, but just tunnels/routes the packets to the SNAT IP. If this is not the case (for example, SNAT IPs are in a separate network from the Node network, and are assigned to extra NICs of a specific set of Nodes act as egress Nodes), we need some way to notify all Nodes about the current active egress Node for a SNAT IP, either through Controller or K8s API which then again require Controller or K8s API must be available in the failover case, another distributed protocol.

Data path design

The source Node forwards (through tunnel or routing in noEncap or hybrid mode) the egress packets to the egress Node, and egress Node SNAT the packets with the assigned SNAT IP.

rangar · 2020-05-20T23:09:03Z

Have you considered multiple interfaces/networks Nodes can be connected to ? Will I be able to pick which interface/network I can SNAT from ?

jianjuns · 2020-05-21T01:03:25Z

Have you considered multiple interfaces/networks Nodes can be connected to ? Will I be able to pick which interface/network I can SNAT from ?

Yes, mentioned multiple interfaces in the "NodePool CRD" section. But as other ideas described in the proposal, I have no detailed design yet, and we might need to look into the details.

github-actions · 2020-11-18T00:07:40Z

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment, or this will be closed in 180 days

tnqn · 2021-01-27T14:18:44Z

@jianjuns I'm working on this and have some questions want to discuss with you:

About the multiple interfaces/networks a Node can connect to, is it a valid use case to consider? I thought the egress policy applies to to all external traffic so it should always be the interface that has the default route to be the egress interface? If there are multiple networks, then the policy should have something like DestinationCIDR? We will have to configure routes if they don't match user configured policy (destinationCIDR => interface / egress IP)
Since user can use PodSelector and NamespaceSelector to select Pods, a Pod may be selected by multiple policies and it seems difficult to prevent it as pod could be created after policy. What do we do for this case? Is randomly SNATing to one of them acceptable as this shouldn't be a valid use case? Another way is to make the egresspolicy 1:1 map to namespace, but I guess you wouldn't sacrifice the flexibility for the unusual case.
Is NodePool CRD needed? Do you think we could just use Node labelSelector to make the configuration easier? Typically user can label certain Nodes as egress Nodes and select them via NodeSelector in policy.

jianjuns · 2021-01-28T01:17:58Z

I am thinking about the case you have a subset of Nodes for egress, which have extra NICs on a different physical subnet. If we allocate SNAT IPs to Nodes, then seems we might need to configure routes too, but for the 1st version if we just assume IPs are configured manually by users, then we can assume routes are correctly configured too.
This is a valid point. Unless we introduce priority, seems we can only randomly select an IP. Another choice is to fall back to Service and Namespace annotation, but then as you said we lose the flexibility.
There is an upstream proposal which also proposes label selector: sig-network: Add egress-source-ip-support KEP kubernetes/enhancements#1105
You mean to select Nodes in the SNAT policy CRD? Basically I am trying to separate IP management from SNAT, so SNAT policies can be independent of IP management. And probably we should consider making IP management part generic, so it can be shared by other features like L4 LB (assuming we might implement LB type Services too).

tnqn · 2021-01-28T03:41:14Z

I mean the Nodes that can access external should have a default route on one of its NICs, right? So we don't need to care about how many NICs they have and could always assign the SNAT IP to the NIC with default route? If this is a reasonable assumption, we could configure the IP automatically instead of asking user to do it.
Then maybe let's first user labelSelector and assume policies overlapping is not normal.
Yes, I mean how user select the Nodes for a policy. I see you proposed NodePool CRD and wondered if we can just use labelSelector. For example, user could label certain Nodes with "egress: true", then configure the egress policy's nodeSelector: egress=true. I feel it might be easier to use and implement.

jianjuns · 2021-01-28T04:05:11Z

Right, for now we might assume there is a default route. Later we can consider using different SNAT IP for different destination, then not necessarily a default route. But I would do IP configuration later still (in my mind we need to support auto IP->Node assignment together).
Ok.
In your proposal, user needs to duplicate the label selector for every SNAT policy? And how we define that for Service when we support L4 LB? I would either have a separate (group) CRD to select Node, or select Nodes in IPPool CRD. But we can decide that when we support auto IP assignment?

jianjuns · 2021-02-19T04:57:55Z

In the first version, we require users to manually configure SNAT IPs on the Nodes. In a SNATPolicy, a particular SNAT IP can be specified for the selected Pods, and antrea-controller will publish the SNATPolicy to the Nodes on which the selected Pods run.
On the Node, antrea-agent will realize the SNATPolicy with OVS flows and iptables rules. If the SNAT IP is not present on the local Node, the packets to be SNAT'd will be tunneled to the SNAT Node using the SNAT IP to be the tunnel destination IP. On the SNAT Node, the tunnel destination IP will be directly used as the SNAT IP.
On the SNAT Node, an iptables rule will be added to perform the SNAT with the specified SNAT IP, but which SNAT IP to use for a given packet is controlled by the OVS flows. The OVS flows will mark a packet that needs to be SNAT'd with a SNAT IP with the corresponding integer ID, and the corresponding iptables SNAT rule matches the packet MARK.

The OVS flow changes include:

table 31
// SNAT flows for Windows
- priority=210 ip,-new+trk,snatCTMARK,from_uplink macRewriteMark,goto:40 (SNAT return traffic)
+ priority=210 ip,-new+trk,snatCTMARK,from_uplink,nw_dst=localSubnet macRewriteMark,goto:40 (SNAT return traffic - remote packets will be handled by L3Fwd flows, so no need to set the macRewrite MAC)

table 70
// Reuse these Windows SNAT flows to skip packets need not SNAT
+priority=200 ip,from_local,nw_dst=localSubnet goto:80
+priority=200 ip,from_local,nw_dst=nodeIP goto:80
+priority=200 ip,from_local,nw_dst=gatewayCTMark goto:80

// Send packets for external network to the SNAT table
+priority=190 ip,from_local goto:71
+priority=190 ip,macRewriteMark mod_dl_dst:gw0_mac,goto:71 (traffic tunneled from remote Nodes)

+table 71 (snatTable. ttlDecTable is moved to table 72)
// Windows flows: load SNAT IP to a register (probably share the endpointIPReg and endpointIPv6XXReg)
priority=200 ip,+new+trk,in_port=local_pods snatRequiredMark(snat_ip),goto:80 (SNAT for local Pods, matching in_ports)
priority=200 ip,+new+trk,tun_dst=snat_ip snatRequiredMark(tun_dst),goto:80 (SNAT for remote Pods, matching tun_dst)
priority=190 ip,+new+trk snatRequiredMark(node_ip),goto:80 (default SNAT IP)

// Linux: mark the packet with an integer ID allocated for each SNAT IP
priority=200 ip,+new+trk,in_port=local_pods mark(snat_id),goto:80 (SNAT for local Pods)
priority=200 ip,+new+trk,tun_dst=snat_ip mark(snat_id),goto:80 (SNAT for remote Pods)

// common: tunnel packets need to SNAT on a remote Node with the SNAT IP to be the outer destination
priority=200 ip,in_port=local_pods mod_dl_src:gw0_mac,mod_dl_dst:vMAC,snat_ip->NXM_NX_TUN_IPV4_DST,goto:72
priority=0 goto_table:80

+table 72 (ttlDecTable)

table 105
// Windows: perform SNAT with the SNAT IP saved in the register
+priority=200 ip,+new+trk,snatRequiredMark ct(commit,table=110,zone=65520,nat(src=snat_ip),snatCTMark)

iptables rules:
iptables -t nat -A POSTROUTING -m mark --mark snat_id -j SNAT --to-source snat_ip

jianjuns · 2021-02-19T04:58:32Z

@tnqn : let me know what you think ^^
I tested the flows and iptables rules already.

tnqn · 2021-02-23T15:29:51Z

@jianjuns the proposal LGTM, one question about configuring SNAT IP and publishing SNATPolicy:
It looks like in the first version the SNATPolicy doesn't have any Node information. How does a Node know itself is a SNAT node and which are SNAT IPs if SNATPolicy is only pushed to Nodes that run the selected Pods? Or if it's still required to specify a Node in SNATPolicy or there is a NodePool CRD, is it really needed to ask user configure the SNAT IP manually?

jianjuns · 2021-02-23T17:22:47Z

@tnqn : I think each Node can discover all local IPs, and based on that decide whether or not to perform SNAT locally or tunnel to the SNAT IP.

Do you have extra thoughts on the 1st version scope, like should we do IPv4 only or not?

tnqn · 2021-02-24T02:00:03Z

I assume agent will need to treat all local IPs as potential SNAT IPs and configure openflows, allocate mark IDs, and configure iptables rules for them. Could it lead to many unnecessary configurations? For example, when kube-proxy ipvs mode is used, all service IPs will be configured on a network interface, there might be other cases that we are not aware of in production?

jianjuns · 2021-02-24T02:03:38Z

Do we have some way to filter IPs and assume we have a reasonable set of IPs to care about?

Another way is to watch all SNAT policies and know the IPs can be used.

tnqn · 2021-02-24T02:51:22Z

I think filtering IPs approach might be not clean and become complex to adapt all scenarios. Using SNATPolicy as source of truth sounds good to me.

Is this the first version scope in your mind:

User needs to configure SNAT IPs manually and reasonablely (no duplicate, no missing)
User configures a SNATPolicy with PodSelector and SNAT IP
No failover if the Node holds the SNAT IP crashes
Encap mode only
dual-stack?

jianjuns · 2021-02-24T04:46:21Z

Yes, that is what I am thinking. Do you think it can save some work if we start from IPv4 and Linux (not much to support Windows too)?

tnqn · 2021-02-24T05:10:03Z

I feel the main work to support IPv6 is more about testing as the design doesn't sound address family specific, while I'm not sure the extra work to support windows given that it doesn't use iptables to do SNAT. I would lean to support dual-stack and only for Linux in the first version.

jianjuns · 2021-02-24T05:36:31Z

Windows is even easier? As we do SNAT with OVS only. But the same as IPv6, there will be work for testing. If our target is 0.14, I think we can just do IPv4 on Linux.

tnqn · 2021-02-24T05:52:11Z

Sure, IPv4 on Linux sounds good to me.

github-actions · 2021-08-24T00:27:56Z

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment, or this will be closed in 180 days

antoninbas · 2021-08-24T00:34:44Z

I am going to close this. Other more specialized issues can be created to address gaps in the implementation: Windows, noEncap, etc.

jianjuns added the proposal A concrete proposal for adding a feature label Apr 29, 2020

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 18, 2020

tnqn removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2021

tnqn assigned tnqn and jianjuns Jan 27, 2021

jianjuns mentioned this issue Feb 22, 2021

Refactor Windows SNAT flows for SNAT policy implementation #1892

Merged

tnqn mentioned this issue Mar 1, 2021

Egress Policy v1alpha1 implementation #1924

Closed

5 tasks

jianjuns mentioned this issue Mar 3, 2021

Add Egress CRD types #1433

Merged

tnqn mentioned this issue Mar 25, 2021

Add iptables interface for implementing Egress #1998

Merged

jianjuns mentioned this issue Apr 27, 2021

Support automatic failover for Egress #2128

Closed

6 tasks

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 24, 2021

antoninbas closed this as completed Aug 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Egress Policy #667

Egress Policy #667

jianjuns commented Apr 29, 2020

rangar commented May 20, 2020

jianjuns commented May 21, 2020

github-actions bot commented Nov 18, 2020

tnqn commented Jan 27, 2021 •

edited

Loading

jianjuns commented Jan 28, 2021

tnqn commented Jan 28, 2021

jianjuns commented Jan 28, 2021

jianjuns commented Feb 19, 2021

jianjuns commented Feb 19, 2021 •

edited

Loading

tnqn commented Feb 23, 2021

jianjuns commented Feb 23, 2021

tnqn commented Feb 24, 2021

jianjuns commented Feb 24, 2021

tnqn commented Feb 24, 2021

jianjuns commented Feb 24, 2021

tnqn commented Feb 24, 2021

jianjuns commented Feb 24, 2021

tnqn commented Feb 24, 2021

github-actions bot commented Aug 24, 2021

antoninbas commented Aug 24, 2021

Egress Policy #667

Egress Policy #667

Comments

jianjuns commented Apr 29, 2020

Egress policy definition

Egress IP management

Data path design

rangar commented May 20, 2020

jianjuns commented May 21, 2020

github-actions bot commented Nov 18, 2020

tnqn commented Jan 27, 2021 • edited Loading

jianjuns commented Jan 28, 2021

tnqn commented Jan 28, 2021

jianjuns commented Jan 28, 2021

jianjuns commented Feb 19, 2021

jianjuns commented Feb 19, 2021 • edited Loading

tnqn commented Feb 23, 2021

jianjuns commented Feb 23, 2021

tnqn commented Feb 24, 2021

jianjuns commented Feb 24, 2021

tnqn commented Feb 24, 2021

jianjuns commented Feb 24, 2021

tnqn commented Feb 24, 2021

jianjuns commented Feb 24, 2021

tnqn commented Feb 24, 2021

github-actions bot commented Aug 24, 2021

antoninbas commented Aug 24, 2021

tnqn commented Jan 27, 2021 •

edited

Loading

jianjuns commented Feb 19, 2021 •

edited

Loading