Egress Policy v1alpha1 implementation #1924

tnqn · 2021-03-01T09:53:33Z

Describe what you are trying to solve
This proposal summarizes the first alpha version of the Egress feature. Please see #667 for the complete proposal.

In v1alpha1, we require users to manually configure SNAT IPs on the Nodes. In an Egress, a particular SNAT IP can be specified for the selected Pods, and antrea-controller will publish the selected Pods of Egresses to the Nodes on which the selected Pods run.

There will be some limiations in the first version: encap mode is the only supported traffic mode. Some features and scenarios, e.g. HA, dual-stack and windows are not supported.

Describe how your solution impacts user flows

User configures secondary IPs that can be used as SNAT IPs to Nodes' network interfaces.
User configures EgressPolicy (a CRD API) which selects specific Pods and the IP they should be translated to when accessing external addresses.

Describe the main design/architecture of your solution

API change

An user-facing API will be introduced. The object schema will be like below:

type EgressPolicy struct {
	metav1.TypeMeta `json:",inline"`
	// Standard metadata of the object.
	metav1.ObjectMeta `json:"metadata,omitempty"`

	// Specification of the desired behavior of EgressPolicy.
	Spec EgressPolicySpec `json:"spec"`
}

// EgressPolicySpec defines the desired state for EgressPolicy.
type EgressPolicySpec struct {
	// AppliedTo selects Pods to which the policy will be applied.
	AppliedTo AppliedTo
	// EgressIP specifies the SNAT IP address for the selected Pods.
	EgressIP string
}

// AppliedTo defines the workloads to which a policy is applied.
type AppliedTo struct {
	// Select Pods matched by this selector. If set with NamespaceSelector,
	// Pods are matched from Namespaces matched by the NamespaceSelector;
	// otherwise, Pods are matched from all Namespaces.
	// +optional
	PodSelector *metav1.LabelSelector `json:"podSelector,omitempty"`
	// Select all Pods from Namespaces matched by this selector, as
	// workloads in To/From fields. If set with PodSelector,
	// Pods are matched from Namespaces matched by the NamespaceSelector.
	// +optional
	NamespaceSelector *metav1.LabelSelector `json:"namespaceSelector,omitempty"`
}

Egress's Pod selection is calculated by antrea-controller and transmitted to antrea-agent via a controlplane API EgressGroup. This is mainly to avoid redundant Pod watching and group calculation when resolving "AppliedTo".
A Egress's corresponding EgressGroup will use the same name for agent to identify, like Service and Endpoint resource.

type EgressGroup struct {
	metav1.TypeMeta
	metav1.ObjectMeta
	// GroupMembers is a list of resources selected by this group.
	GroupMembers []GroupMember
}

type EgressGroupPatch struct {
	metav1.TypeMeta
	metav1.ObjectMeta
	AddedGroupMembers   []GroupMember
	RemovedGroupMembers []GroupMember
}

Control Plane

antrea-controller

antrea-controller watches the Egress resource from Kubernetes API, creates the EgressGroup resources. EgressGroup API in the controlplane API group will provide list, get, and watch interface for agents to consume.

antrea-agent

antrea-agent watches the above EgressGroup API and Egress API, then:
For each Egress, it checks whether the EgressIP is configured on the Node it runs on. If yes, it allocates a locally-unique ID (usage mentioned in the "Data plane" section below) for this IP and configures corresponding openflow rules and iptables rules to enforce SNAT for specific traffic. Otherwise it does nothing.
For each Pod in EgressGroup, it checks whether the associated EgressIP is local or not. If local, it configures specific openflow rules to forward the traffic coming from the Pod to the gateway interface with specific mark set. If remote, it configures specific openflow rules to forward the traffic to the tunnel interface with specific tunnel destination set.

Data Plane

(Copied from #667 (comment))
On the Node, antrea-agent will realize the SNATPolicy with OVS flows and iptables rules. If the SNAT IP is not present on the local Node, the packets to be SNAT'd will be tunneled to the SNAT Node using the SNAT IP to be the tunnel destination IP. On the SNAT Node, the tunnel destination IP will be directly used as the SNAT IP.
On the SNAT Node, an iptables rule will be added to perform the SNAT with the specified SNAT IP, but which SNAT IP to use for a given packet is controlled by the OVS flows. The OVS flows will mark a packet that needs to be SNAT'd with a SNAT IP with the corresponding integer ID, and the corresponding iptables SNAT rule matches the packet MARK.

The OVS flow changes include:

table 31
// SNAT flows for Windows
- priority=210 ip,-new+trk,snatCTMARK,from_uplink macRewriteMark,goto:40 (SNAT return traffic)
+ priority=210 ip,-new+trk,snatCTMARK,from_uplink,nw_dst=localSubnet macRewriteMark,goto:40 (SNAT return traffic - remote packets will be handled by L3Fwd flows, so no need to set the macRewrite MAC)

table 70
// Reuse these Windows SNAT flows to skip packets need not SNAT
+priority=200 ip,from_local,nw_dst=localSubnet goto:80
+priority=200 ip,from_local,nw_dst=nodeIP goto:80
+priority=200 ip,from_local,nw_dst=gatewayCTMark goto:80

// Send packets for external network to the SNAT table
+priority=190 ip,from_local goto:71
+priority=190 ip,macRewriteMark mod_dl_dst:gw0_mac,goto:71 (traffic tunneled from remote Nodes)

+table 71 (snatTable. ttlDecTable is moved to table 72)
// Windows flows: load SNAT IP to a register (probably share the endpointIPReg and endpointIPv6XXReg)
priority=200 ip,+new+trk,in_port=local_pods snatRequiredMark(snat_ip),goto:80 (SNAT for local Pods, matching in_ports)
priority=200 ip,+new+trk,tun_dst=snat_ip snatRequiredMark(tun_dst),goto:80 (SNAT for remote Pods, matching tun_dst)
priority=190 ip,+new+trk snatRequiredMark(node_ip),goto:80 (default SNAT IP)

// Linux: mark the packet with an integer ID allocated for each SNAT IP
priority=200 ip,+new+trk,in_port=local_pods mark(snat_id),goto:80 (SNAT for local Pods)
priority=200 ip,+new+trk,tun_dst=snat_ip mark(snat_id),goto:80 (SNAT for remote Pods)

// common: tunnel packets need to SNAT on a remote Node with the SNAT IP to be the outer destination
priority=200 ip,in_port=local_pods mod_dl_src:gw0_mac,mod_dl_dst:vMAC,snat_ip->NXM_NX_TUN_IPV4_DST,goto:72
priority=0 goto_table:80

+table 72 (ttlDecTable)

table 105
// Windows: perform SNAT with the SNAT IP saved in the register
+priority=200 ip,+new+trk,snatRequiredMark ct(commit,table=110,zone=65520,nat(src=snat_ip),snatCTMark)

iptables rules:
iptables -t nat -A POSTROUTING -m mark --mark snat_id -j SNAT --to-source snat_ip

Work breakdown

new APIs for EgressPolicy Add Egress CRD types #1433
antrea-controller calculates EgressGroups and provides its List, Get, and Watch APIs Add EgressGroup API and Controller #1965
iptables initialization and interface Add iptables interface for implementing Egress #1998
openflow initialization and interface Add OVS flows for implementing Egress #1969
antrea-agent gets Policies and Groups, and invokes dataplane interfaces Add controller to antrea-agent for implementing Egress #2026

Alternative solutions that you considered
NONE

Test plan
Add E2E tests to verify specific Pods are translated to specific IP when accessting an http server deployed "outside" the cluster (it could be a host-network Pod running on a Node that is different from the Egress Node.

Additional context
Any other relevant information.

The text was updated successfully, but these errors were encountered:

tnqn · 2021-03-01T09:57:55Z

@jianjuns I created this issue to track the design change and progress of the first version of the feature, copied some content we have discussed in #667 here. Feel free to update it directly if you have any ideas on details or names.

jianjuns · 2021-03-02T00:43:51Z

Thanks for the details of version 1.

A question for controlplane API: could we reuse AppliedToGroup instead of adding a new EgressGroup?

jianjuns · 2021-03-02T00:45:01Z

And for AppliedTo why do not have ClusterGroup and Service reference there? Is it for simplification of 1st version?

vicky-liu · 2021-03-02T07:14:26Z

cc @ceclinux to take a look at work breakdown.

tnqn · 2021-03-02T16:50:17Z

@jianjuns

A question for controlplane API: could we reuse AppliedToGroup instead of adding a new EgressGroup?

I thought about this but didn't find real benefits to do so so switched to another way that could reduce code redundancy and grouping caculation between all kinds of groups, including clustergroups, appliedtogroups, addressgroups and egressgroups.
Some cons of reusing AppliedToGroup I thought of:

When AppliedToGroups for NetworkPolicy and EgressPolicy are mixed in a single API, the AppliedToGroups for one policy will have to wake up another policy's event handler unnecessarily.
The AppliedToGroup for EgressPolicy may use different strategy to dispatch to agents. Egress nodes may need to know all members of a given group, instead of the ones that are running on it (when we want to support noencap mode or the case that SNAT IP is not reachable from non egress nodes).
Currently the AppliedToGroup is coupled with NetworkPolicyController on both controller and agent side, extracting them for another vertical to reuse is more complex than extracting the grouping logic to a separate module and have different group APIs consuming it. In the latter way, the API is business aware while the grouping process is generic.
My PoC of the latter approach is close to finish, and I have verified it could improve the performance of networkpolicy controller greatly as well as reducing code redundancy. I may push the PR for review in 1 or 2 days.

And for AppliedTo why do not have ClusterGroup and Service reference there? Is it for simplification of 1st version?

I copied the struct from your PR. Supporting ClusterGroup and Service reference should be ok, I don't think of big effort introduced by them. Feel free to add them to the design if you think they should be in 1st version.

jianjuns · 2021-03-02T17:23:57Z

But from understanding/troubleshooting perspective, it is much better to use a single type, and map a single ClusterGroup to a single AddressGroup or AppliedToGroup.
If we think too much work to refactor NetworkPolicyController, can we still reusing the same AppliedToGroup type, but create another set of AppliedToGroups in this release?

I think better to support ClusterGroup and Service reference too. I can update my PR with my ideas.

tnqn · 2021-03-03T02:36:21Z

I think you mean having another API path but use the same struct, like "/v1alpha1/egressgroups" will get the new set of AppliedToGroup. However, clientset code is generated based on the name of the struct or the "resourceName" tag of the struct. I think it won't work if we use same struct in same API group as the paths in the generated clientset will be exactly same.

And what do you think the first and second problems I mentioned above, especially the second? I think the group for egresspolicy has more difference from AppliedToGroup for NetworkPolicy: it needs to include podIP information and dispatched to all egress nodes, which is more like AddressGroup for Egress Node but AppliedToGroup for non Egress Node.
With these differences, do you think it's still worth to consider EgressNode as AppliedToGroup?

tnqn · 2021-03-17T17:52:29Z

@jianjuns Given the fact that all agents need to watch all Egresses and there shouldn't be overlapped groups for Egress, I found there is no much value to have a controlplane Egress API as we could just create a EgressGroup with the same name as the Egress resource (just like Service and Endpoint), then use the Egress's name to get its group on agent side to save many code (the controller in antrea-controller can focus on syncing EgressGroup, antrea-agent can leverage Egress Informer), let me know if you have concern on this. This is the code on antrea-controller side: 178405b

jianjuns · 2021-03-17T17:59:23Z

I am fine to watch Egresses directly from K8s API for now. We can decide what to do later (when we have another solution to discover/assign SNAT IPs).

tnqn · 2021-04-07T17:15:47Z

All code changes have been merged, closing

tnqn added the kind/design Categorizes issue or PR as related to design. label Mar 1, 2021

tnqn added this to the Antrea v0.14.0 release milestone Mar 1, 2021

jianjuns mentioned this issue Mar 18, 2021

Add OVS flows for implementing Egress #1969

Merged

This was referenced Mar 25, 2021

Add iptables interface for implementing Egress #1998

Merged

Add EgressGroup API and Controller #1965

Merged

This was referenced Apr 6, 2021

Add controller to antrea-agent for implementing Egress #2026

Merged

Add CRD resource for Egress feature #2040

Closed

Document Egress feature #2041

Merged

tnqn closed this as completed Apr 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Egress Policy v1alpha1 implementation #1924

Egress Policy v1alpha1 implementation #1924

tnqn commented Mar 1, 2021 •

edited

Loading

tnqn commented Mar 1, 2021

jianjuns commented Mar 2, 2021

jianjuns commented Mar 2, 2021

vicky-liu commented Mar 2, 2021

tnqn commented Mar 2, 2021

jianjuns commented Mar 2, 2021

tnqn commented Mar 3, 2021

tnqn commented Mar 17, 2021

jianjuns commented Mar 17, 2021

tnqn commented Apr 7, 2021

Egress Policy v1alpha1 implementation #1924

Egress Policy v1alpha1 implementation #1924

Comments

tnqn commented Mar 1, 2021 • edited Loading

API change

Control Plane

antrea-controller

antrea-agent

Data Plane

Work breakdown

tnqn commented Mar 1, 2021

jianjuns commented Mar 2, 2021

jianjuns commented Mar 2, 2021

vicky-liu commented Mar 2, 2021

tnqn commented Mar 2, 2021

jianjuns commented Mar 2, 2021

tnqn commented Mar 3, 2021

tnqn commented Mar 17, 2021

jianjuns commented Mar 17, 2021

tnqn commented Apr 7, 2021

tnqn commented Mar 1, 2021 •

edited

Loading