You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/aci/aci_bgp_design.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -39,7 +39,7 @@ In ACI software releases prior to 6.1(2), it is required to either:
39
39
40
40
This is required because routes generated by nodes directly connected to the anchor nodes are preferred over routes from nodes not directly connected to the anchor nodes, and this could lead to nodes not being utilized.
41
41
42
-
For example: if we exposee a service with IP `1.1.1.1` via BGP, anchor `leaf101` will only install one ECMP path for `1.1.1.1/32` through `192.168.2.1`, because the locally attached route is preferred, even though multiple nodes could be providing this service.
42
+
For example: if we expose a service with IP `1.1.1.1` via BGP, anchor `leaf101` will only install one ECMP path for `1.1.1.1/32` through `192.168.2.1`, because the locally attached route is preferred, even though multiple nodes could be providing this service.
The [Advanced Design](../../advanced_design) outlined in this document has undergone scalability testing to ensure its effectiveness and reliability under various conditions. However, it is important to note that these tests were conducted based on general scenarios and assumptions. As each organization's needs and architecture are unique, we strongly recommend conducting your own testing to validate the design's performance and suitability within your specific environment.
12
+
13
+
Custom testing will help identify any potential issues that may arise due to unique architectural elements or specific use cases pertinent to your organization. By doing so, you can ensure that the solution meets your performance expectations and integrates seamlessly with your existing systems.
14
+
15
+
16
+
The Advanced design has been currently tested with:
17
+
18
+
- 350 Node Kubernetes Cluster
19
+
- Each Node is advertising 250 BGP /32 Service Routes.
20
+
- All Nodes peers to 2 ACI Boarder Leaves
21
+
- ACI Fabric is composed by 4 leaves
22
+
- Clients are accessing the service via:
23
+
- An L3OUT
24
+
- An EPG/ESG
25
+
26
+
## Generic ACI Scale
27
+
28
+
In the context of the Isovalent and Cisco DC Fabrics design these are the metrics we need to keep in mind:
29
+
30
+
{: .note }
31
+
These are metric for ACI 6.1.2, if you are using a different version refer to the [Verified Scalability Guide](https://www.cisco.com/c/en/us/support/cloud-systems-management/application-policy-infrastructure-controller-apic/tsd-products-support-series-home.html) for your version
32
+
33
+
34
+
- Floating L3Out: Max of 6 anchors and 32 non-anchor
35
+
- IPs per MAC = 4096
36
+
- BFD neighbors: 2,000 sessions using these minimum BFD timers: minTx:300, minRx:300, multiplier:3
37
+
- Number of BGP neighbors (2000 per leaf with up to 70000 external prefixes with a single path). 20k per fabric scale
38
+
- Shared L3Out (when leaking between VRFs) 2000 IPv4 prefixes
39
+
- External EPGs per L3out (250 per L3out), 600 fabric wide
40
+
- Number of ESGs per Fabric = 10000
41
+
- Number of ESGs per VRF = 4000
42
+
- Number of ESGs per tenant = 4000
43
+
- Number of L3 IP Selectors per leaf = 5000
44
+
- Number of IP Longest Prefix Matches (LPM) entries: 20000 IPv4 with default dual stack profile. Worth noting that changing the profile reduces the amount of supported ECMP paths.
45
+
46
+
## Conducted Tests:
47
+
48
+
### Adding/Removing BGP Peers
49
+
50
+
**Test:**
51
+
52
+
Removing and then adding the label that enables the node for BGP peering.
53
+
54
+
**Impact:**
55
+
56
+
None. This is expected thanks to Maglev even if the nodes still receives the traffic will just be able to forward it on to the correct POD.
57
+
58
+
### Reloading a Kubernetes Node
59
+
60
+
**Test:**
61
+
62
+
This test is conducted by gracefully reloading a node.
63
+
64
+
**Impact:**
65
+
66
+
Minimal. Traffic can be dropped during Routing Table Re Convergence. I.E. if traffic sent to the node that is reloading before is removed as a valid Next Hop.
67
+
This issue can be minimized by first removing the node from BGP Peering.
68
+
69
+
### Cilium Upgrade
70
+
71
+
The BFD and BGP Process are running on the Cilium POD. Restarting the Cilium POD for any reason will result in the BFD adjacency however thank to BGP Graceful Restart the impact is minimal.
72
+
73
+
**Test:**
74
+
75
+
Restart the Cilium POD (or Upgrading Cilium)
76
+
77
+
**Impact:**
78
+
79
+
Minimal.
80
+
81
+
## Reloading an Anchor Node
82
+
**Test:**
83
+
84
+
Reloading an Anchor Node from the CLI without Sending BFD Down Messages
85
+
86
+
**Impact**
87
+
88
+
None aside for potential in flight packets.
89
+
The routing tables are not impacted by this as all the next hops are the Kubernetes Nodes IP.
Copy file name to clipboardexpand all lines: docs/aci/advanced_design.md
+36-4
Original file line number
Diff line number
Diff line change
@@ -54,9 +54,35 @@ Centralized Routing
54
54
{: .note }
55
55
This design requires ACI 6.1(2) or above as the Propagate Next Hop and Ignore IGP Metric features are both needed.
56
56
57
-
## Egress Nodes
57
+
## Cilium Egress design
58
58
59
-
These nodes will be configured with two interfaces. One interface for the node and a dedicated interface for egress:
59
+
When it comes to the Cilium Egress design there are a two options we can evaluate based on our requirements.
60
+
61
+
### Egress IP advertisement Over BGP (Preferred Option)
62
+
63
+
Cilium can Advertise the Egress IP over BGP. This can be done easily by adding in the `IsovalentBGPAdvertisement` CRD the `advertisementType: EgressGateway`
64
+
We can then use ACI external EPGs to classify the egress traffic and apply contracts to it.
65
+
66
+
If ACI external EPGs scalability is an issue and a Firewall is anyway required we recommend using a single External EPG matching on the whole `EgressGateway` subnet and leverage Service Graph redirection to send the traffic to the Firewall.
67
+
68
+
This options keeps the design extremely simple and clean, all the nodes are identical and connect to ACI via a single L3OUT.
69
+
70
+
### Egress IP and ESGs
71
+
72
+
We can harness the capabilities of ACI Endpoint Security Groups (ESGs) to develop an efficient network design with the following structure:
73
+
74
+
* Dedicated ESGs for Egress Gateway Traffic: The nodes performing egress will be configured with an additional Subnet that can be then classified into ESGs
75
+
* Cilium Egress Gateway Policies: Implement Cilium Egress Gateway policies to associate specific namespaces with designated gateway nodes, each with a fixed egress IP address. This mapping ensures consistent and predictable IP addresses for the Outbound cluster traffic.
76
+
* ESG Classification on Egress IPs: Apply ESG classification to the egress IPs to streamline network management and policy enforcement, enhancing the security and control over outbound traffic at a namespace level.
77
+
78
+
It is important to note that this design specifically addresses traffic leaving the cluster. Internal cluster traffic will remain unaffected by these configurations. This ensures that while outbound traffic is tightly controlled and secured, cluster-local communications continue to operate without interruption.
79
+
80
+

81
+
Egress Gateway traffic flows
82
+
83
+
#### Egress Nodes Requirements
84
+
85
+
The Egress nodes will be configured with two interfaces. One interface for the node and a dedicated interface for egress:
60
86
* The node interface will be placed behind the L3Out to simplify node-to-node communication.
61
87
* It is not required for the `egress nodes` to establish BGP peering if they are only used for Egress traffic.
62
88
* The egress interface will be connected to an EPG and will be used for the egress gateway feature for POD initiated traffic.
@@ -65,6 +91,7 @@ These nodes will be configured with two interfaces. One interface for the node a
65
91
For the nodes with Multiple interface is fundamental to ensure that the kubelet’s node-ip is set correctly on each node. In this design this must be the interface placed behind the ACI L3Out.
66
92
Cilium does not have the ability to select which interface is used for pod to pod E/W routing and will use the kubelet’s node IP interface.
67
93
94
+
##### Routing Considerations
68
95
By default traffic received on the egress nodes from the EPG would be returned to the client via the L3OUT Interface resulting in traffic drops.
69
96
To ensure return traffic is routed back to the EPG we can:
70
97
@@ -75,7 +102,12 @@ To ensure return traffic is routed back to the EPG we can:
75
102
* the service IP pool
76
103
is going to use route table 100, thus ensuring that traffic will be sent back to the L3Out, which preserves routing symmetry.
77
104
78
-
{% include_relative cilium_egress_design.md %}
105
+
Regardless of the design choice the only other consideration is how many `egress nodes` to deploy and whether to dedicate them only for this purpose.
106
+
Ideally, the design should have a minimum of two `egress nodes` distributed between two pairs of leaves. This will provide redundancy in case of `egress nodes` or ACI leaf failure or during upgrades.
107
+
Depending on the cluster scale and application requirements, dedicated `egress nodes` could be beneficial for the same reasons discussed for the `inress nodes`.
108
+
109
+
{: .note }
110
+
A single ingress node can be configured with multiple IP addresses, enabling it to support multiple PODs identities. This configuration allows us to efficiently reuse the same node across different namespaces. For example, IP-A can be associated with Namespace A, while IP-B can be linked to Namespace B, and so forth.
79
111
80
112
81
113
## Design trade offs
@@ -85,7 +117,7 @@ This design aims to provide you with an easy and high scalable design; however,
85
117
1. External services can only be advertised as “Cluster Scope”: This requirement is imposed by Maglev. This drawback is however of minor consequence thanks for Direct Server Return.
86
118
2. Potential bottlenecks for egress traffic
87
119
3. Node IP is "hidden" behind the L3out so there is a small loss of visibility compared to the Simplicity first design.
88
-
4. Ingress nodes have two interfaces with different route tables, this adds additional complexity.
120
+
4.(Depending on the Egress design choice) Ingress nodes have two interfaces with different route tables, this adds additional complexity.
89
121
90
122
For issue (1) there is no solution. Issue (2) can be easily addressed with either vertical or horizontal scaling.
0 commit comments