-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EVPN VxLAN Multihoming HLD #1622
Conversation
|
||
The following are functional requirements for EVPN VxLAN Multihoming: | ||
|
||
1. Support All-EVPN based active-active access redundancy with up to 4 VTEPs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "up to 4 VTEPS" just software limitation, or ASIC hardware limitaion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The software data-structures etc need to have some size, it is primarily for that. Not a hardware limitation.
In the example output, L2 NHID group 536870913 has two member nexthops - 268435458 and 268435459 - which resolve to single path VTEP nexthops. | ||
|
||
|
||
#### 2.2.4.2 BUM traffic handling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which component is responsible for installing tc rules to kernel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally it should be FRR (zebra.)
|
||
# 3 Design | ||
|
||
## 3.1 Overview |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest to add Architecture diagram for understanding software design.
- Known unicast traffic with Destination MACs pointing to the above bridgeport type will be load balanced among the remote VTEPs. | ||
|
||
``` | ||
typedef enum _sai_l2_ecmp_group_attr_t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is noted that l2 ecmp group has almost the same attribute with sainexthopgroup.
why not to reuse sainexthopgroup object and add some attribute to identify l2 and l3 ecmp group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@philo-micas , yes, it makes sense. SAI subsection is updated.
Achieve sub-second convergence in following failure scenarios: | ||
1. LAG link down | ||
2. LAG link up | ||
3. VTEP restart |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is sub-second expected for VTEP restart ? Is Fastboot being referred ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sub-second convergence is out of scope for this design. Removed this sec.
|
||
![](images/evpn_mh_unicast_forwarding.PNG) | ||
|
||
In the diagram above, H2 host mac address H2-MAC is advertised by Vtep-1 and/or Vtep-4 in a EVPN Type-2 route that also carries Ethernet-Segment identifier (ES-ID) of PortChannel1. On Vtep-5, the H2-MAC processing undergoes special handling since the Type-2 route has ES-ID. And H2-MAC is installed against L2 NHG corresponding to the ES-ID. Even if Type-2 route for H2-MAC is received only from either Vtep-1 or Vtep-4, the traffic towards H2 will still be load-balanced between Vtep-1 and Vtep-4. Later, if PortChannel1 interface goes down, on say Vtep-1, Vtep-1 will withdraw the Type-1 (AD-per-ES) route and this will result in Vtep-5 updating L2 NHG for the ES-ID. This re-balance of traffic will not wait for the individual MAC route updates to arrived and get processed. This update of L2 NHG on Multihomed ES link failures is referred as "Fast Convergence" in RFC 7432. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't a MAC-to-ESI association in SAI be needed for mass withdrawl ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ESI-based MACs will be installed with L2NHID. In case of remote ESI member change, L2NHID will be updated in the hw. There is no need for SAI to be aware of ESI or MAC-to-ESI association.
* @default false | ||
* @validonly SAI_BRIDGE_PORT_ATTR_TYPE == SAI_BRIDGE_PORT_TYPE_PORT | ||
*/ | ||
SAI_BRIDGE_PORT_ATTR_NON_DF, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SAI_BRIDGE_PORT_ATTR_NON_DF is for n/w facing ports or access facing ports ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is for access facing ports (ESI configured interfaces.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does SAI knows whether a PORT is part of Ethernet segment or not..? do we need one more attribute to define that. I think, SAI can't depend on (SAI_BRIDGE_PORT_ATTR_NON_DF) attribute, since this is "FALSE" even for both Ethernet segment and single homed switches.
and second doubt is, does this "default value of false" condition causes any Micro-loops in the network..?
Consider a case for PO2 configuration..
- when an acess PORT with Ethernet segment enabled is added to Valn, SAI calls bridge port with (SAI_BRIDGE_PORT_ATTR_NON_DF== false), since still DF election is in progress.
- above condition is smae on all 4 VTEPs. Now all the nodes which are part of PO2(VTEP-1,2,3,4) will start flooding all the BUM traffic.
- after the DF election is completed, SAI will update the NON-DF nodes with (SAI_BRIDGE_PORT_ATTR_NON_DF= true)
- till this time, traffic is flooding on all VTEPs. does this momentary flooded traffic will not cause transient loops in the network..?
### 1.1.3 Platform requirements | ||
Support EVPN Multihoming on platforms having VxLAN capabilities. | ||
|
||
### 1.1.4 Scalability Requirements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The solution is vendor agnostic, it's difficult to claim scaling performance and convergence in isolation of platform details and ASIC capabilities, please remove section 1.1.4, 1.1.5, and section 8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Removed.
|
||
|
||
|
||
# 9 Unit Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EVPN MH will require a separate testplan documentation, please move this to a test plan submission
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Removed from here for now.
|
||
|
||
|
||
# 8 Scalability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be removed for the same reason stated in section 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Removed.
![image](https://github.com/hasan-brcm/SONiC/assets/56742004/4bd442a9-5275-4fb1-8ccf-a46083c57bf9) | ||
|
||
|
||
In the diagram above, H2 host mac address H2-MAC is advertised by Vtep-1 and/or Vtep-4 in a EVPN Type-2 route that also carries Ethernet-Segment identifier (ES-ID) of PortChannel1. On Vtep-5, the H2-MAC processing undergoes special handling since the Type-2 route has ES-ID. And H2-MAC is installed against L2 NHG corresponding to the ES-ID. Even if Type-2 route for H2-MAC is received only from either Vtep-1 or Vtep-4, the traffic towards H2 will still be load-balanced between Vtep-1 and Vtep-4. Later, if PortChannel1 interface goes down, on say Vtep-1, Vtep-1 will withdraw the Type-1 (AD-per-ES) route and this will result in Vtep-5 updating L2 NHG for the ES-ID. This re-balance of traffic will not wait for the individual MAC route updates to arrived and get processed. This update of L2 NHG on Multihomed ES link failures is referred as "Fast Convergence" in RFC 7432. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After Vtep-4 receives this update, it installs the MAC against Po1 but also advertises Type-2 route for the same MAC with Proxy=1, VTEP-4 receives MAC from H2 locally. In this case, would VTEP-4 update proxy bits to both remote VTEPs and VTEP1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the mac is learnt locally on Vtep-4, the type-2 route will be re-advertised with proxy=0 by vtep-4.
The MCLAG configuration will not be allowed if ESI is configured on one or more interface(s). | ||
|
||
### 2.2.8 ARP/ND suppression | ||
ARP/ND suppression will be supported in EVPN Multihoming scenarios. The VTEP will respond to the ARP/ND requests received on local access ports only for the ARP/ND installed against remote VTEPs. ARP/ND response will not be generated by the VTEP if ARP/ND is installed on the local Multihomed ES, even though the ARP/ND learning happened on the remote VTEP multihoming a given ES. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my previous question, initially, VTEP1 learns ARP. But later, the ARP packets goes to VTEP4 due to some hash changing locally on the host. In this case, how would VTEP4 handle these ARP packets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ARP rides on mac. If the mac is locally active on vtep-4, the corresponding type-2 routes will be advertised with proxy=0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After VTEP1 learns ARPs, it would use BGP to send this type 2 route to VTEP4. How does VTEP4 install this ARP entry to kernel? since this ARP entry should not be timeout unless VTEP1 withdrew it.
Once VTEP4 learns the same ARP entry locally, would kernel be able to flip previous remote learnt ARP entry to local entry? If the local ARP time sout later, which module detects it and reinstalls remote learnt ARP?
Converged HLD posted as #1702, closing this PR |
First draft of EVPN VxLAN Multihoming high level design.