Active-active dual ToR link manager is an evolution of active-standby dual ToR link manager. Both ToRs are expected to handle traffic in normal scenarios. For consistency, we will keep using the term "standby" to refer inactive links or ToRs.
Rev | Date | Author | Change Description |
---|---|---|---|
0.1 | 05/23/22 | Jing Zhang | Initial version |
0.2 | 12/02/22 | Longxiang Lyu | Add Traffic Forwarding section |
0.3 | 12/08/22 | Longxiang Lyu | Add BGP update delay section |
0.4 | 12/13/22 | Longxiang Lyu | Add skip ACL section |
0.5 | 04/10/23 | Longxiang Lyu | Add command line section |
This document provides the high level design of SONiC dual toR solution, supporting active-active setup.
3 SONiC ToR Controlled Solution
- 3.1 IP Routing
- 3.2 DB Schema Changes
- 3.3 Linkmgrd
- 3.4 Orchagent
- 3.5 Transceiver Daemon
- 3.6 State Transition Flow
- 3.7 Traffic Forwarding
- 3.8 Further Enhancement
- 3.9 Command Line
There are a certain number of racks in a row, each rack will have 2 ToRs, and each row will have 8 Tier One (T1s) network devices. Each server will have a NIC connected to 2 ToRs with 100 Gbps DAC cables.
In this design:
- Both upper ToR (labeled as UT0) and lower ToR (labeled as LT0) will advertise same IP to upstream T1s, each T1 will see 2 available next hops for the VLAN.
- Both UT0 and LT0 are expected to carry traffic in normal scenarios.
- The software stack on server host will see 200 Gbps NIC.
In our cluster setup, as smart y-cable is replaced, some complexity shall be transferred to server NIC.
Note that, this complexity can be handled by active-active smart cables, or any other deployments, as long as long it meets the requirements below.
-
Server NIC is responsible to deliver southbound (tier 0 device to server) traffic from either uplinks to applications running on server host.
- ToRs are presenting same IP, same MAC to server on both links.
-
Server NIC is responsible to dispense northbound (server to tier 0) traffic between two active links: at IO stream (5 tuples) level. Each stream will be dispatched to one of the 2 uplinks until link state changes.
-
Server should provide support for ToR to control traffic forwarding, and follow this control when dispensing traffic.
- gRPC is introduced for this requirement.
- Each ToR will have a well-known IP. Server NIC should dispatch gRPC replies towards these IPs to the corresponding uplinks.
-
Server NIC should avoid sending traffic through unhealthy links when detecting a link state down.
-
Server should replicate these northbound traffic to both ToRs:
- Specified ICMP replies (for probing link health status)
- ARP propagation
- IPv6 router solicitation, neighbor solicitation and neighbor advertisements
Check pseudo code below for details of IO scheduling contract.
// gRPC Response if (ethertype == IPv4 && DestIP == Loopback3_Port0_IPv4) or (ethertype == IPv6 && DestIP == Loopback3_Port0_IPv6) { if (Port 0.LinkState == Up) Send to Port 0 else Drop } else if (ethertype == IPv4 && DestIP == Loopback3_Port1_IPv4) or (ethertype == IPv6 && DestIP == Loopback3_Port1_IPv6) { if (Port 1.LinkState == Up) Send to Port 1 else Drop } // ARP else if (ethertype == ARP) Duplicate to both ports // ICMP Heartbeat Probing else if ((ethertype == IPv4 && DestIP == Loopback2_IPv4 && IPv4.Protocol == ICMP) or (ethertype == IPv6 && DestIP == Loopback2_IPv6 && IPv6.Protocol == ICMPv6)) Duplicate to all active ports // IPv6 router solicitation, neighbor solicitation and neighbor advertisements else if (ethertype == IPv6 && IPv6.Protocol == ICMPv6 && ICMPv6.Type in [133, 135, 136]) Duplicate to both ports else if (gRPC status == "Port 0 disabled" || Port0.LinkState == Down) Send to Port 1 else if (gRPC status == "Port 1 disabled" || Port1.LinkState == Down) Send to Port 0 // Other Traffic else Send packet on either port
- Introduce active-active mode into MUX state machine.
- Probe to determine if link is healthy or not.
- Signal NIC if ToR is switching active or standby.
- Rescue when peer ToR failure occures.
- Unblock traffic when cable control channel is unreachable.
Both T0s are up and functioning and both the server NIC connections are up and functioning.
-
Control Plane
UT0 and LT0 will advertise same VLAN (IPv4 and IPv6) to upstream T1s. Each T1 will see there are 2 available next hops for the VLAN. T1s advertise to T2 as normal. -
Data Plane
- Traffic to the server
- Traffic lands on any of the T1 by ECMP from T2s.
- T1 forwards traffic to either of the T0s by ECMP.
- T0 sends the traffic to the server and NIC delivers traffic up the stack.
- Traffic from the server to outside the cluster
- NIC determines which link to use and sends all the packets on a flow using the same link.
- T0 sends the traffic to the T1 by ECMP.
- Traffic from the server to within the cluster
- NIC determines which link to use and sends all the packet on a flow using the same link.
- T0 sends the traffic to destination server if T0 has learnt the MAC address of the destination server.
- Traffic to the server
Both T0s are up and functioning and some servers NIC are only connected to 1 ToR (due to cable issue, or the cable is taken out for maintenance).
- Control Plane
No change from the normal case. - Data Plane
- Traffic to the server
- Traffic lands on any of the T1 by ECMP from T2s.
- T1 forwards traffic to either of the T0s by ECMP.
- If T0 does not have the downlink to the server, T0 will send the traffic to the peer T0 over IPinIP encap via T1s.
- T0 sends the traffic to the server and NIC delivers traffic up the stack.
- Traffic from the server to outside the cluster
- T0 will signal to NIC which side to use.
- NIC determines which link to use and sends all the packets on a flow using the same link. If server NIC has only 1 connection up, all traffic will be on this connection.
- T0 sends the traffic to the T1 by ECMP
- Traffic from the server to within the cluster
- T0 will signal to NIC which side to use.
- NIC determines which link to use and sends all the packets on a flow using the same link. If Server NIC has only 1 connection up, all traffic will be on this connection
- If T0 does not have the downlink to the server, T0 will send the traffic to the peer T0 over IPinIP encap via T1s.
- T0 sends the traffic to the server.
- Traffic to the server
Only 1 T0s is up and functioning.
- Control Plane
Only 1 T0 will advertise the VLAN (IPv4 and v6) to upstream T1s. - Data Plane
- Traffic to the server
- Traffic lands on any of the T1 by ECMP from T2s.
- T1 forwards traffic to either of the T0s by ECMP. If one T0 is down, T1 forwards traffic to the healthy one.
- T0 sends the traffic to the server.
- Traffic from the server to outside the cluster
- T0 will signal to NIC which side to use.
- T0 sends the traffic to the T1 by ECMP.
- Traffic from the server to within the cluster
- T0 will signal to NIC which side to use.
- T0 sends the traffic to the server.
- Traffic to the server
Highlight on the common and differences with Active-Standby:
Active- Standby | Active-Active | Implication | |
---|---|---|---|
Server uplink view | Single IP, single MAC | ||
Standby side receive traffic | Forward it to active ToR through IPinIP tunnel via T1 | ||
T0 to T1 control plane | Advertise same set of routes | ||
T1 to T0 Traffic | ECMP | ||
Southbound traffic | From either side | ||
Northbound traffic | All is duplicated to both ToRs. | NiC determines which side to forward the traffic. | Orchagent doesn’t need to drop packets on standby side. |
Bandwidth | Up to 1 link | Up to 2 links | T1 and above devices see more throughput from server. |
Cable Control | I2C | gRPC over DAC cables | Control plane and data plane now share the same link. |
- New field in
MUX_CABLE
table to determine cable type
MUX_CABLE|PORTNAME:
cable_type: active-standby|active-active
- New table to invoke transceiver daemon to query server side forwarding state
FORWARDING_STATE_COMMAND | PORTNAME:
command: probe | set_active_self | set_standby_self | set_standby_peer
FORWARDING_STATE_RESPONSE | PORTNAME:
response: active | standby | unknown | error
response_peer: active | standby | unknown | error
- New table for transceiver daemon to write peer link state to linkmgrd
PORT_TABLE_PEER|PORTNAME
oper_status: up|down
- New table to invoke transceiver daemon to set peer's server side forwarding state
HW_FORWARDING_STATE_PEER|PORTNAME
state: active|standby|unknown
- New table for transceiver daemon to write peer's server side forwarding state to linkmgrd
HW_MUX_CABLE_TABLE_PEER| PORTNAME
state: active |standby|unknown
Linkmgrd will provide the determination of a ToR / link's readiness for use.
Linkmgrd will keep the link prober design from active-standby mode for monitoring link health status. Link prober will send ICMP packets and listen to ICMP response packets. ICMP packets will contain payload information about the ToR. ICMP replies will be duplicated to both ToRs from the server, hence a ToR can monitor the health status of its peer ToR as well.
Link Prober will report 4 possible states:
- LinkProberUnknown: Serves as initial states. This state is also reachable in the case of no ICMP reply is received.
- LinkProberActive: It indicates that LinkMgr receives ICMP replies containing ID of the current ToR.
- LinkProberPeerUnknown: It indicates that LinkMgr did not receive ICMP replies containing ID of the peer ToR. Hence, there is a chance that peer ToR’s link is currently down.
- LinkProberPeerAcitve: It indicates that LinkMgr receives ICMP replies containing ID of the peer ToR, or in other words, peer ToR’s links appear to be active.
By default, the heartbeat probing interval is 100 ms. It takes 3 lost of link prober packets, to determine link is unhealthy. Server issue can also cause link prober packet loss, but ToR won't distinguish it from link issue.
ICMP Probing Format
The source MAC will be ToR's SVI mac address. Ethernet destination will be the well-known MAC address. Source IP will be ToR's Loopback IP, destination IP will be SoC's IP address, which will be introduced as a field in minigraph.
Linkmgrd also adapt TLV (Type-Length-Value) as the encoding schema in payload for additional information elements, including cookie, version, ToR GUID etc.
When link is down, linkmgrd will receive notification from SWSS based on kernel message from netlink. This notification will be used to determine if ToR is healthy.
Admin Forwarding State
ToRs will signal NIC if the link is active / standby, we will call this active / standby state as admin forwarding state. It's up to NIC to determine which link to use if both are active, but it should never choose to use a standby link. This logic provides ToR more control over traffic forwarding.
Operational Forwarding State
Server side should maintain an operational forwarding state as well. When link is down, eventually admin forwarding state will be updated to standby. But before that, if server side detects link down, it should stop sending traffic through this link even the admin state is active. In this way, we ensure the ToRs have control over traffic forwarding, and also guarantee immediate reaction when link state is down.
Active-acitve state transition logics are simplified compared to active-standby. In active-standby, linkmgrd makes mux toggle decisions based on y-cable direction, while for active-active, two links are more independent. Linkmgrd will only make state transition decisions based on healthy indicators.
To be more specific, if link prober indicates active AND link state appears to be up, linkmgrd should determine link's forwarding state as active, otherwise, it should be standby.
Linkmgrd also provides rescue mechanism when peer can't switch to standby for some reason, i.e. link failures. If link prober doesn't receive peer's heartbeat response AND self ToR is in healthy active state, linkmgrd should determine peer link to be standby.
When control channel is unreachable, ToR won't block the traffic forwarding, but it will periodically check gRPC server's healthiness. It will make sure server side's admin forwarding state aligns with linkmgrd's decision.
If default route to T1 is missing, dual ToR system can suffer from northbound packet loss, hence linkmgrd also monitors defaul route state. If default route is missing, linkmgrd will stop sending ICMP probing request and fake an unhealthy status. This functionality can be disabled as well, the details is included in default_route.
To summarize the state transition decision we talk about, and the corresponding gRPC action to take, we have this decision table below:
Input | Decision | |||||
---|---|---|---|---|---|---|
Default Route to T1 | Link State | Link Prober | Link Manager State | gRPC Action to Update Server-Side Admin Forwarding State | ||
SELF | PEER | SELF | PEER | |||
Available | Up | Active | Active | Active | Set to Active | No-op |
Available | Active | Unknown | Set to standby | |||
Available | Up | Unknown | * | Standby | Set to standby | No-op |
Available | Down | * | * | Standby | Set to standby | No-op |
Missing | * | * | * | Standby | Set to standby | No-op |
-
Link Prober Packet Loss Statics
Link prober will by default send heartbeat packet every 100 ms, the packet loss statics can be a good measurement of system healthiness. An incremental feature is to collect the packet loss counts, start time and end time. The collected data is stored and updated in state db. User can check and reset through CLI. -
Supoort for Detachment
User can config linkmgrd to a certain mode, so it won't switch to active / standby based on health indicators. User can also config linkmgrd to a mode, so it won't modify peer's forwarding state. This support will be useful for maintenance, upgrade and testing scenarios.
Orchagent will create tunnel at initialization and add / remove routes to forward traffic to peer ToR via this tunnel when linkmgrd switchs state to standby / active.
Check below for an example of config DB entry and tunnel utilization when LT0's link is having issue.
Major components of Orchagent for this IPinIP tunnel are MuxCfgOrch, TunnelOrch, MuxOrch.
-
MuxCfgOrch
MuxCfgOrch listens to config DB entries to populate the port to server IP mapping to MuxOrch. -
TunnelOrch
TunnelOrch will subscribe toMUX_TUNNEL
table and create tunnel, tunnel termination, and decap entry. This tunnel object would be created when initializing. This tunnel object would be used as nexthop object by MuxOrch for programming route via SAI_NEXT_HOP_TYPE_TUNNEL_ENCAP. -
MuxOrch
MuxOrch will listen to state changes from linkmgrd and does the following at a high-level:- Enable / disable neighbor entry.
- Add / remove tunnel routes.
In active-active design, we will use gRPC to do cable control and signal NIC if ToRs is up active. SoC will run a gRPC server. Linkmgrd will determine server side forwarding state based on link prober status and link state. Then linkmgrd can invoke transceiver daemon to update NIC wether ToRs are active or not through gRPC calls.
Current defined gRPC services between SoC and ToRs related with linkmgrd cable controlling:
- DualToRActive
- Query forwarding state of ports for both peer and self ToR;
- Query server side link state of ports for both peer and self ToR;
- Set forwarding states of ports for both peer and self ToR;
- GracefulRestart
- Shutdown / restart notification from SoC to ToR.
The following UML sequence illustrates the state transition when linkmgrd state moves to active. The flow will be similar for moving to standby.
The following shows the traffic forwarding behaviors:
- both ToRs are active.
- one ToR is active while the another ToR is standby.
There is a scenario that, if the upper ToR enters standby when its peer(the lower ToR) is already in standby state, all downstream I/O from ToR A will be forwarded through the tunnel to the peer ToR(the lower ToR), so does the control plane gRPC traffic from the transceiver daemon. As the lower ToR is in standby, those tunneled I/O will be blackholed, the NiC will never know that the upper ToR has entered standby in this case.
To solve this issue, we want the control plane gRPC traffic from the transceiver daemon to be forwarded directly via the local devices. This is to differentiate the control plane traffic to the NiC IPs from dataplane traffic that its forwarding behavior honors the mux state and be forwarded to the peer active ToR via the tunnel when the port comes to standby.
The following shows the traffic forwarding behavior when the lower ToR is active while the upper ToR is standby. Now, gRPC traffic from the standby ToR(Upper ToR) is forwarded to the NiC directly. The downstream dataplane traffic to the Upper ToR are directed to the tunnel to the active Lower ToR.
When orchagent is notified to change to standby, it will re-program both the ASIC and the kernel to let both control plane and data plane traffic be forwarded via the tunnel. To achieve the design proposed above, MuxOrch now will be changed to skip notifying the Tunnelmgrd if the neighbor address is the NiC IP address, so Tunnelmgrd will not re-program the kernel route in this case and the gRPC traffic to the NiC IP address from the transceiver daemon will be forwarded directly.
The following UML diagram shows this change when Linkmgrd state moves to standby:
Current failover strategy can smoothly handle the link failure cases, but if one of the ToRs crashes, and if T1 still sends traffic to the crashed ToR, we will see packet loss.
A further improvement in rescuing scenario, is when detecting peer's unhealthy status, local ToR advertises specific routes (i.e. longer prefix), so that traffic from T1 does't go to crashed ToR as all.
For server graceful restart, We already have gRPC service defined in 3.5.1. An indicator of ongoing server servicing should be defined based on that notification, so ToR can avoid upgrades in the meantime. Vice versa, we can also define gRPC APIs to notify server when ToR upgrade is ongoing.
When the BGP neighbors are started on an active-active T0 switch, the T0 will try to establish BGP sessions with its connected T1 switches. After the BGP sessions' establishment, the T0 will exchange routes with those T1s. T1 switches usually have more routes than the T0 so T1 switches take more time to process out routes before sending updates. The consequence is that, after BGP sessions’ establishment, T1 switches could receive BGP updates from the T0 before the T0 receives any BGP updates from the T1s. There will be a period that those T1s have routes learnt from the T0 but the T0 has no routes learnt from the T1(T0 has no default routes). In this period, Those T1s could send downstream traffic to this T0, as stated in 3.3.5, the T0 is still in standby state, it will try to forward the traffic via the tunnel. As the T0 has no default route in this period, those traffic will be blackholed.
So for the active-active T0s, a BGP update delay of 10 seconds is introduced to the BGP configurations to postpone sending BGP update after BGP session establishment. In this case, the T0 could learn routes from the T1s before the T1s learn any routes from the T0. So when the T1 could send any downstream traffic to the T0, the T0 will have default routes ready.
Previously, at a high level, when the mux port comes to standby, the MuxOrch add ingress ACL to drop packets on the mux port. And when the mux port comes to active, the MuxOrch remove the ingress ACL. As described in [3.6], the MuxOrch is acted an intermediate agent between LinkMgrd and the transceiver daemon. Before the NiC receives gRPC request to toggle standby, the ingress drop ACL has already been programmed by MuxOrch. In this period, the server NiC still regard this ToR as active and could send upstream traffic to this ToR, but the upstream traffic will be dropped by the installed ingress drop ACL rule.
A change to skip the installation of ingress drop ACL rule when toggling standby is introduced to forward the upstream traffic with best effort. This is because that, though the mux port is already in standby state in this period, the removal of the ingress drop ACL could allow the upstream traffic to reach the ToR and to be possibly forwarded by the ToR.
This part only covers the command lines and options for active-active dualtor.
show mux status
returns the mux status for mux ports:
PORT
: mux port nameSTATUS
: current mux status, could be eitheractive
orstandby
SERVER_STATUS
: the mux status read from mux server as the result of last toggleactive
: mux server returnedactive
as the result of last toggle toactive
standby
: mux server returnedstandby
as the result of last toggle tostandby
unknown
: last toggle failed to switch the mux server status, or failed to read the status from the mux servererror
: last toggle failed to switch the orchagent status
HEALTH
: mux port healthhealthy
: it means that the ToR could receive link probe replies from the mux server, the following conditions must be satisfied for a mux port to behealthy
:- port status is
up
- could receive replies for self link probes
- current mux status(
STATUS
) should match server status(SERVER_STATUS
) or server status isunknown
- default route to T1s is present
- port status is
unheathy
: any of the abovehealthy
conditions is broken
HWSTATUS
: check if current mux status matches server statusconsistent
:STATUS
matchesSERVER_STATUS
inconsistent
:STATUS
doesn't matchesSERVER_STATUS
absent
:SERVER_STATUS
is not present
LAST_SWITCHOVER_TIME
: last switchover timestamp
$ show mux status
PORT STATUS SERVER_STATUS HEALTH HWSTATUS LAST_SWITCHOVER_TIME
---------- -------- --------------- -------- ---------- ---------------------------
Ethernet4 active active healthy consistent 2023-Mar-27 07:57:43.314674
Ethernet8 active active healthy consistent 2023-Mar-27 07:59:33.227819
show mux config
returns the mux configurations:
SWITCH_NAME
: peer switch hostnamePEER_TOR
: peer switch loopback addressPORT
: mux port namestate
: mux mode configurationauto
: enable failover logics for both self and peermanual
: disable failover logics for both self and peeractive
: if current mux status is notactive
, toggle the mux toactive
once, then work inmanual
modestandby
: if current mux status is notstandby
, toggle the muxstandby
once, then work inmanual
modedetach
: enable failover logics only for self
ipv4
: mux server ipv4 addressipv6
: mux server ipv6 addresscable_type
: mux cable type,active-active
for active-active dualtorsoc_ipv4
: soc ipv4 address
$ show mux config
SWITCH_NAME PEER_TOR
----------------- ----------
lab-switch-2 10.1.0.33
port state ipv4 ipv6 cable_type soc_ipv4
---------- ------- --------------- ----------------- ------------- ---------------
Ethernet4 auto 192.168.0.2/32 fc02:1000::2/128 active-active 192.168.0.3/32
Ethernet8 auto 192.168.0.4/32 fc02:1000::4/128 active-active 192.168.0.5/32
show mux tunnel-route
returns tunnel routes that have been created for mux ports.
For each mux port, there can be 3 entries: server_ipv4
, server_ipv6
, soc_ipv4
. For each entry, if tunnel route is created in kernel
or asic
, you will see added
in command output, if not, you will see -
. If no tunnel route is created for any of the 3 entries, mux port won't show in the command output.
- Usage:
show mux tunnel-route [OPTIONS] <port_name>
show muxcable tunnel-route <port_name>
- Options:
--json display the output in json format
- Example
$ show mux tunnel-route Ethernet44
PORT DEST_TYPE DEST_ADDRESS kernel asic
---------- ----------- ----------------- -------- ------
Ethernet44 server_ipv4 192.168.0.22/32 added added
Ethernet44 server_ipv6 fc02:1000::16/128 added added
Ethernet44 soc_ipv4 192.168.0.23/32 - added
config mux mode
configures the operational mux mode for specified port.
# config mux mode <operation_status> <port_name>
argument "<operation_status>" is choose from:
active,
auto,
manual,
standby,
detach.
TBD