Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix inconsistent mux state #92

Merged
merged 2 commits into from
Jun 28, 2022

Conversation

lolyu
Copy link
Contributor

@lolyu lolyu commented Jun 27, 2022

Description of PR

Summary:
Fixes # (issue)

Type of change

  • Bug fix
  • New feature
  • Doc/Design
  • Unit test

Approach

What is the motivation for this PR?

This issue occurs when running config load_minigraph to load new configs at both ToRs.
After write_standby.py, the muxorch will try to direct all traffic downstream to the SoC IP and server IP to the tunnel, but xcvrd might fail to set the adminforwardingstate of the port via gRPC because the gRPC channel could not be established because the other ToR is in standby state.
After linkmgrd starts to run and receives heartbeats from itself, it will try to toggle to active[toggle#1], but xcvrd might not be able to make the hardware toggle at the moment, so linkmgrd will mux wait. Also, because the mux probe table is initialized with Unknown state, linkmgrd will handle the initial mux probe state to have the composite states (active, unknown, up) and tries to probe the mux state.
As the muxorch has been toggled to active[toggle#1], the gRPC channel will be established at some point after, xcvrd will be able to answer the mux probe, so the linkmgrd will be able to change into (active, active, up) state.
But the toggling to active[toggle#1] is only finished half way, the mux status in STATE_DB:MUX_CABLE_TABLE is not updated, so show mux status will show unknown for those ports.

Signed-off-by: Longxiang Lyu lolv@microsoft.com

How did you do it?

When the linkmgrd changes into states (active, active, up) and has the original mux state as unknown, it will toggle the mux to active again to have those DB tables updated:

linkmgrd  -> APP_DB:MUX_CABLE_TABLE -> swss -> APP_DB:HW_MUX_CABLE_TABLE -> xcvrd 
xcvrd -> STATE_DB:HW_MUX_CABLE_TABLE -> swss -> STATE_DB:MUX_CABLE_TABLE -> linkmgrd  

How did you verify/test it?

On dualtor-mixed topo with icmp_responder running, do config load_minigraph on both ToRs, verify the show mux status on both ToRs:

root@demo-switch:~# show mux status
PORT        STATUS    HEALTH     HWSTATUS    LAST_SWITCHOVER_TIME
----------  --------  ---------  ----------  ---------------------------
Ethernet4   active    healthy    consistent
Ethernet8   active    healthy    consistent
Ethernet12  active    healthy    consistent
Ethernet16  active    healthy    consistent
Ethernet20  active    healthy    consistent
Ethernet24  active    healthy    consistent
Ethernet28  active    healthy    consistent
Ethernet32  active    healthy    consistent
Ethernet36  active    healthy    consistent
Ethernet40  active    healthy    consistent
Ethernet44  active    healthy    consistent
Ethernet48  active    healthy    consistent

Any platform specific information?

Documentation

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
@lolyu lolyu marked this pull request as ready for review June 27, 2022 13:50
@lolyu lolyu requested review from zjswhhh and yxieca June 27, 2022 13:50
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
@yxieca yxieca merged commit 6b5d739 into sonic-net:master Jun 28, 2022
yxieca pushed a commit that referenced this pull request Jun 28, 2022
What is the motivation for this PR?
This issue occurs when running config load_minigraph to load new configs at both ToRs.
After write_standby.py, the muxorch will try to direct all traffic downstream to the SoC IP and server IP to the tunnel, but xcvrd might fail to set the adminforwardingstate of the port via gRPC because the gRPC channel could not be established because the other ToR is in standby state.
After linkmgrd starts to run and receives heartbeats from itself, it will try to toggle to active[toggle#1], but xcvrd might not be able to make the hardware toggle at the moment, so linkmgrd will mux wait. Also, because the mux probe table is initialized with Unknown state, linkmgrd will handle the initial mux probe state to have the composite states (active, unknown, up) and tries to probe the mux state.
As the muxorch has been toggled to active[toggle#1], the gRPC channel will be established at some point after, xcvrd will be able to answer the mux probe, so the linkmgrd will be able to change into (active, active, up) state.
But the toggling to active[toggle#1] is only finished half way, the mux status in STATE_DB:MUX_CABLE_TABLE is not updated, so show mux status will show unknown for those ports.

Signed-off-by: Longxiang Lyu lolv@microsoft.com

How did you do it?
When the linkmgrd changes into states (active, active, up) and has the original mux state as unknown, it will toggle the mux to active again to have those DB tables updated:

linkmgrd  -> APP_DB:MUX_CABLE_TABLE -> swss -> APP_DB:HW_MUX_CABLE_TABLE -> xcvrd 
xcvrd -> STATE_DB:HW_MUX_CABLE_TABLE -> swss -> STATE_DB:MUX_CABLE_TABLE -> linkmgrd  
How did you verify/test it?
On dualtor-mixed topo with icmp_responder running, do config load_minigraph on both ToRs, verify the show mux status on both ToRs.

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
zjswhhh added a commit to sonic-net/sonic-buildimage that referenced this pull request Jul 12, 2022
[master][sonic-linkmgrd] submodule update

58d8aae Longxiang Lyu Sat Jul 2 10:14:50 2022 +0800 Enforce switch after config mux to active (sonic-net/sonic-linkmgrd#95)
600df46 Longxiang Lyu Thu Jun 30 15:09:10 2022 +0800 Add unittest to verify mux toggle active (sonic-net/sonic-linkmgrd#94)
400b1b8 gregshpit Wed Jun 29 21:32:45 2022 +0300 For Sonic cross-compilation build. CC variable is used as gcc compiler. CXX variable is used as g++ compiler. (sonic-net/sonic-linkmgrd#91)
a516668 Jing Zhang Tue Jun 28 11:07:23 2022 -0700 Use Vlan MAC as src MAC for link prober by default (sonic-net/sonic-linkmgrd#93)
6b5d739 Longxiang Lyu Tue Jun 28 22:46:12 2022 +0800 Fix inconsistent mux state (sonic-net/sonic-linkmgrd#92)
9265497 Jing Zhang Fri Jun 24 09:10:12 2022 -0700 Remove exception throwing when initializing missing loopback interface (sonic-net/sonic-linkmgrd#90)

sign-off: Jing Zhang zhangjing@microsoft.com
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants