Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIC simulator not receiving SetAdminForwardingPortState request for 2 min after peer config reload #2485

Closed
zjswhhh opened this issue Aug 26, 2022 · 7 comments · Fixed by sonic-net/sonic-platform-daemons#287
Assignees

Comments

@zjswhhh
Copy link
Contributor

zjswhhh commented Aug 26, 2022

Description

Steps to reproduce the issue:

  1. Config reload ToR
  2. Check syslog, expect to see ycable timeout log for RPC requests:
    NOTICE pmon#ycable[28]: rpc timeout exceeded for port= StatusCode.DEADLINE_EXCEEDED timeout = 0.5
  3. Check NIC simulator, expect to miss SetAdminForwardingPortState request.

Describe the results you received:

Test environment

Image 20220531.05

admin@xxxxxxxxxxxxx-1:~$ show mux config Ethernet32
SWITCH_NAME        PEER_TOR
-----------------  ----------
xxxxxxxxxxxxx-2  10.1.0.33
port        state    ipv4             ipv6               cable_type     soc_ipv4
----------  -------  ---------------  -----------------  -------------  ---------------
Ethernet32  auto     192.168.0.16/32  fc02:1000::10/128  active-active  192.168.0.17/32

Config reload timestamp

16:45:10 config_reload.config_reload              L0086 INFO   | reloading config_db

Traffic disruption

>       pytest_assert(len(failures) == 0, '\n' + '\n'.join(failures))
E       Failed: 
E       Traffic on server 192.168.0.18 was disrupted for 133.606751919s. Maximum allowed disruption: 10s

Logs on Peer ToR

Linkmgrd asked for toggle, ycabled sent request but timeout.

admin@xxxxxxxxxxxxx-2:~$ sudo zgrep Ethernet32 /var/log/syslog | grep "16:45:"
Aug 25 16:45:12.660641 xxxxxxxxxxxxx-2 WARNING pmon#ycable[32]: Error: trying to post mux info without presence of port Ethernet32
Aug 25 16:45:12.661343 xxxxxxxxxxxxx-2 WARNING mux#linkmgrd: MuxManager.cpp:275 addOrUpdatePeerLinkState: Ethernet32: peer link state unknown
Aug 25 16:45:21.354329 xxxxxxxxxxxxx-2 WARNING mux#linkmgrd: link_manager/LinkManagerStateMachineActiveActive.cpp:471 handlePeerStateChange: Ethernet32: Received peer link prober event, new state: PeerUnknown
Aug 25 16:45:21.354445 xxxxxxxxxxxxx-2 WARNING mux#linkmgrd: link_manager/LinkManagerStateMachineActiveActive.cpp:848 switchPeerMuxState: Ethernet32: Switching peer MUX state to 'Standby'
Aug 25 16:45:24.354383 xxxxxxxxxxxxx-2 WARNING mux#linkmgrd: link_manager/LinkManagerStateMachineActiveActive.cpp:1048 handlePeerMuxWaitTimeout: Ethernet32: xcvrd timed out responding to linkmgrd peer mux state, current peer mux state: Standby
Aug 25 16:45:26.155087 xxxxxxxxxxxxx-2 NOTICE pmon#ycable[32]: calling RPC for hw mux_cable set state ispeer = True port Ethernet32 portid 1 read_side 0 state requested standby
Aug 25 16:45:26.655906 xxxxxxxxxxxxx-2 NOTICE pmon#ycable[32]: response was none hw_mux_cable_table_grpc_notification Ethernet32 
Aug 25 16:45:26.655952 xxxxxxxxxxxxx-2 WARNING pmon#ycable[32]: ERR: Got a change event for updating gRPC but could not toggle the mux-direction for port Ethernet32 state from unknown to standby, writing unknown
Aug 25 16:45:26.667567 xxxxxxxxxxxxx-2 WARNING mux#linkmgrd: link_manager/LinkManagerStateMachineActiveActive.cpp:281 handlePeerMuxStateNotification: Ethernet32: state db mux state: Unknown

Describe the results you expected:

Additional information you deem important:

**Output of `show version`:**

```
(paste your output here)
```

**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```
@zjswhhh
Copy link
Contributor Author

zjswhhh commented Sep 2, 2022

Issue should be address in PR below:
sonic-net/sonic-platform-daemons#287

@zjswhhh zjswhhh transferred this issue from sonic-net/sonic-mgmt Sep 7, 2022
@zjswhhh zjswhhh closed this as completed Sep 7, 2022
@lolyu
Copy link
Contributor

lolyu commented Sep 19, 2022

test_upper_tor_config_reload_upstream still fails, it looks like the issue persists, could you please take a look?

@lolyu lolyu reopened this Sep 19, 2022
@lolyu
Copy link
Contributor

lolyu commented Sep 19, 2022

syslog snippet:

Sep 19 06:41:06.727145 svcstr-xxxx-acs-2 NOTICE pmon#ycable[27]: calling RPC for hw mux_cable set state ispeer = True port Ethernet28 portid 1 read_side 0 state requested standby 
Sep 19 06:41:07.229007 svcstr-xxxx-acs-2 NOTICE pmon#ycable[27]: rpc timeout exceeded for port= StatusCode.DEADLINE_EXCEEDED timeout = 0.5 
Sep 19 06:41:07.229007 svcstr-xxxx-acs-2 NOTICE pmon#ycable[27]: response was none hw_mux_cable_table_grpc_notification Ethernet28 
Sep 19 06:41:07.229077 svcstr-xxxx-acs-2 WARNING pmon#ycable[27]: recieved an error state while parsing response hw mux for port 
Sep 19 06:41:07.229128 svcstr-xxxx-acs-2 WARNING pmon#ycable[27]: ERR: Got a change event for updating gRPC but could not toggle the mux-direction for port Ethernet28 state from standby to standby, writing unknown 
Sep 19 06:41:07.230342 svcstr-xxxx-acs-2 WARNING mux#linkmgrd: link_manager/LinkManagerStateMachineActiveActive.cpp:281 handlePeerMuxStateNotification: Ethernet28: state db mux state: Unknown 

@zjswhhh
Copy link
Contributor Author

zjswhhh commented Oct 12, 2022

Weird that:

  • load_minigraph --> NOT reproducible
  • load_minigraph + config save + config reload -> reproducible
  • config reload -> reproducible.

@zjswhhh zjswhhh self-assigned this Oct 14, 2022
@zjswhhh
Copy link
Contributor Author

zjswhhh commented Oct 14, 2022

Some mode findings:
For config reload on dualToR, arp table is cached (#1465), this is different from load_minigraph. If I do config reload with --disable_arp_cache the issue is NOT reproducible any more.

Considering that when swss restores the table (sonic-net/sonic-buildimage#6912), arp requests will be sent, and in active-active scenarios, arp response is duplicated to both sides.

Per offline sync-up with @vdahiya12 there is a chance tcp connection is re-established per arp entry update.

This seems to be the root cause. Will continue looking.

@zjswhhh
Copy link
Contributor Author

zjswhhh commented Oct 27, 2022

Root caused the issue:

  • arp cache uses fast-reboot-dump script to dump arp entries

arp_cache_cmd = '/usr/local/bin/fast-reboot-dump.py -t {}'.format(cache_dir)

  • fast-reboot-dump send arp request uses DEVICE MAC as src mac address

send_garp_nd(neighbor_entries, map_mac_ip_per_vlan)

@zjswhhh
Copy link
Contributor Author

zjswhhh commented Oct 31, 2022

Fix merged: #2460

@zjswhhh zjswhhh closed this as completed Oct 31, 2022
@zjswhhh zjswhhh transferred this issue from sonic-net/sonic-platform-daemons Nov 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants