Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warm-reboot: teamd warm restart caused neighbor deleted and learned again. #2600

Open
jipanyang opened this issue Feb 23, 2019 · 2 comments
Open
Assignees

Comments

@jipanyang
Copy link
Collaborator

warm restart of teamd caused interruption on Linux neighbor entries.

It may mess up neighbor restore processing even for system warm reboot.

Steps to reproduce the issue:
1.

root@vlab-01:/host/warmboot/teamd# show warm_restart config
name    enable    timer_name    timer_duration
------  --------  ------------  ----------------
system  true      NULL          NULL
teamd   true      NULL          NULL
root@vlab-01:/host/warmboot/teamd# teamshow
Flags: A - active, I - inactive, Up - up, Dw - Down, N/A - not available, S - selected, D - deselected
  No.  Team Dev         Protocol     Ports
-----  ---------------  -----------  --------------
 0001  PortChannel0001  LACP(A)(Up)  Ethernet112(S)
 0002  PortChannel0002  LACP(A)(Up)  Ethernet116(S)
 0003  PortChannel0003  LACP(A)(Up)  Ethernet120(S)
 0004  PortChannel0004  LACP(A)(Up)  Ethernet124(S)
root@vlab-01:/host/warmboot/teamd# 
root@vlab-01:/host/warmboot/teamd# 
root@vlab-01:/host/warmboot/teamd# docker exec -i teamd pkill -USR1 teamd
root@vlab-01:/host/warmboot/teamd# 
root@vlab-01:/host/warmboot/teamd# ls -l
total 16
-rw------- 1 root root 124 Feb 23 23:02 Ethernet112
-rw------- 1 root root 124 Feb 23 23:02 Ethernet116
-rw------- 1 root root 124 Feb 23 23:02 Ethernet120
-rw------- 1 root root 124 Feb 23 23:02 Ethernet124
root@vlab-01:/host/warmboot/teamd# systemctl restart teamd

Describe the results you received:


2019-02-23.23:02:12.913622|NEIGH_TABLE:PortChannel0001:fc00::72|DEL
2019-02-23.23:02:12.913681|NEIGH_TABLE:PortChannel0001:10.0.0.57|DEL
2019-02-23.23:02:12.949428|LAG_TABLE:PortChannel0001|SET|mtu:9100
2019-02-23.23:02:12.951454|PORT_TABLE:Ethernet112|SET|mtu:9100
2019-02-23.23:02:12.958725|NEIGH_TABLE:PortChannel0002:10.0.0.59|DEL
2019-02-23.23:02:12.958774|NEIGH_TABLE:PortChannel0002:fc00::76|DEL
2019-02-23.23:02:13.000059|LAG_TABLE:PortChannel0002|SET|mtu:9100
2019-02-23.23:02:13.000761|PORT_TABLE:Ethernet116|SET|mtu:9100
2019-02-23.23:02:13.009988|NEIGH_TABLE:PortChannel0003:fc00::7a|DEL
2019-02-23.23:02:13.010062|NEIGH_TABLE:PortChannel0003:10.0.0.61|DEL
2019-02-23.23:02:13.042046|LAG_TABLE:PortChannel0003|SET|mtu:9100
2019-02-23.23:02:13.042903|PORT_TABLE:Ethernet120|SET|mtu:9100
2019-02-23.23:02:13.049285|NEIGH_TABLE:PortChannel0004:10.0.0.63|DEL
2019-02-23.23:02:13.049356|NEIGH_TABLE:PortChannel0004:fc00::7e|DEL
2019-02-23.23:02:13.084162|LAG_TABLE:PortChannel0004|SET|mtu:9100
2019-02-23.23:02:13.084728|PORT_TABLE:Ethernet124|SET|mtu:9100
2019-02-23.23:02:13.533543|NEIGH_TABLE:PortChannel0004:fc00::7e|SET|neigh:52:54:00:e5:5b:02|family:IPv6
2019-02-23.23:02:13.597808|NEIGH_TABLE:PortChannel0003:10.0.0.61|SET|neigh:52:54:00:31:7b:48|family:IPv4
2019-02-23.23:02:13.636639|NEIGH_TABLE:PortChannel0002:10.0.0.59|SET|neigh:52:54:00:ed:7a:ca|family:IPv4
2019-02-23.23:02:14.686387|NEIGH_TABLE:PortChannel0004:10.0.0.63|SET|neigh:52:54:00:e5:5b:02|family:IPv4
2019-02-23.23:02:15.115361|NEIGH_TABLE:PortChannel0002:fc00::76|SET|neigh:52:54:00:ed:7a:ca|family:IPv6
2019-02-23.23:02:15.118383|NEIGH_TABLE:PortChannel0001:fc00::72|SET|neigh:52:54:00:30:12:7f|family:IPv6
2019-02-23.23:02:15.121553|NEIGH_TABLE:PortChannel0001:10.0.0.57|SET|neigh:52:54:00:30:12:7f|family:IPv4
2019-02-23.23:02:15.212709|NEIGH_TABLE:PortChannel0003:fc00::7a|SET|neigh:52:54:00:31:7b:48|family:IPv6
2019-02-23.23:03:25.231184|LAG_TABLE:PortChannel0001|SET|admin_status:up|oper_status:up
2019-02-23.23:03:25.231393|LAG_TABLE:PortChannel0003|SET|admin_status:up|oper_status:up
2019-02-23.23:03:25.231491|LAG_TABLE:PortChannel0002|SET|admin_status:up|oper_status:up
2019-02-23.23:03:25.231564|LAG_TABLE:PortChannel0004|SET|admin_status:up|oper_status:up

Describe the results you expected:
No impact on existing arp and ND entries.

Additional information you deem important (e.g. issue happens only occasionally):

**Output of `show version`:**

```
(paste your output here)
```

**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```
@yxieca
Copy link
Contributor

yxieca commented Sep 12, 2019

We made several changes to address this issue. Including extend the timeout value for neighbor restore. I believe @prsunny made some change recently to finally address this issue.

@prsunny can you confirm and close this issue?

@prsunny
Copy link
Contributor

prsunny commented Sep 12, 2019

I'm not sure if this issue is fixed by sonic-net/sonic-swss#1040. The PR was to address Vlan based neighbor restore but seems like this is about port-channels. Need to debug more!

vadymhlushko-mlnx added a commit to vadymhlushko-mlnx/sonic-buildimage that referenced this issue Jan 24, 2023
Update sonic-utilities submodule pointer to include the following:
* fba87f4 Revert ([sonic-net#2599](sonic-net/sonic-utilities#2599))
* d6d7ab3 [warm-reboot] Use kexec_file_load instead of kexec_load when available ([sonic-net#2608](sonic-net/sonic-utilities#2608))
* db4683d fix show techsupport error ([sonic-net#2597](sonic-net/sonic-utilities#2597))
* 3d8e9c6 [GCU] Prohibit removal of PFC_WD POLL_INTERVAL field ([sonic-net#2545](sonic-net/sonic-utilities#2545))
* 163e766 [techsupport] include APPL_STATE_DB dump ([sonic-net#2607](sonic-net/sonic-utilities#2607))
* 8703773 YANG Validation for ConfigDB Updates: RADIUS_SERVER ([sonic-net#2604](sonic-net/sonic-utilities#2604))
* c2d746d Remove TODO comment which is no longer relevant ([sonic-net#2600](sonic-net/sonic-utilities#2600))
* f09da99 [show] Add bgpraw to show run all ([sonic-net#2537](sonic-net/sonic-utilities#2537))
* 39ac564 Extend fast-reboot STATE_DB entry timer ([sonic-net#2577](sonic-net/sonic-utilities#2577))

Signed-off-by: vadymhlushko-mlnx <vadymh@nvidia.com>
vadymhlushko-mlnx added a commit to vadymhlushko-mlnx/sonic-buildimage that referenced this issue Jan 25, 2023
Update sonic-utilities submodule pointer to include the following:
* f4f857e [GCU] Ignore bgpraw in GCU applier ([sonic-net#2623](sonic-net/sonic-utilities#2623))
* b5ac600 [muxcable][config] Add support to enable/disable ceasing to be an advertisement interface when  service is stopped ([sonic-net#2622](sonic-net/sonic-utilities#2622))
* 981f953 [chassis][voq] Add show fabric reachability command. ([sonic-net#2528](sonic-net/sonic-utilities#2528))
* fba87f4 Revert ([sonic-net#2599](sonic-net/sonic-utilities#2599))
* d6d7ab3 [warm-reboot] Use kexec_file_load instead of kexec_load when available ([sonic-net#2608](sonic-net/sonic-utilities#2608))
* db4683d fix show techsupport error ([sonic-net#2597](sonic-net/sonic-utilities#2597))
* 3d8e9c6 [GCU] Prohibit removal of PFC_WD POLL_INTERVAL field ([sonic-net#2545](sonic-net/sonic-utilities#2545))
* 163e766 [techsupport] include APPL_STATE_DB dump ([sonic-net#2607](sonic-net/sonic-utilities#2607))
* 8703773 YANG Validation for ConfigDB Updates: RADIUS_SERVER ([sonic-net#2604](sonic-net/sonic-utilities#2604))
* c2d746d Remove TODO comment which is no longer relevant ([sonic-net#2600](sonic-net/sonic-utilities#2600))
* f09da99 [show] Add bgpraw to show run all ([sonic-net#2537](sonic-net/sonic-utilities#2537))
* 39ac564 Extend fast-reboot STATE_DB entry timer ([sonic-net#2577](sonic-net/sonic-utilities#2577))

Signed-off-by: vadymhlushko-mlnx <vadymh@nvidia.com>
qiluo-msft pushed a commit that referenced this issue Jan 27, 2023
Includes below commits
```
0d5e68f5a [GCU] Ignore bgpraw table in GCU operation (#2628)
22757b1f3 Add interface link-training command into the CLI doc (#2257)
f4f857e10 [GCU] Ignore bgpraw in GCU applier (#2623)
b5ac60036 [muxcable][config] Add support to enable/disable ceasing to be an advertisement interface when `radv` service is stopped (#2622)
981f9531e [chassis][voq] Add "show fabric reachability" command. (#2528)
fba87f43f Revert (#2599)
d6d7ab37f [warm-reboot] Use kexec_file_load instead of kexec_load when available (#2608)
db4683d40 fix show techsupport error (#2597)
3d8e9c62d [GCU] Prohibit removal of PFC_WD POLL_INTERVAL field (#2545)
163e766cc [techsupport] include APPL_STATE_DB dump (#2607)
8703773eb YANG Validation for ConfigDB Updates: RADIUS_SERVER (#2604)
c2d746d4f Remove TODO comment which is no longer relevant (#2600)
f09da9983 [show] Add bgpraw to show run all (#2537)
39ac5641b Extend fast-reboot STATE_DB entry timer (#2577)
```
StormLiangMS added a commit that referenced this issue Feb 17, 2023
Why I did it
Submodule advances:
sonic-utilities

8e8e6088 - [202211][dhcp_relay] Remove add field of vlanid to DHCP_RELAY table while adding vlan ([201811 sub-module] advance sub-modules: utilities, swss, swss-common #2679) (16 hours ago) [Yaqiang Zhu]
1400fb94 - [GCU] Ignore bgpraw in GCU applier (Fix sfputil indexing for 7170-Q59S20 #2623) (15 hours ago) [jingwenxie]
f76a6364 - [vlan] Refresh dhcpv6_relay config while adding/deleting a vlan ([sonic-py-swsssdk] Update submodule #2660) (15 hours ago) [Yaqiang Zhu]
7849e18d - [db_migrator] make LOG_LEVEL_DB migration more robust (Mellanox platform: attach queues 2 and 6 to lossy profile using generic buffer template #2651) (16 hours ago) [Stepan Blyshchak]
c7df6dfa - Fixed a bug in "show vnet routes all" causing screen overrun. (Add hook to allow customizing link cable lengths #2644) (16 hours ago) [siqbal1986]
a5505f02 - show logging CLI support for logs stored in tmpfs (Traceback error seen while issuing show interface commands with if_names #2641) (16 hours ago) [mihirpat1]
bbacb91a - [system-health] Fix issue: show system-health CLI crashes (Updating deb package for platform and sai #2635) (16 hours ago) [Junchao-Mellanox]
8d724024 - [sai_failure_dump]Invoking dump during SAI failure ([dockers]: Upgrade LLDP docker to stretch build #2633) (16 hours ago) [Sudharsan Dhamal Gopalarathnam]
3c3be526 - Add transceiver info CLI support to show output from TRANSCEIVER_INFO for ZR ([submodule]: Update sonic-sairedis pointer #2630) (16 hours ago) [mihirpat1]
37f41666 - [show] add support for gRPC show commands for active-active ([bitmap-vnet]: Bitmap vnet test image [DO NOT MERGE] #2629) (16 hours ago) [vdahiya12]
b06d7fe4 - [show_bfd] add local discriminator in show bfd command ([Pmon] Selectively load pmon container daemons #2625) (16 hours ago) [Baorong Liu]
6adcd3e8 - [GCU] Ignore bgpraw table in GCU operation ([Mellanox] Fix SAI version #2628) (16 hours ago) [jingwenxie]
c65bdc35 - [muxcable][config] Add support to enable/disable ceasing to be an advertisement interface when radv service is stopped (Add knob in ConfigDB to enable/disable telemetry container #2622) (16 hours ago) [Jing Zhang]
91e9457f - Add Transceiver PM basic CLI support to show output from TRANSCEIVER_PM table for ZR ([201803] Restart SwSS, syncd and dependent services if a critical process in syncd container exits #2615) (16 hours ago) [longhuan-cisco]
54cc8c5a - Remove TODO comment which is no longer relevant (Warm-reboot: teamd warm restart caused neighbor deleted and learned again.  #2600) (16 hours ago) [Lior Avramov]
6891b4fb - Making 'show feature autorestart' more resilient to missing auto_restart config in CONFIG_DB ([submodule] update mellanox hw-mgmgt pointer (V.2.0.0061) #2592) (16 hours ago) [kartik-arista]
1e8bea37 - [storyteller] add link prober state change to story teller ([sonic-buildimage] New feature managementVRF(L3mdev) #2585) (16 hours ago) [Jing Zhang]
7481a20f - Extend fast-reboot STATE_DB entry timer ([submodule]: update sonic-swss-common, sonic-py-swsssdk, sonic-snmpagent #2577) (16 hours ago) [Aryeh Feigin]
0e08701c - [sonic_installer] use /etc/resolv.conf from the host when migrating packages (Set a rate limit on syslog messages from all Docker containers #2573) (16 hours ago) [Stepan Blyshchak]
06096780 - Fixed admin state config CLI for Backport interfaces (Prior to install a new ONIE SONiC image, delete all partitions except EFI/ONIE #2557) (16 hours ago) [anamehra]
9f1f13e4 - [show] Add bgpraw to show run all (Fixed typo on paragraph #40 #2537) (16 hours ago) [jingwenxie]
98bc8bd2 - [chassis][voq] Add "show fabric reachability" command. ([ntp]: Build 4.2.6 locally. #2528) (16 hours ago) [jfeng-arista]
3a50b63f - Preserve copp tables through DB migration ([docker-radvd]: upgrade docker radvd to stretch based #2524) (16 hours ago) [Aryeh Feigin]
28f6b127 - [masic] 'show interfaces counters' reminds to use '-d all' option to check for internal links (solve dependency issue #2466) (16 hours ago) [wenyiz2021]
15026e14 - suppport multi asic for show queue counter ([dockers] Prevent old supervisord messages from gettting re-logged to syslog #2439) (16 hours ago) [zhixzhu]
2d773e17 - [masic support] 'show run bgp' support for multi-asic (lo address not synced to the asic #2427) (16 hours ago) [wenyiz2021]
sonic-swss

4f304bc - [EVPN]Handling race condition when remote VNI arrives before tunnel map entry ([sonic-quagga] Function defect, do NOT cancel route while connect IP down #2642) (15 hours ago) [Sudharsan Dhamal Gopalarathnam]
34fc615 - [sai_failure_dump]Invoking dump during SAI failure (Add hook to allow customizing link cable lengths #2644) (15 hours ago) [Sudharsan Dhamal Gopalarathnam]
b817695 - [autoneg]Fixing adv interface types to be set when AN is disabled (Fix issue with platform file path name #2638) (15 hours ago) [Sudharsan Dhamal Gopalarathnam]
ab36bd4 - [bfdorch] add local discriminator to state DB ([bitmap-vnet]: Bitmap vnet test image [DO NOT MERGE] #2629) (15 hours ago) [Baorong Liu]
6343471 - Remove TODO comments that are no longer relevant (Add knob in ConfigDB to enable/disable telemetry container #2622) (15 hours ago) [Lior Avramov]
2b1869c - [refactor]Refactoring sai handle status (Rollback kernel submodule update. #2621) (15 hours ago) [Sudharsan Dhamal Gopalarathnam]
c41a1b7 - Fix issue ARP entry is out of sync between kernel and APPL_DB after warm reboot if the ARP entry is updated more than once during warm reboot in PFC watchdog warm reboot test #13341 ARP entry can be out of sync between kernel and APPL_DB if multiple updates are received from RTNL ([sub module] advance sonic-utilities sub module for 201811 branch #2619) (15 hours ago) [Stephen Sun]
da0cf7a - Changed the BFD default detect multiplier to 10x ("failed to load plugin io.containerd.snapshotter..." seen during linux boot up #2614) (15 hours ago) [siqbal1986]
13b5adf - [vstest] Only collect stdout of orchagent_restart_check in vstest ([submodules] update swss and utilities pointers #2597) (15 hours ago) [bingwang-ms]
2b9d94d - Avoid aborting orchagent when setting TUNNEL attributes (build failing for PLATFORM=p4 #2591) (15 hours ago) [Stephen Sun]
99b7d3b - Only collect stdout of orchagent_restart_check in vstest ( [saibcm-modules]: import new bcm modules #2578) (15 hours ago) [bingwang-ms]
5209c42 - dereg acl-rule counters during acl-table del ([201803] Set a rate limit on syslog messages from all Docker containers #2574) (15 hours ago) [Vivek]
ae68054 - Fixed set mtu for deleted subintf due to late notification ([vs]: Add option to specify platform name for DVS orchagent #2571) (15 hours ago) [EdenGri]
ab13dfa - Remove TODO comments which are no longer needed (support set timezone in ConfigDB #2568) (15 hours ago) [Junchao-Mellanox]
a3545cf - Modify coppmgr mergeConfig to support preserving copp tables through reboot. (Added new SN3700/SN3700C Mellanox platforms #2548) (15 hours ago) [Aryeh Feigin]
be16e79 - Use github code scanning instead of LGTM ([201803] [services] Restart SwSS service upon unexpected critical process exit #2546) (15 hours ago) [Liu Shilong]
63c0234 - Updated handling of VRF_VNI mapping and VLAN_VNI mapping for same VNI ID (Move warm_restart enable/disable config to stateDB WARM_RESTART_ENABL… #2538) (15 hours ago) [Tapash Das]
4844111 - Fix potential risks ([mlnx] Fix sai xml path for boxer platform #2516) (15 hours ago) [Liran-Ar]
6420808 - [p4orch]: PINS Extension tables support ([build] When generating image version, handle case where current commit has no reachable tags #2506) (15 hours ago) [svshah-intel]
sonic-swss-common

1badd46 - Increase the netlink buffer size from 3MB to 16MB. (arp_update doesn't sleep 300 between each execution #739) (14 hours ago) [KISHORE KUNAL]
6555057 - Refactor eventpublisher deinit ([acl] Add default deny rule for l3 table #734) (14 hours ago) [Zain Budhwani]
f4d6de7 - Use github code scanning instead of LGTM ([sonic-quagga]:update submodule #718) (14 hours ago) [Liu Shilong]
sonic-linux-kernel

74f9a8f - Update linux kernel for hw-mgmt V.7.0020.4104 (Move template files to /usr/share/sonic/templates #305) (14 hours ago) [Stephen Sun]
6365701 - Fixes for emmc unreliability ([build_debian.sh]: Integrate system dump script #270) (14 hours ago) [Samuel Angebault]
How I did it
How to verify it
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants