-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DUT takes more than 7 seconds to finish update ip v6 neighbor #2414
Comments
@prsunny could you please provide some kind of estimation when this issue can be addressed? |
When I tested manually on another platform, I see that the neighbor entry is updated immediately in ASIC DB. Please note the swss and sairedis rec. Can you check the same manually and verify?
SWSS Rec
SAIREDIS rec Syslog:
Notes: This is on the latest master, with T0 topology with BGP sessions up. |
I also tested on the Nov branch, and observed the same result when changing the Neighbor MAC address manually. |
@prsunny thanks for sharing the test result. do you know the flow for the neighbor entry update? I would like to further investigate on our platform. |
@prsunny This issue observed from time to time on our testbed. In one certain test change the neighbor mac two times
but only after 10s syncd start to handle the second time change, seems that the first time change was "lost"
Is it the flow that neighsyncd get neighbor change via RTLINK from kernel and then trigger syncd to finish all the change? |
@prsunny we cont to see this issue on testbed from time to time. these failures i believe are real and even if it succeed there is something wrong in it. |
@prsunny
I've also add a snippet of code which queries APPL_DB to determine whether the updated mac address has been written into APPL_DB. Having tested this case on the image, I find:
According to the above points, I tend to consider that neighbor update notification has been backlogged in somewhere in orchagent. |
Remove the workaround in neighbour_mac_noptf, which temporarily adjust the log level to DEBUG for orchagent, neighsync and nbrmgrd ahead of nd test and recover to INFO after that. This workaround was introduced in order to provide detailed debug information for the issue sonic-net/sonic-buildimage#2414 (DUT takes more than 7 seconds to finish update ip v6 neighbor). However, we found those information can hardly help. Signed-off-by: Stephen Sun <stephens@mellanox.com>
@prsunny @keboliu @stephenxs So first in the test goes IPv4 neighbor MAC change test which passes. |
i think your assumption is reasonable. and to test with a static neighbor is also a good idea. |
Closing this as the fix is made as part of referenced commit #1026 |
Update sonic-swss submodule pointer to include the following: * 8eea92e [202205][counters] Revert PR sonic-net#2432 for the buffer queue/pg counters improvement ([sonic-net#2462](sonic-net/sonic-swss#2462)) * 5d8636a [202205] Enhance orchagent and buffer manager in error handling (sonic-net#2414) ([sonic-net#2449](sonic-net/sonic-swss#2449)) * aa22237 [Everflow/ERSPAN] Set correct destination port and mac address when the nexthop is updated for ERSPAN mirror destination (sonic-net#2392) ([sonic-net#2455](sonic-net/sonic-swss#2455)) * 04ce7be check state_db for po before sending ARP/ND pkts (sonic-net#2444) ([sonic-net#2450](sonic-net/sonic-swss#2450)) * f0138a2 [portmgr] Fixed the orchagent crash due to late arrival of notif (sonic-net#2431) ([sonic-net#2451](sonic-net/sonic-swss#2451)) * 7cfde48 Change the log messages in addKernelNeigh/Route from ERROR to INFO ([sonic-net#2437](sonic-net/sonic-swss#2437)) * 2c5116e [202205][counters] Improve performance by polling only configured ports buffer queue/pg counters ([sonic-net#2432](sonic-net/sonic-swss#2432)) Signed-off-by: dgsudharsan <sudharsand@nvidia.com>
Update sonic-swss submodule pointer to include the following: * 8eea92e [202205][counters] Revert PR #2432 for the buffer queue/pg counters improvement ([#2462](sonic-net/sonic-swss#2462)) * 5d8636a [202205] Enhance orchagent and buffer manager in error handling (#2414) ([#2449](sonic-net/sonic-swss#2449)) * aa22237 [Everflow/ERSPAN] Set correct destination port and mac address when the nexthop is updated for ERSPAN mirror destination (#2392) ([#2455](sonic-net/sonic-swss#2455)) * 04ce7be check state_db for po before sending ARP/ND pkts (#2444) ([#2450](sonic-net/sonic-swss#2450)) * f0138a2 [portmgr] Fixed the orchagent crash due to late arrival of notif (#2431) ([#2451](sonic-net/sonic-swss#2451)) * 7cfde48 Change the log messages in addKernelNeigh/Route from ERROR to INFO ([#2437](sonic-net/sonic-swss#2437)) * 2c5116e [202205][counters] Improve performance by polling only configured ports buffer queue/pg counters ([#2432](sonic-net/sonic-swss#2432))
Include: df92fb7 Improve verbosity level and provide more info in the log (sonic-net#2472) e81ed20 [intfmgr]: Enable `accept_untracked_na` kernel param (sonic-net#2436) 24d29f1 [orchdaemon]: Fixed sairedis record file rotation (sonic-net#2299) b8ee07d [build] add missing package libyang-dev in lgtm.yml (sonic-net#2475) e46dd29 [crm] Fix issue with continues EXCEEDED and CLEAR logs for ACL group/table counters (sonic-net#2463) b61d24c [doc]: Update README.md (sonic-net#2456) b9ade5d [orchagent] Fix issue: ip prefix shall be inited even if VRF/VNET is not ready (sonic-net#2461) f0f1eb4 Revert "[counters] Improve performance by polling only configured ports buffer queue/pg counters (sonic-net#2360)" (sonic-net#2458) 3d757a8 [ci][asan] add DVS tests run with ASAN (sonic-net#2441) 04fbc8e [ci] Only when test stage succeeded or succeededwithissues, PR run Gcov (sonic-net#2460) 7cc035f [orchagent]: Publish identified events via structured-events channel (sonic-net#2446) efa0f01 [QoS] Enforce drop probability only for colors whose WRED are enabled (sonic-net#2422) 05c5c2f [swss] Replace memset functions (sonic-net#2423) 9ff993d Modified the test file to remove click commands and do the REDIS-DB u… (sonic-net#2264) 9e376af Install libyang in azure pipeline. (sonic-net#2445) c1eb99a check state_db for po before sending ARP/ND pkts (sonic-net#2444) 43cc486 [portmgr] Fixed the orchagent crash due to late arrival of notif (sonic-net#2431) b62c716 Enhance orchagent and buffer manager in error handling (sonic-net#2414) 13bda3c [Everflow/ERSPAN] Set correct destination port and mac address when the nexthop is updated for ERSPAN mirror destination (sonic-net#2392) 0ccb315 Revert "[VS Test] Skip failing subport tests (sonic-net#2370)" (sonic-net#2421) ac8a83f [UT] [Portsyncd] Added Unit Tests for portsyncd (sonic-net#2297) 83a186a Change the log messages in addKernelNeigh/Route from ERROR to INFO (sonic-net#2437) 9c23389 [BFD]Clean up state_db BFD entries on swss restart (sonic-net#2434) d41aebf EntityBulker SIGSEGV when create_entry attr_count 0 (sonic-net#2224) f52a7b1 Fix the Fec Mode Setting of gbsyncd (sonic-net#2430) 8cc0a45 [neighsyncd] Enabling ipv4 link local entries for non-dualtor (sonic-net#2427) 5624e87 Revert "[ci][asan] add DVS tests run with ASAN (sonic-net#2363)" (sonic-net#2433) a26b26a Dynamic port configuration - add port buffer cfg to the port ref counter (sonic-net#2194) 486939a tlm_teamd: Filter portchannel subinterface events from STATE_DB LAG_TABLE (sonic-net#2408) a4b8992 [counters] Improve performance by polling only configured ports buffer queue/pg counters (sonic-net#2360) 4aaeec9 added support for Xsight platform (sonic-net#2426) ca9edca [ci][asan] add DVS tests run with ASAN (sonic-net#2363) dec4570 Handle dual ToR neighbor miss scenario (sonic-net#2151) 9eb4422 Upstream new development on p4orch (sonic-net#2237) e9be2c0 [lgtm] Fix dependency (sonic-net#2419) c0168f3 [muxorch] Returning true if nbr in skip_neighbor_ in isNeighborActive() (sonic-net#2415) cfcf3d8 [macsec]: Set MTU for MACsec (sonic-net#2398) 8346034 Delete Invalid if condition in intfsorch.cpp (sonic-net#2411) Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
Include: df92fb7 Improve verbosity level and provide more info in the log (#2472) e81ed20 [intfmgr]: Enable `accept_untracked_na` kernel param (#2436) 24d29f1 [orchdaemon]: Fixed sairedis record file rotation (#2299) b8ee07d [build] add missing package libyang-dev in lgtm.yml (#2475) e46dd29 [crm] Fix issue with continues EXCEEDED and CLEAR logs for ACL group/table counters (#2463) b61d24c [doc]: Update README.md (#2456) b9ade5d [orchagent] Fix issue: ip prefix shall be inited even if VRF/VNET is not ready (#2461) f0f1eb4 Revert "[counters] Improve performance by polling only configured ports buffer queue/pg counters (#2360)" (#2458) 3d757a8 [ci][asan] add DVS tests run with ASAN (#2441) 04fbc8e [ci] Only when test stage succeeded or succeededwithissues, PR run Gcov (#2460) 7cc035f [orchagent]: Publish identified events via structured-events channel (#2446) efa0f01 [QoS] Enforce drop probability only for colors whose WRED are enabled (#2422) 05c5c2f [swss] Replace memset functions (#2423) 9ff993d Modified the test file to remove click commands and do the REDIS-DB u… (#2264) 9e376af Install libyang in azure pipeline. (#2445) c1eb99a check state_db for po before sending ARP/ND pkts (#2444) 43cc486 [portmgr] Fixed the orchagent crash due to late arrival of notif (#2431) b62c716 Enhance orchagent and buffer manager in error handling (#2414) 13bda3c [Everflow/ERSPAN] Set correct destination port and mac address when the nexthop is updated for ERSPAN mirror destination (#2392) 0ccb315 Revert "[VS Test] Skip failing subport tests (#2370)" (#2421) ac8a83f [UT] [Portsyncd] Added Unit Tests for portsyncd (#2297) 83a186a Change the log messages in addKernelNeigh/Route from ERROR to INFO (#2437) 9c23389 [BFD]Clean up state_db BFD entries on swss restart (#2434) d41aebf EntityBulker SIGSEGV when create_entry attr_count 0 (#2224) f52a7b1 Fix the Fec Mode Setting of gbsyncd (#2430) 8cc0a45 [neighsyncd] Enabling ipv4 link local entries for non-dualtor (#2427) 5624e87 Revert "[ci][asan] add DVS tests run with ASAN (#2363)" (#2433) a26b26a Dynamic port configuration - add port buffer cfg to the port ref counter (#2194) 486939a tlm_teamd: Filter portchannel subinterface events from STATE_DB LAG_TABLE (#2408) a4b8992 [counters] Improve performance by polling only configured ports buffer queue/pg counters (#2360) 4aaeec9 added support for Xsight platform (#2426) ca9edca [ci][asan] add DVS tests run with ASAN (#2363) dec4570 Handle dual ToR neighbor miss scenario (#2151) 9eb4422 Upstream new development on p4orch (#2237) e9be2c0 [lgtm] Fix dependency (#2419) c0168f3 [muxorch] Returning true if nbr in skip_neighbor_ in isNeighborActive() (#2415) cfcf3d8 [macsec]: Set MTU for MACsec (#2398) 8346034 Delete Invalid if condition in intfsorch.cpp (#2411) Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
Related work items: sonic-net#2151, sonic-net#2194, sonic-net#2224, sonic-net#2237, sonic-net#2264, sonic-net#2281, sonic-net#2286, sonic-net#2297, sonic-net#2299, sonic-net#2305, sonic-net#2325, sonic-net#2335, sonic-net#2338, sonic-net#2341, sonic-net#2343, sonic-net#2347, sonic-net#2350, sonic-net#2355, sonic-net#2356, sonic-net#2358, sonic-net#2360, sonic-net#2363, sonic-net#2367, sonic-net#2368, sonic-net#2370, sonic-net#2374, sonic-net#2392, sonic-net#2398, sonic-net#2408, sonic-net#2414, sonic-net#2415, sonic-net#2419, sonic-net#2421, sonic-net#2422, sonic-net#2423, sonic-net#2426, sonic-net#2427, sonic-net#2430, sonic-net#2431, sonic-net#2433, sonic-net#2434, sonic-net#2436, sonic-net#2437, sonic-net#2441, sonic-net#2444, sonic-net#2445, sonic-net#2446, sonic-net#2456, sonic-net#2458, sonic-net#2460, sonic-net#2461, sonic-net#2463, sonic-net#2472, sonic-net#2475, sonic-net#11877, sonic-net#12024, sonic-net#12065, sonic-net#12097, sonic-net#12130, sonic-net#12209, sonic-net#12217, sonic-net#12244, sonic-net#12251, sonic-net#12255, sonic-net#12276, sonic-net#12284
What I did Enhance orchagent and buffer manager Buffer manager: do not insert buffer queue into cache if the profile is illegal, which prevents an empty string from being inserted into APPL_DB during initialization. orchagent: handle the case that a field referencing other objects is an empty string. There had been such logic that was broken by a PR last year. Signed-off-by: Stephen Sun stephens@nvidia.com Why I did it Enhance the error handling logic. In most cases, a user will not encounter such scenarios in a production environment because it's the front-ends' (eg. CLI) responsibility to identify the wrong configuration and prevent them from being inserted to CONFIG_DB. However, in some cases, like a wrong config_db.json composed and copied to the switch, front-ends can not prevent that. How I verified it Manual and mock tests. Details if related For the improvement in buffer manager: previously, the logic was: declare a reference portQueue to m_portQueueLookup[port][queues] and then assign fvValue(i) to portQueue.running_profile_name But [] operation on C++ map has a side-effect -- it will insert a new element into the map if there wasn't one. In case the validation check in checkBufferProfileDirection failed and there was not one in the map, the portQueue.running_profile_name will keep empty. This is not what we want. In case there was an item configured in the map, we should not remove it on failure because we want to prevent the user from being affected by misconfiguration and alert user to correct the error. There is log in checkBufferProfileDirection Now it is improved in this way: Avoid using reference and initialize m_portQueueLookup[port][queues] only if there is a valid egress profile configured
b8173428 - [GCU] Add Sample Unit Test for RDMA Headroom Pool Size Tuning ([device/celestica] blacklist gpio_ich kernel module on haliburton #2692) (5 hours ago) [isabelmsft] 6f84aae7 - Add begin logs to config reload/config minigraph/warm-reboot/fast-reboot (RPC syncd docker image does not start RPC server automatically #2694) (27 hours ago) [Junchao-Mellanox] e98011f8 - Revert "Secure upgrade ([sonic-cfggen] Multi-key should be in form of (a,b) instead of 'a|b' #2337)" (Add Broadcom XLR/GTS ((BCM9COMX2XMC) support for Tomahawk switch (BCM956960K) #2675) (34 hours ago) [StormLiangMS] eda4e91b - [show][muxcable] add some new commands health, reset-cause, queue_info support for muxcable (DUT takes more than 7 seconds to finish update ip v6 neighbor #2414) (4 days ago) [vdahiya12] 54e26359 - Replace pickle by json (Add autoneg to 7170-Q59S20 #2636) (4 days ago) [Mai Bui]
Why I did it 8c7ddf56 - [warm/fast-reboot] Backup logs from tmpfs to disk during fast/warm shutdown ([swss]: update swss docker to stretch #2714) (3 hours ago) [Vaibhav Hemant Dixit] f2a31b30 - [ci] Fix pipeline issue caused by sonic-slave-* change. ([201803] Modify Debian apt repos to reflect changes made by maintainers #2709) (3 hours ago) [Liu Shilong] 586ecf0e - [dhcp_relay] Fix dhcp_relay restart error while add/del vlan ([thrift] add a patch to revert THRIFT-3650 #2688) (3 hours ago) [Yaqiang Zhu] 07b0ef4c - [portstat CLI] don't print reminder if use json format ([devices] add new accton platform minipack. #2670) (3 hours ago) [wenyiz2021] 48d3d3ef - [show][muxcable] add some new commands health, reset-cause, queue_info support for muxcable (DUT takes more than 7 seconds to finish update ip v6 neighbor #2414) (3 hours ago) [vdahiya12] How I did it How to verify it
b8173428 - [GCU] Add Sample Unit Test for RDMA Headroom Pool Size Tuning ([device/celestica] blacklist gpio_ich kernel module on haliburton sonic-net#2692) (5 hours ago) [isabelmsft] 6f84aae7 - Add begin logs to config reload/config minigraph/warm-reboot/fast-reboot (RPC syncd docker image does not start RPC server automatically sonic-net#2694) (27 hours ago) [Junchao-Mellanox] e98011f8 - Revert "Secure upgrade ([sonic-cfggen] Multi-key should be in form of (a,b) instead of 'a|b' sonic-net#2337)" (Add Broadcom XLR/GTS ((BCM9COMX2XMC) support for Tomahawk switch (BCM956960K) sonic-net#2675) (34 hours ago) [StormLiangMS] eda4e91b - [show][muxcable] add some new commands health, reset-cause, queue_info support for muxcable (DUT takes more than 7 seconds to finish update ip v6 neighbor sonic-net#2414) (4 days ago) [vdahiya12] 54e26359 - Replace pickle by json (Add autoneg to 7170-Q59S20 sonic-net#2636) (4 days ago) [Mai Bui]
…14048) For sonic-platform-daemons following commits are added to the submodule dd8fbae (HEAD -> 202012, origin/202012) [ycabled] add more coverage to ycabled; add minor name change for vendor API CLI return key-values pairs (#338) 846555e [thermalctld] fix some redundant removal of state DB tables (#315) 3d92fb9 Use github code scanning instead of LGTM (#316) For sonic-utilities the following commits are added in this PR to the submodule git log --oneline 39cdb49c..202012 ec4c6ea5 (HEAD -> 202012, origin/202012) [show][muxcable] add some new commands health, reset-cause, queue_info support for muxcable (#2414) (#2704) 03ef272e [202012][vlan] Remove add field of vlanid to DHCP_RELAY table while adding vlan (#2681) e00a81ac [202012][dhcp-relay] Add support for dhcp_relay config cli (#2640) 274184e1 [vlan] Refresh dhcpv6_relay config while adding/deleting a vlan (#2660) (#2668 #### Why I did it updating the submodule of sonic-platform-daemons, sonic-utilities #### How I did it updated the submodule
Description
Steps to reproduce the issue:
run sonic-mgmt test "neighbour_mac_noptf", in this test it use command "ip -6 neigh change xxxxxx" to change the ipv6 neighbor mac address and then verify the ASIC DB to see whether the change is success or not.
Describe the results you received:
sometimes the test failed due to the mac address in the ASIC DB table "KEYS ASIC_STATE:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY*" not change yet, ASIC DB only been updated after about 7 seconds after issue the ip neigh change command.
Describe the results you expected:
the ansible test expecting DB will be updated in 2 seconds, 7 seconds is too long.
Additional information you deem important (e.g. issue happens only occasionally):
when the issue happened, there was no time consuming task running on the DUT, only ansible test is running on it.
Output of
show version
:syslog for this issue, can see that ip neighbor change command was issued at 06:20:05, ASIC DB was updated at 06:20:11 and 06:20:12, in total it took around 7 seconds:
The text was updated successfully, but these errors were encountered: