Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SNMP-SUBAGENT] CPU Spike because of redundant and flooded keyspace notifications #8293

Closed
vivekrnv opened this issue Jul 30, 2021 · 3 comments · Fixed by sonic-net/sonic-snmpagent#230
Labels
Triaged this issue has been triaged

Comments

@vivekrnv
Copy link
Contributor

vivekrnv commented Jul 30, 2021

Description

Recently, an extremely similar bug LINK is seen in the snmp-subagent process which was related to blocking calls and fix is ported already.

This is of similar nature i.e very high CPU Usage, but this occurs occasionally i.e around 20-30 % of the time.

Steps to reproduce the issue:

  1. Load an image which has the fix for LINK
  2. Monitor the CPU Usage for the snmp-subagent process

Describe the results you received:

Snapshot of where the process spends the time

Collecting samples from 'python3 -m sonic_ax_impl' (python v3.7.3)
Total Samples 466600
GIL: 100.00%, Active: 100.00%, Threads: 2

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)                                               
100.00% 100.00%    3862s     3862s   update_rem_if_mgmt (sonic_ax_impl/mibs/ieee802_1ab.py:539)
  0.00%   0.00%   585.3s    585.3s   <listcomp> (sonic_ax_impl/mibs/ieee802_1ab.py:555)
  0.00%   0.00%   109.8s    695.1s   _update_per_namespace_data (sonic_ax_impl/mibs/ieee802_1ab.py:555)
  0.00%   0.00%   32.10s    33.01s   get_all (swsscommon/swsscommon.py:1663)
  0.00%   0.00%    9.04s    11.21s   get_message (swsscommon/swsscommon.py:1726)
  0.00%   0.00%    4.14s     5.57s   match (re.py:173)
  0.00%   0.00%    3.51s     3.51s   _parse_octet (ipaddress.py:1159)
  0.00%   0.00%    3.05s     3.05s   _worker (concurrent/futures/thread.py:78)
  0.00%   0.00%    2.84s     2.84s   __getitem__ (swsscommon/swsscommon.py:345)
  0.00%   0.00%    2.76s    37.93s   get_all (swsscommon/swsscommon.py:1706)
  0.00%   0.00%    1.59s    26.93s   _update_per_namespace_data (sonic_ax_impl/mibs/ieee802_1ab.py:546)
  0.00%   0.00%    1.47s     6.11s   _ip_int_from_string (ipaddress.py:1138)
  0.00%   0.00%    1.41s     6.93s   ip2byte_tuple (ax_interface/util.py:109)
  0.00% 100.00%    1.31s     3477s   _update_per_namespace_data (sonic_ax_impl/mibs/ieee802_1ab.py:552)
  0.00%   0.00%    1.22s     2.26s   poll_lldp_entry_updates (sonic_ax_impl/mibs/ieee802_1ab.py:82)
  0.00%   0.00%    1.14s     1.14s   _swig_setattr_nondynamic (swsscommon/swsscommon.py:55)
  0.00%   0.00%    1.11s     1.11s   get_index_from_str (swsssdk/port_util.py:58)
  0.00%   0.00%    1.06s    41.10s   update_rem_if_mgmt (sonic_ax_impl/mibs/ieee802_1ab.py:504)
  0.00%   0.00%   0.940s    12.15s   poll_lldp_entry_updates (sonic_ax_impl/mibs/ieee802_1ab.py:77)
  0.00%   0.00%   0.910s    38.85s   dbs_get_all (sonic_ax_impl/mibs/__init__.py:628)
  0.00%   0.00%   0.860s    0.860s   _compile (re.py:273)
  0.00%   0.00%   0.730s    0.730s   _swig_setattr_nondynamic (swsscommon/swsscommon.py:56)
  0.00%   0.00%   0.710s    0.710s   __bool__ (swsscommon/swsscommon.py:331)
  0.00%   0.00%   0.630s    0.630s   get_index_from_str (swsssdk/port_util.py:55)
  0.00%   0.00%   0.590s     7.17s   __init__ (ipaddress.py:1301)
  0.00%   0.00%   0.560s    0.560s   _compile (re.py:276)
  0.00%   0.00%   0.540s     9.13s   poll_lldp_entry_updates (sonic_ax_impl/mibs/ieee802_1ab.py:90)
  0.00%   0.00%   0.530s    0.530s   keys (swsscommon/swsscommon.py:354)
  0.00%   0.00%   0.480s     4.02s   get_subtype (sonic_ax_impl/mibs/ieee802_1ab.py:595)
  0.00%   0.00%   0.460s     2.59s   _swig_setattr (swsscommon/swsscommon.py:71)
  0.00%   0.00%   0.460s     3.08s   <lambda> (swsscommon/swsscommon.py:317)
  0.00%   0.00%   0.450s    0.450s   _parse_octet (ipaddress.py:1168)
  0.00%   0.00%   0.410s     1.12s   poll_lldp_entry_updates (sonic_ax_impl/mibs/ieee802_1ab.py:78)
  0.00%   0.00%   0.360s     5.94s   get_index_from_str (swsssdk/port_util.py:56)
  0.00%   0.00%   0.340s     4.44s   ip_address (ipaddress.py:44)
  0.00%   0.00%   0.330s    0.330s   dbs_get_all (sonic_ax_impl/mibs/__init__.py:630)
  0.00%   0.00%   0.310s    0.310s   get_index_from_str (swsssdk/port_util.py:48)
  0.00%   0.00%   0.300s    0.300s   _ip_int_from_string (ipaddress.py:1133)
  0.00%   0.00%   0.300s    0.300s   _parse_octet (ipaddress.py:1172)
  0.00%   0.00%   0.290s    0.290s   lldp_entry_table (sonic_ax_impl/mibs/__init__.py:143)
  0.00%   0.00%   0.290s    0.580s   packed (ipaddress.py:1306)
  0.00%   0.00%   0.270s    0.270s   v4_int_to_packed (ipaddress.py:137)
  0.00%   0.00%   0.270s    0.270s   get_index_from_str (swsssdk/port_util.py:52)
  0.00%   0.00%   0.270s    0.270s   dbs_get_all (sonic_ax_impl/mibs/__init__.py:622)

Traige:

On further troubleshooting, issue was traced down to LLDPRemManAddrUpdater MIB inside the snmp-subagent. This MIB unlike most of the others uses redis key_space notifications to update it's internal cache. It subscribes to the notifications for "LLDP_ENTRY_TABLE*". These keys on the other hand are del & set for every 10 secs. Check lldp_syncd logic for more info.

Each of the set operation on these keys are result in 11 notifications, These (12xnum_interfacesxtime/10) notifications accumulate inside the buffer and once the asyncio schedules the LLDPRemManAddrUpdater MIB, the update_date logic fetches the key-value pairs from redis and updates the internal cache once for every notification seen resulting in the spike

All of these older notifications are redundant and can be skipped.

I've thus tested this prototype, which seemed to get down the CPU usage. If deemed fit, this can be used as a good starting point towards the full solution.

diff --git a/src/sonic_ax_impl/mibs/ieee802_1ab.py b/src/sonic_ax_impl/mibs/ieee802_1ab.py
index 8526352..6d31b97 100644
--- a/src/sonic_ax_impl/mibs/ieee802_1ab.py
+++ b/src/sonic_ax_impl/mibs/ieee802_1ab.py
@@ -96,6 +96,19 @@ def poll_lldp_entry_updates(pubsub):
         return ret
     return data, interface, if_index

+def get_latest_update(pubsub):
+    """
+    Fetches the latest notification recorded on a lldp entry
+    """
+    latest_update_map = {}
+    while True:
+        data, interface, if_index = poll_lldp_entry_updates(pubsub)
+        if not data:
+            break
+        latest_update_map[interface] = (data, if_index)
+    return latest_update_map
+
 def parse_sys_capability(sys_cap):
     return bytearray([int (x, 16) for x in sys_cap.split()])

@@ -542,18 +555,16 @@ class LLDPRemManAddrUpdater(MIBUpdater):
         """
         Listen to updates in APP DB, update local cache
         """
-        while True:
-            data, interface, if_index = poll_lldp_entry_updates(pubsub)
-
-            if not data:
-                break
-
+        event_cache = get_latest_update(pubsub)
+        for interface in event_cache.keys():
+            data = event_cache[interface][0]
+            if_index = event_cache[interface][1]
+
             if "set" in data:
                 self.update_rem_if_mgmt(if_index, interface)
             elif "del" in data:
-                # some remote data about that neighbor is gone, del it and try to query again
+                # if del is the latest notification, then just delete it from the local cache
                 self.if_range = [sub_oid for sub_oid in self.if_range if sub_oid[0] != if_index]
-                self.update_rem_if_mgmt(if_index, interface)

     def update_data(self):
         for i in range(len(self.db_conn)):

Note: This event-notification logic is also seen in LocPortUpdater MIB but this issue is not seen there because the PORT table is not updated with the frequency that the LLDP_ENTRY is updated with. So, having this fixed as well might also help.

Describe the results you expected:

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@qiluo-msft
Copy link
Collaborator

Thanks for the report, analysis and proposal.

These keys on the other hand are del & set for every 10 secs. Check lldp_syncd logic for more info

Is it possible to improve upstream lldp_syncd first? If it only apply the delta into Redis, does snmpagent work well?

@vivekrnv
Copy link
Contributor Author

It is possible to optimize lldp_syncd but we would still get atleast num_lldp_neigh notifications for every 10 secs. So, snmp-sub agent will still have to be modified

@qiluo-msft
Copy link
Collaborator

Please submit a PR based on "Traige", we can continue discuss on the PR context.

@zhangyanzhao zhangyanzhao added the Triaged this issue has been triaged label Aug 4, 2021
qiluo-msft pushed a commit to sonic-net/sonic-snmpagent that referenced this issue Aug 5, 2021
…230)

**- What I did**

Fixes [#8293](sonic-net/sonic-buildimage#8293)

**- How I did it**

Accumulated all the older notifications and did act only upon the latest notification discarding the others
qiluo-msft pushed a commit to sonic-net/sonic-snmpagent that referenced this issue Aug 5, 2021
…230)

**- What I did**

Fixes [#8293](sonic-net/sonic-buildimage#8293)

**- How I did it**

Accumulated all the older notifications and did act only upon the latest notification discarding the others
judyjoseph pushed a commit to sonic-net/sonic-snmpagent that referenced this issue Nov 14, 2021
…230)

**- What I did**

Fixes [#8293](sonic-net/sonic-buildimage#8293)

**- How I did it**

Accumulated all the older notifications and did act only upon the latest notification discarding the others
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Triaged this issue has been triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants