[SNMP-SUBAGENT] [202012] Observed a unusual spike in CPU usage because of blocking=True argument #8202

vivekrnv · 2021-07-16T22:51:06Z

Description

Steps to reproduce the issue:

Load any 202012 image which was was built with this change PR
Disable counterpoll, if not already disabled and reload the switch
Wait for the SNMP Container to be up.

Describe the results you received:

Check the CPU usage of the SNMP docker and snmp-subagent process

CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
24b286433cb7        mgmt-framework      0.01%               108.2MiB / 7.689GiB   1.37%               0B / 0B             26.5MB / 69.6kB     19
78e05cfe9404        snmp                100.66%             68.76MiB / 7.689GiB   0.87%               0B / 0B             0B / 139kB          9
f88c433f5867        telemetry           0.89%               141.4MiB / 7.689GiB   1.80%               0B / 0B             36MB / 86kB         34
36437a3448cd        dhcp_relay          0.03%               30.47MiB / 7.689GiB   0.39%               0B / 0B             795kB / 73.7kB      6
ea34908916a6        radv                0.03%               29.96MiB / 7.689GiB   0.38%               0B / 0B             561kB / 77.8kB      6
dcdd0f45efb8        pmon                0.94%               212MiB / 7.689GiB     2.69%               0B / 0B             72.4MB / 1.21MB     16
8fa306e016cc        lldp                6.52%               59.86MiB / 7.689GiB   0.76%               0B / 0B             7.14MB / 127kB      11
e010a19edc0b        syncd               4.44%               813.2MiB / 7.689GiB   10.33%              0B / 0B             34.9MB / 69.6kB     41
6fc14c869be8        teamd               0.04%               33MiB / 7.689GiB      0.42%               0B / 0B             5.78MB / 90.1kB     12
fedda16a902e        swss                0.13%               54.29MiB / 7.689GiB   0.69%               0B / 0B             17.4MB / 246kB      37
84aaea73fd3f        bgp                 0.05%               71.48MiB / 7.689GiB   0.91%               0B / 0B             16.8MB / 5.83MB     22
29fbbe40ea00        database            5.78%               93.08MiB / 7.689GiB   1.18%               0B / 0B             40.8MB / 98.3kB     12

admin@sonic:~$ top
top - 08:06:58 up  4:47,  2 users,  load average: 2.00, 1.67, 1.13
Tasks: 306 total,   2 running, 302 sleeping,   0 stopped,   2 zombie
%Cpu(s): 15.9 us,  2.4 sy,  0.0 ni, 81.5 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
MiB Mem :   7873.9 total,   2911.0 free,   3284.7 used,   1678.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   4232.0 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  74572 root      20   0  153692  57784  13900 R 100.0   0.7  12:56.35 python3
   7819 root      20   0 1330816 822260  46216 S  24.2  10.2  49:31.42 sx_sdk

admin@sonic:~$ ps -aux | grep 74572
root       74572 99.1  0.7 153692 57784 pts/0    Rl   07:53  13:29 python3 -m sonic_ax_impl
admin      75908  0.0  0.0   6692   828 pts/1    S+   08:07   0:00 grep 74572

After doing some profiling,

GIL: 0.00%, Active: 100.00%, Threads: 2

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)
100.00% 100.00%   319.0s    319.1s   get_all (swsscommon/swsscommon.py:1661)
  0.00%   0.00%    2.89s     2.89s   update_rem_if_mgmt (sonic_ax_impl/mibs/ieee802_1ab.py:542)

It is very clear that the get_all method of swsscommon.Sonicv2Connector is taking a significant amount of time, when used with blocking set to True (waits for around 60 sec if the data is not present in the Redis). And this seems to busy wait as inferred from the logic here LINK and thus the reason for CPU Usage spike.

This behavior is particular seen after that commit because, that commit essentially delays the initialization of FLEX_COUNTERS and thereby the COUNTERS:oid* until the counterpoll is enabled.

But the MIB classes in the snmp-subagent periodically poll these DB with blocking set to True.

As a fix, we can set the blocking set to False in all these MIB classes. This shouldn't affect how they operate as they periodically poll the redis anyway and if the data is missed at this instant, that can eventually be fetched within the next 3 to 7 second. {5 is the default period with a random delay of -2,2}

Describe the results you expected:

Shouldn't see the CPU Usage spike.

Output of `show version`:

(paste your output here)

Output of `show techsupport`:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

The text was updated successfully, but these errors were encountered:

qiluo-msft · 2021-07-20T12:02:31Z

How long is the "CPU Usage spike" starting from process invoke? If the timeframe is reasonable, does snmpagent work well after that as before?

zhangyanzhao · 2021-07-21T15:17:40Z

@qiluo-msft is checking

Fixes sonic-net/sonic-buildimage#8202 - Fix pubsub channel timeout unit - Reset context err before next redisGetReply, otherwise later redisGetReply early return on no data, and burn CPU core in a loop.

vivekrnv mentioned this issue Jul 17, 2021

[202012] CPU Usage Spike Issue Handled vivekrnv/sonic-snmpagent#1

Closed

zhangyanzhao added the Triaged this issue has been triaged label Jul 21, 2021

qiluo-msft mentioned this issue Jul 21, 2021

Fix DBInterface blocking operations sonic-net/sonic-swss-common#505

Merged

qiluo-msft closed this as completed in sonic-net/sonic-swss-common#505 Jul 22, 2021

vivekrnv mentioned this issue Jul 30, 2021

[SNMP-SUBAGENT] CPU Spike because of redundant and flooded keyspace notifications #8293

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SNMP-SUBAGENT] [202012] Observed a unusual spike in CPU usage because of blocking=True argument #8202

[SNMP-SUBAGENT] [202012] Observed a unusual spike in CPU usage because of blocking=True argument #8202

vivekrnv commented Jul 16, 2021

qiluo-msft commented Jul 20, 2021

zhangyanzhao commented Jul 21, 2021

[SNMP-SUBAGENT] [202012] Observed a unusual spike in CPU usage because of blocking=True argument #8202

[SNMP-SUBAGENT] [202012] Observed a unusual spike in CPU usage because of blocking=True argument #8202

Comments

vivekrnv commented Jul 16, 2021

Description

Steps to reproduce the issue:

Describe the results you received:

Describe the results you expected:

Output of show version:

Output of show techsupport:

Additional information you deem important (e.g. issue happens only occasionally):

qiluo-msft commented Jul 20, 2021

zhangyanzhao commented Jul 21, 2021

Output of `show version`:

Output of `show techsupport`: