Skip to content

Commit 0c399dc

Browse files
authored
Fix snmp agent not-responding issue when high CPU utilization (#346)
<!-- Please make sure you've read and understood our contributing guidelines; https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md ** Make sure all your commits include a signature generated with `git commit -s` ** If this is a bug fix, make sure your description includes "fixes #xxxx", or "closes #xxxx" Please provide the following information: --> Fix sonic-net/sonic-buildimage#21314 The SNMP agent and MIB updaters are basic on Asyncio/Coroutine, the mib updaters share the same Asyncio event loop with the SNMP agent client. Hence during the updaters executing, the agent client can't receive/respond to new requests. When the CPU utilization is high (In some stress test we make CPU 100% utilization), the updates are slow, and this causes the snmpd request to be timeout because the agent got suspended during updating. **- What I did** Decrease the MIB update frequency when the update execution is slow. pros: 1. The snmp request can success even if 100% CPU utilization. 2. The snmpd request seldomly fails due to timeout, combined with sonic-net/sonic-buildimage#21316 , we have 4*5 = 20s time windows for the SNMP agent to wait for the MIB updates finish and respond to snmpd request. 3. Relief the CPU cost when CPU is high, the can avoid CPU becomes more crowded. cons: 1. Tested on pizzabox (4600c), the updaters are very fast, generally finished within 0.001~0.02s, the chagne won't actually affect the frequency and interval. 2. On Cisco chassis, the update of SNMP data could be delayed for 10~20s(at most 60s in extreme situation). Per my oberservation, most of the updater finishes within 0.5s. But for 1.a ciscoSwitchQosMIB.QueueStatUpdater, generally finishs in 0.5-2s, expected to be delayed to 5-20s 1.b PhysicalTableMIBUpdater, generally finishs in 0.5-1.5s, expected to be delayed to 5-1.5s 1.c ciscoPfcExtMIB.PfcPrioUpdater, generally finishs in 0.5-3s, expected to be delayed to 5-30s **- How I did it** In get_next_update_interval, we compute the interval based on current execution time. Roughly, we make the 'update interval'/'update execution time' >= UPDATE_FREQUENCY_RATE(10) More specifically, if the execution time is 2.000001s, we sleep 21s before next update round. And the max interval won't be longer than MAX_UPDATE_INTERVAL(60s). **- How to verify it** Test on Cisco chassis, test_snmp_cpu.py which triggers 100% CPU utilization test whether snmp requests work well. **- Description for the changelog** <!-- Write a short (one line) summary that describes the changes in this pull request for inclusion in the changelog: -->
1 parent c5301b2 commit 0c399dc

File tree

11 files changed

+330
-10
lines changed

11 files changed

+330
-10
lines changed

src/ax_interface/agent.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88

99

1010
class Agent:
11-
def __init__(self, mib_cls, update_frequency, loop):
11+
def __init__(self, mib_cls, enable_dynamic_frequency, update_frequency, loop):
1212
if not type(mib_cls) is MIBMeta:
1313
raise ValueError("Expected a class with type: {}".format(MIBMeta))
1414

@@ -20,7 +20,7 @@ def __init__(self, mib_cls, update_frequency, loop):
2020
self.stopped = asyncio.Event()
2121

2222
# Initialize our MIB
23-
self.mib_table = MIBTable(mib_cls, update_frequency)
23+
self.mib_table = MIBTable(mib_cls, enable_dynamic_frequency, update_frequency)
2424

2525
# containers
2626
self.socket_mgr = SocketManager(self.mib_table, self.run_enabled, self.loop)

src/ax_interface/constants.py

+5
Original file line numberDiff line numberDiff line change
@@ -90,3 +90,8 @@ class PduTypes(int, Enum):
9090

9191

9292
DEFAULT_PDU_TIMEOUT = 5
93+
94+
# MIBUpdater rate: Interval/Execution time
95+
UPDATE_FREQUENCY_RATE = 10
96+
# MIBUpdater max update interval
97+
MAX_UPDATE_INTERVAL = 60

src/ax_interface/mib.py

+42-2
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
11
import asyncio
22
import bisect
33
import random
4+
from datetime import datetime
45

56
from . import logger, util
67
from .constants import ValueType
78
from .encodings import ValueRepresentation
9+
from .util import get_next_update_interval
810

911
"""
1012
Update interval between update runs (in seconds).
@@ -16,6 +18,11 @@
1618
"""
1719
DEFAULT_REINIT_RATE = 60
1820

21+
"""
22+
Disable dynamic frequency by default
23+
"""
24+
DEFAULT_ENABLE_DYNAMIC_FREQUENCY = False
25+
1926

2027
class MIBUpdater:
2128
"""
@@ -25,13 +32,15 @@ class MIBUpdater:
2532
def __init__(self):
2633
self.run_event = asyncio.Event()
2734
self.frequency = DEFAULT_UPDATE_FREQUENCY
35+
self.enable_dynamic_frequency = DEFAULT_ENABLE_DYNAMIC_FREQUENCY
2836
self.reinit_rate = DEFAULT_REINIT_RATE // DEFAULT_UPDATE_FREQUENCY
2937
self.update_counter = self.reinit_rate + 1 # reinit_data when init
3038

3139
async def start(self):
3240
# Run the update while we are allowed
3341
redis_exception_happen = False
3442
while self.run_event.is_set():
43+
start = datetime.now()
3544
try:
3645
# reinit internal structures
3746
if self.update_counter > self.reinit_rate:
@@ -57,9 +66,36 @@ async def start(self):
5766
# Any unexpected exception or error, log it and keep running
5867
logger.exception("MIBUpdater.start() caught an unexpected exception during update_data()")
5968

69+
if self.enable_dynamic_frequency:
70+
"""
71+
On SONiC device with huge interfaces
72+
for example RP card on ethernet chassis, including backend ports, 600+ interfaces
73+
The update_data function could be very slow, especially when 100% CPU utilization.
74+
for example ciscoSwitchQosMIB.QueueStatUpdater, uses 1-3 seconds on normal state.
75+
uses 3-8 seconds on 100% CPU utilization state.
76+
We use Asyncio/Coroutine as the basic framework,
77+
the mib updaters share the same asyncio event loop with the SNMP agent client.
78+
Hence during the updaters executing, the agent client can't receive/respond to new requests,
79+
80+
The high frequency and the long execution time
81+
causes the SNMP request to be timed out on High CPU utilization.
82+
The stable frequency(generally with default value 5s)
83+
doesn't works well on this huge interfaces situation.
84+
when the execution time is long,
85+
wait for longer time to give back the control of asyncio event loop to SNMP agent
86+
"""
87+
execution_time = (datetime.now() - start).total_seconds()
88+
next_frequency = get_next_update_interval(execution_time, self.frequency)
89+
90+
if next_frequency > self.frequency:
91+
logger.debug(f"MIBUpdater type[{type(self)}] slow update detected, "
92+
f"update execution time[{execution_time}], next_frequency[{next_frequency}]")
93+
else:
94+
next_frequency = self.frequency
95+
6096
# wait based on our update frequency before executing again.
6197
# randomize to avoid concurrent update storms.
62-
await asyncio.sleep(self.frequency + random.randint(-2, 2))
98+
await asyncio.sleep(next_frequency + random.randint(-2, 2))
6399

64100
def reinit_data(self):
65101
"""
@@ -275,10 +311,13 @@ class MIBTable(dict):
275311
Simplistic LUT for Get/GetNext OID. Interprets iterables as keys and implements the same interfaces as dict's.
276312
"""
277313

278-
def __init__(self, mib_cls, update_frequency=DEFAULT_UPDATE_FREQUENCY):
314+
def __init__(self, mib_cls,
315+
enable_dynamic_frequency=DEFAULT_ENABLE_DYNAMIC_FREQUENCY,
316+
update_frequency=DEFAULT_UPDATE_FREQUENCY):
279317
if type(mib_cls) is not MIBMeta:
280318
raise ValueError("Supplied object is not a MIB class instance.")
281319
super().__init__(getattr(mib_cls, MIBMeta.KEYSTORE))
320+
self.enable_dynamic_frequency = enable_dynamic_frequency
282321
self.update_frequency = update_frequency
283322
self.updater_instances = getattr(mib_cls, MIBMeta.UPDATERS)
284323
self.prefixes = getattr(mib_cls, MIBMeta.PREFIXES)
@@ -296,6 +335,7 @@ def start_background_tasks(self, event):
296335
tasks = []
297336
for updater in self.updater_instances:
298337
updater.frequency = self.update_frequency
338+
updater.enable_dynamic_frequency = self.enable_dynamic_frequency
299339
updater.run_event = event
300340
fut = asyncio.ensure_future(updater.start())
301341
fut.add_done_callback(MIBTable._done_background_task_callback)

src/ax_interface/util.py

+29
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
import ipaddress
2+
import math
23
import re
34

45
from ax_interface import constants
@@ -108,3 +109,31 @@ def ip2byte_tuple(ip):
108109
"""
109110
return tuple(i for i in ipaddress.ip_address(ip).packed)
110111

112+
113+
def get_next_update_interval(execution_time, static_frequency):
114+
"""
115+
>>> get_next_update_interval(0.4, 5)
116+
5
117+
>>> get_next_update_interval(0.87, 5)
118+
9
119+
>>> get_next_update_interval(18.88, 5)
120+
60
121+
122+
123+
:param static_frequency: Static frequency, generally use default value 5
124+
:param execution_time: The execution time of the updater
125+
:return: the interval before next update
126+
127+
We expect the rate of 'update interval'/'update execution time' >= UPDATE_FREQUENCY_RATE(10)
128+
Because we're using asyncio/Coroutines, the update execution blocks SNMP proxy service and other updaters.
129+
Generally we expect the update to be quick and the execution time/interval time < 0.25
130+
Given the static_frequency == 5,
131+
if the execution_time < 0.5,
132+
the update interval is(for example) 1.1s
133+
It sleeps 1.1s * 10 = 11s before run next update
134+
135+
"""
136+
frequency_based_on_execution_time = math.ceil(execution_time * constants.UPDATE_FREQUENCY_RATE)
137+
frequency_based_on_execution_time = min(frequency_based_on_execution_time, constants.MAX_UPDATE_INTERVAL)
138+
139+
return max(static_frequency, frequency_based_on_execution_time)

src/sonic_ax_impl/__main__.py

+4-3
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,12 @@
55
import shutil
66
import sys
77

8-
import sonic_py_common.util
98
from swsscommon import swsscommon
109

1110
import ax_interface
1211
import sonic_ax_impl
12+
13+
from .utils.arg_parser import process_options
1314
from . import mibs
1415

1516
LOG_FORMAT = "snmp-subagent [%(name)s] %(levelname)s: %(message)s"
@@ -70,7 +71,7 @@ def logging_level_to_swss_level(log_level):
7071
sys.exit(0)
7172

7273
# import command line arguments
73-
args = sonic_py_common.util.process_options("sonic_ax_impl")
74+
args = process_options("sonic_ax_impl")
7475

7576
# configure logging. If debug '-d' is specified, logs to stdout at designated level. syslog/INFO otherwise.
7677
log_level = log_level_sdk = args.get('log_level')
@@ -110,4 +111,4 @@ def logging_level_to_swss_level(log_level):
110111

111112
from .main import main
112113

113-
main(update_frequency=args.get('update_frequency'))
114+
main(enable_dynamic_frequency=args.get('enable_dynamic_frequency'), update_frequency=args.get('update_frequency'))

src/sonic_ax_impl/main.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -54,14 +54,14 @@ def shutdown(signame, agent):
5454
shutdown_task = event_loop.create_task(agent.shutdown())
5555

5656

57-
def main(update_frequency=None):
57+
def main(enable_dynamic_frequency=False, update_frequency=None):
5858
global event_loop
5959

6060
try:
6161
Namespace.init_sonic_db_config()
6262

6363
# initialize handler and set update frequency (or use the default)
64-
agent = ax_interface.Agent(SonicMIB, update_frequency or DEFAULT_UPDATE_FREQUENCY, event_loop)
64+
agent = ax_interface.Agent(SonicMIB, enable_dynamic_frequency, update_frequency or DEFAULT_UPDATE_FREQUENCY, event_loop)
6565

6666
# add "shutdown" signal handlers
6767
# https://docs.python.org/3.5/library/asyncio-eventloop.html#set-signal-handlers-for-sigint-and-sigterm

src/sonic_ax_impl/utils/__init__.py

Whitespace-only changes.

src/sonic_ax_impl/utils/arg_parser.py

+39
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
from __future__ import print_function
2+
import sys
3+
from getopt import getopt
4+
5+
6+
def usage(script_name):
7+
print('Usage: python ', script_name,
8+
'-t [host] -p [port] -s [unix_socket_path] -d [logging_level] -f [update_frequency] -r [enable_dynamic_frequency] -h [help]')
9+
10+
11+
def process_options(script_name):
12+
"""
13+
Process command line options
14+
"""
15+
options, remainders = getopt(sys.argv[1:], "t:p:s:d:f:rh", ["host=", "port=", "unix_socket_path=", "debug=", "frequency=", "enable_dynamic_frequency", "help"])
16+
17+
args = {}
18+
for (opt, arg) in options:
19+
try:
20+
if opt in ('-d', '--debug'):
21+
args['log_level'] = int(arg)
22+
elif opt in ('-t', '--host'):
23+
args['host'] = arg
24+
elif opt in ('-p', '--port'):
25+
args['port'] = int(arg)
26+
elif opt in ('-s', '--unix_socket_path'):
27+
args['unix_socket_path'] = arg
28+
elif opt in ('-f', '--frequency'):
29+
args['update_frequency'] = int(arg)
30+
elif opt in ('-r', '--enable_dynamic_frequency'):
31+
args['enable_dynamic_frequency'] = True
32+
elif opt in ('-h', '--help'):
33+
usage(script_name)
34+
sys.exit(0)
35+
except ValueError as e:
36+
print('Invalid option for {}: {}'.format(opt, e))
37+
sys.exit(1)
38+
39+
return args

tests/test_agent.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,6 @@ async def delayed_shutdown(self, agent):
2222

2323
def test_agent_loop(self):
2424
event_loop = asyncio.get_event_loop()
25-
agent = ax_interface.Agent(SonicMIB, 5, event_loop)
25+
agent = ax_interface.Agent(SonicMIB, False, 5, event_loop)
2626
event_loop.create_task(self.delayed_shutdown(agent))
2727
event_loop.run_until_complete(agent.run_in_event_loop())

0 commit comments

Comments
 (0)