SNMP: snmpwalk is slow and can timeout #404

talyacs · 2023-01-17T14:30:20Z

Hi.

Running snmpwalk takes a long time to complete, and if there is a large amount of data it can also timeout.
By looking at the clixon logs I saw that the most of the delay happens when the walk reaches ifTestTable.

We suspect that the general delay is caused by traversing the yang files in real time
and the ifTestTable delay is caused by ignoring the deprecated status which is, according to the documentation, currently unimplemented in clixon.

olofhagsand · 2023-01-17T14:59:55Z

This is typically best observed by running clixon-snmp with the IFMIB walk in a profiler.
Also, it is unclear to me wether one should (always) skip deprecated nodes in a walk, RFC 7950 is unclear:

  o  "current" means that the definition is current and valid.

   o  "deprecated" indicates an obsolete definition, but it permits
      new/continued implementation in order to foster interoperability
      with older/existing implementations.

   o  "obsolete" means that the definition is obsolete and SHOULD NOT be
      implemented and/or can be removed from implementations.

One could add such a snmp-specific option to skip deprecated YANG.
However, I suspect it may have to do with a suboptimal (non-linear) complexity of the interface traversal.

talyacs · 2023-01-17T15:53:55Z

I'm currently trying to reproduce this issue on a smaller setup, I'll update with more details if needed when done.

olofhagsand · 2023-01-17T20:55:10Z

I extended the clixon regression test/test_snmp_ifmib.sh just by adding eight more interfaces with no significant delay:

olof@alarik> time /usr/local/bin/snmpwalk -c public -v2c localhost  IF-MIB::ifTable > /dev/null
real    0m1,500s
user    0m0,026s
sys     0m0,019s

That is 1.5s using an old x86. In other words, I cannot reproduce it.
Waiting for more info: something like smaller setup (see ^) or using existing test_snmp_ifmib.sh?

shmuelhazan · 2023-01-29T18:45:58Z

I did some more research into that. Looks like it is a combination of multiple issues:

Sending a GET_NEXT request to netsnmp will cause snmp_table_getnext to be called for each possible leaf in the table; if it does not exist, snmp_table_getnext will return -1, causing netsnmp to call clixon_snmp_table_handler again, requesting the next registered OID.
snmp_table_getnext tries to match the requested OID in O(N*M), where N is the number of entries in the table and M is the number of columns in the table. Each snmp_table_getnext calls from the number 1 above will be a worst case as it will need to loop over all of the possibilities in order to confirm the the requested OID does not exist.
According to multiple profiling we done (with perf and calgrind), most of the CPU work inside of snmp_table_getnext is being spent on yang_find; which is being called between 3K to 10K times for each snmp_table_getnext call (for IF-MIB.yang).
Each call to snmp_table_getnext also triggers a get RPC to the clixon_backend, which can trigger statedata callbacks which can increase the time even more.

olofhagsand · 2023-01-30T09:10:30Z

This patch makes a hardcoding of the prefix:oid usage instead of using a generic function. It removes the large number
of yang_find:s.
Can you please try it out?

diff --git a/apps/snmp/snmp_lib.c b/apps/snmp/snmp_lib.c
index 6ccb1734..e9dc962d 100644
--- a/apps/snmp/snmp_lib.c
+++ b/apps/snmp/snmp_lib.c
@@ -335,6 +335,44 @@ snmp_yang_type_get(yang_stmt  *ys,
     return retval;
 }
 
+/*! Specialized and optimized variant of yang_extension_value
+ */
+static int
+yang_extension_value2(yang_stmt *ys,
+                      char      *id,
+                      int       *exist,
+                      char     **value)
+{
+    int        retval = -1;
+    yang_stmt *yext;
+    cg_var    *cv;
+
+    if (ys == NULL){
+        clicon_err(OE_YANG, EINVAL, "ys is NULL");
+        goto done;
+    }
+    if (exist)
+        *exist = 0;
+    yext = NULL; /* This loop gets complicated in the case the extension is augmented */
+    while ((yext = yn_each(ys, yext)) != NULL) {
+        if (yang_keyword_get(yext) != Y_UNKNOWN)
+            continue;
+        if (strcmp(yang_argument_get(yext), id) != 0)
+            continue;
+        break;
+    }
+    if (yext != NULL){ /* Found */
+        if (exist)
+            *exist = 1;
+        if (value &&
+            (cv = yang_cv_get(yext)) != NULL)
+            *value = cv_string_get(cv);
+    }
+    retval = 0;
+ done:
+    return retval;
+}
+
 /*! Given a YANG node, return SMIv2 oid extension as OID 
  * @param[in]  yn        Yang node
  * @param[out] objid     OID vector, assume allocated with MAX_OID_LEN > oidlen
@@ -361,8 +399,10 @@ yangext_oid_get(yang_stmt *yn,
     }
     else
         yref = yn;
-    /* Get OID from table /list  */
-    if (yang_extension_value(yref, "oid", IETF_YANG_SMIV2_NS, &exist, &oidstr) < 0)
+    /* Get OID from table /list  
+     * @note: prefix hardcoded to smiv2:
+     */
+    if (yang_extension_value2(yref, "smiv2:oid", &exist, &oidstr) < 0)
         goto done;
     if (exist == 0 || oidstr == NULL){
         clicon_debug(1, "OID not found as SMIv2 yang extension of %s", yang_argument_get(yref));

shmuelhazan · 2023-01-30T09:59:45Z

@olofhagsand I see some significant performance improvement:
Many snmp_table_getnext calls now won't call yang_find at all now. Others call it between 1K to <100 times.
In terms of actual time, snmpwalk on one of our machines used to take ~19s. After the patch it takes ~7s.

Optimized yang_extension_value

olofhagsand · 2023-02-01T14:03:48Z

Fixed by optimizing yang_extension_value() into a more specialized version yang_extension_value_opt() specific to snmp code.
The optimization relies on the fact that no SMI extensions (eg smiv2:oid) are augmented but reside in the original YANG (ie the generated YANG from SMI).
Please verify the patch.

talyacs · 2023-02-07T13:34:38Z

I verified this patch with the latest commit and we did get an improvement (from ~19 seconds to ~7 seconds).
I think this is reasonable enough to close this issue.

dima1308 · 2023-02-13T18:12:40Z

@olofhagsand we are currently testing the SNMP performance,
We saw performance improvement. But sometimes, the response is coming slow. In the attached picture below, the delay between the SNMP request and the response is 4.41 seconds which is very close to the timeout (5 seconds is the default value for timeout). Is it possible to improve the performance more?

dima1308 · 2023-08-07T14:36:42Z

@olofhagsand we are getting SNMP timeouts when running snmpwalk. May I ask you to look at it?

olofhagsand · 2023-08-07T15:16:06Z

@dima1308 Has there been some change in behvior or is it the same result as in Feb 13?

dima1308 · 2023-08-07T15:45:22Z

@dima1308 Has there been some change in behvior or is it the same result as in Feb 13?

@olofhagsand No changes in the behavior. The problem is that now we have larger setups where the issue is reproduced all the time:

olofhagsand · 2023-08-07T16:19:24Z

Can one control the timeout as a work-around, who makes the timeout?
"port unreachable" is pretty low-level. Is this due to a timeout?

dima1308 · 2023-08-07T16:52:36Z

"port unreachable" is pretty low-level. Is this due to a timeout?

Yes, if I increase the timeout on snmpwalk (using a command line argument), the timeout issue is not happening.

Can one control the timeout as a work-around, who makes the timeout?

In some cases, if customers use our in-house management stations, the issue is not happening (our management station did not do snmp walk).
But in most cases, our equipment is integrated into larger systems that are controlled by a 3rd party management stations (in this case, we have no easy way to know what the management station does).
In some cases, we can tell customers to increase the timeout, in some cases, customers use the management system as a black box and do not know how to change timeouts.

shmuelhazan · 2023-08-14T14:14:19Z

I did some more research into that. Looks like it is a combination of multiple issues:

Sending a GET_NEXT request to netsnmp will cause snmp_table_getnext to be called for each possible leaf in the table; if it does not exist, snmp_table_getnext will return -1, causing netsnmp to call clixon_snmp_table_handler again, requesting the next registered OID.

snmp_table_getnext tries to match the requested OID in O(N*M), where N is the number of entries in the table and M is the number of columns in the table. Each snmp_table_getnext calls from the number 1 above will be a worst case as it will need to loop over all of the possibilities in order to confirm the the requested OID does not exist.

According to multiple profiling we done (with perf and calgrind), most of the CPU work inside of snmp_table_getnext is being spent on yang_find; which is being called between 3K to 10K times for each snmp_table_getnext call (for IF-MIB.yang).

Each call to snmp_table_getnext also triggers a get RPC to the clixon_backend, which can trigger statedata callbacks which can increase the time even more.

What is written here is still correct (apart from 3). Maybe it would be correct to begin with the low-hanging fruits like caching "get" calls for let's say a second.
Also, in snmp_table_getnext, maybe it would be possible to cache the data in a hashmap or similar in order to speed-up lookups which are currently linear.

olofhagsand · 2023-08-14T18:52:01Z

Yes I think caching is a good way forward. However, caching is seldom trivial, due to stale entries, timeouts, memory usage etc.

dima1308 · 2024-10-25T12:01:22Z

@olofhagsand Are there any plans to improve SNMP performance?

In our system, when NMS stations run SNMP queries, CPU usage spikes to 100%, making SNMP unusable.

olofhagsand · 2024-10-25T12:33:27Z

No current plans.

olofhagsand · 2024-10-25T12:47:58Z

What is required as a first step is a profiling of a repeatable specific usecase. I see some profiling made earlier, are they still valid?

dima1308 · 2024-10-28T08:15:18Z

@olofhagsand It remains valid. In the thread above, you can find an analysis of the failure. From what I understand, the issue stems from the implementation of the SNMP get-next command, which consumes substantial CPU resources. During standard SNMP operations (such as get-bulk or walk), the get-next command is executed frequently, causing the CPU utilization to spike to 100% and making SNMP usage impractical. Could you please address this issue?

olofhagsand · 2024-10-31T15:29:46Z

Added a cache for getnext. Please verify.

Cherry pick from 739d052

olofhagsand added the question label Jan 17, 2023

olofhagsand added performance and removed question labels Jan 29, 2023

olofhagsand added the question label Jan 30, 2023

olofhagsand removed the question label Feb 1, 2023

olofhagsand added a commit that referenced this issue Feb 1, 2023

Fixed: [SNMP: snmpwalk is slow and can timeout #404 ](#404)

af1f5a0

Optimized yang_extension_value

olofhagsand added the fixed please verify label Feb 1, 2023

olofhagsand removed the fixed please verify label Aug 7, 2023

olofhagsand added the question label Aug 7, 2023

olofhagsand added a commit that referenced this issue Oct 31, 2024

Fixed: [SNMP: snmpwalk is slow and can timeout](#404)

739d052

olofhagsand added fixed please verify and removed question labels Oct 31, 2024

dima1308 pushed a commit to dima1308/clixon that referenced this issue Dec 19, 2024

Fixed: [SNMP: snmpwalk is slow and can timeout](clicon#404)

b9d0ba2

Cherry pick from 739d052

dima1308 pushed a commit to dima1308/clixon that referenced this issue Dec 19, 2024

Fixed: [SNMP: snmpwalk is slow and can timeout](clicon#404)

f2d9b61

Cherry pick from 739d052

dima1308 pushed a commit to dima1308/clixon that referenced this issue Dec 19, 2024

Fixed: [SNMP: snmpwalk is slow and can timeout](clicon#404)

d44727e

Cherry pick from 739d052

dima1308 pushed a commit to dima1308/clixon that referenced this issue Dec 19, 2024

Fixed: [SNMP: snmpwalk is slow and can timeout](clicon#404)

43c9184

Cherry pick from 739d052

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNMP: snmpwalk is slow and can timeout #404

SNMP: snmpwalk is slow and can timeout #404

talyacs commented Jan 17, 2023

olofhagsand commented Jan 17, 2023

talyacs commented Jan 17, 2023

olofhagsand commented Jan 17, 2023

shmuelhazan commented Jan 29, 2023 •

edited

Loading

olofhagsand commented Jan 30, 2023

shmuelhazan commented Jan 30, 2023 •

edited

Loading

olofhagsand commented Feb 1, 2023

talyacs commented Feb 7, 2023 •

edited

Loading

dima1308 commented Feb 13, 2023

dima1308 commented Aug 7, 2023

olofhagsand commented Aug 7, 2023

dima1308 commented Aug 7, 2023

olofhagsand commented Aug 7, 2023 •

edited

Loading

dima1308 commented Aug 7, 2023

shmuelhazan commented Aug 14, 2023

olofhagsand commented Aug 14, 2023

dima1308 commented Oct 25, 2024

olofhagsand commented Oct 25, 2024

olofhagsand commented Oct 25, 2024

dima1308 commented Oct 28, 2024

olofhagsand commented Oct 31, 2024

SNMP: snmpwalk is slow and can timeout #404

SNMP: snmpwalk is slow and can timeout #404

Comments

talyacs commented Jan 17, 2023

olofhagsand commented Jan 17, 2023

talyacs commented Jan 17, 2023

olofhagsand commented Jan 17, 2023

shmuelhazan commented Jan 29, 2023 • edited Loading

olofhagsand commented Jan 30, 2023

shmuelhazan commented Jan 30, 2023 • edited Loading

olofhagsand commented Feb 1, 2023

talyacs commented Feb 7, 2023 • edited Loading

dima1308 commented Feb 13, 2023

dima1308 commented Aug 7, 2023

olofhagsand commented Aug 7, 2023

dima1308 commented Aug 7, 2023

olofhagsand commented Aug 7, 2023 • edited Loading

dima1308 commented Aug 7, 2023

shmuelhazan commented Aug 14, 2023

olofhagsand commented Aug 14, 2023

dima1308 commented Oct 25, 2024

olofhagsand commented Oct 25, 2024

olofhagsand commented Oct 25, 2024

dima1308 commented Oct 28, 2024

olofhagsand commented Oct 31, 2024

shmuelhazan commented Jan 29, 2023 •

edited

Loading

shmuelhazan commented Jan 30, 2023 •

edited

Loading

talyacs commented Feb 7, 2023 •

edited

Loading

olofhagsand commented Aug 7, 2023 •

edited

Loading