From 007cdb6cf613d59a344c8fdeabecb9c424409227 Mon Sep 17 00:00:00 2001
From: Andrew Smith <als5ev@virginia.edu>
Date: Thu, 20 Jan 2022 11:02:40 -0800
Subject: [PATCH] Longhaul: Tolerance adjustments  (#5916)

## Overview
I split #5869 into two PRs. The first (#5902) has been merged. This is the second.

Currently longhaul pass results don't indicate the current release criteria. I am updating the tolerances for our tests in order to make it easier on the release manager to know what is releasable vs what requires further investigation. I have added tolerances for the broker-enabled case too, just so we know we aren't causing any big regressions.

A lot of these failures already have associated PBIs. For these cases I have linked the PBI/bugs below in the tolerance description section.

In order to do this there were these steps:
1. Logic specific to each test report type to determine pass result from report data (i.e. test events + test conditions).
2. Modified tests as needed. No new tests added as I was thinking it is fine to verify manually. The unit tests are super valuable for this test framework when dealing with report generators (i.e. the classes that have complex logic to deal with multiple queues of test events). But since I didn't touch those I didn't think it was worth the opportunity cost to add further testing.

I currently have longhaul and nested longhaul running and will wait to make sure the tolerances behave as expected before merge.

## Tolerance Descriptions

**Custom Mqtt Telemetry Tolerance**
Overview:
- Sometimes we see message loss here so need a tolerance.

Tolerance description:
- [Nested-Edge] [Broker-Enabled]: We pass the tests as long as we get back 80% of messages. We sometimes have message loss, especially in middle->lower node telemetry.

PBI Required:
- No. We are rewriting the current version of the broker so not worth fixing, but the tolerance is good just to make sure nothing regresses.

Notes:
- When the new broker is running off of master branch, then this tolerance should be removed so we assure the new broker functions properly.

**IoTHub Message Tolerance**
Overview:
- This only affects the broker. Sometimes we see message loss here so need a tolerance. When this problem occurs it affects < 10 messages.

Tolerance description:
- [Nested-Edge] [Broker-Enabled]: Pass the test if we receive 99% of messages.

PBI Required:
- No. We are rewriting the current version of the broker so not worth fixing, but the tolerance is good just to make sure nothing regresses.

Notes:
- When the new broker is running off of master branch, then this tolerance should be removed so we assure the new broker functions properly.

**Twin Test Tolerance**
Overview:
- Sometimes we see some twin updates are missing in nested edge, but have still released in the past. This problem seems worse for non-broker vs broker-enabled, so it seems like a product issue we should followup on. I have only added a tolerance for the broker case so as not to mask this important issue.

Tolerance description:
- [Nested-Edge] [Broker-Enabled]:  Two issues. 1) Missing some desired property updates coming through the sdk registered callback function. 2) Failing to update reported properties from the client. For both of these issues, we pass the tests if it happens less than 0.5% of the time.

PBI Required:
- No PBI needed to unwind this tolerance since it only affects the broker-enabled case. However it looks like there is a product issue for the non-broker case. I didn't add a tolerance here, as I think first we should do some investigation. It is worth running longhaul also on 1.2 to confirm behavior is the same. We logged a bug for a prior release [here](https://msazure.visualstudio.com/One/_workitems/edit/10039969), but now the problem may be more severe. We will need logs to investigate further.

**Direct Method Tolerance**
Overview:
- A lot of these tolerances existed previously. For resource errors and transient errors, the tolerance seemed too high so I made the tests behave more strictly. For DeviceNotFound exceptions, we are looking at the test conditions to specify the exact tolerance.

Tolerance description:
- [All-Cases] Status code 0: Fail the tests if we get > 1 in 1000 direct methods with status code 0.
- [All-Cases] Resource error: Fail the tests if we get > 1 in 1000 direct methods with resource error.
- [All-Cases] Unauthorized: Fail the tests if we get > 1 in 1000 direct methods with resource error.
- [All-Cases] Transient error: Fail the tests if we get > 1 in 1000 direct methods with transient error.
- [All-Cases] NotImplemented: Fail the tests if we get > 1 in 1000 direct methods with NotImplemented error.
- [Nested-Edge] [Broker-Enabled] DeviceNotFound: Fail the tests if we get > 1 in 400 direct methods with DeviceNotFound
- [Nested-Edge] [Non-Broker] DeviceNotFound: Fail the tests if we get > 1 in 250 direct methods with DeviceNotFound
- [Single-Node] DeviceNotFound: Fail the tests if we get > 1 in 200 direct methods with DeviceNotFound

PBI Required:
- Yes, only for the fact that in nested edge, broker-enabled gets way less DeviceNotFound exceptions compared to non-broker.  The bug we logged during a prior release exists [here](https://msazure.visualstudio.com/One/_workitems/edit/10149346). We will need logs to investigate further.

## Pipeline Confirmation
Here are the pipelines I have ran with these changes to confirm this will not regress any of our infra:
1. Longhaul: https://msazure.visualstudio.com/One/_build/results?buildId=49427019&view=results
2. Nested Longhaul: https://msazure.visualstudio.com/One/_build/results?buildId=49427258&view=results
3. Connectivity: https://msazure.visualstudio.com/One/_build/results?buildId=49431644&view=results
4. Nested Connectivity: https://msazure.visualstudio.com/One/_build/results?buildId=49431669&view=results
5. End to End: https://dev.azure.com/msazure/One/_build/results?buildId=49431614&view=results
6. Nested End-to-End : https://msazure.visualstudio.com/One/_build/results?buildId=49431628&view=results


## Azure IoT Edge PR checklist:
---
 .../Reports/CountingReport.cs                 | 45 ++++++++--
 .../LongHaul/DirectMethodLongHaulReport.cs    | 69 +++++++++++++--
 .../LegacyTwin/LegacyTwinReportGenerator.cs   | 83 +++++++++++++++++--
 3 files changed, 175 insertions(+), 22 deletions(-)

diff --git a/test/modules/TestResultCoordinator/Reports/CountingReport.cs b/test/modules/TestResultCoordinator/Reports/CountingReport.cs
index d15c5e5973c..148f3c7fde0 100644
--- a/test/modules/TestResultCoordinator/Reports/CountingReport.cs
+++ b/test/modules/TestResultCoordinator/Reports/CountingReport.cs
@@ -117,6 +117,13 @@ public CountingReport(
 
         public override bool IsPassed => this.IsPassedHelper();
 
+        // Tolerances are needed due to a combination of false-positive failures and real product-issues.
+        // Connectivity tolerances:
+        // - [All-Cases]: Fail the tests if we have > 20% missing C2D messages
+        // - [Nested-Edge] [Broker-Enabled]: Fail tests if we have > 10% missing custom mqtt messages
+        // Longhaul tolerances:
+        // - [Nested-Edge] [Broker-Enabled]: Fail the tests if we have > 20% missing custom mqtt messages
+        // - [Nested-Edge] [Broker-Enabled]: Fail the tests if we have > 1% missing iothub messages
         bool IsPassedHelper()
         {
             return this.TotalExpectCount > 0 && this.TotalDuplicateExpectedResultCount == 0 && this.EventHubSpecificReportComponents.Match(
@@ -126,19 +133,39 @@ bool IsPassedHelper()
                 },
                 () =>
                 {
-                    // Product issue for C2D messages connected to edgehub over mqtt.
-                    // We should remove this failure tolerance when fixed.
-                    if (this.TestDescription.Contains(C2dTestDescription))
+                    if (this.TestMode == TestMode.Connectivity)
                     {
-                        return ((double)this.TotalMatchCount / this.TotalExpectCount) > .8d;
-                    }
-                    else if (this.TestDescription == GenericMqttTelemetryTestDescription && this.TestMode == TestMode.Connectivity)
-                    {
-                        return ((double)this.TotalMatchCount / this.TotalExpectCount) > .9d;
+                        // Product issue for C2D messages connected to edgehub.
+                        if (this.TestDescription.Contains(C2dTestDescription))
+                        {
+                            return ((double)this.TotalMatchCount / this.TotalExpectCount) > .8d;
+                        }
+                        // Product issue for custom mqtt telemetry.
+                        else if (this.Topology == Topology.Nested && this.MqttBrokerEnabled && this.TestDescription == GenericMqttTelemetryTestDescription)
+                        {
+                            return ((double)this.TotalMatchCount / this.TotalExpectCount) > .9d;
+                        }
+                        else
+                        {
+                            return this.TotalExpectCount == this.TotalMatchCount;
+                        }
                     }
                     else
                     {
-                        return this.TotalExpectCount == this.TotalMatchCount;
+                        // Product issue for custom mqtt telemetry.
+                        if (this.Topology == Topology.Nested && this.MqttBrokerEnabled && this.TestDescription.Contains(GenericMqttTelemetryTestDescription))
+                        {
+                            return ((double)this.TotalMatchCount / this.TotalExpectCount) > .8d;
+                        }
+                        // Product issue for messages when broker is enabled.
+                        else if (this.Topology == Topology.Nested && this.MqttBrokerEnabled && this.TestDescription.Contains(MessagesTestDescription))
+                        {
+                            return ((double)this.TotalMatchCount / this.TotalExpectCount) > .99d;
+                        }
+                        else
+                        {
+                            return this.TotalExpectCount == this.TotalMatchCount;
+                        }
                     }
                 });
         }
diff --git a/test/modules/TestResultCoordinator/Reports/DirectMethod/LongHaul/DirectMethodLongHaulReport.cs b/test/modules/TestResultCoordinator/Reports/DirectMethod/LongHaul/DirectMethodLongHaulReport.cs
index 4fe3fa830dd..8b00c435f46 100644
--- a/test/modules/TestResultCoordinator/Reports/DirectMethod/LongHaul/DirectMethodLongHaulReport.cs
+++ b/test/modules/TestResultCoordinator/Reports/DirectMethod/LongHaul/DirectMethodLongHaulReport.cs
@@ -76,6 +76,16 @@ public DirectMethodLongHaulReport(
 
         public override bool IsPassed => this.IsPassedHelper();
 
+        // Tolerances are needed because sometimes test has a combination of false-positive failures.
+        // Here is a description of these tolerances:
+        // - [All-Cases] Status code 0: Fail the tests if we get > 1 in 1000 direct methods with status code 0.
+        // - [All-Cases] Resource error: Fail the tests if we get > 1 in 1000 direct methods with resource error.
+        // - [All-Cases] Unauthorized: Fail the tests if we get > 1 in 1000 direct methods with resource error.
+        // - [All-Cases] Transient error: Fail the tests if we get > 1 in 1000 direct methods with transient error.
+        // - [All-Cases] NotImplemented: Fail the tests if we get > 1 in 1000 direct methods with NotImplemented error.
+        // - [Nested-Edge] [Broker-Enabled] DeviceNotFound: Fail the tests if we get > 1 in 400 direct methods with DeviceNotFound
+        // - [Nested-Edge] [Non-Broker] DeviceNotFound: Fail the tests if we get > 1 in 250 direct methods with DeviceNotFound
+        // - [Single-Node] DeviceNotFound: Fail the tests if we get > 1 in 200 direct methods with DeviceNotFound
         bool IsPassedHelper()
         {
             if (this.Other.Sum(x => x.Value) > 0)
@@ -85,20 +95,65 @@ bool IsPassedHelper()
             }
 
             bool senderAndReceiverSuccessesPass = this.SenderSuccesses <= this.ReceiverSuccesses;
+            long allStatusCount = this.SenderSuccesses + this.StatusCodeZero + this.Unauthorized + this.DeviceNotFound + this.TransientError + this.ResourceError + this.NotImplemented + this.Other.Sum(x => x.Value);
+
+            double statusCodeZeroThreshold;
+            double unauthorizedThreshold;
+            double deviceNotFoundThreshold;
+            double transientErrorThreshold;
+            double resourceErrorThreshold;
+            double notImplementedThreshold;
 
             // The SDK does not allow edgehub to de-register from iothub subscriptions, which results in DirectMethod clients sometimes receiving status code 0.
             // Github issue: https://github.com/Azure/iotedge/issues/681
             // We expect to get this status sometimes because of edgehub restarts, but if we receive too many we should fail the tests.
             // TODO: When the SDK allows edgehub to de-register from subscriptions and we make the fix in edgehub, then we can fail tests for any status code 0.
-            long allStatusCount = this.SenderSuccesses + this.StatusCodeZero + this.Other.Sum(x => x.Value);
-            bool statusCodeZeroBelowThreshold = (this.StatusCodeZero == 0) || (this.StatusCodeZero < ((double)allStatusCount / 1000));
-            bool unauthorizedBelowThreshold = (this.Unauthorized == 0) || (this.Unauthorized < ((double)allStatusCount / 1000));
-            bool deviceNotFoundBelowThreshold = (this.DeviceNotFound == 0) || (this.DeviceNotFound < ((double)allStatusCount / 100));
-            bool transientErrorBelowThreshold = (this.TransientError == 0) || (this.TransientError < ((double)allStatusCount / 100));
-            bool resourceErrorBelowThreshold = (this.ResourceError == 0) || (this.ResourceError < ((double)allStatusCount / 100));
+            statusCodeZeroThreshold = (double)allStatusCount / 1000;
+
+            // Sometimes transient network/resource errors are caught necessitating a tolerance.
+            transientErrorThreshold = (double)allStatusCount / 1000;
+            resourceErrorThreshold = (double)allStatusCount / 1000;
+
+            // Sometimes iothub returns Unauthorized or NotImplemented that then later recovers.
+            // Only occurs with broker enabled, so only apply tolerance in this case.
+            if (this.MqttBrokerEnabled)
+            {
+                unauthorizedThreshold = (double)allStatusCount / 1000;
+                notImplementedThreshold = (double)allStatusCount / 1000;
+            }
+            else
+            {
+                unauthorizedThreshold = (double)allStatusCount / double.MaxValue;
+                notImplementedThreshold = (double)allStatusCount / double.MaxValue;
+            }
+
+            // DeviceNotFound typically happens when EdgeHub restarts and is offline.
+            // For different test suites this happens at different rates.
+            // 1) Single node runs arm devices, so this tolerance is a bit lenient.
+            // 2) Nested non-broker has some product issue where we need some tolerance.
+            // 3) Nested broker-enabled is the most stable.
+            if (this.Topology == Topology.SingleNode && !this.MqttBrokerEnabled)
+            {
+                deviceNotFoundThreshold = (double)allStatusCount / 200;
+            }
+            else if (this.Topology == Topology.Nested && !this.MqttBrokerEnabled)
+            {
+                deviceNotFoundThreshold = (double)allStatusCount / 250;
+            }
+            else
+            {
+                deviceNotFoundThreshold = (double)allStatusCount / 400;
+            }
+
+            bool statusCodeZeroBelowThreshold = (this.StatusCodeZero == 0) || (this.StatusCodeZero < statusCodeZeroThreshold);
+            bool unauthorizedBelowThreshold = (this.Unauthorized == 0) || (this.Unauthorized < unauthorizedThreshold);
+            bool deviceNotFoundBelowThreshold = (this.DeviceNotFound == 0) || (this.DeviceNotFound < deviceNotFoundThreshold);
+            bool transientErrorBelowThreshold = (this.TransientError == 0) || (this.TransientError < transientErrorThreshold);
+            bool resourceErrorBelowThreshold = (this.ResourceError == 0) || (this.ResourceError < resourceErrorThreshold);
+            bool notImplementedBelowThreshold = (this.NotImplemented == 0) || (this.NotImplemented < notImplementedThreshold);
 
             // Pass if below the thresholds, and sender and receiver got same amount of successess (or receiver has no results)
-            return statusCodeZeroBelowThreshold && unauthorizedBelowThreshold && deviceNotFoundBelowThreshold && transientErrorBelowThreshold && senderAndReceiverSuccessesPass;
+            return statusCodeZeroBelowThreshold && unauthorizedBelowThreshold && deviceNotFoundBelowThreshold && transientErrorBelowThreshold && senderAndReceiverSuccessesPass && notImplementedBelowThreshold;
         }
     }
 }
diff --git a/test/modules/TestResultCoordinator/Reports/LegacyTwin/LegacyTwinReportGenerator.cs b/test/modules/TestResultCoordinator/Reports/LegacyTwin/LegacyTwinReportGenerator.cs
index c971ab093f5..d25051258ff 100644
--- a/test/modules/TestResultCoordinator/Reports/LegacyTwin/LegacyTwinReportGenerator.cs
+++ b/test/modules/TestResultCoordinator/Reports/LegacyTwin/LegacyTwinReportGenerator.cs
@@ -11,6 +11,8 @@ namespace TestResultCoordinator.Reports.LegacyTwin
 
     sealed class LegacyTwinReportGenerator : ITestResultReportGenerator
     {
+        const double BigToleranceProportion = .005;
+        const double LittleToleranceProportion = .001;
         static readonly ILogger Logger = ModuleUtil.CreateLogger(nameof(LegacyTwinReportGenerator));
         readonly string trackingId;
 
@@ -47,15 +49,9 @@ public async Task<ITestResultReport> CreateReportAsync()
         {
             Logger.LogInformation($"Start to generate report by {nameof(LegacyTwinReportGenerator)} for Sources [{this.SenderSource}] ");
             IDictionary<int, int> results = new Dictionary<int, int>();
-            bool isPassed = true;
             while (await this.SenderTestResults.MoveNextAsync())
             {
                 int status = int.Parse(this.SenderTestResults.Current.Result.Substring(0, 3));
-                if (status > 299)
-                {
-                    isPassed = false;
-                }
-
                 if (results.ContainsKey(status))
                 {
                     results[status] = results[status] + 1;
@@ -66,6 +62,8 @@ public async Task<ITestResultReport> CreateReportAsync()
                 }
             }
 
+            bool isPassed = this.IsPassed(results);
+
             var report = new LegacyTwinReport(
                 this.TestDescription,
                 this.trackingId,
@@ -77,5 +75,78 @@ public async Task<ITestResultReport> CreateReportAsync()
             Logger.LogInformation($"Successfully finished creating LegacyTwinReport for Source [{this.SenderSource}]");
             return report;
         }
+
+        // Tolerances are needed due to a combination of false-positive failures and real product-issues.
+        // - [Nested-Edge] [Broker-Enabled]: Sometimes we get an excessive amount of 501 and 504 status codes.
+        //   Fail the tests if we have > 0.1% of either code.
+        //     + (501) We don't receive some desired properties in module-registered twin desired property callback
+        //     + (504) Module cannot make reported property update
+        bool IsPassed(IDictionary<int, int> statusCodesToCount)
+        {
+            bool isPassed = true;
+            int totalResults = statusCodesToCount.Sum(x => x.Value);
+
+            if (totalResults == 0)
+            {
+                return false;
+            }
+
+            if (this.Topology == Topology.Nested && this.MqttBrokerEnabled)
+            {
+                // See TwinTester/StatusCode.cs for reference.
+                int[] bigToleranceStatusCodes = { };
+                int[] littleToleranceStatusCodes = { 501, 504 };
+                isPassed = this.GeneratePassResult(statusCodesToCount, bigToleranceStatusCodes, littleToleranceStatusCodes);
+            }
+            else
+            {
+                List<int> statusCodes = statusCodesToCount.Keys.ToList();
+                IEnumerable<int> failingStatusCodes = statusCodes.Where(s =>
+                {
+                    string statusCode = s.ToString();
+                    return !statusCode.StartsWith("2");
+                });
+
+                isPassed = failingStatusCodes.Count() == 0;
+            }
+
+            return isPassed;
+        }
+
+        bool GeneratePassResult(IDictionary<int, int> statusCodesToCount, int[] bigToleranceStatusCodes, int[] littleToleranceStatusCodes)
+        {
+            int totalResults = statusCodesToCount.Sum(x => x.Value);
+            foreach (KeyValuePair<int, int> statusCodeToCount in statusCodesToCount)
+            {
+                int statusCode = statusCodeToCount.Key;
+                int statusCodeCount = statusCodeToCount.Value;
+
+                // ignore the status codes indicating some success
+                if (statusCode.ToString().StartsWith("2"))
+                {
+                    continue;
+                }
+                else if (bigToleranceStatusCodes.Contains(statusCode))
+                {
+                    if ((double)statusCodeCount / totalResults > BigToleranceProportion)
+                    {
+                        return false;
+                    }
+                }
+                else if (littleToleranceStatusCodes.Contains(statusCode))
+                {
+                    if ((double)statusCodeCount / totalResults > LittleToleranceProportion)
+                    {
+                        return false;
+                    }
+                }
+                else
+                {
+                    return false;
+                }
+            }
+
+            return true;
+        }
     }
 }