Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestResultCoordinator: Plumb down test context into reports #5902

Merged
merged 5 commits into from
Dec 8, 2021

Conversation

and-rewsmith
Copy link
Contributor

@and-rewsmith and-rewsmith commented Dec 7, 2021

Please replace this line with your PR description and read PR checklist below
Currently longhaul pass results don't indicate the current release criteria. I am updating the tolerances for our tests in order to make it easier on the release manager to know what is releasable vs what requires further investigation.

The first step in this direction is to plumb down two pieces of information into the test reports:

  1. Whether or not the broker is enabled (MqttBrokerEnabled flag)
  2. Whether topology is single node (Topology enum). This used to come from the deployment file, but there was a lot of duplication there and it is simpler if it comes from an environment variable.

Tests running to confirm these changes work:

  1. Longhaul: https://msazure.visualstudio.com/One/_build/results?buildId=49707570&view=results
  2. Nested Longhaul: https://msazure.visualstudio.com/One/_build/results?buildId=49707599&view=results
  3. Connectivity: https://msazure.visualstudio.com/One/_build/results?buildId=49707606&view=results
  4. Nested Connectivity: https://msazure.visualstudio.com/One/_build/results?buildId=49707634&view=results
  5. End to End: https://dev.azure.com/msazure/One/_build/results?buildId=49707665&view=results
  6. Nested End-to-End : https://msazure.visualstudio.com/One/_build/results?buildId=49712331&view=results

Azure IoT Edge PR checklist:

This checklist is used to make sure that common guidelines for a pull request are followed.

General Guidelines and Best Practices

  • I have read the contribution guidelines.
  • Title of the pull request is clear and informative.
  • Description of the pull request includes a concise summary of the enhancement or bug fix.

Testing Guidelines

  • Pull request includes test coverage for the included changes.
  • Description of the pull request includes
    • concise summary of tests added/modified
    • local testing done.

Draft PRs

  • Open the PR in Draft mode if it is:
    • Work in progress or not intended to be merged.
    • Encountering multiple pipeline failures and working on fixes.

Note: We use the kodiakhq bot to merge PRs once the necessary checks and approvals are in place. When it merges a PR, kodiakhq converts the PR title to the commit title, PR description to the commit description, and squashes all the commits in the PR to a single commit. The net effect is that entire PR becomes a single commit. Please follow the best practices mentioned here for the PR title and description

@and-rewsmith and-rewsmith changed the title TestResultCoordinator: Plumb down contextual information into reports TestResultCoordinator: Plumb down test context into reports Dec 7, 2021
@and-rewsmith and-rewsmith marked this pull request as ready for review December 7, 2021 18:57
@kodiakhq kodiakhq bot merged commit 6b9a85f into Azure:master Dec 8, 2021
@and-rewsmith and-rewsmith mentioned this pull request Dec 8, 2021
6 tasks
kodiakhq bot pushed a commit that referenced this pull request Jan 20, 2022
## Overview
I split #5869 into two PRs. The first (#5902) has been merged. This is the second.

Currently longhaul pass results don't indicate the current release criteria. I am updating the tolerances for our tests in order to make it easier on the release manager to know what is releasable vs what requires further investigation. I have added tolerances for the broker-enabled case too, just so we know we aren't causing any big regressions.

A lot of these failures already have associated PBIs. For these cases I have linked the PBI/bugs below in the tolerance description section.

In order to do this there were these steps:
1. Logic specific to each test report type to determine pass result from report data (i.e. test events + test conditions).
2. Modified tests as needed. No new tests added as I was thinking it is fine to verify manually. The unit tests are super valuable for this test framework when dealing with report generators (i.e. the classes that have complex logic to deal with multiple queues of test events). But since I didn't touch those I didn't think it was worth the opportunity cost to add further testing.

I currently have longhaul and nested longhaul running and will wait to make sure the tolerances behave as expected before merge.

## Tolerance Descriptions

**Custom Mqtt Telemetry Tolerance** 
Overview:
- Sometimes we see message loss here so need a tolerance.

Tolerance description:
- [Nested-Edge] [Broker-Enabled]: We pass the tests as long as we get back 80% of messages. We sometimes have message loss, especially in middle->lower node telemetry. 

PBI Required: 
- No. We are rewriting the current version of the broker so not worth fixing, but the tolerance is good just to make sure nothing regresses.

Notes:
- When the new broker is running off of master branch, then this tolerance should be removed so we assure the new broker functions properly.

**IoTHub Message Tolerance**
Overview:
- This only affects the broker. Sometimes we see message loss here so need a tolerance. When this problem occurs it affects < 10 messages.

Tolerance description:
- [Nested-Edge] [Broker-Enabled]: Pass the test if we receive 99% of messages.

PBI Required: 
- No. We are rewriting the current version of the broker so not worth fixing, but the tolerance is good just to make sure nothing regresses.

Notes:
- When the new broker is running off of master branch, then this tolerance should be removed so we assure the new broker functions properly.

**Twin Test Tolerance**
Overview:
- Sometimes we see some twin updates are missing in nested edge, but have still released in the past. This problem seems worse for non-broker vs broker-enabled, so it seems like a product issue we should followup on. I have only added a tolerance for the broker case so as not to mask this important issue.

Tolerance description:
- [Nested-Edge] [Broker-Enabled]:  Two issues. 1) Missing some desired property updates coming through the sdk registered callback function. 2) Failing to update reported properties from the client. For both of these issues, we pass the tests if it happens less than 0.5% of the time.

PBI Required:
- No PBI needed to unwind this tolerance since it only affects the broker-enabled case. However it looks like there is a product issue for the non-broker case. I didn't add a tolerance here, as I think first we should do some investigation. It is worth running longhaul also on 1.2 to confirm behavior is the same. We logged a bug for a prior release [here](https://msazure.visualstudio.com/One/_workitems/edit/10039969), but now the problem may be more severe. We will need logs to investigate further.

**Direct Method Tolerance**
Overview:
- A lot of these tolerances existed previously. For resource errors and transient errors, the tolerance seemed too high so I made the tests behave more strictly. For DeviceNotFound exceptions, we are looking at the test conditions to specify the exact tolerance.

Tolerance description:
- [All-Cases] Status code 0: Fail the tests if we get > 1 in 1000 direct methods with status code 0.
- [All-Cases] Resource error: Fail the tests if we get > 1 in 1000 direct methods with resource error.
- [All-Cases] Unauthorized: Fail the tests if we get > 1 in 1000 direct methods with resource error.
- [All-Cases] Transient error: Fail the tests if we get > 1 in 1000 direct methods with transient error.
- [All-Cases] NotImplemented: Fail the tests if we get > 1 in 1000 direct methods with NotImplemented error.
- [Nested-Edge] [Broker-Enabled] DeviceNotFound: Fail the tests if we get > 1 in 400 direct methods with DeviceNotFound
- [Nested-Edge] [Non-Broker] DeviceNotFound: Fail the tests if we get > 1 in 250 direct methods with DeviceNotFound
- [Single-Node] DeviceNotFound: Fail the tests if we get > 1 in 200 direct methods with DeviceNotFound

PBI Required:
- Yes, only for the fact that in nested edge, broker-enabled gets way less DeviceNotFound exceptions compared to non-broker.  The bug we logged during a prior release exists [here](https://msazure.visualstudio.com/One/_workitems/edit/10149346). We will need logs to investigate further.

## Pipeline Confirmation
Here are the pipelines I have ran with these changes to confirm this will not regress any of our infra:
1. Longhaul: https://msazure.visualstudio.com/One/_build/results?buildId=49427019&view=results
2. Nested Longhaul: https://msazure.visualstudio.com/One/_build/results?buildId=49427258&view=results
3. Connectivity: https://msazure.visualstudio.com/One/_build/results?buildId=49431644&view=results
4. Nested Connectivity: https://msazure.visualstudio.com/One/_build/results?buildId=49431669&view=results
5. End to End: https://dev.azure.com/msazure/One/_build/results?buildId=49431614&view=results
6. Nested End-to-End : https://msazure.visualstudio.com/One/_build/results?buildId=49431628&view=results
 

## Azure IoT Edge PR checklist:
and-rewsmith added a commit to and-rewsmith/iotedge that referenced this pull request Jan 27, 2022
## Overview
I split Azure#5869 into two PRs. The first (Azure#5902) has been merged. This is the second.

Currently longhaul pass results don't indicate the current release criteria. I am updating the tolerances for our tests in order to make it easier on the release manager to know what is releasable vs what requires further investigation. I have added tolerances for the broker-enabled case too, just so we know we aren't causing any big regressions.

A lot of these failures already have associated PBIs. For these cases I have linked the PBI/bugs below in the tolerance description section.

In order to do this there were these steps:
1. Logic specific to each test report type to determine pass result from report data (i.e. test events + test conditions).
2. Modified tests as needed. No new tests added as I was thinking it is fine to verify manually. The unit tests are super valuable for this test framework when dealing with report generators (i.e. the classes that have complex logic to deal with multiple queues of test events). But since I didn't touch those I didn't think it was worth the opportunity cost to add further testing.

I currently have longhaul and nested longhaul running and will wait to make sure the tolerances behave as expected before merge.

## Tolerance Descriptions

**Custom Mqtt Telemetry Tolerance** 
Overview:
- Sometimes we see message loss here so need a tolerance.

Tolerance description:
- [Nested-Edge] [Broker-Enabled]: We pass the tests as long as we get back 80% of messages. We sometimes have message loss, especially in middle->lower node telemetry. 

PBI Required: 
- No. We are rewriting the current version of the broker so not worth fixing, but the tolerance is good just to make sure nothing regresses.

Notes:
- When the new broker is running off of master branch, then this tolerance should be removed so we assure the new broker functions properly.

**IoTHub Message Tolerance**
Overview:
- This only affects the broker. Sometimes we see message loss here so need a tolerance. When this problem occurs it affects < 10 messages.

Tolerance description:
- [Nested-Edge] [Broker-Enabled]: Pass the test if we receive 99% of messages.

PBI Required: 
- No. We are rewriting the current version of the broker so not worth fixing, but the tolerance is good just to make sure nothing regresses.

Notes:
- When the new broker is running off of master branch, then this tolerance should be removed so we assure the new broker functions properly.

**Twin Test Tolerance**
Overview:
- Sometimes we see some twin updates are missing in nested edge, but have still released in the past. This problem seems worse for non-broker vs broker-enabled, so it seems like a product issue we should followup on. I have only added a tolerance for the broker case so as not to mask this important issue.

Tolerance description:
- [Nested-Edge] [Broker-Enabled]:  Two issues. 1) Missing some desired property updates coming through the sdk registered callback function. 2) Failing to update reported properties from the client. For both of these issues, we pass the tests if it happens less than 0.5% of the time.

PBI Required:
- No PBI needed to unwind this tolerance since it only affects the broker-enabled case. However it looks like there is a product issue for the non-broker case. I didn't add a tolerance here, as I think first we should do some investigation. It is worth running longhaul also on 1.2 to confirm behavior is the same. We logged a bug for a prior release [here](https://msazure.visualstudio.com/One/_workitems/edit/10039969), but now the problem may be more severe. We will need logs to investigate further.

**Direct Method Tolerance**
Overview:
- A lot of these tolerances existed previously. For resource errors and transient errors, the tolerance seemed too high so I made the tests behave more strictly. For DeviceNotFound exceptions, we are looking at the test conditions to specify the exact tolerance.

Tolerance description:
- [All-Cases] Status code 0: Fail the tests if we get > 1 in 1000 direct methods with status code 0.
- [All-Cases] Resource error: Fail the tests if we get > 1 in 1000 direct methods with resource error.
- [All-Cases] Unauthorized: Fail the tests if we get > 1 in 1000 direct methods with resource error.
- [All-Cases] Transient error: Fail the tests if we get > 1 in 1000 direct methods with transient error.
- [All-Cases] NotImplemented: Fail the tests if we get > 1 in 1000 direct methods with NotImplemented error.
- [Nested-Edge] [Broker-Enabled] DeviceNotFound: Fail the tests if we get > 1 in 400 direct methods with DeviceNotFound
- [Nested-Edge] [Non-Broker] DeviceNotFound: Fail the tests if we get > 1 in 250 direct methods with DeviceNotFound
- [Single-Node] DeviceNotFound: Fail the tests if we get > 1 in 200 direct methods with DeviceNotFound

PBI Required:
- Yes, only for the fact that in nested edge, broker-enabled gets way less DeviceNotFound exceptions compared to non-broker.  The bug we logged during a prior release exists [here](https://msazure.visualstudio.com/One/_workitems/edit/10149346). We will need logs to investigate further.

## Pipeline Confirmation
Here are the pipelines I have ran with these changes to confirm this will not regress any of our infra:
1. Longhaul: https://msazure.visualstudio.com/One/_build/results?buildId=49427019&view=results
2. Nested Longhaul: https://msazure.visualstudio.com/One/_build/results?buildId=49427258&view=results
3. Connectivity: https://msazure.visualstudio.com/One/_build/results?buildId=49431644&view=results
4. Nested Connectivity: https://msazure.visualstudio.com/One/_build/results?buildId=49431669&view=results
5. End to End: https://dev.azure.com/msazure/One/_build/results?buildId=49431614&view=results
6. Nested End-to-End : https://msazure.visualstudio.com/One/_build/results?buildId=49431628&view=results
 

## Azure IoT Edge PR checklist:
and-rewsmith added a commit to and-rewsmith/iotedge that referenced this pull request Jan 28, 2022
***Please replace this line with your PR description and read PR checklist below***
Currently longhaul pass results don't indicate the current release criteria. I am updating the tolerances for our tests in order to make it easier on the release manager to know what is releasable vs what requires further investigation. 

The first step in this direction is to plumb down two pieces of information into the test reports:
1. Whether or not the broker is enabled (`MqttBrokerEnabled` flag)
2. Whether topology is single node (`Topology` enum). This used to come from the deployment file, but there was a lot of duplication there and it is simpler if it comes from an environment variable.

Tests running to confirm these changes work:
1. Longhaul: https://msazure.visualstudio.com/One/_build/results?buildId=49707570&view=results
2. Nested Longhaul: https://msazure.visualstudio.com/One/_build/results?buildId=49707599&view=results
3. Connectivity: https://msazure.visualstudio.com/One/_build/results?buildId=49707606&view=results
4. Nested Connectivity: https://msazure.visualstudio.com/One/_build/results?buildId=49707634&view=results
5. End to End: https://dev.azure.com/msazure/One/_build/results?buildId=49707665&view=results
6. Nested End-to-End : https://msazure.visualstudio.com/One/_build/results?buildId=49712331&view=results


## Azure IoT Edge PR checklist:
kodiakhq bot pushed a commit that referenced this pull request Jan 29, 2022
## Overview
Cherry pick #5902 #5916. Tested via nested longhaul pipeline.

I split #5869 into two PRs. The first (#5902) has been merged. This is the second.

Currently longhaul pass results don't indicate the current release criteria. I am updating the tolerances for our tests in order to make it easier on the release manager to know what is releasable vs what requires further investigation. I have added tolerances for the broker-enabled case too, just so we know we aren't causing any big regressions.

A lot of these failures already have associated PBIs. For these cases I have linked the PBI/bugs below in the tolerance description section.

In order to do this there were these steps:
1. Logic specific to each test report type to determine pass result from report data (i.e. test events + test conditions).
2. Modified tests as needed. No new tests added as I was thinking it is fine to verify manually. The unit tests are super valuable for this test framework when dealing with report generators (i.e. the classes that have complex logic to deal with multiple queues of test events). But since I didn't touch those I didn't think it was worth the opportunity cost to add further testing.

I currently have longhaul and nested longhaul running and will wait to make sure the tolerances behave as expected before merge.

## Tolerance Descriptions

**Custom Mqtt Telemetry Tolerance** 
Overview:
- Sometimes we see message loss here so need a tolerance.

Tolerance description:
- [Nested-Edge] [Broker-Enabled]: We pass the tests as long as we get back 80% of messages. We sometimes have message loss, especially in middle->lower node telemetry. 

PBI Required: 
- No. We are rewriting the current version of the broker so not worth fixing, but the tolerance is good just to make sure nothing regresses.

Notes:
- When the new broker is running off of master branch, then this tolerance should be removed so we assure the new broker functions properly.

**IoTHub Message Tolerance**
Overview:
- This only affects the broker. Sometimes we see message loss here so need a tolerance. When this problem occurs it affects < 10 messages.

Tolerance description:
- [Nested-Edge] [Broker-Enabled]: Pass the test if we receive 99% of messages.

PBI Required: 
- No. We are rewriting the current version of the broker so not worth fixing, but the tolerance is good just to make sure nothing regresses.

Notes:
- When the new broker is running off of master branch, then this tolerance should be removed so we assure the new broker functions properly.

**Twin Test Tolerance**
Overview:
- Sometimes we see some twin updates are missing in nested edge, but have still released in the past. This problem seems worse for non-broker vs broker-enabled, so it seems like a product issue we should followup on. I have only added a tolerance for the broker case so as not to mask this important issue.

Tolerance description:
- [Nested-Edge] [Broker-Enabled]:  Two issues. 1) Missing some desired property updates coming through the sdk registered callback function. 2) Failing to update reported properties from the client. For both of these issues, we pass the tests if it happens less than 0.5% of the time.

PBI Required:
- No PBI needed to unwind this tolerance since it only affects the broker-enabled case. However it looks like there is a product issue for the non-broker case. I didn't add a tolerance here, as I think first we should do some investigation. It is worth running longhaul also on 1.2 to confirm behavior is the same. We logged a bug for a prior release [here](https://msazure.visualstudio.com/One/_workitems/edit/10039969), but now the problem may be more severe. We will need logs to investigate further.

**Direct Method Tolerance**
Overview:
- A lot of these tolerances existed previously. For resource errors and transient errors, the tolerance seemed too high so I made the tests behave more strictly. For DeviceNotFound exceptions, we are looking at the test conditions to specify the exact tolerance.

Tolerance description:
- [All-Cases] Status code 0: Fail the tests if we get > 1 in 1000 direct methods with status code 0.
- [All-Cases] Resource error: Fail the tests if we get > 1 in 1000 direct methods with resource error.
- [All-Cases] Unauthorized: Fail the tests if we get > 1 in 1000 direct methods with resource error.
- [All-Cases] Transient error: Fail the tests if we get > 1 in 1000 direct methods with transient error.
- [All-Cases] NotImplemented: Fail the tests if we get > 1 in 1000 direct methods with NotImplemented error.
- [Nested-Edge] [Broker-Enabled] DeviceNotFound: Fail the tests if we get > 1 in 400 direct methods with DeviceNotFound
- [Nested-Edge] [Non-Broker] DeviceNotFound: Fail the tests if we get > 1 in 250 direct methods with DeviceNotFound
- [Single-Node] DeviceNotFound: Fail the tests if we get > 1 in 200 direct methods with DeviceNotFound

PBI Required:
- Yes, only for the fact that in nested edge, broker-enabled gets way less DeviceNotFound exceptions compared to non-broker.  The bug we logged during a prior release exists [here](https://msazure.visualstudio.com/One/_workitems/edit/10149346). We will need logs to investigate further.

## Pipeline Confirmation
Here are the pipelines I have ran with these changes to confirm this will not regress any of our infra:
1. Longhaul: https://msazure.visualstudio.com/One/_build/results?buildId=49427019&view=results
2. Nested Longhaul: https://msazure.visualstudio.com/One/_build/results?buildId=49427258&view=results
3. Connectivity: https://msazure.visualstudio.com/One/_build/results?buildId=49431644&view=results
4. Nested Connectivity: https://msazure.visualstudio.com/One/_build/results?buildId=49431669&view=results
5. End to End: https://dev.azure.com/msazure/One/_build/results?buildId=49431614&view=results
6. Nested End-to-End : https://msazure.visualstudio.com/One/_build/results?buildId=49431628&view=results
 

## Azure IoT Edge PR checklist:

***Please replace this line with your PR description and read PR checklist below***

## Azure IoT Edge PR checklist:

This checklist is used to make sure that common guidelines for a pull request are followed.

### General Guidelines and Best Practices
- [x] I have read the [contribution guidelines](https://github.com/azure/iotedge#contributing).
- [x] Title of the pull request is clear and informative.
- [x] Description of the pull request includes a concise summary of the enhancement or bug fix.

### Testing Guidelines
- [x] Pull request includes test coverage for the included changes.
- Description of the pull request includes 
	- [x] concise summary of tests added/modified
	- [x] local testing done.  

### Draft PRs
- Open the PR in `Draft` mode if it is:
	- Work in progress or not intended to be merged.
	- Encountering multiple pipeline failures and working on fixes.

_Note: We use the kodiakhq bot to merge PRs once the necessary checks and approvals are in place. When it merges a PR, kodiakhq converts the PR title to the commit title, PR description to the commit description, and squashes all the commits in the PR to a single commit. The net effect is that entire PR becomes a single commit. Please follow the best practices mentioned [here](https://chris.beams.io/posts/git-commit/#:~:text=The%20seven%20rules%20of%20a%20great%20Git%20commit,what%20and%20why%20vs.%20how%20For%20example%3A%20) for the PR title and description_
lt72 pushed a commit to lt72/iotedge that referenced this pull request Jan 31, 2022
## Overview
I split Azure#5869 into two PRs. The first (Azure#5902) has been merged. This is the second.

Currently longhaul pass results don't indicate the current release criteria. I am updating the tolerances for our tests in order to make it easier on the release manager to know what is releasable vs what requires further investigation. I have added tolerances for the broker-enabled case too, just so we know we aren't causing any big regressions.

A lot of these failures already have associated PBIs. For these cases I have linked the PBI/bugs below in the tolerance description section.

In order to do this there were these steps:
1. Logic specific to each test report type to determine pass result from report data (i.e. test events + test conditions).
2. Modified tests as needed. No new tests added as I was thinking it is fine to verify manually. The unit tests are super valuable for this test framework when dealing with report generators (i.e. the classes that have complex logic to deal with multiple queues of test events). But since I didn't touch those I didn't think it was worth the opportunity cost to add further testing.

I currently have longhaul and nested longhaul running and will wait to make sure the tolerances behave as expected before merge.

## Tolerance Descriptions

**Custom Mqtt Telemetry Tolerance** 
Overview:
- Sometimes we see message loss here so need a tolerance.

Tolerance description:
- [Nested-Edge] [Broker-Enabled]: We pass the tests as long as we get back 80% of messages. We sometimes have message loss, especially in middle->lower node telemetry. 

PBI Required: 
- No. We are rewriting the current version of the broker so not worth fixing, but the tolerance is good just to make sure nothing regresses.

Notes:
- When the new broker is running off of master branch, then this tolerance should be removed so we assure the new broker functions properly.

**IoTHub Message Tolerance**
Overview:
- This only affects the broker. Sometimes we see message loss here so need a tolerance. When this problem occurs it affects < 10 messages.

Tolerance description:
- [Nested-Edge] [Broker-Enabled]: Pass the test if we receive 99% of messages.

PBI Required: 
- No. We are rewriting the current version of the broker so not worth fixing, but the tolerance is good just to make sure nothing regresses.

Notes:
- When the new broker is running off of master branch, then this tolerance should be removed so we assure the new broker functions properly.

**Twin Test Tolerance**
Overview:
- Sometimes we see some twin updates are missing in nested edge, but have still released in the past. This problem seems worse for non-broker vs broker-enabled, so it seems like a product issue we should followup on. I have only added a tolerance for the broker case so as not to mask this important issue.

Tolerance description:
- [Nested-Edge] [Broker-Enabled]:  Two issues. 1) Missing some desired property updates coming through the sdk registered callback function. 2) Failing to update reported properties from the client. For both of these issues, we pass the tests if it happens less than 0.5% of the time.

PBI Required:
- No PBI needed to unwind this tolerance since it only affects the broker-enabled case. However it looks like there is a product issue for the non-broker case. I didn't add a tolerance here, as I think first we should do some investigation. It is worth running longhaul also on 1.2 to confirm behavior is the same. We logged a bug for a prior release [here](https://msazure.visualstudio.com/One/_workitems/edit/10039969), but now the problem may be more severe. We will need logs to investigate further.

**Direct Method Tolerance**
Overview:
- A lot of these tolerances existed previously. For resource errors and transient errors, the tolerance seemed too high so I made the tests behave more strictly. For DeviceNotFound exceptions, we are looking at the test conditions to specify the exact tolerance.

Tolerance description:
- [All-Cases] Status code 0: Fail the tests if we get > 1 in 1000 direct methods with status code 0.
- [All-Cases] Resource error: Fail the tests if we get > 1 in 1000 direct methods with resource error.
- [All-Cases] Unauthorized: Fail the tests if we get > 1 in 1000 direct methods with resource error.
- [All-Cases] Transient error: Fail the tests if we get > 1 in 1000 direct methods with transient error.
- [All-Cases] NotImplemented: Fail the tests if we get > 1 in 1000 direct methods with NotImplemented error.
- [Nested-Edge] [Broker-Enabled] DeviceNotFound: Fail the tests if we get > 1 in 400 direct methods with DeviceNotFound
- [Nested-Edge] [Non-Broker] DeviceNotFound: Fail the tests if we get > 1 in 250 direct methods with DeviceNotFound
- [Single-Node] DeviceNotFound: Fail the tests if we get > 1 in 200 direct methods with DeviceNotFound

PBI Required:
- Yes, only for the fact that in nested edge, broker-enabled gets way less DeviceNotFound exceptions compared to non-broker.  The bug we logged during a prior release exists [here](https://msazure.visualstudio.com/One/_workitems/edit/10149346). We will need logs to investigate further.

## Pipeline Confirmation
Here are the pipelines I have ran with these changes to confirm this will not regress any of our infra:
1. Longhaul: https://msazure.visualstudio.com/One/_build/results?buildId=49427019&view=results
2. Nested Longhaul: https://msazure.visualstudio.com/One/_build/results?buildId=49427258&view=results
3. Connectivity: https://msazure.visualstudio.com/One/_build/results?buildId=49431644&view=results
4. Nested Connectivity: https://msazure.visualstudio.com/One/_build/results?buildId=49431669&view=results
5. End to End: https://dev.azure.com/msazure/One/_build/results?buildId=49431614&view=results
6. Nested End-to-End : https://msazure.visualstudio.com/One/_build/results?buildId=49431628&view=results
 

## Azure IoT Edge PR checklist:
damonbarry pushed a commit to damonbarry/iotedge that referenced this pull request Apr 15, 2022
***Please replace this line with your PR description and read PR checklist below***
Currently longhaul pass results don't indicate the current release criteria. I am updating the tolerances for our tests in order to make it easier on the release manager to know what is releasable vs what requires further investigation. 

The first step in this direction is to plumb down two pieces of information into the test reports:
1. Whether or not the broker is enabled (`MqttBrokerEnabled` flag)
2. Whether topology is single node (`Topology` enum). This used to come from the deployment file, but there was a lot of duplication there and it is simpler if it comes from an environment variable.

Tests running to confirm these changes work:
1. Longhaul: https://msazure.visualstudio.com/One/_build/results?buildId=49707570&view=results
2. Nested Longhaul: https://msazure.visualstudio.com/One/_build/results?buildId=49707599&view=results
3. Connectivity: https://msazure.visualstudio.com/One/_build/results?buildId=49707606&view=results
4. Nested Connectivity: https://msazure.visualstudio.com/One/_build/results?buildId=49707634&view=results
5. End to End: https://dev.azure.com/msazure/One/_build/results?buildId=49707665&view=results
6. Nested End-to-End : https://msazure.visualstudio.com/One/_build/results?buildId=49712331&view=results


## Azure IoT Edge PR checklist:
damonbarry pushed a commit to damonbarry/iotedge that referenced this pull request Apr 15, 2022
## Overview
I split Azure#5869 into two PRs. The first (Azure#5902) has been merged. This is the second.

Currently longhaul pass results don't indicate the current release criteria. I am updating the tolerances for our tests in order to make it easier on the release manager to know what is releasable vs what requires further investigation. I have added tolerances for the broker-enabled case too, just so we know we aren't causing any big regressions.

A lot of these failures already have associated PBIs. For these cases I have linked the PBI/bugs below in the tolerance description section.

In order to do this there were these steps:
1. Logic specific to each test report type to determine pass result from report data (i.e. test events + test conditions).
2. Modified tests as needed. No new tests added as I was thinking it is fine to verify manually. The unit tests are super valuable for this test framework when dealing with report generators (i.e. the classes that have complex logic to deal with multiple queues of test events). But since I didn't touch those I didn't think it was worth the opportunity cost to add further testing.

I currently have longhaul and nested longhaul running and will wait to make sure the tolerances behave as expected before merge.

## Tolerance Descriptions

**Custom Mqtt Telemetry Tolerance** 
Overview:
- Sometimes we see message loss here so need a tolerance.

Tolerance description:
- [Nested-Edge] [Broker-Enabled]: We pass the tests as long as we get back 80% of messages. We sometimes have message loss, especially in middle->lower node telemetry. 

PBI Required: 
- No. We are rewriting the current version of the broker so not worth fixing, but the tolerance is good just to make sure nothing regresses.

Notes:
- When the new broker is running off of master branch, then this tolerance should be removed so we assure the new broker functions properly.

**IoTHub Message Tolerance**
Overview:
- This only affects the broker. Sometimes we see message loss here so need a tolerance. When this problem occurs it affects < 10 messages.

Tolerance description:
- [Nested-Edge] [Broker-Enabled]: Pass the test if we receive 99% of messages.

PBI Required: 
- No. We are rewriting the current version of the broker so not worth fixing, but the tolerance is good just to make sure nothing regresses.

Notes:
- When the new broker is running off of master branch, then this tolerance should be removed so we assure the new broker functions properly.

**Twin Test Tolerance**
Overview:
- Sometimes we see some twin updates are missing in nested edge, but have still released in the past. This problem seems worse for non-broker vs broker-enabled, so it seems like a product issue we should followup on. I have only added a tolerance for the broker case so as not to mask this important issue.

Tolerance description:
- [Nested-Edge] [Broker-Enabled]:  Two issues. 1) Missing some desired property updates coming through the sdk registered callback function. 2) Failing to update reported properties from the client. For both of these issues, we pass the tests if it happens less than 0.5% of the time.

PBI Required:
- No PBI needed to unwind this tolerance since it only affects the broker-enabled case. However it looks like there is a product issue for the non-broker case. I didn't add a tolerance here, as I think first we should do some investigation. It is worth running longhaul also on 1.2 to confirm behavior is the same. We logged a bug for a prior release [here](https://msazure.visualstudio.com/One/_workitems/edit/10039969), but now the problem may be more severe. We will need logs to investigate further.

**Direct Method Tolerance**
Overview:
- A lot of these tolerances existed previously. For resource errors and transient errors, the tolerance seemed too high so I made the tests behave more strictly. For DeviceNotFound exceptions, we are looking at the test conditions to specify the exact tolerance.

Tolerance description:
- [All-Cases] Status code 0: Fail the tests if we get > 1 in 1000 direct methods with status code 0.
- [All-Cases] Resource error: Fail the tests if we get > 1 in 1000 direct methods with resource error.
- [All-Cases] Unauthorized: Fail the tests if we get > 1 in 1000 direct methods with resource error.
- [All-Cases] Transient error: Fail the tests if we get > 1 in 1000 direct methods with transient error.
- [All-Cases] NotImplemented: Fail the tests if we get > 1 in 1000 direct methods with NotImplemented error.
- [Nested-Edge] [Broker-Enabled] DeviceNotFound: Fail the tests if we get > 1 in 400 direct methods with DeviceNotFound
- [Nested-Edge] [Non-Broker] DeviceNotFound: Fail the tests if we get > 1 in 250 direct methods with DeviceNotFound
- [Single-Node] DeviceNotFound: Fail the tests if we get > 1 in 200 direct methods with DeviceNotFound

PBI Required:
- Yes, only for the fact that in nested edge, broker-enabled gets way less DeviceNotFound exceptions compared to non-broker.  The bug we logged during a prior release exists [here](https://msazure.visualstudio.com/One/_workitems/edit/10149346). We will need logs to investigate further.

## Pipeline Confirmation
Here are the pipelines I have ran with these changes to confirm this will not regress any of our infra:
1. Longhaul: https://msazure.visualstudio.com/One/_build/results?buildId=49427019&view=results
2. Nested Longhaul: https://msazure.visualstudio.com/One/_build/results?buildId=49427258&view=results
3. Connectivity: https://msazure.visualstudio.com/One/_build/results?buildId=49431644&view=results
4. Nested Connectivity: https://msazure.visualstudio.com/One/_build/results?buildId=49431669&view=results
5. End to End: https://dev.azure.com/msazure/One/_build/results?buildId=49431614&view=results
6. Nested End-to-End : https://msazure.visualstudio.com/One/_build/results?buildId=49431628&view=results
 

## Azure IoT Edge PR checklist:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants