Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

system_health/test_system_health.py::test_service_checker_with_process_exit fails with "AssertionError: ... is not recorded" #7832

Closed
kartik-arista opened this issue Mar 23, 2023 · 2 comments

Comments

@kartik-arista
Copy link
Contributor

Description

system_health/test_system_health.py::test_service_checker_with_process_exit

has started to fail in latest sonic-mgmt runs. This seems to be fallout from

sonic-net/sonic-buildimage#13497

Steps to reproduce the issue:
1.
2.
3.

Just run the test.

Describe the results you received:

duthosts = [<MultiAsicSonicHost cmp210-3>, <MultiAsicSonicHost cmp210-4>, <MultiAsicSonicHost cmp210-5>, <MultiAsicSonicHost cmp210>], enum_rand_one_per_hwsku_hostname = 'cmp210'

    @pytest.mark.disable_loganalyzer
    def test_service_checker_with_process_exit(duthosts, enum_rand_one_per_hwsku_hostname):
        duthost = duthosts[enum_rand_one_per_hwsku_hostname]
        wait_system_health_boot_up(duthost)
        with ConfigFileContext(duthost, os.path.join(FILES_DIR, IGNORE_DEVICE_CHECK_CONFIG_FILE)):
            processes_status = duthost.all_critical_process_status()
            containers = [x for x in list(processes_status.keys()) if "syncd" not in x and "database" not in x and
                          "bgp" not in x and "swss" not in x]
            logging.info('Test containers: {}'.format(containers))
            random.shuffle(containers)
            for container in containers:
                running_critical_process = processes_status[container]['running_critical_process']
                if not running_critical_process:
                    continue

                critical_process = random.sample(running_critical_process, 1)[0]
                with ProcessExitContext(duthost, container, critical_process):
                    # use wait_until to check if SYSTEM_HEALTH_INFO has expected content
                    # avoid waiting for too long or DEFAULT_INTERVAL is not long enough to refresh db
                    category = '{}:{}'.format(container, critical_process)
                    expected_value = "'{}' is not running".format(critical_process)
                    result = wait_until(WAIT_TIMEOUT, 10, 2, check_system_health_info, duthost, category, expected_value)
>                   assert result == True, '{} is not recorded'.format(critical_process)
E                   AssertionError: tlm_teamd is not recorded

Describe the results you expected:

Test should pass.

The root cause is that

expected_value = "'{}' is not running".format(critical_process)

No longer matches the string storbed the service health checker in STATE_DB. Adjusting the string to match the new string gets the test passing again.

Additional information you deem important:

**Output of `show version`:**

```
(paste your output here)
```

**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```
@kartik-arista
Copy link
Contributor Author

If someone wants to workaround this issue

diff --git a/tests/system_health/test_system_health.py b/tests/system_health/test_system_health.py
index 427dbb3a..60fc0fd3 100644
--- a/tests/system_health/test_system_health.py
+++ b/tests/system_health/test_system_health.py
@@ -134,7 +134,7 @@ def test_service_checker_with_process_exit(duthosts, enum_rand_one_per_hwsku_hos
                 # use wait_until to check if SYSTEM_HEALTH_INFO has expected content
                 # avoid waiting for too long or DEFAULT_INTERVAL is not long enough to refresh db
                 category = '{}:{}'.format(container, critical_process)
-                expected_value = "'{}' is not running".format(critical_process)
+                expected_value = "Process '{}' in container '{}' is not running".format(critical_process, container)
                 result = wait_until(WAIT_TIMEOUT, 10, 2, check_system_health_info, duthost, category, expected_value)
                 assert result == True, '{} is not recorded'.format(critical_process)
                 summary = redis_get_field_value(duthost, STATE_DB, HEALTH_TABLE_NAME, 'summary')
                 

will fix it. PR to follow.

@kartik-arista
Copy link
Contributor Author

Found that this was already fixed by PR

#7649

and is marked for cherry pick to 202205, so we are covered. I am closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

1 participant