Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_pfcwd_wb is flaky due to missing log #8490

Closed
bingwang-ms opened this issue Jun 3, 2023 · 7 comments
Closed

test_pfcwd_wb is flaky due to missing log #8490

bingwang-ms opened this issue Jun 3, 2023 · 7 comments
Labels

Comments

@bingwang-ms
Copy link
Collaborator

Description

The test case test_pfcwd_wb is flaky on SN2700 platform.
The reason is because below log didn't show up after warm-reboot.

NOTICE swss#orchagent: :- setWarmStartState: orchagent warm start state changed to initialized

Actually, the warm-reboot is completed, and PFC watchdog is triggered after warm-reboot as expected.
The missing log is used to locate the start point after warm-reboot.
Because the expected log for recent warm-reboot is missing, LogAnalyzer will search in all syslog until the log pattern is found. Finally, LogAnalyzer will find the log pattern of warm-reboot in history. That results in more log messages being parsed, and then more PFC watchdog logs are found.

LogAnalyzerError: Log analyzer expected 2 messages but found only 1
expected_match: 6
expected_missing_match: 0
match: 0

Expected Messages:
Jun  2 18:43:39.938873 str-msn2700-01 NOTICE swss#orchagent: :- startWdActionOnQueue: PFC Watchdog detected PFC storm on port Ethernet8, queue index 4, queue id 0x150000000004e9 and port id 0x10000000004e2.

Jun  2 18:44:53.270169 str-msn2700-01 NOTICE swss#orchagent: :- startWdActionOnQueue: PFC Watchdog detected PFC storm on port Ethernet68, queue index 4, queue id 0x1500000000016e and port id 0x1000000000167.

Jun  2 18:48:53.983053 str-msn2700-01 NOTICE swss#orchagent: :- startWdActionOnQueue: PFC Watchdog detected PFC storm on port Ethernet8, queue index 4, queue id 0x150000000004e9 and port id 0x10000000004e2.

Jun  2 18:50:03.763184 str-msn2700-01 NOTICE swss#orchagent: :- startWdActionOnQueue: PFC Watchdog detected PFC storm on port Ethernet68, queue index 4, queue id 0x1500000000016e and port id 0x1000000000167.

Jun  2 18:54:06.819480 str-msn2700-01 NOTICE swss#orchagent: :- startWdActionOnQueue: PFC Watchdog detected PFC storm on port Ethernet68, queue index 4, queue id 0x1500000000016e and port id 0x1000000000167.

Jun  2 18:54:06.956152 str-msn2700-01 NOTICE swss#orchagent: :- startWdActionOnQueue: PFC Watchdog detected PFC storm on port Ethernet8, queue index 4, queue id 0x150000000004e9 and port id 0x10000000004e2.

There are two possible reasons for the log missing

  1. The test_disable_rsyslog_rate_limit in test_pretest.py doesn't work as expected
  2. The rsyslog service is not fully ready after warm-reboot

Steps to reproduce the issue:

  1. Run test case test_pfcwd_wb. The failing rate is around 20%

Describe the results you received:

Describe the results you expected:

Additional information you deem important:

**Output of `show version`:**
SONiC Software Version: SONiC.20220531.28
SONiC OS Version: 11
Distribution: Debian 11.7
Kernel: 5.10.0-18-2-amd64
Build commit: b80d2eaa1f
Build date: Fri May 26 23:43:46 UTC 2023
Built by: cloudtest@21946c6cc000002
**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```
@SavchukRomanLv
Copy link
Contributor

@bingwang-ms is this repro only at SN2700 or you do see failures on other platforms? this is important as if you're seeing it on other platforms - TC can be modified

@bingwang-ms
Copy link
Collaborator Author

Hi @SavchukRomanLv, I only saw this failure on SN2700. I also checked the test result on SN4600 testbed in the past 30 days, it was pretty stable.

@SavchukRomanLv
Copy link
Contributor

Hi @bingwang-ms can you please share info which fanout has been used? Thank you!

@bingwang-ms
Copy link
Collaborator Author

Hi @SavchukRomanLv , the fanout switch for the testbed is Arista-7260. But I don't think this issue is related to the leaf fanout, because we can confirm the PFC storm was triggered successfully after warm-boot from the syslog. The only issue is the log pattern NOTICE swss#orchagent: :- setWarmStartState: orchagent warm start state changed to initialized is missing after warm-boot. That results in unexpected log messages were counted in.

@SavchukRomanLv
Copy link
Contributor

Hi @bingwang-ms. Most probably message is not detected due to 15076, 11180. I do see that 15080 also included in 202205 branch two days ago. Can you update sonic-buildimage pointer and monitor it TC still randomly failing?

@bingwang-ms
Copy link
Collaborator Author

Thanks @SavchukRomanLv. I manually update the file as PR#15080 did, and run the test for 3 times, all passed. It looks like the issue can be addressed by PR#15080.
I will check the nightly test result when the new image is built, and update later.

@bingwang-ms
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants