Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] FsHealthServiceTests.testFailsHealthOnHungIOBeyondHealthyTimeout (Random test Failure) #1567

Closed
CEHENKLE opened this issue Nov 16, 2021 · 14 comments · Fixed by #1692 or #1813
Closed
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run

Comments

@CEHENKLE
Copy link
Member

CEHENKLE commented Nov 16, 2021

Describe the bug
Random Test Failure. Please dig in, figure out what's wrong :(

https://ci.opensearch.org/logs/ci/workflow/OpenSearch_CI/PR_Checks/Gradle_Check/gradle_check_1068.log
https://ci.opensearch.org/logs/ci/workflow/OpenSearch_CI/PR_Checks/Gradle_Check/gradle_check_1068_reports.zip

@CEHENKLE CEHENKLE added bug Something isn't working untriaged labels Nov 16, 2021
@tlfeng
Copy link
Collaborator

tlfeng commented Nov 16, 2021

Most recently found in #1541

@CEHENKLE CEHENKLE added flaky-test Random test failure that succeeds on second run and removed untriaged labels Nov 16, 2021
@Poojita-Raj
Copy link
Contributor

Also found in #1584

@tlfeng
Copy link
Collaborator

tlfeng commented Dec 1, 2021

The failed test is introduced by commit f7e2984

After looking at all the test failure reports mentioned above, the error messages are all:

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnHungIOBeyondHealthyTimeout" -Dtests.seed=xxx -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=nb -Dtests.timezone=Australia/Sydney -Druntime.java=17

org.opensearch.monitor.fs.FsHealthServiceTests > testFailsHealthOnHungIOBeyondHealthyTimeout FAILED
    java.lang.AssertionError
        at __randomizedtesting.SeedInfo.seed([xxx:xxx]:0)
        at org.junit.Assert.fail(Assert.java:86)
        at org.junit.Assert.assertTrue(Assert.java:41)
        at org.junit.Assert.assertTrue(Assert.java:52)
        at org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnHungIOBeyondHealthyTimeout(FsHealthServiceTests.java:239)

FsHealthServiceTests.java:239 pointed to the code:

The assertion statement leads to the test failure.

            assertTrue(
                waitUntil(
                    () -> fsHealthSrvc.getHealth().getStatus() == HEALTHY,
                    delayBetweenChecks + (2 * refreshInterval),
                    TimeUnit.MILLISECONDS
                )
            );

Note that in issue #1307, The last line shows a different line number:
org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnHungIOBeyondHealthyTimeout(FsHealthServiceTests.java:228), that is because "spotless" has not been applied on the code by the commit 996d33a, and it pointed to the same assertion statement.

A few seed shown in the console output: 521B23369D39EE22, 20CA4F502621468F, 20C97377870AC21B

Made some experiment to reproduce the failure, I tried different values for these variables:

public void testFailsHealthOnHungIOBeyondHealthyTimeout() throws Exception {
    long healthyTimeoutThreshold = randomLongBetween(500, 1000);
    long refreshInterval = randomLongBetween(500, 1000);
    long slowLogThreshold = randomLongBetween(100, 200);
    long delayBetweenChecks = 100;

I found that sometime the test "testFailsHealthOnHungIOBeyondHealthyTimeout" will be failed when running the whole tests in the class by ./gradlew ':server:test' --tests "org.opensearch.monitor.fs.FsHealthServiceTests", instead of running only one test case by ./gradlew ':server:test' --tests "org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnHungIOBeyondHealthyTimeout"

Another finding:
Seeing from the gradle check log 1066 that, there are additional warning from the path tempDir-013 and tempDir-014, seems such additional warning is not shown in the output log for successful test runs.

[testFailsHealthOnHungIOBeyondHealthyTimeout] --> First monitor run
  1> [2021-11-13T11:28:45,018][WARN ][o.o.m.f.FsHealthService  ] [testFailsHealthOnHungIOBeyondHealthyTimeout] health check of [/var/CITOOL/workflow/OpenSearch_CI/PR_Checks/Gradle_Check/search/server/build/testrun/test/temp/org.opensearch.monitor.fs.FsHealthServiceTests_20C97377870AC21B-001/tempDir-012/nodes/0] took [212ms] which is above the warn threshold of [113ms]
  1> [2021-11-13T11:28:45,193][WARN ][o.o.m.f.FsHealthService  ] [testFailsHealthOnHungIOBeyondHealthyTimeout] health check of [/var/CITOOL/workflow/OpenSearch_CI/PR_Checks/Gradle_Check/search/server/build/testrun/test/temp/org.opensearch.monitor.fs.FsHealthServiceTests_20C97377870AC21B-001/tempDir-013/nodes/0] took [174ms] which is above the warn threshold of [113ms]
  1> [2021-11-13T11:28:45,380][WARN ][o.o.m.f.FsHealthService  ] [testFailsHealthOnHungIOBeyondHealthyTimeout] health check of [/var/CITOOL/workflow/OpenSearch_CI/PR_Checks/Gradle_Check/search/server/build/testrun/test/temp/org.opensearch.monitor.fs.FsHealthServiceTests_20C97377870AC21B-001/tempDir-014/nodes/0] took [187ms] which is above the warn threshold of [113ms]
  1> [2021-11-13T11:28:45,381][INFO ][o.o.m.f.FsHealthServiceTests] [testFailsHealthOnHungIOBeyondHealthyTimeout] --> Disrupt file system
  1> [2021-11-13T11:28:47,602][INFO ][o.o.m.f.FsHealthServiceTests] [testFailsHealthOnHungIOBeyondHealthyTimeout] --> Fix file system disruption
  1> [2021-11-13T11:28:47,670][WARN ][o.o.m.f.FsHealthService  ] [org.opensearch.monitor.fs.FsHealthServiceTests] health check of [/var/CITOOL/workflow/OpenSearch_CI/PR_Checks/Gradle_Check/search/server/build/testrun/test/temp/org.opensearch.monitor.fs.FsHealthServiceTests_20C97377870AC21B-001/tempDir-012/nodes/0] took [1564ms] which is above the warn threshold of [113ms]
  1> [2021-11-13T11:28:47,671][ERROR][o.o.m.f.FsHealthService  ] [org.opensearch.monitor.fs.FsHealthServiceTests] health check of [/var/CITOOL/workflow/OpenSearch_CI/PR_Checks/Gradle_Check/search/server/build/testrun/test/temp/org.opensearch.monitor.fs.FsHealthServiceTests_20C97377870AC21B-001/tempDir-012/nodes/0] failed, took [1564ms] which is above the healthy threshold of [801ms]
  1> [2021-11-13T11:28:47,920][WARN ][o.o.m.f.FsHealthService  ] [org.opensearch.monitor.fs.FsHealthServiceTests] health check of [/var/CITOOL/workflow/OpenSearch_CI/PR_Checks/Gradle_Check/search/server/build/testrun/test/temp/org.opensearch.monitor.fs.FsHealthServiceTests_20C97377870AC21B-001/tempDir-014/nodes/0] took [161ms] which is above the warn threshold of [113ms]
  1> [2021-11-13T11:28:48,810][WARN ][o.o.m.f.FsHealthService  ] [org.opensearch.monitor.fs.FsHealthServiceTests] health check of [/var/CITOOL/workflow/OpenSearch_CI/PR_Checks/Gradle_Check/search/server/build/testrun/test/temp/org.opensearch.monitor.fs.FsHealthServiceTests_20C97377870AC21B-001/tempDir-012/nodes/0] took [183ms] which is above the warn threshold of [113ms]
  1> [2021-11-13T11:28:48,980][WARN ][o.o.m.f.FsHealthService  ] [org.opensearch.monitor.fs.FsHealthServiceTests] health check of [/var/CITOOL/workflow/OpenSearch_CI/PR_Checks/Gradle_Check/search/server/build/testrun/test/temp/org.opensearch.monitor.fs.FsHealthServiceTests_20C97377870AC21B-001/tempDir-013/nodes/0] took [170ms] which is above the warn threshold of [113ms]
  1> [2021-11-13T11:28:49,131][WARN ][o.o.m.f.FsHealthService  ] [org.opensearch.monitor.fs.FsHealthServiceTests] health check of [/var/CITOOL/workflow/OpenSearch_CI/PR_Checks/Gradle_Check/search/server/build/testrun/test/temp/org.opensearch.monitor.fs.FsHealthServiceTests_20C97377870AC21B-001/tempDir-014/nodes/0] took [150ms] which is above the warn threshold of [113ms]
  1> [2021-11-13T11:28:49,136][INFO ][o.o.m.f.FsHealthServiceTests] [testFailsHealthOnHungIOBeyondHealthyTimeout] after test
  2> REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnHungIOBeyondHealthyTimeout" -Dtests.seed=20C97377870AC21B -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=nb -Dtests.timezone=Australia/Sydney -Druntime.java=17
  2> java.lang.AssertionError
        at __randomizedtesting.SeedInfo.seed([20C97377870AC21B:A2BD2D7EF3F6B453]:0)
        at org.junit.Assert.fail(Assert.java:86)
        at org.junit.Assert.assertTrue(Assert.java:41)
        at org.junit.Assert.assertTrue(Assert.java:52)
        at org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnHungIOBeyondHealthyTimeout(FsHealthServiceTests.java:239)

@andrross
Copy link
Member

andrross commented Dec 7, 2021

@tlfeng Can you easily reproduce the failure?

@tlfeng
Copy link
Collaborator

tlfeng commented Dec 7, 2021

@tlfeng Can you easily reproduce the failure?

Hi Andrew, I couldn't. Although decreasing some of the time interval can caused the assertion failed, it's not likely the same case in CI.

long healthyTimeoutThreshold = 100; // randomLongBetween(500, 1000);
long refreshInterval = 50; // randomLongBetween(500, 1000);
long slowLogThreshold = randomLongBetween(100, 200);
long delayBetweenChecks = 20; // 100;

I guess there might be 3 possible causes:

  1. The wait time in the assertion statement is not enough for the code disruptFileSystemProvider.injectIODelay.set(false); ("Fix file system disruption") processing.
  2. The file system somehow is not back to normal after the test case testLoggingOnHungIO() runs, because it hung the file system, but maybe didn't restore properly (https://github.com/opensearch-project/OpenSearch/blob/main/server/src/test/java/org/opensearch/monitor/fs/FsHealthServiceTests.java#L181).
    Note for 1&2: I'm not sure if "Fix file system disruption" is a necessary step in the test, because the test case testLoggingOnHungIO() written by Elastic didn't restore the hanging status manually in the test case.
  3. There is defect for the logic of FileSystemFsyncHungProvider (https://github.com/opensearch-project/OpenSearch/blob/main/server/src/test/java/org/opensearch/monitor/fs/FsHealthServiceTests.java#L422), and it was modified in the commit f7e2984 recently.

@tlfeng
Copy link
Collaborator

tlfeng commented Dec 8, 2021

Hi @Bukhtawar, do you have any thoughts about the test failure?
It can not be easily reproduced, but it failed several times in the gradle check task in our Jenkins CI. The most recent seems on last Sunday in the daily "gradle check build".
I noticed your comment in #1307 (comment), but need your more opinions on the real cause of the test failure. Thank you! 😄

@Bukhtawar
Copy link
Collaborator

Adding more delay here should help I guess, assuming this is the only assertion failing

assertTrue(
waitUntil(
() -> fsHealthSrvc.getHealth().getStatus() == HEALTHY,
delayBetweenChecks + (2 * refreshInterval),
TimeUnit.MILLISECONDS
)

@andrross
Copy link
Member

andrross commented Dec 8, 2021

Maybe just increase that 2x multiplier in the waitUntil step to 3x or 4x to give the FsHealthService a couple more chances to observe a healthy state?

Also, super minor, but I'd also remove this assertion:

As long as the thing after this wait step is to do assertEquals(HEALTHY, fsHealth.getStatus()); then you're covering this assertion and a failure will give a much better error message.

@tlfeng
Copy link
Collaborator

tlfeng commented Dec 8, 2021

Thanks @Bukhtawar and @andrross for providing suggestions on resolving the test failure.

I realized that it's a bit duplicate to have double assertion statements to check the File System health status around the 2 waitUntil(). That's a good point 👍 😁

Further, I'm curious is it necessary to have the steps to restore the file system status? Or just a good habit to restore the status after a test case.
Such as:

logger.info("--> Fix file system disruption");
disruptFileSystemProvider.injectIODelay.set(false);
assertTrue(
waitUntil(
() -> fsHealthSrvc.getHealth().getStatus() == HEALTHY,
delayBetweenChecks + (2 * refreshInterval),
TimeUnit.MILLISECONDS
)
);
fsHealth = fsHealthSrvc.getHealth();
assertEquals(HEALTHY, fsHealth.getStatus());
assertEquals("health check passed", fsHealth.getInfo());
assertEquals(disruptedPathCount, disruptFileSystemProvider.getInjectedPathCount());

I noticed that in the test testLoggingOnHungIO(), it didn't restore the disrupted file system.

public void testLoggingOnHungIO() throws Exception {

@andrross
Copy link
Member

andrross commented Dec 8, 2021

This is the critical line that must happen after each test invocation:

The actual disrupted filesystem is an instance local to a single test invocation, so as long as that static teardown method is called to reset the underlying static variable in PathUtils, then it doesn't matter what is done to the instance.

@andrross
Copy link
Member

@tlfeng Just came across a failure of this test in an unrelated PR. I see the updated error message, so it appears the additional retry time didn't fix it in this case.

Gradle Check 1532

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnHungIOBeyondHealthyTimeout" -Dtests.seed=3ED0AD06FBD1DA83 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCach
eSize=64m -Djava.security.manager=allow" -Dtests.locale=en-CA -Dtests.timezone=Asia/Dubai -Druntime.java=17

WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.gradle.api.internal.tasks.testing.worker.TestWorker (file:/home/ubuntu/.gradle/wrapper/dists/gradle-7.3-all/dfe0rb4hnplvqkibhoc45how8/gradle-7.3/lib/plugins/gradle-testing-base-7.3.jar)
WARNING: Please consider reporting this to the maintainers of org.gradle.api.internal.tasks.testing.worker.TestWorker
WARNING: System::setSecurityManager will be removed in a future release
org.opensearch.monitor.fs.FsHealthServiceTests > testFailsHealthOnHungIOBeyondHealthyTimeout FAILED
    java.lang.AssertionError: expected:<HEALTHY> but was:<UNHEALTHY>
        at __randomizedtesting.SeedInfo.seed([3ED0AD06FBD1DA83:BCA4F30F8F2DACCB]:0)
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.failNotEquals(Assert.java:834)
        at org.junit.Assert.assertEquals(Assert.java:118)
        at org.junit.Assert.assertEquals(Assert.java:144)
        at org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnHungIOBeyondHealthyTimeout(FsHealthServiceTests.java:243)

@dreamer-89
Copy link
Member

Reopened after another occurrence #5344 (comment)

@dreamer-89 dreamer-89 removed their assignment Dec 15, 2023
@anasalkouz
Copy link
Member

Seems old one, if someone can try to reproduce it. Otherwise, we can close it.

@dreamer-89 dreamer-89 assigned kartg and unassigned anasalkouz Dec 19, 2023
@dreamer-89
Copy link
Member

dreamer-89 commented Dec 20, 2023

Closing this issue as stale. Also, this test has not failed based on flaky report #5031 (comment).

@kartg kartg removed their assignment Dec 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run
Projects
None yet
8 participants