-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI]flaky test faiure - org.opensearch.remotestore.RemoteIndexRecoveryIT.testRerouteRecovery #9580
Comments
There is another kind of error message:
Another same error message in build No.23672:
The test failure occurred very frequently since first found in #9448 (comment), build No.23397 on Aug.24th 5am PDT |
@sachinpkale anyone looking into this? |
Hi @anasalkouz, I noticed there is already a PR created to fix it #9637 |
We started seeing this test fail due to a recent merge on main. I am working on fixing this. |
This is easily reproducible after 20-25 iterations while running the test on loop. I have also been able to root cause the current issue that is leading to the test failure. I am suspecting that there might be some more flakiness later the current assertion failure. I will run the test for multiple iterations to ascertain the same. I will try to share the root cause and raise PR accordingly. |
…roject#9580 Signed-off-by: Ashish Singh <ssashish@amazon.com>
The issue is happening due to 2 reasons as we have seen the stack trace -
The first issue happens due to cluster state publication cleaning up the index shard on the old primary node between 2 assertion check interval. The current assertBusy has exponential backoff between 2 assertion checks and between these 2 checks, the condition of assertion becomes true and the index shard itself gets cleared from the old node. However, due to high interval, it misses hitting the assertion true condition. The fix for this issue is to have the assertion checks done at fixed interval. The second issue occurs due to the same reason as above but in this case the peer recovery has completed and changed the state of recovery state of the index shard. |
…roject#9580 Signed-off-by: Ashish Singh <ssashish@amazon.com>
…roject#9580 (opensearch-project#11918) Signed-off-by: Ashish Singh <ssashish@amazon.com>
…roject#9580 (opensearch-project#11918) Signed-off-by: Ashish Singh <ssashish@amazon.com>
Looks like this testcase was not fixed fully. The same test failed here - #8992 (comment)
|
@ashking94 it looks like your fix just reduced the likelihood (though significantly) but the test can still fail. As RecoveryStats are just point in time and not cumulative. So race condition can still happen betwen fixed interval |
…roject#9580 (opensearch-project#11918) Signed-off-by: Ashish Singh <ssashish@amazon.com> Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Closing in favour of #14323 |
Seen in #9506 (comment), build No.23062
The text was updated successfully, but these errors were encountered: