Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky test SegmentReplicationRemoteStoreIT.testPressureServiceStats #8827

Merged
merged 2 commits into from
Jul 24, 2023

Conversation

mch2
Copy link
Member

@mch2 mch2 commented Jul 22, 2023

Description

This test was marked as non flaky recently, but I was able to catch it while testing a fix for #8059. This test occasionally fails on an assertion on segrep stats after a failover event. The assertion is checking that there are no stats returned right after the event because the new primary has no allocated replicas.

    java.lang.AssertionError: expected:<0> but was:<1>
        at __randomizedtesting.SeedInfo.seed([6FBC0417BD6F56E8:B70016A383AEB076]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.junit.Assert.assertEquals(Assert.java:633)
        at org.opensearch.indices.replication.SegmentReplicationIT.testPressureServiceStats(SegmentReplicationIT.java:835)
        at java.****/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
        at java.****/java.lang.reflect.Method.invoke(Method.java:578)

This assertion trips occasionally because the former primary is not yet removed as an in-sync allocation id within the replication group, and is included in the result set. Further we are computing and returning checkpoint timers for a shard that would never catch up as its about to be removed.

This PR fixes this by excluding allocation ids that are marked in-sync but are unavailable in the replication group, meaning they do not have any routing entry.

Related Issues

#7592

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…ds when computing replication stats.

Signed-off-by: Marc Handalian <handalm@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

…. The API should still show a result for the replica even if it has not sync'd.

Signed-off-by: Marc Handalian <handalm@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.indices.replication.SegmentReplicationRelocationIT.testPrimaryRelocation
      1 org.opensearch.indices.replication.SegmentReplicationRelocationIT.classMethod

@codecov
Copy link

codecov bot commented Jul 22, 2023

Codecov Report

Merging #8827 (215f10c) into main (57d5e90) will increase coverage by 0.13%.
The diff coverage is 75.00%.

@@             Coverage Diff              @@
##               main    #8827      +/-   ##
============================================
+ Coverage     70.76%   70.89%   +0.13%     
- Complexity    57111    57155      +44     
============================================
  Files          4771     4771              
  Lines        270241   270251      +10     
  Branches      39500    39504       +4     
============================================
+ Hits         191237   191598     +361     
+ Misses        62846    62551     -295     
+ Partials      16158    16102      -56     
Impacted Files Coverage Δ
...org/opensearch/index/seqno/ReplicationTracker.java 69.09% <74.07%> (+0.41%) ⬆️
.../replication/checkpoint/ReplicationCheckpoint.java 87.71% <100.00%> (+2.00%) ⬆️

... and 465 files with indirect coverage changes

@mch2
Copy link
Member Author

mch2 commented Jul 22, 2023

  1 org.opensearch.indices.replication.SegmentReplicationRelocationIT.testPrimaryRelocation

This looks like a bug introduced with #8715 - looking at this and re-opening that issue.

@dreamer-89
Copy link
Member

This assertion trips occasionally because the former primary is not yet removed as an in-sync allocation id within the replication group, and is included in the result set.

What is the cluster state at this point ? Is the other replica active ?

@dreamer-89
Copy link
Member

@mch2 : How flaky was this test before this fix ? Does your fix completely fixes the flakyness of this issue ?

@mch2
Copy link
Member Author

mch2 commented Jul 24, 2023

What is the cluster state at this point ? Is the other replica active ?

Hmm I didn't get a dump of cluster state here but the new primary is the only allocated shard, there is no additional replica.

@mch2 : How flaky was this test before this fix ? Does your fix completely fixes the flakyness of this issue ?

This was one of the more rare ones according to #8279 - I ran it about 2-3k times on repeat after this change and I don't see it again, but I've seen this situation before where its hard to reproduce outside of CI.

The other test that is linked here testPrimaryRelocation I have a follow up PR to fix this one but the failure is unrelated to this fix.

@mch2 mch2 merged commit a3baa68 into opensearch-project:main Jul 24, 2023
@mch2 mch2 added the backport 2.x Backport to 2.x branch label Jul 24, 2023
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jul 24, 2023
…ts (#8827)

* Fix ReplicationTracker to not include unavailable former primary shards when computing replication stats.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Fix relocation IT relying on stats to determine if segrep has occured. The API should still show a result for the replica even if it has not sync'd.

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
(cherry picked from commit a3baa68)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
mch2 pushed a commit that referenced this pull request Jul 25, 2023
…ts (#8827) (#8855)

* Fix ReplicationTracker to not include unavailable former primary shards when computing replication stats.



* Fix relocation IT relying on stats to determine if segrep has occured. The API should still show a result for the replica even if it has not sync'd.



---------


(cherry picked from commit a3baa68)

Signed-off-by: Marc Handalian <handalm@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@mch2 mch2 deleted the 7592 branch July 26, 2023 18:20
baba-devv pushed a commit to baba-devv/OpenSearch that referenced this pull request Jul 29, 2023
…ts (opensearch-project#8827)

* Fix ReplicationTracker to not include unavailable former primary shards when computing replication stats.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Fix relocation IT relying on stats to determine if segrep has occured. The API should still show a result for the replica even if it has not sync'd.

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
kaushalmahi12 pushed a commit to kaushalmahi12/OpenSearch that referenced this pull request Sep 12, 2023
…ts (opensearch-project#8827)

* Fix ReplicationTracker to not include unavailable former primary shards when computing replication stats.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Fix relocation IT relying on stats to determine if segrep has occured. The API should still show a result for the replica even if it has not sync'd.

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
Signed-off-by: Kaushal Kumar <ravi.kaushal97@gmail.com>
brusic pushed a commit to brusic/OpenSearch that referenced this pull request Sep 25, 2023
…ts (opensearch-project#8827)

* Fix ReplicationTracker to not include unavailable former primary shards when computing replication stats.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Fix relocation IT relying on stats to determine if segrep has occured. The API should still show a result for the replica even if it has not sync'd.

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
Signed-off-by: Ivan Brusic <ivan.brusic@flocksafety.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
…ts (opensearch-project#8827)

* Fix ReplicationTracker to not include unavailable former primary shards when computing replication stats.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Fix relocation IT relying on stats to determine if segrep has occured. The API should still show a result for the replica even if it has not sync'd.

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants