Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] [BUG] Deleted docs remain searchable on replica after refresh #4216

Closed
dreamer-89 opened this issue Aug 15, 2022 · 4 comments
Labels
bug Something isn't working distributed framework enhancement Enhancement or improvement to existing feature or request

Comments

@dreamer-89
Copy link
Member

Describe the bug
Coming from #4185 (comment), deleted documents remain searchable on replica even after refresh/commit.

To Reproduce
Run below test in continuation (fails 3/10 in general):


    public void testDeleteOperations() throws Exception {
        final String nodeA = internalCluster().startNode();
        final String nodeB = internalCluster().startNode();

        createIndex(INDEX_NAME);
        ensureGreen(INDEX_NAME);
        final int initialDocCount = scaledRandomIntBetween(0, 200);
        try (
            BackgroundIndexer indexer = new BackgroundIndexer(
                INDEX_NAME,
                "_doc",
                client(),
                -1,
                RandomizedTest.scaledRandomIntBetween(2, 5),
                false,
                random()
            )
        ) {
            indexer.start(initialDocCount);
            waitForDocs(initialDocCount, indexer);
            refresh(INDEX_NAME);
            waitForReplicaUpdate();

            // wait a short amount of time to give replication a chance to complete.
            assertHitCount(client(nodeA).prepareSearch(INDEX_NAME).setSize(0).setPreference("_only_local").get(), initialDocCount);
            assertHitCount(client(nodeB).prepareSearch(INDEX_NAME).setSize(0).setPreference("_only_local").get(), initialDocCount);

            final int additionalDocCount = scaledRandomIntBetween(0, 200);
            final int expectedHitCount = initialDocCount + additionalDocCount;
            indexer.start(additionalDocCount);
            waitForDocs(expectedHitCount, indexer);
            waitForReplicaUpdate();

            assertHitCount(client(nodeA).prepareSearch(INDEX_NAME).setSize(0).setPreference("_only_local").get(), expectedHitCount);
            assertHitCount(client(nodeB).prepareSearch(INDEX_NAME).setSize(0).setPreference("_only_local").get(), expectedHitCount);

            ensureGreen(INDEX_NAME);

            Set<String> ids = indexer.getIds();
            String id = ids.toArray()[0].toString();
            client(nodeA).prepareDelete(INDEX_NAME, id).setRefreshPolicy(WriteRequest.RefreshPolicy.IMMEDIATE).get();

            refresh(INDEX_NAME);
            waitForReplicaUpdate();

            assertHitCount(client(nodeA).prepareSearch(INDEX_NAME).setSize(0).setPreference("_only_local").get(), expectedHitCount - 1);
            assertHitCount(client(nodeB).prepareSearch(INDEX_NAME).setSize(0).setPreference("_only_local").get(), expectedHitCount - 1);
        }
    }

Expected behavior
Deleted documents should be removed from replica shard.

Host/Environment (please complete the following information):

  • OS: iOS
@dreamer-89
Copy link
Member Author

Looking into this.

CC @mch2 @kartg

@dreamer-89
Copy link
Member Author

dreamer-89 commented Aug 16, 2022

This appears to be a general lag issue, where replica does gets the updates if a wait of few secs is added (verified that adding 2s thread sleep passes deleteOperations test consistently(ran 100 times). This generic lag is also evident from the fact that waitForReplicaUpdate method fails occasionally, waiting for replica to get latest generational file from primary.

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.indices.replication.SegmentReplicationIT.testDeleteOperations {seed=[56EC8F41C1863940:3EDA376F1387408A]}" -Dtests.seed=56EC8F41C1863940 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=sl-SI -Dtests.timezone=America/Mazatlan -Druntime.java=17

java.lang.AssertionError
	at __randomizedtesting.SeedInfo.seed([56EC8F41C1863940:3EDA376F1387408A]:0)
	at org.junit.Assert.fail(Assert.java:87)
	at org.junit.Assert.assertTrue(Assert.java:42)
	at org.junit.Assert.assertTrue(Assert.java:53)
	at org.opensearch.indices.replication.SegmentReplicationIT.lambda$waitForReplicaUpdate$3(SegmentReplicationIT.java:574)
	at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1049)
	at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1022)

The existing waitForReplicaUpdate ensures replica has the latest segment generation file. I think for now, adding Thread.sleep(2000) should be sufficient unless there is a hard sync mechanism which forces replica to have all files from primary. @mch2 : WDYT ?

@dreamer-89
Copy link
Member Author

dreamer-89 commented Aug 16, 2022

In order to identify the file diff, added below code to print direcotry files on primary and replica but test never fails, which is interesting.

final TotalHits totalHits = client(nodeB).prepareSearch(INDEX_NAME).setSize(0).setPreference("_only_local").get().getHits().getTotalHits();
if (totalHits.value == expectedHitCount) {
  logger.info("--> Replica lagging");
  final IndexShard replicaShard = getIndexShard(nodeB);
  for(String fileName: replicaShard.store().directory().listAll()) {
    logger.info("{}", fileName);
  }
  final IndexShard primaryShard = getIndexShard(nodeA);
  for(String fileName: primaryShard.store().directory().listAll()) {
    logger.info("{}", fileName);
  }
}

@dreamer-89
Copy link
Member Author

Unrelated failures in test

  1. This test rarely trips on doc count assertions after indexing which also shows occasional delay in file sync even though replica is on latest segment generation. Though, interesting to note this test fails more often when there are delete operations.
[2022-08-16T21:16:57,018][INFO ][o.o.i.r.SegmentReplicationIT] [testDeleteOperations] [seed=[B45141644B6AF7EB:1E489223DC0997E2]] after test
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.indices.replication.SegmentReplicationIT.testDeleteOperations {seed=[B45141644B6AF7EB:1E489223DC0997E2]}" -Dtests.seed=B45141644B6AF7EB -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-JO -Dtests.timezone=Zulu -Druntime.java=17

Count is 160 hits but 174 was expected.  Total shards: 1 Successful shards: 1 & 0 shard failures:
java.lang.AssertionError: Count is 160 hits but 174 was expected.  Total shards: 1 Successful shards: 1 & 0 shard failures:
	at __randomizedtesting.SeedInfo.seed([B45141644B6AF7EB:1E489223DC0997E2]:0)
	at org.junit.Assert.fail(Assert.java:89)
  1. This test occasionally trip due to all shard failure, which appears to be related to test clean up issue [Segment Replication] [BUG] file handle leak in SegmentFileTransferHandler #4205.
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.indices.replication.SegmentReplicationIT.testDeleteOperations {seed=[7EA71A364B32DA37:4AF4AC9FF7D5B24F]}" -Dtests.seed=7EA71A364B32DA37 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=pt-BR -Dtests.timezone=Australia/Darwin -Druntime.java=17

org.opensearch.indices.replication.SegmentReplicationIT > testDeleteOperations {seed=[7EA71A364B32DA37:4AF4AC9FF7D5B24F]} FAILED
    Failed to execute phase [query], all shards failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework enhancement Enhancement or improvement to existing feature or request
Projects
None yet
Development

No branches or pull requests

2 participants