Rollback a primary before recovering from translog #27804

dnhatn · 2017-12-14T04:25:20Z

Today we always recover a primary from the last commit point. However with a new deletion policy, we keep multiple commit points in the existing store, thus we have chance to find a good starting commit point. With a good starting commit point, we may be able to throw away stale operations. This PR rollbacks a primary to a starting commit then recovering from translog.

Relates #10708

Today we always recover a primary from the last commit point. However with a new deletion policy, we keep multiple commit points in the existing store, thus we have chance to find a good starting commit point. With a good starting commit point, we may be able to throw away stale operations. This PR rollbacks a primary to a starting commit then recovering from translog.

bleskes

Thanks Nhat. I don't really like the fact that we duplicate the logic in the CombinedDeletionPolicy. Instead, I'd try to make static methods in that class which can be reused.

Second I don't think we should delete commits. Instead I was thinking to use the ability to open an IndexWriter on a specific commit. For this we'd use the same commit we identify in the deletion policy. Wdyt?

dnhatn · 2017-12-16T04:01:26Z

I don't think we should delete commits.

@bleskes I agree. I've updated the PR to let an engine open a starting commit point. Could you please take a look? Thank you.

bleskes

Thx Nhat. I left some questions.

bleskes · 2017-12-16T09:03:58Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+                throw new AssertionError("unknown recovery type: [" + recoveryType + "]");
+        }
+        final IndexCommit startingCommit;
+        if (recoveryType == RecoverySource.Type.EXISTING_STORE) {


why does this needs to be out of engine? can't we do it in the engine construct as an invariant when opening an index (and translog)

Yep, I will move it to the Engine.

bleskes · 2017-12-16T09:05:09Z

core/src/main/java/org/elasticsearch/index/engine/CombinedDeletionPolicy.java

+        // Snapshotted commits may not have all its required translog.
+        final List<IndexCommit> recoverableCommits = new ArrayList<>();
+        for (IndexCommit commit : commits) {
+            if (minRetainedTranslogGen <= Long.parseLong(commit.getUserData().get(Translog.TRANSLOG_GENERATION_KEY))) {


why do we do this? isn't the logic in indexOfKeptCommits enough to deal with this?

In the previous 6.x versions, we keep the last commit and translog for that commit only. If we take a snapshot and commit, we will have two commits but translog for the last commit only. During the store recovery, if the max_seqno of the last commit is greater than the global checkpoint, the Policy will pick the snapshotted commit although it does not have full translog.

I see. I think we should keep this class clean and put this prefiltering in the engine, if we open an index created before 6.2. This way it will be clear when we can remove it.

bleskes · 2017-12-16T09:05:48Z

core/src/main/java/org/elasticsearch/index/engine/CombinedDeletionPolicy.java

+                } else {
+                    assert commits.contains(startingIndexCommit) : "Existing commits must contain the starting commit; " +
+                        "startingCommit [" + startingIndexCommit + "], commits [" + commits + "]";
+                    commits.stream().filter(commit -> startingIndexCommit.equals(commit) == false).forEach(IndexCommit::delete);


why do we need special handling here and need the start commit point? can you explain?

I calculated the retained translog generations incorrectly; I will revert this change. ~~There is an issue with the local checkpoint; I will reach out to discuss with you~~ @bleskes.

dnhatn · 2017-12-17T22:09:22Z

@bleskes, I've moved the starting commit point to the InternalEngine and removed the special case in the CombinedDeletionPolicy. Would you please take another look? Thank you.

# Conflicts: # core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java # core/src/main/java/org/elasticsearch/index/store/Store.java # core/src/main/java/org/elasticsearch/indices/recovery/PeerRecoveryTargetService.java

This reverts commit 4c4a1c7.

bleskes

Thx @dnhatn . I left some more suggestions

bleskes · 2017-12-21T15:46:45Z

core/src/main/java/org/elasticsearch/index/engine/CombinedDeletionPolicy.java

+     * @param globalCheckpoint       the persisted global checkpoint from the translog, see {@link Translog#readGlobalCheckpoint(Path)}
+     * @param minRetainedTranslogGen the minimum translog generation is retained, see {@link Translog#readMinReferencedTranslogGen(Path)}
+     */
+    public static IndexCommit startingCommitPoint(List<IndexCommit> commits, long globalCheckpoint, long minRetainedTranslogGen)


wondering - should we call this findSafeCommit? I suspect that this will be use full for later too.

bleskes · 2017-12-21T15:48:37Z

core/src/main/java/org/elasticsearch/index/engine/CombinedDeletionPolicy.java

+        // Snapshotted commits may not have all its required translog.
+        final List<IndexCommit> recoverableCommits = new ArrayList<>();
+        for (IndexCommit commit : commits) {
+            if (minRetainedTranslogGen <= Long.parseLong(commit.getUserData().get(Translog.TRANSLOG_GENERATION_KEY))) {


I see. I think we should keep this class clean and put this prefiltering in the engine, if we open an index created before 6.2. This way it will be clear when we can remove it.

bleskes · 2017-12-21T15:56:37Z

core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

@@ -179,6 +182,9 @@ public InternalEngine(EngineConfig engineConfig) {
            mergeScheduler = scheduler = new EngineMergeScheduler(engineConfig.getShardId(), engineConfig.getIndexSettings());
            throttle = new IndexThrottle();
            try {
+                this.startingCommit = getStartingCommitPoint();


why do we need to make it a field? can't it be a parameter to createWriter ?

It is used in two other places: #recoverFromTranslogInternal and #createLocalCheckpointTracker.

I think it should be a parameter to createLocalCheckpointTracker and I left a comment on recoverFromTranslogInternal . The less state the better ;)

bleskes · 2017-12-21T16:04:00Z

core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

        }
-        return localCheckpointTrackerSupplier.apply(maxSeqNo, localCheckpoint);


i like the final approach better this gives a uniform return value (easier to understand) and it makes sure these things are set.

bleskes · 2017-12-21T16:15:40Z

core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

    private void recoverFromTranslogInternal() throws IOException {
        Translog.TranslogGeneration translogGeneration = translog.getGeneration();
        final int opsRecovered;
-        final long translogGen = Long.parseLong(lastCommittedSegmentInfos.getUserData().get(Translog.TRANSLOG_GENERATION_KEY));
+        final long translogGen = Long.parseLong(startingCommit.getUserData().get(Translog.TRANSLOG_GENERATION_KEY));


I think the lastCommittedSegmentInfo should be set to the opening commit, no?

dnhatn · 2017-12-21T18:02:50Z

@bleskes, I have addressed your feedbacks. Would you please give it another go. Thank you.

dnhatn · 2017-12-21T20:16:08Z

please test this.

bleskes

thx @dnhatn . I think we're very close. I left another bunch of minor comments

bleskes · 2017-12-22T14:35:20Z

core/src/main/java/org/elasticsearch/index/engine/CombinedDeletionPolicy.java

@@ -90,12 +91,26 @@ private void updateTranslogDeletionPolicy(final IndexCommit minRequiredCommit, f
        translogDeletionPolicy.setMinTranslogGenerationForRecovery(minRequiredGen);
    }

+    /**
+     * Find a safe commit point from a list of existing commits based on the persisted global checkpoint from translog.
+     * The max seqno of a safe commit point should be at most the global checkpoint from the translog checkpoint.


can we add a huge warning that says that if the index was created before 6.2 we can't guarantee we'll find one and that in that case we return the oldest valid commit?

bleskes · 2017-12-22T14:36:13Z

core/src/main/java/org/elasticsearch/index/engine/CombinedDeletionPolicy.java

@@ -90,12 +91,26 @@ private void updateTranslogDeletionPolicy(final IndexCommit minRequiredCommit, f
        translogDeletionPolicy.setMinTranslogGenerationForRecovery(minRequiredGen);
    }

+    /**
+     * Find a safe commit point from a list of existing commits based on the persisted global checkpoint from translog.


persisted global checkpoint from the translog -> supplied global checkpoint.

bleskes · 2017-12-22T14:38:55Z

core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

@@ -177,14 +180,17 @@ public InternalEngine(EngineConfig engineConfig) {
            mergeScheduler = scheduler = new EngineMergeScheduler(engineConfig.getShardId(), engineConfig.getIndexSettings());
            throttle = new IndexThrottle();
            try {
-                this.localCheckpointTracker = createLocalCheckpointTracker(localCheckpointTrackerSupplier);
+                final IndexCommit startingCommit = getStartingCommitPoint();
+                assert startingCommit == null || openMode == EngineConfig.OpenMode.OPEN_INDEX_AND_TRANSLOG :


I don't think this boolean does what you want? if start commit is null it will always pass.

bleskes · 2017-12-22T14:40:18Z

core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

@@ -243,8 +249,14 @@ private LocalCheckpointTracker createLocalCheckpointTracker(
                localCheckpoint = SequenceNumbers.NO_OPS_PERFORMED;
                break;
            case OPEN_INDEX_AND_TRANSLOG:
+                // When recovering from a previous commit point, we use the local checkpoint from that commit,
+                // but the max_seqno from the last commit. This allows use to throw away stale operations.


why do we need this distinction? can't we keep it conceptually simple? also we recover everything from the translog, so I'm not sure how it can happen that we throw away stuff?

If we use the local checkpoint from the last commit, operations having seqno less than or equal to the local checkpoint will be skipped -> we need to use the local checkpoint from the starting commit.

We fillSeqNoGaps until the max_seqno -> we need to use max_seqno from the last commit to have full history.

We fillSeqNoGaps until the max_seqno -> we need to use max_seqno from the last commit to have full history.

We do so after we replay the translog. at which point we'll know what the max seq is (and it may be higher than the one in the last commit, but it can't be lower)

bleskes · 2017-12-22T14:45:49Z

core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

+                        recoverableCommits.add(commit);
+                    }
+                }
+                assert recoverableCommits.isEmpty() == false : "Unable to select a proper safe commit point; " +


I think you want to say that no commit point was found which could be recovered from the translog.

bleskes · 2017-12-22T14:47:20Z

core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

+            // To avoid this issue, we only select index commits whose translog files are fully retained.
+            if (engineConfig.getIndexSettings().getIndexVersionCreated().before(Version.V_6_2_0)) {
+                final List<IndexCommit> recoverableCommits = new ArrayList<>();
+                final long minRetainedTranslogGen = Translog.readMinReferencedTranslogGen(translogPath);


since we open the translog before the writer now - can we use the open translog for all this information rather than static reading it? this will make sure that we use something that went through all the right validations.

bleskes · 2017-12-22T14:48:03Z

core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

@@ -521,7 +559,11 @@ private ExternalSearcherManager createSearcherManager(SearchFactory externalSear
                final DirectoryReader directoryReader = ElasticsearchDirectoryReader.wrap(DirectoryReader.open(indexWriter), shardId);
                internalSearcherManager = new SearcherManager(directoryReader,
                        new RamAccountingSearcherFactory(engineConfig.getCircuitBreakerService()));
-                lastCommittedSegmentInfos = readLastCommittedSegmentInfos(internalSearcherManager, store);
+                if (openMode == EngineConfig.OpenMode.OPEN_INDEX_AND_TRANSLOG) {


can you explain why we need this destinction?

I think if we open other commit rather the last commit, we should assign lastCommittedSegmentInfos from that commit. I can revert this change and add this logic to recoverFromTranslogInternal. Your thought?

bleskes · 2017-12-22T14:56:13Z

core/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java

@@ -1121,8 +1120,12 @@ public void testRenewSyncFlush() throws Exception {
    }

    public void testSyncedFlushSurvivesEngineRestart() throws IOException {
+        final LongSupplier inSyncGlobalCheckpointSupplier = () -> this.engine.getLocalCheckpointTracker().getCheckpoint();


why is this needed?

bleskes · 2017-12-22T14:57:51Z

core/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java

@@ -1092,6 +1086,7 @@ public void testAcquireIndexCommit() throws Exception {
    public void testSnapshotStore() throws IOException {
        final IndexShard shard = newStartedShard(true);
        indexDoc(shard, "test", "0");
+        shard.updateLocalCheckpointForShard(shard.shardRouting.allocationId().getId(), 0);


I presume you did this to make sure the global checkpoint advances. I wonder if we should fold into indexDoc? it's sneaky

bleskes · 2017-12-22T15:00:42Z

core/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java

+        flushShard(shard);
+        assertThat(getShardDocUIDs(shard), containsInAnyOrder("doc-0", "doc-1"));
+        // Simulate resync (without rollback): Noop #1, index #2
+        shard.markSeqNoAsNoop(1, "test");


I think you need a primary term of 2 here?

bleskes

LGTM. I left some minor suggestions. No need for another review cycle. I think the stats (+191, -75) show we got this to the essence. Thanks for all the iterations @dnhatn

bleskes · 2017-12-22T20:00:42Z

core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

@@ -521,7 +550,7 @@ private ExternalSearcherManager createSearcherManager(SearchFactory externalSear
                final DirectoryReader directoryReader = ElasticsearchDirectoryReader.wrap(DirectoryReader.open(indexWriter), shardId);
                internalSearcherManager = new SearcherManager(directoryReader,
                        new RamAccountingSearcherFactory(engineConfig.getCircuitBreakerService()));
-                lastCommittedSegmentInfos = readLastCommittedSegmentInfos(internalSearcherManager, store);
+                lastCommittedSegmentInfos = store.readCommittedSegmentsInfo(indexWriter.getConfig().getIndexCommit());


add a comment that indexWriter.getConfig().getIndexCommit() can be null?

bleskes · 2017-12-22T20:01:21Z

core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

        try {
            final IndexWriterConfig iwc = getIndexWriterConfig(create);
+            assert startingCommit == null || create == false : "Starting commit makes sense only when create=false";
+            iwc.setIndexCommit(startingCommit);


this should be part of getIndexWriterConfig?

bleskes · 2017-12-22T20:37:13Z

core/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java

@@ -2514,6 +2518,7 @@ private Mapping dynamicUpdate() {
    }

    public void testTranslogReplay() throws IOException {
+        final LongSupplier inSyncGlobalCheckpointSupplier = () -> this.engine.getLocalCheckpointTracker().getCheckpoint();


I'm wondering if we should make this the default?

Most of the time, we prefer to deal with an in-sync global checkpoint, thus making this as a default makes sense to me. I will do in a follow-up.

dnhatn · 2017-12-22T21:22:11Z

@bleskes, Thanks a lot for your review.

Today we always recover a primary from the last commit point. However with a new deletion policy, we keep multiple commit points in the existing store, thus we have chance to find a good starting commit point. With a good starting commit point, we may be able to throw away stale operations. This PR rollbacks a primary to a starting commit then recovering from translog. Relates #10708

Keeping unsafe commits when opening an engine can be problematic because these commits are not safe at the recovering time but they can suddenly become safe in the future. The following issues can happen if unsafe commits are kept oninit. 1. Replica can use unsafe commit in peer-recovery. This happens when a replica with a safe commit c1 (max_seqno=1) and an unsafe commit c2 (max_seqno=2) recovers from a primary with c1(max_seqno=1). If a new document (seqno=2) is added without flushing, the global checkpoint is advanced to 2; and the replica recovers again, it will use the unsafe commit c2 (max_seqno=2 <= gcp=2) as the starting commit for sequenced based recovery even the commit c2 contains a stale operation and the document (with seqno=2) will not be replicated to the replica. 2. Min translog gen for recovery can go backwards in peer-recovery. This happens when a replica with a safe commit c1 (local_checkpoint=1, recovery_translog_gen=1) and an unsafe commit c2 (local_checkpoint=2, recovery_translog_gen=2). The replica recovers from a primary, and keeps c2 as the last commit, then sets last_translog_gen to 2. Flushing a new commit on the replica will cause exception as the new last commit c3 will have recovery_translog_gen=1. The recovery translog generation of a commit is calculated based on the current local checkpoint. The local checkpoint of c3 is 1 while the local checkpoint of c2 is 2. 3. Commit without translog can be used for recovery. An old index, which was created before multiple-commits is introduced (v6.2), may not have a safe commit. If that index has a snapshotted commit without translog and an unsafe commit, the policy can consider the snapshotted commit as a safe commit for recovery even the commit does not have translog. These issues can be avoided if the combined deletion policy keeps only the starting commit onInit. Relates #27804 Relates #28181

dnhatn added :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. :Sequence IDs WIP labels Dec 14, 2017

dnhatn requested a review from bleskes December 14, 2017 04:25

add assert recovery type

0aaee95

dnhatn requested a review from jasontedor December 14, 2017 04:41

dnhatn added 2 commits December 14, 2017 12:19

move prune files to store

6aa3200

cleanUnsafeCommits -> prepareStartingCommitPoint

7c11d37

dnhatn changed the title ~~Rollback primary before recovering from store~~ Rollback a primary before starting to recover from translog Dec 14, 2017

bleskes suggested changes Dec 15, 2017

View reviewed changes

dnhatn added 2 commits December 15, 2017 21:22

Merge branch 'master' into rollback-primary

d94b06e

open engine from a starting commit point

f6b5c58

dnhatn added v7.0.0 v6.2.0 and removed WIP labels Dec 16, 2017

bleskes reviewed Dec 16, 2017

View reviewed changes

dnhatn added 5 commits December 17, 2017 13:46

Do not pass startingIndex to Policy

db24f52

remove starting from engine config

63c3cd4

Merge branch 'master' into wip-rollback-primary

fba0784

Open a starting commit directly in engine

0a447e9

test: sync global checkpoint in engine tests

4c4a1c7

dnhatn added 4 commits December 19, 2017 12:38

Merge branch 'master' into rollback-primary

f412572

# Conflicts: # core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java # core/src/main/java/org/elasticsearch/index/store/Store.java # core/src/main/java/org/elasticsearch/indices/recovery/PeerRecoveryTargetService.java

Revert "test: sync global checkpoint in engine tests"

4892029

This reverts commit 4c4a1c7.

Update engine tests

40d5d24

use starting commit in TestTranslog

cebbe6d

bleskes reviewed Dec 21, 2017

View reviewed changes

dnhatn added 2 commits December 21, 2017 12:44

Merge branch 'master' into rollback-primary

ac01498

address feedbacks

1b1b984

assign lastCommittedSegmentInfos from a starting commit

9db0d41

bleskes suggested changes Dec 22, 2017

View reviewed changes

dnhatn added 2 commits December 22, 2017 13:58

more feedbacks

b6b5226

test: simplify the gcp

3b265e0

dnhatn added the >enhancement label Dec 22, 2017

bleskes approved these changes Dec 22, 2017

View reviewed changes

dnhatn added 2 commits December 22, 2017 15:58

Merge branch 'master' into rollback-primary

d5558d0

more feedbacks

2bd3354

move startingCommit close to openMode in IWC

b27ac02

dnhatn merged commit 6629f4a into elastic:master Dec 22, 2017

dnhatn deleted the rollback-primary branch December 22, 2017 23:25

dnhatn added the backport pending label Dec 22, 2017

dnhatn changed the title ~~Rollback a primary before starting to recover from translog~~ Rollback a primary before recovering from translog Dec 22, 2017

bleskes mentioned this pull request Dec 22, 2017

Add Sequence Numbers to write operations #10708

Closed

64 tasks

dnhatn removed the backport pending label Dec 25, 2017

dnhatn mentioned this pull request Jan 15, 2018

Open engine should keep only starting commit #28228

Merged

clintongormley added :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Sequence IDs labels Feb 14, 2018

jpountz removed the :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. label Jan 29, 2019

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

		}
		return localCheckpointTrackerSupplier.apply(maxSeqNo, localCheckpoint);

Rollback a primary before recovering from translog #27804

Rollback a primary before recovering from translog #27804

Uh oh!

Conversation

dnhatn commented Dec 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bleskes left a comment

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Dec 16, 2017

Uh oh!

bleskes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnhatn Dec 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnhatn Dec 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Dec 17, 2017

Uh oh!

bleskes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Dec 21, 2017

Uh oh!

dnhatn commented Dec 21, 2017

Uh oh!

bleskes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

dnhatn commented Dec 14, 2017 •

edited

Loading

dnhatn Dec 16, 2017 •

edited

Loading

dnhatn Dec 17, 2017 •

edited

Loading