Fix Concurrent Snapshot Ending And Stabilize Snapshot Finalization #38368

original-brownbear · 2019-02-04T20:26:23Z

Partly extracted and inspired by https://github.com/elastic/elasticsearch/compare/master...ywelsch:snapshot-refactored?expand=1#diff-a0853be4492c052f24917b5c1464003dR975
The problem in testAbortedSnapshotDuringInitDoesNotStart fails with ClassCastException #38226 is that in some corner cases multiple calls to endSnapshot were made concurrently, leading to non-deterministic behavior (beginSnapshot was triggering a repository finalization while one that was triggered by a deleteSnapshot was already in progress)
- Fix by:
  - Making all endSnapshot calls originate from the cluster state being in a "completed" state (apart from on short-circuit on initializing an empty snapshot). This forced putting the failure string into SnapshotsInProgress.Entry.
  - Adding deduplication logic to endSnapshot
Also:
- Streamlined the init behavior to work the same way (keep state on the SnapshotsService to decide which snapshot entries are stale)
closes testAbortedSnapshotDuringInitDoesNotStart fails with ClassCastException #38226

Note: I ran a few thousand iterations of the SnapshotResiliencyTests for these changes and they came back green,

…pshot-ending

elasticmachine · 2019-02-04T20:26:25Z

Pinging @elastic/es-distributed

original-brownbear · 2019-02-04T20:50:43Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

@@ -680,14 +692,27 @@ public void applyClusterState(ClusterChangedEvent event) {
        try {
            if (event.localNodeMaster()) {
                // We don't remove old master when master flips anymore. So, we need to check for change in master
-                if (event.nodesRemoved() || event.previousState().nodes().isLocalNodeElectedMaster() == false) {
-                    processSnapshotsOnRemovedNodes(event);
+                final SnapshotsInProgress snapshotsInProgress = event.state().custom(SnapshotsInProgress.TYPE);


Simplified the logic here a little to avoid the endless null check nestings that make it really hard to figure out what line of conditions led to something being executed.

original-brownbear · 2019-02-04T20:52:23Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

+                    // 1. Completed snapshots
+                    // 2. Snapshots in state INIT that the previous master failed to start
+                    // 3. Snapshots in any other state that have all their shard tasks completed
+                    snapshotsInProgress.entries().stream().filter(


All snapshot ending happens here now.

This should prevent any future stale snapshots that have all their shards completed.

Makes it much easier to reason about master failovers.

original-brownbear · 2019-02-04T20:52:59Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

     */
-    private void removeFinishedSnapshotFromClusterState(ClusterChangedEvent event) {


This is now automatically covered by the applyClusterState hook

original-brownbear · 2019-02-04T20:54:38Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

+                            }
+                        }
+                        entries.add(updatedSnapshot);
+                    } else if (snapshot.state() == State.INIT && initializingSnapshots.contains(snapshot.snapshot()) == false) {


This should be more stable and easier to reason about. It's weird that we check newMaster on some version of the state and then "later" on run this code based on whether or not we failed over earlier.

original-brownbear · 2019-02-04T20:55:47Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

-        return false;
+    private static boolean removedNodesCleanupNeeded(SnapshotsInProgress snapshotsInProgress, List<DiscoveryNode> removedNodes) {
+        // If at least one shard was running on a removed node - we need to fail it
+        return removedNodes.isEmpty() == false && snapshotsInProgress.entries().stream().flatMap(snapshot ->


This could be way simplified now too since we're already cleaning up snapshots in SUCCESS and INIT state at the top level of applyClusterState.

original-brownbear · 2019-02-04T20:56:21Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

-     * @param failure failure reason or null if snapshot was successful
-     */
-    private void endSnapshot(final SnapshotsInProgress.Entry entry, final String failure) {
+    private void endSnapshot(final SnapshotsInProgress.Entry entry) {


Just one private method now, the potential failure message lives in the cluster state.

original-brownbear · 2019-02-04T21:04:47Z

server/src/test/java/org/elasticsearch/snapshots/SharedClusterSnapshotRestoreIT.java

-                    SnapshotsStatusResponse status =
-                        client.admin().cluster().prepareSnapshotStatus("repository").setSnapshots("snap").get();
-                    assertThat(status.getSnapshots().iterator().next().getState(), equalTo(State.ABORTED));
-                } catch (Exception e) {


This isn't necessary anymore, we'll never create a broken repository with this fix.

…pshot-ending

original-brownbear · 2019-02-05T07:29:37Z

server/src/test/java/org/elasticsearch/discovery/SnapshotDisruptionIT.java

@@ -156,9 +154,6 @@ public void clusterChanged(ClusterChangedEvent event) {
                logger.info("--> got exception from race in master operation retries");
            } else {
                logger.info("--> got exception from hanged master", ex);
-                assertThat(cause, instanceOf(MasterNotDiscoveredException.class));


The timing here changed now and we're running into

[2019-02-05T09:27:33,492][INFO ][o.e.d.SnapshotDisruptionIT] [testDisruptionOnSnapshotInitialization] --> got exception from hanged master java.util.concurrent.ExecutionException: RemoteTransportException[[node_tm0][127.0.0.1:46407][cluster:admin/snapshot/create]]; nested: InvalidSnapshotNameException[[test-repo:test-snap-2] Invalid snapshot name [test-snap-2], snapshot with the same name already exists];

in most cases from the retries on the hanged master. I relaxed the assertion as we did elsewhere for this case.

original-brownbear · 2019-02-05T09:32:33Z

Jenkins run elasticsearch-ci/2

original-brownbear · 2019-02-05T09:46:25Z

@ywelsch I tried to fix this in a shorter manner (i.e. without having to make wire protocol changes to SnapshotsInProgress), but I eventually decided against it:

The resiliency tests could not catch the problem here because it came from 2 concurrent threads in the snapshot worker pool interfering with each other (and obviously we only execute those sequentially in the deterministic task queue)
Simply adding the deduplicating of endSnapshot calls introduces some new concurrency spots to think about and I figured it's safer to go with this approach (in general the one you took in your branch as well) and at least completely eliminate the risk of stuck snapshots where all shards have finished and get away from the brittle if (newMaster) style logic. Unfortunately, doing all ending of snapshots via the cluster-state needs the failure message in the state ...

Take a look when you have a chance (diff isn't so large with whitespaces ignored :)).

…pshot-ending

original-brownbear · 2019-02-05T12:13:49Z

test failure is due to #38412

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

original-brownbear · 2019-02-05T15:44:46Z

@ywelsch thanks!

* Backport of various snapshot stability fixes from `master` to `6.7` * Includes elastic#38368, elastic#38025 and elastic#37612

* Snapshot Stability Fixes * Backport of various snapshot stability fixes from `master` to `6.7` * Includes #38368, #38025 and #37612

* Backport of various snapshot stability fixes from `master` to `6.7` making the snapshot logic in `6.7` equivalent to that in `master` functionally * Includes #38368, #38025 and #37612

- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686) - Improve resilience SnapshotShardService (elastic/elasticsearch#36113) - Fix concurrent snapshot ending and stabilize snapshot finalization (elastic/elasticsearch#38368)

original-brownbear added 13 commits February 4, 2019 13:49

Fix concurrent ending step 1

f513d0f

Merge remote-tracking branch 'elastic/master' into fix-concurrent-sna…

ca5f1ec

…pshot-ending

reenable test

c3f0cf9

nicer

dcdb6f0

worsk

cb6bf42

Merge remote-tracking branch 'elastic/master' into fix-concurrent-sna…

d3f6476

…pshot-ending

worsk

f0be60c

bck

70f41d2

Merge remote-tracking branch 'elastic/master' into fix-concurrent-sna…

f73aeba

…pshot-ending

add asserts

a65dfed

add asserts

1207b60

remove pointless assert

e9c4a96

cleaner

d70a026

original-brownbear added >bug WIP :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v7.0.0 v6.7.0 labels Feb 4, 2019

original-brownbear mentioned this pull request Feb 4, 2019

Simplify SnapshotsService Cluster State Updates #37966

Closed

original-brownbear commented Feb 4, 2019

View reviewed changes

original-brownbear changed the title ~~Fix Concurrent Snapshot Ending~~ [WIP] Fix Concurrent Snapshot Ending Feb 4, 2019

original-brownbear added 4 commits February 5, 2019 08:13

revert noisy change

8a891be

Merge remote-tracking branch 'elastic/master' into fix-concurrent-sna…

3504a87

…pshot-ending

some cleanups

3de4346

remove redundant guard

c042d07

original-brownbear commented Feb 5, 2019

View reviewed changes

original-brownbear changed the title ~~[WIP] Fix Concurrent Snapshot Ending~~ Fix Concurrent Snapshot Ending And Stabilize Snapshot Finalization Feb 5, 2019

original-brownbear removed the WIP label Feb 5, 2019

Merge remote-tracking branch 'elastic/master' into fix-concurrent-sna…

b9b416c

…pshot-ending

ywelsch approved these changes Feb 5, 2019

View reviewed changes

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java Outdated Show resolved Hide resolved

cheaper set

f8d3582

original-brownbear merged commit 2f6afd2 into elastic:master Feb 5, 2019

original-brownbear deleted the fix-concurrent-snapshot-ending branch February 5, 2019 15:44

original-brownbear added the backport pending label Feb 5, 2019

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

original-brownbear mentioned this pull request Feb 28, 2019

Snapshot Stability Fixes #39502

Merged

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Feb 28, 2019

Snapshot Stability Fixes

56b0f19

* Backport of various snapshot stability fixes from `master` to `6.7` * Includes elastic#38368, elastic#38025 and elastic#37612

original-brownbear added a commit that referenced this pull request Mar 1, 2019

Snapshot Stability Fixes (#39502)

4b725e0

* Snapshot Stability Fixes * Backport of various snapshot stability fixes from `master` to `6.7` * Includes #38368, #38025 and #37612

original-brownbear removed the backport pending label Mar 1, 2019

original-brownbear mentioned this pull request Mar 1, 2019

Snapshot Stability Fixes #39550

Merged

kovrus mentioned this pull request Apr 24, 2019

Port ES snapshotting code. crate/crate#8601

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Concurrent Snapshot Ending And Stabilize Snapshot Finalization #38368

Fix Concurrent Snapshot Ending And Stabilize Snapshot Finalization #38368

original-brownbear commented Feb 4, 2019 •

edited

Loading

elasticmachine commented Feb 4, 2019

original-brownbear Feb 4, 2019

original-brownbear Feb 4, 2019

original-brownbear Feb 4, 2019

original-brownbear Feb 4, 2019

original-brownbear Feb 4, 2019

original-brownbear Feb 4, 2019

original-brownbear Feb 4, 2019

original-brownbear Feb 5, 2019

original-brownbear commented Feb 5, 2019

original-brownbear commented Feb 5, 2019 •

edited

Loading

original-brownbear commented Feb 5, 2019

original-brownbear commented Feb 5, 2019

		*/
		private void removeFinishedSnapshotFromClusterState(ClusterChangedEvent event) {

Fix Concurrent Snapshot Ending And Stabilize Snapshot Finalization #38368

Fix Concurrent Snapshot Ending And Stabilize Snapshot Finalization #38368

Conversation

original-brownbear commented Feb 4, 2019 • edited Loading

elasticmachine commented Feb 4, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

original-brownbear commented Feb 5, 2019

original-brownbear commented Feb 5, 2019 • edited Loading

original-brownbear commented Feb 5, 2019

original-brownbear commented Feb 5, 2019

original-brownbear commented Feb 4, 2019 •

edited

Loading

original-brownbear commented Feb 5, 2019 •

edited

Loading