Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Concurrent Snapshot Ending And Stabilize Snapshot Finalization #38368

Conversation

original-brownbear
Copy link
Member

@original-brownbear original-brownbear commented Feb 4, 2019


Note: I ran a few thousand iterations of the SnapshotResiliencyTests for these changes and they came back green,

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@@ -680,14 +692,27 @@ public void applyClusterState(ClusterChangedEvent event) {
try {
if (event.localNodeMaster()) {
// We don't remove old master when master flips anymore. So, we need to check for change in master
if (event.nodesRemoved() || event.previousState().nodes().isLocalNodeElectedMaster() == false) {
processSnapshotsOnRemovedNodes(event);
final SnapshotsInProgress snapshotsInProgress = event.state().custom(SnapshotsInProgress.TYPE);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified the logic here a little to avoid the endless null check nestings that make it really hard to figure out what line of conditions led to something being executed.

// 1. Completed snapshots
// 2. Snapshots in state INIT that the previous master failed to start
// 3. Snapshots in any other state that have all their shard tasks completed
snapshotsInProgress.entries().stream().filter(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All snapshot ending happens here now.

  1. This should prevent any future stale snapshots that have all their shards completed.
  2. Makes it much easier to reason about master failovers.

*/
private void removeFinishedSnapshotFromClusterState(ClusterChangedEvent event) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now automatically covered by the applyClusterState hook

}
}
entries.add(updatedSnapshot);
} else if (snapshot.state() == State.INIT && initializingSnapshots.contains(snapshot.snapshot()) == false) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be more stable and easier to reason about. It's weird that we check newMaster on some version of the state and then "later" on run this code based on whether or not we failed over earlier.

return false;
private static boolean removedNodesCleanupNeeded(SnapshotsInProgress snapshotsInProgress, List<DiscoveryNode> removedNodes) {
// If at least one shard was running on a removed node - we need to fail it
return removedNodes.isEmpty() == false && snapshotsInProgress.entries().stream().flatMap(snapshot ->
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be way simplified now too since we're already cleaning up snapshots in SUCCESS and INIT state at the top level of applyClusterState.

* @param failure failure reason or null if snapshot was successful
*/
private void endSnapshot(final SnapshotsInProgress.Entry entry, final String failure) {
private void endSnapshot(final SnapshotsInProgress.Entry entry) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one private method now, the potential failure message lives in the cluster state.

SnapshotsStatusResponse status =
client.admin().cluster().prepareSnapshotStatus("repository").setSnapshots("snap").get();
assertThat(status.getSnapshots().iterator().next().getState(), equalTo(State.ABORTED));
} catch (Exception e) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessary anymore, we'll never create a broken repository with this fix.

@original-brownbear original-brownbear changed the title Fix Concurrent Snapshot Ending [WIP] Fix Concurrent Snapshot Ending Feb 4, 2019
@@ -156,9 +154,6 @@ public void clusterChanged(ClusterChangedEvent event) {
logger.info("--> got exception from race in master operation retries");
} else {
logger.info("--> got exception from hanged master", ex);
assertThat(cause, instanceOf(MasterNotDiscoveredException.class));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timing here changed now and we're running into

[2019-02-05T09:27:33,492][INFO ][o.e.d.SnapshotDisruptionIT] [testDisruptionOnSnapshotInitialization] --> got exception from hanged master
java.util.concurrent.ExecutionException: RemoteTransportException[[node_tm0][127.0.0.1:46407][cluster:admin/snapshot/create]]; nested: InvalidSnapshotNameException[[test-repo:test-snap-2] Invalid snapshot name [test-snap-2], snapshot with the same name already exists];

in most cases from the retries on the hanged master. I relaxed the assertion as we did elsewhere for this case.

@original-brownbear original-brownbear changed the title [WIP] Fix Concurrent Snapshot Ending Fix Concurrent Snapshot Ending And Stabilize Snapshot Finalization Feb 5, 2019
@original-brownbear
Copy link
Member Author

Jenkins run elasticsearch-ci/2

@original-brownbear
Copy link
Member Author

original-brownbear commented Feb 5, 2019

@ywelsch I tried to fix this in a shorter manner (i.e. without having to make wire protocol changes to SnapshotsInProgress), but I eventually decided against it:

  • The resiliency tests could not catch the problem here because it came from 2 concurrent threads in the snapshot worker pool interfering with each other (and obviously we only execute those sequentially in the deterministic task queue)
  • Simply adding the deduplicating of endSnapshot calls introduces some new concurrency spots to think about and I figured it's safer to go with this approach (in general the one you took in your branch as well) and at least completely eliminate the risk of stuck snapshots where all shards have finished and get away from the brittle if (newMaster) style logic. Unfortunately, doing all ending of snapshots via the cluster-state needs the failure message in the state ...

Take a look when you have a chance (diff isn't so large with whitespaces ignored :)).

@original-brownbear
Copy link
Member Author

test failure is due to #38412

@original-brownbear original-brownbear merged commit 2f6afd2 into elastic:master Feb 5, 2019
@original-brownbear original-brownbear deleted the fix-concurrent-snapshot-ending branch February 5, 2019 15:44
@original-brownbear
Copy link
Member Author

@ywelsch thanks!

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Feb 28, 2019
* Backport of various snapshot stability fixes from `master` to `6.7`
* Includes elastic#38368, elastic#38025 and elastic#37612
original-brownbear added a commit that referenced this pull request Mar 1, 2019
* Snapshot Stability Fixes

* Backport of various snapshot stability fixes from `master` to `6.7`
* Includes #38368, #38025 and #37612
original-brownbear added a commit that referenced this pull request Mar 4, 2019
* Backport of various snapshot stability fixes from `master` to `6.7` making the snapshot logic in `6.7` equivalent to that in `master` functionally
* Includes #38368, #38025 and #37612
kovrus added a commit to crate/crate that referenced this pull request Apr 24, 2019
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686)
- Improve resilience SnapshotShardService (elastic/elasticsearch#36113)
- Fix concurrent snapshot ending and stabilize snapshot finalization
    (elastic/elasticsearch#38368)
kovrus added a commit to crate/crate that referenced this pull request Apr 25, 2019
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686)
- Improve resilience SnapshotShardService (elastic/elasticsearch#36113)
- Fix concurrent snapshot ending and stabilize snapshot finalization
    (elastic/elasticsearch#38368)
kovrus added a commit to crate/crate that referenced this pull request Apr 25, 2019
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686)
- Improve resilience SnapshotShardService (elastic/elasticsearch#36113)
- Fix concurrent snapshot ending and stabilize snapshot finalization (elastic/elasticsearch#38368)
kovrus added a commit to crate/crate that referenced this pull request Apr 25, 2019
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686)
- Improve resilience SnapshotShardService (elastic/elasticsearch#36113)
- Fix concurrent snapshot ending and stabilize snapshot finalization (elastic/elasticsearch#38368)
kovrus added a commit to crate/crate that referenced this pull request Apr 25, 2019
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686)
- Improve resilience SnapshotShardService (elastic/elasticsearch#36113)
- Fix concurrent snapshot ending and stabilize snapshot finalization (elastic/elasticsearch#38368)
kovrus added a commit to crate/crate that referenced this pull request Apr 26, 2019
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686)
- Improve resilience SnapshotShardService (elastic/elasticsearch#36113)
- Fix concurrent snapshot ending and stabilize snapshot finalization (elastic/elasticsearch#38368)
mergify bot pushed a commit to crate/crate that referenced this pull request Apr 26, 2019
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686)
- Improve resilience SnapshotShardService (elastic/elasticsearch#36113)
- Fix concurrent snapshot ending and stabilize snapshot finalization (elastic/elasticsearch#38368)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v6.7.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

testAbortedSnapshotDuringInitDoesNotStart fails with ClassCastException
4 participants