Skip to content

Conversation

@ywangd
Copy link
Member

@ywangd ywangd commented Jan 5, 2026

Refactor for SnapshotResiliencyTests to make it easier to extend. The main changes are (1) extract TestCluster related code into its own file and (2) add a separate init phase to the test node so that it can be customized by future subclasses. It also extracts serveral protected methods from existing code so they can be customized in future as well.

Relates: #138333

Also support waiting for a condition before starting the restart node.
@ywangd ywangd requested a review from DaveCTurner January 5, 2026 07:54
@ywangd ywangd added the >test Issues or PRs that are addressing/adding tests label Jan 5, 2026
@ywangd ywangd added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v9.4.0 labels Jan 5, 2026
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Jan 5, 2026
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks broadly good to me, just a few questions/nits.

}
return true;
}).sorted(Comparator.comparing(n -> n.node.getName())).toList();
return dataNodes.isEmpty() ? Optional.empty() : Optional.ofNullable(ESTestCase.randomFrom(dataNodes));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: slight preference for qualified imports rather than qualifying the method calls

Suggested change
return dataNodes.isEmpty() ? Optional.empty() : Optional.ofNullable(ESTestCase.randomFrom(dataNodes));
return dataNodes.isEmpty() ? Optional.empty() : Optional.ofNullable(randomFrom(dataNodes));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure pushed 94a9744

}
}

public static class TestClusterNode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why raise this to a first-level inner static class rather than remaining an inner class of TestClusterNodes? I'm guessing it's so that we can subclass it later? But if you're in a subclass of TestClusterNodes can you not subclass TestClusterNode? Haven't tried it, just curious really.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sorta generally prefer top level classes. But it is also ok to keep it as an inner class of TestClusterNodes. We can still subclass it as you suggested which is what happened for DeterministicQueue#DeterministicThreadPool. So I changed it accordingly in 9d6806d
Since it touches the constructor, the change also moved both nodeSettings() and createPluginsService() methods to TestClusterNodes so that we can remove the this-escape.

Copy link
Member Author

@ywangd ywangd Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you want TestClusterNodes to be put back into SnapshotResiliencyTests. I'd like to still make it a static class in that case. That said, I find it overall more manageable with two separate files since both of them have quite large sizes and we could potentially use the test cluster for other things in future.

Comment on lines 1112 to 1113
actions.put(
TransportClusterHealthAction.TYPE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to register this action now? Seems like a little more than a pure refactor if so. If not, can we delay it until it's needed?

Copy link
Member Author

@ywangd ywangd Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this action registration along with the doInit method in 1dbc330

)
);

doInit(actions, actionFilters);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise here, this hook doesn't do anything yet and adding it will be relatively noise-free later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above


private Coordinator coordinator;

@SuppressWarnings("this-escape")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really allow a this to escape, given that we now have a separate init() method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I see now, I thought it picked this up earlier in the precommit run than it did. The issue is the calls to the instance methods nodeSettings() and createPluginsService(), which do not leak this here but might do so in a subclass I guess. But I wonder, do we need these to be overridable by subclasses or could we pass them in as contructor parameters instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented above, it is removed in 9d6806d


runUntil(documentCountVerified::get, TimeUnit.MINUTES.toMillis(5L));
assertNotNull(safeResult(createSnapshotResponseListener));
assertNotNull(safeResult(restoreSnapshotResponseListener));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm do we need to drop this assertion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a mistake. Restored in 8970408 Thanks for catching it.

Comment on lines 934 to 936
.filter(ShardRouting::primary)
.findFirst()
.orElseThrow(() -> new AssertionError("no primary shard found"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight preference for deferring this change in behaviour.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, restored in 266bd8b

null,
emptySet()
);
// TODO: The indexNameExpressionResolver does not use the same threadContext and projectResolver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this important? It sounds important...

Copy link
Member Author

@ywangd ywangd Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Practically it is not important since the tests do not rely on either multi-project or system indices. I didn't touch it because there are other inconsistencies with projectResolver and I was not sure whether it was the right PR to fix them. Since we now have these changes separately. I pushed 4b5d66a to fix them.

@ywangd ywangd requested a review from DaveCTurner January 6, 2026 02:16
@ywangd
Copy link
Member Author

ywangd commented Jan 8, 2026

@DaveCTurner This PR is ready for another look when you have time. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Coordination Meta label for Distributed Coordination team >test Issues or PRs that are addressing/adding tests v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants