Skip to content

Conversation

@DaveCTurner
Copy link
Contributor

Today the background sync of retention leases is a fire-and-forget action, and
therefore so is IndexShard#syncRetentionLeases(). However, in tests sometimes
we want to sync the retention leases and then wait for its completion before
continuing, and today this wait has to be a busy-wait. This change allows us to
listen for completion instead.

Today the background sync of retention leases is a fire-and-forget action, and
therefore so is `IndexShard#syncRetentionLeases()`. However, in tests sometimes
we want to sync the retention leases and then wait for its completion before
continuing, and today this wait has to be a busy-wait. This change allows us to
listen for completion instead.
@DaveCTurner DaveCTurner added >non-issue :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features labels Feb 21, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Feb 21, 2019

This change is groundwork for the introduction of peer-recovery retention leases: if history retention is to be coordinated by the primary we must explicitly sync the leases to release history in a reasonable number of other tests.

I.e. relates #39133.

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a question but this looks good.


private void syncRetentionLeases() {
sync(IndexShard::syncRetentionLeases, "retention lease");
sync(primary -> primary.syncRetentionLeases(wrap(() -> {})), "retention lease");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just use ActionListener.wrap without a static import?

* @param listener the callback when the retention lease is successfully removed and synced to replicas
*/
public void removeRetentionLease(final String id, final ActionListener<ReplicationResponse> listener) {
public void removeRetentionLease(final String id, final ActionListener<Void> listener) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we are not really using the response but is there any reason for this change? Is it okay if we make both sync and backgroundSync return ReplicationResponse instead of Void?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Primarily because I think RetentionLeaseSyncer.EMPTY should notify the listener (the contract for a method that accepts an ActionListener is that it is eventually notified, which today we do not satisfy) and yet that implementation does not have a ReplicationResponse with which to notify it. I could make one up, but it felt wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can move RetentionLeaseSyncer.EMPTY to tests and make it notify listeners with an empty ReplicationResponse?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I reverted back to using a made-up ReplicationResponse in 50acab7; I think I'll leave RetentionLeaseSyncer.EMPTY where it is.

}
futures.forEach(f -> {
try {
f.get();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe use actionGet.

group.addRetentionLease(Integer.toString(i), randomNonNegativeLong(), "test-" + i, addLeaseFuture);
RetentionLeaseSyncAction.Request request = ((SyncRetentionLeasesResponse) addLeaseFuture.actionGet()).syncRequest;
final IndexShard newPrimary = randomFrom(group.getReplicas());
final AtomicReference<RetentionLeases> latestRetentionLeasesOnNewPrimary = new AtomicReference<>(RetentionLeases.EMPTY);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test becomes a bit more tangled since we change to a Void listener.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does. I nearly backed out the whole idea because of this test. OTOH it has less casting after this change.

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change to production code to support testing. Sometimes that is the right thing. Yet, I am not following enough the thinking to make this change. Can you explain more why assert busy is not enough for these cases? Since in production these things are only going to eventually happen, I feel like our tests should reflect that rather than making them artificial with an actual sync point that would never exist in production code?

I am marking this as "Request changes" only to be clear that I would like to have a better understanding for this change before it is merged.

@DaveCTurner
Copy link
Contributor Author

We have a number of tests that perform some other "happens eventually" actions in a synchronous fashion (refreshes/flushes) and then assert that history has been discarded or segments have been deleted. Although technically we're asserting that history has been discarded only if the user performed an explicit refresh/flush in those cases, what we're trying to ensure is really that history is discarded eventually even without that explicit action. I think this is a reasonable extrapolation, but if we're not happy with this then we should consider moving these tests to assertBusy() on the refresh/flush having happened too.

I am not a huge fan of assertBusy(). There are places where we have to use it, but it leads to slower tests that rely on timely activity that CI can't always deliver, and can be avoided here with a fairly small exposed-for-tests change to the API of the RetentionLeaseSyncer interface that itself maybe wouldn't exist if we didn't need it for tests.

@DaveCTurner DaveCTurner requested a review from dnhatn February 27, 2019 17:31
@DaveCTurner
Copy link
Contributor Author

@elasticmachine please run elasticsearch-ci/docbldesx

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Thanks @DaveCTurner.

@jasontedor Are you okay with this change?


// now ensure that the retention leases are the same
assertBusy(() -> {
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to call syncRetentionLeases or keep assertBusy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's sufficient not to do so because we are asserting that leases were not renewed, so checking the primary is enough. But it does not cause any harm, so I added an explicit sync in fc4459e.

@DaveCTurner
Copy link
Contributor Author

@jasontedor it'd be good to resolve this one way or another - could you let me know your thoughts?

@jasontedor
Copy link
Member

@DaveCTurner I still feel how I felt when I left my initial comment. That is, I understand this cleans up some assertBusys and makes some tests easier to reason about, but I still have concerns about tests running code that is a little bit different than what production code would do.

As I went back and read this thread again, and read the implementation, I saw a comment that I missed before.

This change is groundwork for the introduction of peer-recovery retention leases: if history retention is to be coordinated by the primary we must explicitly sync the leases to release history in a reasonable number of other tests.

I am not following this, can you elaborate?

@polyfractal polyfractal removed the v7.0.0 label Apr 9, 2019
@colings86 colings86 added v6.7.3 and removed v6.7.2 labels Apr 17, 2019
@DaveCTurner
Copy link
Contributor Author

On reflection this isn't obviously necessary for peer recovery retention leases: I think I can use assertBusy wherever necessary.

@DaveCTurner DaveCTurner deleted the 2019-02-21-sync-retention-leases-with-listeners branch April 25, 2019 14:34
@colings86 colings86 added v6.7.2 and removed v6.7.3 labels Apr 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. >non-issue v6.7.2 v7.2.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants