Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timing out stale remote master history #86936

Merged
merged 9 commits into from
May 24, 2022

Conversation

masseyke
Copy link
Member

This change makes it so that the remote master history that has been fetched will time out after 5 minutes (by default). The MasterHistoryService will return its value as null if it is older than the configured timeout. This way MasterHistoryService clients don't report status based on out of date information.

@masseyke masseyke marked this pull request as ready for review May 19, 2022 18:56
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label May 19, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Keith.

This generally LGTM - just left a few questions

*/
public static final Setting<TimeValue> MAX_USABLE_REMOTE_HISTORY_AGE_SETTING = Setting.timeSetting(
"master_history.max_usable_remote_history_age",
DEFAULT_MAX_USABLE_REMOTE_HISTORY_AGE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this mandate a min-value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think it ought to be? I thought briefly about it, and then thought maybe there would be some case where we would want to tell a user to set it to 0 or negative to basically skip checking remote master history (probably not, but we might be kicking ourselves later if we put in some arbitrary minimum).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we use positiveTimeSetting to at least not worry about negative values?

private static final Logger logger = LogManager.getLogger(MasterHistoryService.class);

private static final TimeValue DEFAULT_MAX_USABLE_REMOTE_HISTORY_AGE = new TimeValue(5, TimeUnit.MINUTES);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "usable" a bit redundant in the name? Would going with "time to live" be a bit more intuitive? (given the already established understanding in caching technologies) e.g. "remote_history_time_to_live" ?

*/
long acceptableRemoteHistoryTime = currentTimeMillisSupplier.getAsLong() - acceptableRemoteHistoryAge.getMillis();
if (remoteHistoryOrExceptionCopy.creationTimestamp < acceptableRemoteHistoryTime) {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we want to also at some point null-ify remoteHistoryOrException ? (we'd likely need an AtomicReference or such to do a compare and swap)

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that there was no harm in holding onto a single one, and it avoids having to synchronize access to it. Synchronizing it would not be super expensive (we'd need to read it, do the time calculation, and set it to null in an atomic action), but it seemed like an unnecessary complication. Is it the memory that you're concerned about? We cap these things to have at most 50 entries, so I'd expect at most a few KB here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mostly worried about the remoteHistoryOrException leaking outside the service by other means than getRemoteMasterHistory or the service evolving and using the remoteHistoryOrException in a private way (assuming it's null when it's stale, which would not be a wild assumption given what the getter returns) - @DaveCTurner would you have a strong opinion here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That second concern would maybe mean having a thread to time this thing out. That seems like something we ought to deal with if we really need it in the future, but seems overly complex for the current situation doesn't it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am inclined to leave it in place too. You never know, it might even end up being useful in a heap dump. Maybe add a comment to the field indicating that it might be stale.

For future reference it wouldn't be a big deal to clear it after the timeout either:

        transportService.getThreadPool().schedule(() -> {/* clear field*/}, acceptableRemoteHistoryAge, Names.SAME);

(also there's no need for a ThreadPool constructor argument if you have a TransportService)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ let's leave it as is then

@masseyke
Copy link
Member Author

@elasticmachine run elasticsearch-ci/part-1

@masseyke masseyke requested a review from andreidan May 23, 2022 18:15
}
}
);
}

record RemoteHistoryOrException(List<DiscoveryNode> remoteHistory, Exception exception) { // non-private for testing
// non-private for testing
record RemoteHistoryOrException(List<DiscoveryNode> remoteHistory, Exception exception, long creationTimestamp) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
record RemoteHistoryOrException(List<DiscoveryNode> remoteHistory, Exception exception, long creationTimestamp) {
record RemoteHistoryOrException(List<DiscoveryNode> remoteHistory, Exception exception, long creationTimeMillis) {

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with a minor rename suggestion

Thanks Keith


public RemoteHistoryOrException {
if (remoteHistory != null && exception != null) {
throw new IllegalArgumentException("Remote history and exception cannot both be non-null");
}
}

RemoteHistoryOrException(List<DiscoveryNode> remoteHistory) {
this(remoteHistory, null);
RemoteHistoryOrException(List<DiscoveryNode> remoteHistory, long creationTimestamp) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename the creationTimestamp to include the unit everywhere? ( https://github.com/elastic/elasticsearch/pull/86936/files#r880273628 )

@masseyke
Copy link
Member Author

@elasticmachine run elasticsearch-ci/part-1

@masseyke
Copy link
Member Author

@elasticmachine run elasticsearch-ci/part-2

@masseyke masseyke merged commit 3630a14 into elastic:master May 24, 2022
@masseyke masseyke deleted the fix/age-out-remote-master-history branch May 24, 2022 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants