-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timing out stale remote master history #86936
Timing out stale remote master history #86936
Conversation
Pinging @elastic/es-data-management (Team:Data Management) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this Keith.
This generally LGTM - just left a few questions
*/ | ||
public static final Setting<TimeValue> MAX_USABLE_REMOTE_HISTORY_AGE_SETTING = Setting.timeSetting( | ||
"master_history.max_usable_remote_history_age", | ||
DEFAULT_MAX_USABLE_REMOTE_HISTORY_AGE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this mandate a min-value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think it ought to be? I thought briefly about it, and then thought maybe there would be some case where we would want to tell a user to set it to 0 or negative to basically skip checking remote master history (probably not, but we might be kicking ourselves later if we put in some arbitrary minimum).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we use positiveTimeSetting
to at least not worry about negative values?
private static final Logger logger = LogManager.getLogger(MasterHistoryService.class); | ||
|
||
private static final TimeValue DEFAULT_MAX_USABLE_REMOTE_HISTORY_AGE = new TimeValue(5, TimeUnit.MINUTES); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "usable" a bit redundant in the name? Would going with "time to live" be a bit more intuitive? (given the already established understanding in caching technologies) e.g. "remote_history_time_to_live" ?
*/ | ||
long acceptableRemoteHistoryTime = currentTimeMillisSupplier.getAsLong() - acceptableRemoteHistoryAge.getMillis(); | ||
if (remoteHistoryOrExceptionCopy.creationTimestamp < acceptableRemoteHistoryTime) { | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we want to also at some point null-ify remoteHistoryOrException
? (we'd likely need an AtomicReference or such to do a compare and swap)
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was that there was no harm in holding onto a single one, and it avoids having to synchronize access to it. Synchronizing it would not be super expensive (we'd need to read it, do the time calculation, and set it to null in an atomic action), but it seemed like an unnecessary complication. Is it the memory that you're concerned about? We cap these things to have at most 50 entries, so I'd expect at most a few KB here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was mostly worried about the remoteHistoryOrException
leaking outside the service by other means than getRemoteMasterHistory
or the service evolving and using the remoteHistoryOrException
in a private way (assuming it's null
when it's stale, which would not be a wild assumption given what the getter returns) - @DaveCTurner would you have a strong opinion here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That second concern would maybe mean having a thread to time this thing out. That seems like something we ought to deal with if we really need it in the future, but seems overly complex for the current situation doesn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am inclined to leave it in place too. You never know, it might even end up being useful in a heap dump. Maybe add a comment to the field indicating that it might be stale.
For future reference it wouldn't be a big deal to clear it after the timeout either:
transportService.getThreadPool().schedule(() -> {/* clear field*/}, acceptableRemoteHistoryAge, Names.SAME);
(also there's no need for a ThreadPool
constructor argument if you have a TransportService
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ let's leave it as is then
@elasticmachine run elasticsearch-ci/part-1 |
} | ||
} | ||
); | ||
} | ||
|
||
record RemoteHistoryOrException(List<DiscoveryNode> remoteHistory, Exception exception) { // non-private for testing | ||
// non-private for testing | ||
record RemoteHistoryOrException(List<DiscoveryNode> remoteHistory, Exception exception, long creationTimestamp) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
record RemoteHistoryOrException(List<DiscoveryNode> remoteHistory, Exception exception, long creationTimestamp) { | |
record RemoteHistoryOrException(List<DiscoveryNode> remoteHistory, Exception exception, long creationTimeMillis) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with a minor rename suggestion
Thanks Keith
|
||
public RemoteHistoryOrException { | ||
if (remoteHistory != null && exception != null) { | ||
throw new IllegalArgumentException("Remote history and exception cannot both be non-null"); | ||
} | ||
} | ||
|
||
RemoteHistoryOrException(List<DiscoveryNode> remoteHistory) { | ||
this(remoteHistory, null); | ||
RemoteHistoryOrException(List<DiscoveryNode> remoteHistory, long creationTimestamp) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename the creationTimestamp to include the unit everywhere? ( https://github.com/elastic/elasticsearch/pull/86936/files#r880273628 )
@elasticmachine run elasticsearch-ci/part-1 |
@elasticmachine run elasticsearch-ci/part-2 |
This change makes it so that the remote master history that has been fetched will time out after 5 minutes (by default). The MasterHistoryService will return its value as
null
if it is older than the configured timeout. This way MasterHistoryService clients don't report status based on out of date information.