-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cluster applier stats #77552
Add cluster applier stats #77552
Conversation
Keep track of the time spent in each registered cluster state applier and listener. Adjust the discovery stats to include these time spent broken down by individual registered appliers and listeners.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @martijnvg, looks like the right sort of thing. I left a few thoughts.
} | ||
|
||
void removeListener(String name) { | ||
timeSpentPerListener.remove(name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be good to keep track of the accumulated time spent on removed listeners somewhere, just so we can see a bit better what's happening if there's a listener/applier which gets added and removed a lot which is slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you mean an entry in the stats that represents time spent for all listeners that were removed?
|
||
// Most listeners are added when the node is initialized, however a few | ||
// are added during the lifetime of a node and for these cases we need ConcurrentHashMap | ||
private final Map<String, CounterMetric> timeSpentPerApplier = new ConcurrentHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it'd be good to track how often each entry was called as well, since they do get added and removed.
} | ||
|
||
Stats getStats() { | ||
return new Stats( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if the stats we returned were from a completed update so they're all consistent with each other. No need for anything fancy tho, the updates are all on the applier thread and we don't get stats often so we can just do something simple with synchronised
I think.
That said, it's a bit more complicated to return consistent stats if we let other threads add and remove the trackers. Could we instead add trackers as needed during cluster state application, and remove any ones that didn't execute at the end of a successful update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for anything fancy tho, the updates are all on the applier thread and we don't get stats often so we can just do something simple with synchronised I think.
👍
Could we instead add trackers as needed during cluster state application, and remove any ones that didn't execute at the end of a successful update?
👍
@@ -527,8 +543,11 @@ private void callClusterStateListener(ClusterChangedEvent clusterChangedEvent, S | |||
for (ClusterStateListener listener : listeners) { | |||
try { | |||
logger.trace("calling [{}] with change to version [{}]", listener, clusterChangedEvent.state().version()); | |||
try (Releasable ignored = stopWatch.timing("notifying listener [" + listener + "]")) { | |||
final String name = listener.toString(); | |||
try (Releasable ignored = stopWatch.timing("notifying listener [" + name + "]")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we stop using the stopwatch altogether and move the timing into the new tracker? This'd also let us track the other things that aren't listeners/appliers (e.g. connecting to new nodes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense. But it does mean that slow appliers/listeners will not show up in the slow cluster state applier warning log. I think that is ok? Since slow appliers/listeners are now visible via the node stats api.
…avoid confusion with StopWatch. * Removed the usage of StopWatch with a Recorder (new class). * Also keep track of other executions then applying listeners/appliers. * Bulk update stats after cluster state application. * Remove stat entries of not executed actions after bulk update.
@DaveCTurner I've updated the PR and addressed the last 3 comments. Would you like to take another look and see whether the latest commits address these comments? With the latest changes to node stats output looks like this: {
"_nodes": {
"total": 1,
"successful": 1,
"failed": 0
},
"cluster_name": "runTask",
"nodes": {
"v1xBnVkoTjSwTIEthcMiPg": {
"timestamp": 1631609769074,
"name": "runTask-0",
"transport_address": "127.0.0.1:9300",
"host": "127.0.0.1",
"ip": "127.0.0.1:9300",
"roles": [
"data",
"data_cold",
"data_content",
"data_frozen",
"data_hot",
"data_warm",
"ingest",
"master",
"ml",
"remote_cluster_client",
"transform"
],
"attributes": {
"ml.machine_memory": "34359738368",
"xpack.installed": "true",
"testattr": "test",
"ml.max_open_jobs": "512",
"ml.max_jvm_size": "536870912"
},
"discovery": {
"cluster_state_queue": {
"total": 0,
"pending": 0,
"committed": 0
},
"published_cluster_states": {
"full_states": 2,
"incompatible_diffs": 0,
"compatible_diffs": 42
},
"cluster_state_update": {
"unchanged": {
"count": 39,
"computation_time_millis": 10,
"notification_time_millis": 1
},
"success": {
"count": 44,
"computation_time_millis": 419,
"publication_time_millis": 4916,
"context_construction_time_millis": 26,
"commit_time_millis": 4087,
"completion_time_millis": 4187,
"master_apply_time_millis": 670,
"notification_time_millis": 7
},
"failure": {
"count": 0,
"computation_time_millis": 0,
"publication_time_millis": 0,
"context_construction_time_millis": 0,
"commit_time_millis": 0,
"completion_time_millis": 0,
"master_apply_time_millis": 0,
"notification_time_millis": 0
}
},
"cluster_applier_stats": {
"recordings": [
{
"name": "org.elasticsearch.xpack.ml.MlIndexTemplateRegistry@10191b7e",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 168
},
{
"name": "org.elasticsearch.indices.cluster.IndicesClusterStateService@3182d3f4",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 120
},
{
"name": "org.elasticsearch.xpack.stack.StackTemplateRegistry@6980990",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 35
},
{
"name": "org.elasticsearch.xpack.watcher.support.WatcherIndexTemplateRegistry@6738ef17",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 30
},
{
"name": "org.elasticsearch.license.LicenseService@656089d5",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 15
},
{
"name": "org.elasticsearch.cluster.InternalClusterInfoService@7ca9f2de",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 12
},
{
"name": "org.elasticsearch.xpack.ilm.IndexLifecycleService@478e3072",
"cumulative_execution_count": 88,
"cumulative_execution_time_millis": 8
},
{
"name": "org.elasticsearch.xpack.ilm.history.ILMHistoryTemplateRegistry@4b8cb07c",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 8
},
{
"name": "org.elasticsearch.snapshots.SnapshotsService@1d0b54",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 4
},
{
"name": "org.elasticsearch.xpack.ml.MlAssignmentNotifier@dc2dfc8",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 3
},
{
"name": "org.elasticsearch.xpack.core.slm.history.SnapshotLifecycleTemplateRegistry@6becae4d",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 3
},
{
"name": "connecting to new nodes",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 2
},
{
"name": "org.elasticsearch.xpack.deprecation.logging.DeprecationIndexingTemplateRegistry@e801088",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 2
},
{
"name": "org.elasticsearch.xpack.slm.SnapshotRetentionService@4fe86fcc",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 2
},
{
"name": "org.elasticsearch.xpack.security.support.SecurityIndexManager@3463fd38",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 2
},
{
"name": "org.elasticsearch.indices.SystemIndexManager@7b6ea2b",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 1
},
{
"name": "org.elasticsearch.xpack.ml.MlInitializationService@240c6c4c",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 1
},
{
"name": "org.elasticsearch.shutdown.PluginShutdownService@5b60b709",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 1
},
{
"name": "org.elasticsearch.xpack.ml.MlAutoUpdateService@61e4fe68",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 1
},
{
"name": "org.elasticsearch.xpack.ml.notifications.AbstractMlAuditor$$Lambda$1452/0x0000000801122928@7a12870e",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 1
},
{
"name": "org.elasticsearch.action.admin.cluster.repositories.cleanup.TransportCleanupRepositoryAction$$Lambda$4886/0x0000000801786210@1f5f9b97",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager@2356a61a",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.slm.SnapshotLifecycleService@67c27b15",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.ml.inference.allocation.TrainedModelAllocationClusterService@228697d4",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.watcher.WatcherLifeCycleService@2c3402df",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.gateway.GatewayService@6c7d4a39",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.core.async.AsyncTaskMaintenanceService@d32546b",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.transform.notifications.TransformAuditor$$Lambda$3214/0x00000008014168a8@7111cf50",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.snapshots.RestoreService@6f52781a",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.enrich.EnrichPolicyMaintenanceService@34fbcfef",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.ml.notifications.AbstractMlAuditor$$Lambda$1452/0x0000000801122928@6ea2e157",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.transform.TransformClusterStateListener@6adce77f",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.ml.MlUpgradeModeActionFilter$$Lambda$3028/0x0000000801307738@7a4f3dbd",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "running task [Publication{term=1, version=44}]",
"cumulative_execution_count": 1,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.watcher.WatcherIndexingListener@4299514b",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.ml.notifications.AbstractMlAuditor$$Lambda$1452/0x0000000801122928@49f843f1",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.searchablesnapshots.SearchableSnapshots$RepositoryUuidWatcher@af07d91",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.ml.job.task.OpenJobPersistentTasksExecutor$$Lambda$4159/0x0000000801592b68@74556e7d",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.persistent.PersistentTasksClusterService@793999ff",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.fleet.FleetTemplateRegistry@45e07ff6",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.persistent.PersistentTasksNodeService@7a158663",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.ml.autoscaling.MlAutoscalingDeciderService@479a2453",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.ccr.action.ShardFollowTaskCleaner@36a2a774",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.indices.store.IndicesStore@19f0938",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.ingest.IngestService@48e80b77",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.security.authc.TokenService$$Lambda$3229/0x000000080141eac0@3d008c47",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.shutdown.NodeSeenService@718bcb05",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.node.ResponseCollectorService@a04029e",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.ml.action.TransportStartDataFrameAnalyticsAction$TaskExecutor$$Lambda$4160/0x0000000801593980@5eef929b",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.ml.process.MlMemoryTracker@55c360e6",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.ml.datafeed.DatafeedRunner$TaskRunner@2062b80f",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$$Lambda$3337/0x0000000801446100@7b12d6fe",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.tasks.TaskManager@637db78f",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.ml.inference.TrainedModelStatsService$$Lambda$3187/0x0000000801405160@2b0e8039",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.ml.notifications.AbstractMlAuditor$$Lambda$1452/0x0000000801122928@6f0ba1ac",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.indices.TimestampFieldMapperService@63b1a2ce",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.ml.inference.allocation.TrainedModelAllocationNodeService@74655368",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.deprecation.logging.DeprecationIndexingComponent@77664245",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.ml.inference.loadingservice.ModelLoadingService@297446b8",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.repositories.RepositoriesService@2954c3d",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.cluster.metadata.TemplateUpgradeService@4c5dcb2c",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.security.support.SecurityIndexManager@30ea723a",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.action.ingest.IngestActionForwarder@29e42379",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.indices.recovery.PeerRecoverySourceService@2c78b767",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.script.ScriptService@384b34df",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.ml.job.snapshot.upgrader.SnapshotUpgradeTaskExecutor$$Lambda$4161/0x0000000801658318@26402dcc",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.snapshots.SnapshotShardsService@39d12594",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.ccr.action.AutoFollowCoordinator@6db0d69d",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.ml.notifications.AbstractMlAuditor$$Lambda$1452/0x0000000801122928@5673db34",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.snapshots.InternalSnapshotsInfoService@1e5eb6e",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.cluster.routing.DelayedAllocationService@6cd08253",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.ml.notifications.AbstractMlAuditor$$Lambda$1452/0x0000000801122928@4a593dd2",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.cluster.metadata.SystemIndexMetadataUpgradeService@96cc42e",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.autoscaling.capacity.memory.AutoscalingMemoryInfoService$$Lambda$3026/0x00000008012cae08@4a05e8b",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
},
{
"name": "org.elasticsearch.xpack.ml.utils.persistence.ResultsPersisterService$$Lambda$3032/0x0000000801398c80@1162635b",
"cumulative_execution_count": 44,
"cumulative_execution_time_millis": 0
}
]
}
}
}
}
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @martijnvg this looks really nice. One question about the rendering of the applier/listener names.
I'm especially worried about the fact that if we don't get consistent strings across nodes/clusters, then it's hard to build nice monitoring/aggregations/queries for these maybe?
Collection<ClusterStateApplier> clusterStateAppliers) { | ||
for (ClusterStateApplier applier : clusterStateAppliers) { | ||
logger.trace("calling [{}] with change to version [{}]", applier, clusterChangedEvent.state().version()); | ||
try (Releasable ignored = stopWatch.timing("running applier [" + applier + "]")) { | ||
final String name = applier.toString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should render nicer names here ... the output as rendered JSON is a little weird with all the $
in them isn't it? (it's especially bothersome if the applier is a lambda and wouldn't even render to the same string on ever machine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry this slipped right down my list, I need to get back to it soon.
We were planning to address the descriptions in a follow-up. I mean sure it's a bit weird to see the default toString()
but it'd still tell us where to look at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the plan is to add a unique name for each applier/listener in a follow up pr.
I should have added as a comment to this PR.
Collection<? extends ClusterStateListener> listeners) { | ||
for (ClusterStateListener listener : listeners) { | ||
try { | ||
logger.trace("calling [{}] with change to version [{}]", listener, clusterChangedEvent.state().version()); | ||
try (Releasable ignored = stopWatch.timing("notifying listener [" + listener + "]")) { | ||
final String name = listener.toString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here with the rendering, we have a couple of lambdas here I think.
Oh oops I merged master for my local review but didn't mean to push it. You'll have to pull that commit, sorry. |
No worries. This PR had to be updated with master anyway :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation looks good, I left a few tiny comments. Could we have some unit tests for ClusterApplierRecordingService
and a few additions to rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/nodes.stats/30_discovery.yml
to show that we really do emit these new stats in the API?
|
||
private static void toXContentTimeSpent(XContentBuilder builder, Map<String, Recording> timeSpentPerListener) throws IOException { | ||
timeSpentPerListener.entrySet().stream() | ||
.sorted((o1, o2) -> -Long.compare(o1.getValue().sum, o2.getValue().sum)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can never remember which way round this'll sort, but then I couldn't get the explicit alternative to work with type inference for some reason so it looks like this. Yuck either way.
.sorted((o1, o2) -> -Long.compare(o1.getValue().sum, o2.getValue().sum)) | |
.sorted(Comparator.<Map.Entry<String, Recording>>comparingLong(o -> o.getValue().sum).reversed()) |
builder.humanReadableField(name + "_time_millis", name + "_time", TimeValue.timeValueMillis(entry.getValue().sum)); | ||
builder.endObject(); | ||
} catch (IOException e) { | ||
throw new UncheckedIOException(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think I like this, we don't generally check for UncheckedIOException
. Could we just use a normal loop and let a regular IOException
propagate?
|
||
this.recording = true; | ||
this.currentAction = action; | ||
this.startTimeNS = System.nanoTime(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're ok with milliseconds so we can use ThreadPool#rawRelativeTimeInMillis
which will let us simulate the passage of time in tests.
|
||
public final class ClusterApplierRecordingService { | ||
|
||
private final Map<String, MeanMetric> recordedActions = new ConcurrentHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the access to this is synchronized
so I think we can just use a regular HashMap
Pinging @elastic/es-data-management (Team:Data Management) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm mildly worried about adding this level of synchronization to the applier thread but then again we do worse things on it I suppose :)
LGTM if David is fine with it to :)
|
||
private final Map<String, MeanMetric> recordedActions = new HashMap<>(); | ||
|
||
synchronized Stats getStats() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems mildly scary, though I'm having a hard time guessing the impact. But technically, a barrage of stats requests would cause the CS applier thread to slow down (potentially considerably?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking is that this should complete in microseconds => we'd need implausibly many concurrent stats requests to cause meaningful starvation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I guess it also runs on the management pool with somewhat limited concurrency 🤞 :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM2
|
||
private final Map<String, MeanMetric> recordedActions = new HashMap<>(); | ||
|
||
synchronized Stats getStats() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking is that this should complete in microseconds => we'd need implausibly many concurrent stats requests to cause meaningful starvation here.
Backporting elastic#77552 to 7.x branch. Keep track of the time spent in each registered cluster state applier / listener and other cluster state applier activities. Also adjusted the discovery stats to include these time spent stats broken down by individual registered appliers / listeners and other cluster state applier activities. Also replaced the usage of `StopWatch` in `ClusterStateApplierService` with `ClusterApplierRecordingService.Recorder`. Relates to elastic#77466
…' into feature/data_stream_support_routing * wjp/feature/data_stream_support_routing: (44 commits) Revert "Adjust /_cat/templates not to request all metadata (elastic#78812)" Allow indices lookup to be built lazily (elastic#78745) [DOCS] Document default security in alpha2 (elastic#78227) Add cluster applier stats (elastic#77552) Fix failing URLDecodeProcessorTests::testProcessor test (elastic#78690) Upgrade to lucene snapshot ba75dc5e6bf (elastic#78817) Adjust /_cat/templates not to request all metadata (elastic#78812) Simplify build plugin license handling (elastic#77009) Fix SearchableSnapshotsBlobStoreCacheIntegTests.testBlobStoreCache (elastic#78616) Improve Docker image caching and testing (elastic#78552) Load knn vectors format with mmapfs (elastic#78724) Fix date math zone test to use negative minutes (elastic#78796) Changing name of shards field in node/stats api to shard_stats (elastic#78531) [DOCS] Fix system index refs in restore tutorial (elastic#78582) Add previously removed settings back for 8.0 (elastic#78784) TSDB: Fix template name in test Add a system property to forcibly format everything (elastic#78768) Revert "Adding config so that some tests will break if over-the-wire encryption fails (elastic#78409)" (elastic#78787) Must date math test failure Adding config so that some tests will break if over-the-wire encryption fails (elastic#78409) ...
Backporting #77552 to 7.x branch. Keep track of the time spent in each registered cluster state applier / listener and other cluster state applier activities. Also adjusted the discovery stats to include these time spent stats broken down by individual registered appliers / listeners and other cluster state applier activities. Also replaced the usage of `StopWatch` in `ClusterStateApplierService` with `ClusterApplierRecordingService.Recorder`. Relates to #77466
Keep track of the time spent in each registered cluster state applier
and listener. Adjust the discovery stats to include these time spent
broken down by individual registered appliers and listeners.
Example usage
Request:
Response: