Monitoring indices fails #84041

HaZet1968 · 2020-11-23T07:14:23Z

Kibana version: 7.10

Elasticsearch version: 7.10

Server OS version: Debian Buster

Browser version: Firefox 83.0

Browser OS version: Ubuntu Xenial

Original install method (e.g. download page, yum, from source, etc.): apt

Describe the bug: Try to view Management -> Stack Monitoring ->Indices fails and leads to a lot of entries in the elasticsearch logs since change from legacy monitoring to monitoring with metricbeat (7.9.1)

Steps to reproduce:

Change elasticsearch monitoring to metricbeat
Go to page Management -> Stack Monitoring ->Indices
Wait...

After some time I get then the following entries in the elasticsearch logs (on several nodes):
[

Expected behavior: Sho an overview of the indices

Screenshots (if relevant):

Errors in browser console (if relevant):

Provide logs and/or server output (if relevant):

Output of one elasticsearch log:
2020-11-23T07:40:17,219][WARN ][o.e.t.InboundHandler ] [node-2-a] handling inbound transport message [InboundMessage{Header{3377158}{7.10.0}{33130325}{false}{false}{false}{false}{NO_ACTION_NAME_FOR_RESPONSES}}] took [7750ms] which is above the warn threshold of [5000ms]
[2020-11-23T07:41:49,122][INFO ][o.e.m.j.JvmGcMonitorService] [node-2-a] [gc][243275] overhead, spent [339ms] collecting in the last [1s]
[2020-11-23T07:41:55,957][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-2-a] uncaught exception in thread [elasticsearch[node-2-a][search][T#393]]
org.elasticsearch.tasks.TaskCancelledException: The parent task was cancelled, shouldn't start any child tasks
at org.elasticsearch.tasks.TaskManager$CancellableTaskHolder.registerChildNode(TaskManager.java:522) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.tasks.TaskManager.registerChildNode(TaskManager.java:213) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.support.TransportAction.registerChildNode(TransportAction.java:56) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:75) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:86) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:75) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:412) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.client.support.AbstractClient.search(AbstractClient.java:545) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.search.TransportMultiSearchAction.executeSearch(TransportMultiSearchAction.java:149) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.search.TransportMultiSearchAction$1.handleResponse(TransportMultiSearchAction.java:172) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.search.TransportMultiSearchAction$1.onFailure(TransportMultiSearchAction.java:157) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:98) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.support.ContextPreservingActionListener.onFailure(ContextPreservingActionListener.java:50) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.ActionListener$5.onFailure(ActionListener.java:258) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.raisePhaseFailure(AbstractSearchAsyncAction.java:594) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:568) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.search.FetchSearchPhase$1.onFailure(FetchSearchPhase.java:100) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:39) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.10.0.jar:7.10.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
at java.lang.Thread.run(Thread.java:832) [?:?]
last message repeated about 80-100 times...

Any additional context:
Sometimes the nodes get disconnected from cluster then. One time I had to do a full restart of the cluster.
Cluster has three nodes, 972 indices, 1944 Shards, 12 TB Data

elasticmachine · 2020-11-23T18:36:14Z

Pinging @elastic/stack-monitoring (Team:Monitoring)

chrisronline · 2020-11-23T18:43:43Z

I'd suggest opening this ticket in the the Elasticsearch repo as these errors are no Kibana-related.

HaZet1968 · 2020-11-24T06:17:43Z

Opened a ticket in the Elasticsearch repo:
elastic/elasticsearch#65405

mayya-sharipova · 2020-12-29T22:35:53Z

I've would like to reopen this issue, to clarify how Elastic -> Monitoring works on the Kibana side, as we have several issues filed in ES repo (elastic/elasticsearch#65405, https://github.com/elastic/sdh-elasticsearch/issues/3681, https://github.com/elastic/sdh-elasticsearch/issues/3680) and also from external users.

I would appreciate the clarification of the Kibana team or the Monitoring team on the following:

What kind of request is sent when a user accesses Elasticsearch > Indices in Stack Monitoring? Is it async search on all monitoring indices? Is it an aggregation? It would be nice to get a full search request. It is possible that we are running a huge aggregation here, and may be we need to redesign it, for example reduce the number of max buckets in an aggregation.
What is the body of the request: .monitoring-es-6-,.monitoring-es-7-,metricbeat-*/_search that lead to OOM in https://github.com/elastic/sdh-elasticsearch/issues/3680?
What new changes in the monitoring app in v 7.10?

chrisronline · 2021-01-04T14:25:56Z

This is most likely related to #76015 after further research. In preparation for #73864, we are now reading from metricbeat-* indices in certain spots in the Stack Monitoring UI. This is a new change in 7.10.

My assumption here is the size of metricbeat-* indices is affecting the query.

@mayya-sharipova Is there anyway to confirm or deny this with the existing use cases?

chrisronline · 2021-01-04T15:06:39Z

There is something we can do on our side to help and that is migrate to using server-side pagination which should greatly increase the load time of this page. I opened #87159 to track this.

FWIW, we made this change for the ES nodes listing and Logstash pipelines listing pages and saw substantial improvements in terms of load.

mayya-sharipova · 2021-01-05T00:25:56Z

@chrisronline Thank you for your update. For OOM case in elastic/sdh-elasticsearch#3680, the request was for the index aem-project-bluesky-errorlog-2020.12.19-000014. My guess this is a customer index, and not related to Kibana or Monitoring? right?

For https://github.com/elastic/sdh-elasticsearch/issues/3681, I have asked @jasonyoum to confirm the size of metricbeat-* indices.

chrisronline · 2021-01-05T13:56:14Z

@mayya-sharipova

My guess this is a customer index, and not related to Kibana or Monitoring? right?

That's right. Can we also find out the approximate size of any metricbeat-* indices on that cluster as well? I'm imagining the query against metricbeat-* is adding a ton of pressure on the cluster and the error is manifesting with other search queries.

mayya-sharipova · 2021-01-05T17:33:32Z

For the size of metricbeat indices in the cluster of https://github.com/elastic/sdh-elasticsearch/issues/3681, I am copying Jason's response here:

Based on the previously provided diagnostic result, total shard number of metricbeat-* was 608, and shards from the current version 7.10 were 196. All their shards are managed by ILM which rollover with "max_size": 50gb.

$ grep 'metricbeat-' local-diagnostics-20201222-081332/cat/cat_shards.txt  | wc -l
     608
$ grep 'metricbeat-7.10' local-diagnostics-20201222-081332/cat/cat_shards.txt  | wc -l
     196

The size of metricbeat indices for elastic/sdh-elasticsearch#3680 cluster is following:
there are 46 shards both primary and replica shards in the format metricbeat-20..., each shard is approximately 10Gb

chrisronline · 2021-01-05T18:10:32Z

One thing we can have the customer do, as a temporary measure, to test the root cause is to have them manually configure monitoring.ui.metricbeat.index to .monitoring-es-7-* inside of Kibana. Ask them to change that config, restart Kibana, and observe if they are still experiencing the same performance issues.

I wouldn't recommend keeping the config around long-term because it will affect how the stack monitoring UI will work in the near future, but we can at least use that for testing for now.

jasonyoum · 2021-03-05T02:13:30Z

Thank you for all updates @chrisronline @mayya-sharipova The case was closed after the customer agreed to upgrade to latest version for the fix. I am closing this issue. Thanks again!

afharo added Feature:Stack Monitoring Team:Monitoring Stack Monitoring team labels Nov 23, 2020

chrisronline closed this as completed Nov 23, 2020

mayya-sharipova reopened this Dec 29, 2020

jasonyoum closed this as completed Mar 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring indices fails #84041

Monitoring indices fails #84041

HaZet1968 commented Nov 23, 2020 •

edited

Loading

elasticmachine commented Nov 23, 2020

chrisronline commented Nov 23, 2020

HaZet1968 commented Nov 24, 2020

mayya-sharipova commented Dec 29, 2020

chrisronline commented Jan 4, 2021

chrisronline commented Jan 4, 2021

mayya-sharipova commented Jan 5, 2021

chrisronline commented Jan 5, 2021

mayya-sharipova commented Jan 5, 2021 •

edited

Loading

chrisronline commented Jan 5, 2021

jasonyoum commented Mar 5, 2021

Monitoring indices fails #84041

Monitoring indices fails #84041

Comments

HaZet1968 commented Nov 23, 2020 • edited Loading

elasticmachine commented Nov 23, 2020

chrisronline commented Nov 23, 2020

HaZet1968 commented Nov 24, 2020

mayya-sharipova commented Dec 29, 2020

chrisronline commented Jan 4, 2021

chrisronline commented Jan 4, 2021

mayya-sharipova commented Jan 5, 2021

chrisronline commented Jan 5, 2021

mayya-sharipova commented Jan 5, 2021 • edited Loading

chrisronline commented Jan 5, 2021

jasonyoum commented Mar 5, 2021

HaZet1968 commented Nov 23, 2020 •

edited

Loading

mayya-sharipova commented Jan 5, 2021 •

edited

Loading