Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring indices fails #84041

Closed
HaZet1968 opened this issue Nov 23, 2020 · 11 comments
Closed

Monitoring indices fails #84041

HaZet1968 opened this issue Nov 23, 2020 · 11 comments

Comments

@HaZet1968
Copy link

HaZet1968 commented Nov 23, 2020

Kibana version: 7.10

Elasticsearch version: 7.10

Server OS version: Debian Buster

Browser version: Firefox 83.0

Browser OS version: Ubuntu Xenial

Original install method (e.g. download page, yum, from source, etc.): apt

Describe the bug: Try to view Management -> Stack Monitoring ->Indices fails and leads to a lot of entries in the elasticsearch logs since change from legacy monitoring to monitoring with metricbeat (7.9.1)

Steps to reproduce:

  1. Change elasticsearch monitoring to metricbeat
  2. Go to page Management -> Stack Monitoring ->Indices
  3. Wait...

After some time I get then the following entries in the elasticsearch logs (on several nodes):
[

Expected behavior: Sho an overview of the indices

Screenshots (if relevant):

Errors in browser console (if relevant):

Provide logs and/or server output (if relevant):

Output of one elasticsearch log:
2020-11-23T07:40:17,219][WARN ][o.e.t.InboundHandler ] [node-2-a] handling inbound transport message [InboundMessage{Header{3377158}{7.10.0}{33130325}{false}{false}{false}{false}{NO_ACTION_NAME_FOR_RESPONSES}}] took [7750ms] which is above the warn threshold of [5000ms]
[2020-11-23T07:41:49,122][INFO ][o.e.m.j.JvmGcMonitorService] [node-2-a] [gc][243275] overhead, spent [339ms] collecting in the last [1s]
[2020-11-23T07:41:55,957][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-2-a] uncaught exception in thread [elasticsearch[node-2-a][search][T#393]]
org.elasticsearch.tasks.TaskCancelledException: The parent task was cancelled, shouldn't start any child tasks
at org.elasticsearch.tasks.TaskManager$CancellableTaskHolder.registerChildNode(TaskManager.java:522) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.tasks.TaskManager.registerChildNode(TaskManager.java:213) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.support.TransportAction.registerChildNode(TransportAction.java:56) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:75) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:86) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:75) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:412) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.client.support.AbstractClient.search(AbstractClient.java:545) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.search.TransportMultiSearchAction.executeSearch(TransportMultiSearchAction.java:149) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.search.TransportMultiSearchAction$1.handleResponse(TransportMultiSearchAction.java:172) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.search.TransportMultiSearchAction$1.onFailure(TransportMultiSearchAction.java:157) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:98) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.support.ContextPreservingActionListener.onFailure(ContextPreservingActionListener.java:50) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.ActionListener$5.onFailure(ActionListener.java:258) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.raisePhaseFailure(AbstractSearchAsyncAction.java:594) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:568) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.action.search.FetchSearchPhase$1.onFailure(FetchSearchPhase.java:100) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:39) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737) ~[elasticsearch-7.10.0.jar:7.10.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.10.0.jar:7.10.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
at java.lang.Thread.run(Thread.java:832) [?:?]
last message repeated about 80-100 times...

Any additional context:
Sometimes the nodes get disconnected from cluster then. One time I had to do a full restart of the cluster.
Cluster has three nodes, 972 indices, 1944 Shards, 12 TB Data

@elasticmachine
Copy link
Contributor

Pinging @elastic/stack-monitoring (Team:Monitoring)

@chrisronline
Copy link
Contributor

I'd suggest opening this ticket in the the Elasticsearch repo as these errors are no Kibana-related.

@HaZet1968
Copy link
Author

Opened a ticket in the Elasticsearch repo:
elastic/elasticsearch#65405

@mayya-sharipova
Copy link

I've would like to reopen this issue, to clarify how Elastic -> Monitoring works on the Kibana side, as we have several issues filed in ES repo (elastic/elasticsearch#65405, https://github.com/elastic/sdh-elasticsearch/issues/3681, https://github.com/elastic/sdh-elasticsearch/issues/3680) and also from external users.

I would appreciate the clarification of the Kibana team or the Monitoring team on the following:

  • What kind of request is sent when a user accesses Elasticsearch > Indices in Stack Monitoring? Is it async search on all monitoring indices? Is it an aggregation? It would be nice to get a full search request. It is possible that we are running a huge aggregation here, and may be we need to redesign it, for example reduce the number of max buckets in an aggregation.

  • What is the body of the request: .monitoring-es-6-,.monitoring-es-7-,metricbeat-*/_search that lead to OOM in https://github.com/elastic/sdh-elasticsearch/issues/3680?

  • What new changes in the monitoring app in v 7.10?

@chrisronline
Copy link
Contributor

This is most likely related to #76015 after further research. In preparation for #73864, we are now reading from metricbeat-* indices in certain spots in the Stack Monitoring UI. This is a new change in 7.10.

My assumption here is the size of metricbeat-* indices is affecting the query.

@mayya-sharipova Is there anyway to confirm or deny this with the existing use cases?

@chrisronline
Copy link
Contributor

There is something we can do on our side to help and that is migrate to using server-side pagination which should greatly increase the load time of this page. I opened #87159 to track this.

FWIW, we made this change for the ES nodes listing and Logstash pipelines listing pages and saw substantial improvements in terms of load.

@mayya-sharipova
Copy link

@chrisronline Thank you for your update. For OOM case in elastic/sdh-elasticsearch#3680, the request was for the index aem-project-bluesky-errorlog-2020.12.19-000014. My guess this is a customer index, and not related to Kibana or Monitoring? right?

For https://github.com/elastic/sdh-elasticsearch/issues/3681, I have asked @jasonyoum to confirm the size of metricbeat-* indices.

@chrisronline
Copy link
Contributor

@mayya-sharipova

My guess this is a customer index, and not related to Kibana or Monitoring? right?

That's right. Can we also find out the approximate size of any metricbeat-* indices on that cluster as well? I'm imagining the query against metricbeat-* is adding a ton of pressure on the cluster and the error is manifesting with other search queries.

@mayya-sharipova
Copy link

mayya-sharipova commented Jan 5, 2021

For the size of metricbeat indices in the cluster of https://github.com/elastic/sdh-elasticsearch/issues/3681, I am copying Jason's response here:

Based on the previously provided diagnostic result, total shard number of metricbeat-* was 608, and shards from the current version 7.10 were 196. All their shards are managed by ILM which rollover with "max_size": 50gb.

$ grep 'metricbeat-' local-diagnostics-20201222-081332/cat/cat_shards.txt  | wc -l
     608
$ grep 'metricbeat-7.10' local-diagnostics-20201222-081332/cat/cat_shards.txt  | wc -l
     196

The size of metricbeat indices for elastic/sdh-elasticsearch#3680 cluster is following:
there are 46 shards both primary and replica shards in the format metricbeat-20..., each shard is approximately 10Gb

@chrisronline
Copy link
Contributor

One thing we can have the customer do, as a temporary measure, to test the root cause is to have them manually configure monitoring.ui.metricbeat.index to .monitoring-es-7-* inside of Kibana. Ask them to change that config, restart Kibana, and observe if they are still experiencing the same performance issues.

I wouldn't recommend keeping the config around long-term because it will affect how the stack monitoring UI will work in the near future, but we can at least use that for testing for now.

@jasonyoum
Copy link

Thank you for all updates @chrisronline @mayya-sharipova The case was closed after the customer agreed to upgrade to latest version for the fix. I am closing this issue. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants